# Topic Modeling  (Section - 1)

### DSCI6003_and_DSCI6004 Final Project

#### Author : Srini Ananthakrishnan
#### Date    : 12/15/2016

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (8.0, 6.0)

### Import GraphLab package

In [2]:
import graphlab as gl
from IPython.display import HTML
from IPython.display import display
from IPython.display import Image
gl.canvas.set_target('ipynb')

### Import pyLDAvis package

In [3]:
import pyLDAvis
import pyLDAvis.graphlab
pyLDAvis.enable_notebook()

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1488392376.log


This non-commercial license of GraphLab Create for academic use is assigned to sanan2@galvanizeu.newhaven.edu and will expire on December 05, 2017.


### Import pre_processing (custom) package

In [4]:
import pre_processing as pp

## Step 1: Data Exploration
**Understand data --> Class balance --> Visualize and Explore**

### Load file and filter reviews

In [5]:
# Reviews file
REVIEWS_FILE = 'music_reviews.json'

In [6]:
# Load jason file to sFrame
sFrame = pp.load_json_file_to_sframe(REVIEWS_FILE)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [7]:
# Filter rows with no reviews
reviews = sFrame.filter_by(values = '', column_name = 'reviewText', exclude = True)
# Drop rows where every value is missing
reviews = sFrame.dropna(how='all')

### Peek into reviews data

In [8]:
reviews.show()

### Check for imbalance and balance it

In [9]:
for rating in range(1, 6):
    index = reviews['overall'] == rating
    print (str(rating) + ': Count = '+ str(sum(index)))

1: Count = 2791
2: Count = 3010
3: Count = 6789
4: Count = 16536
5: Count = 35580


In [10]:
reviews.show(view = 'Bar Chart', x = 'overall')

### Let's under_sample and balance the reviews

In [11]:
reviews_balanced = pp.balance_data(reviews,5000)

### Look at balanced reviews

In [12]:
for rating in range(1, 6):
    idx = reviews_balanced['overall'] == rating
    print str(rating) + ': Count = '+ str(sum(idx))

1: Count = 5582
2: Count = 3010
3: Count = 5000
4: Count = 5000
5: Count = 5000


In [13]:
reviews_balanced.show(view = 'Bar Chart', x = 'overall')

## Visualize and explore reviews data

In [14]:
reviews_balanced.show(view = 'Bar Chart', x = 'asin')


## Step 2: Pre-processing
**Tokenize --> Remove stopwords --> Lemmetize**



### Preprocess reviews

In [15]:
reviews_balanced['processedReviews'] = reviews_balanced['reviewText'].apply(pp.preprocess_pipeline)

## Step 3: Build N-Gram models
** Unigram --> Bigram --> Trigram **

### Build Unigram, Bigram and Trigram of reviews

In [16]:
reviews_balanced['Unigram'] =  gl.text_analytics.count_ngrams(reviews_balanced['processedReviews'], n=1)
reviews_balanced['Bigram']  =  gl.text_analytics.count_ngrams(reviews_balanced['processedReviews'], n=2)
reviews_balanced['Trigram'] =  gl.text_analytics.count_ngrams(reviews_balanced['processedReviews'], n=3)

#### Peek into N-Gram models

In [17]:
reviews_balanced

asin,helpful,overall,reviewText,reviewTime,reviewerID
B0021X515S,"[8, 31]",2.0,These guys are talented but this is not my type ...,"06 19, 2009",A2R8WPPAH1SEAQ
B00005NKYQ,"[3, 4]",3.0,The concept for this CD is high-concept femin ...,"05 12, 2005",AYOO12C9Y2T95
B00065BYAY,"[4, 7]",4.0,I rushed out to get Fantasia's cd in support ...,"01 30, 2005",A35JR4D6FLXYRQ
B000002NG7,"[1, 13]",2.0,When I first heard Faith No More was putting out ...,"12 30, 2004",A1DIFL0333QPEB
B0000TAZS8,"[4, 6]",2.0,Young Buck flat out sucks. 50 can be good or ...,"11 1, 2005",A39XGSL1OGDVDE
B00004TL26,"[0, 0]",5.0,"OMG, what did i tell yall people! Busta wont go ...","01 28, 2003",A3RCTN2UUW6EQ2
B00005NDZC,"[2, 2]",5.0,I bought this album for one song &quot;Hide ...,"02 11, 2003",AGG41GO9FT3FA
B00FAEQ22G,"[0, 0]",5.0,Just a good time tune that has some catchy ...,"01 21, 2014",A1R4HQIWADCTP0
B000BY8278,"[3, 4]",4.0,"As always, The Isley Bros. return as ...","05 12, 2006",A1QEWOSV05RYEO
B00005TPKC,"[1, 2]",4.0,"I own all of Alanis's studio albums, and ...","07 12, 2004",A20OAPE0RCEJ9P

reviewerName,summary,unixReviewTime,processedReviews
J. Casey,Doesn't resonate,1245369600,guy talented type music
"Greg Brady ""columbusboy""",Amos takes a look at how the other half writes ...,1115856000,concept cd high concept feminism amos cover 12 ...
Geminigirl,Free Yourself - 4.5 stars ...,1107043200,rushed get fantasia cd support disappointed ...
Marcus T. Brody,Don't let the title fool you ...,1104364800,first heard faith putting album called album year ...
nick to the wil,2 Stars for 2 Songs,1130803200,young buck flat suck
Zukester,Anarchy,1043712000,omg tell yall people
"Frederick A. Bristol ""Fabj"" ...",Glued to my CD player,1044921600,bought album one song quot hide u quot got ...
"Darrel Poppino ""Have a good day!"" ...","And, we'll never be royal",1390262400,good time tune catchy line song like headline ...
"Michael Brent Faulkner, Jr. ""Brent Faulkner"" ...","Exceptional Album, Isley's voice is a st ...",1147392000,always isley bros return consistent ever another ...
A Customer,Few can do it like Alanis,1089590400,alaniss studio album although least favorite ...

Unigram,Bigram,Trigram
"{'music': 1, 'guy': 1, 'type': 1, 'talented' ...","{'talented type': 1, 'type music': 1, 'guy ...","{'talented type music': 1, 'guy talented type': ..."
"{'concept': 2, 'woman': 1, 'attempt': 1, 'song': ...","{'men attempt': 1, 'something men': 1, ' ...","{'amos cover 12': 1, 'cover 12 song': 1, ..."
"{'rushed': 1, 'support': 1, 'get': 1, 'cd': 1, ...","{'get fantasia': 1, 'cd support': 1, 'fantasia ...","{'fantasia cd support': 1, 'cd support ..."
"{'album': 2, 'faith': 1, 'pumped': 1, 'heard': 1, ...","{'album called': 1, 'album year': 1, 'faith ...","{'called album year': 1, 'first heard faith': 1, ..."
"{'buck': 1, 'suck': 1, 'young': 1, 'flat': 1} ...","{'flat suck': 1, 'young buck': 1, 'buck flat' ...","{'young buck flat': 1, 'buck flat suck': 1} ..."
"{'omg': 1, 'tell': 1, 'yall': 1, 'people': 1} ...","{'yall people': 1, 'tell yall': 1, 'omg tell': 1} ...","{'omg tell yall': 1, 'tell yall people': 1} ..."
"{'album': 1, 'hide': 1, 'song': 1, 'quot': 2, ...","{'treat long': 1, 'quot hide': 1, 'u quot': 1, ...","{'got best treat': 1, 'one song quot': 1, ' ..."
"{'good': 1, 'like': 1, 'song': 1, 'headline' ...","{'song like': 1, 'headline review': 1, ...","{'good time tune': 1, 'time tune catchy': 1, ..."
"{'set': 1, 'love': 2, 'exceptional': 1, ...","{'making music': 1, 'always isley': 1, ...","{'always isley bros': 1, 'isley bros return': 1, ..."
"{'album': 4, 'good': 1, '21': 1, 'song': 1, ...","{'album although': 1, 'album 1': 1, '21 thi ...","{'album still good': 1, 'alaniss studio album': ..."


In [18]:
# Drop rows where every value is missing on balanced data
reviews_balanced = reviews_balanced.dropna()

## Step 4: Apply LDA (Latent Dirchilet Allocation)
** Create models using GraphLab LDA to identify topics**
## Step 5: Visualize Topics
**Visualize Top 5 most relevant words for each topic**

### Create Unigram Topic Model

In [19]:
unigram_topic_model = gl.topic_model.create(reviews_balanced['Unigram'], num_topics=10, num_iterations=200)

### Unigram: Top 5 most relevant words for each topic

In [20]:
unigram_topic_model.get_topics().print_rows(100)

+-------+---------+------------------+
| topic |   word  |      score       |
+-------+---------+------------------+
|   0   |    cd   | 0.0578316217822  |
|   0   |   fan   | 0.0334037560094  |
|   0   |  first  | 0.0307929153187  |
|   0   |  album  | 0.0271317364191  |
|   0   |  heard  |  0.027101726756  |
|   1   |  album  | 0.0738048342734  |
|   1   |   song  | 0.0664042468882  |
|   1   |  great  | 0.0328976504592  |
|   1   |   time  | 0.0195532842292  |
|   1   |   good  | 0.0189414246422  |
|   2   |   rap   | 0.0225101713448  |
|   2   |  album  | 0.0181807585353  |
|   2   |   hop   | 0.0147116777585  |
|   2   |  rapper | 0.0147116777585  |
|   2   |   hip   |  0.014656172466  |
|   3   |  music  | 0.0435222064928  |
|   3   |   one   | 0.0233672049332  |
|   3   |  artist | 0.0211477553567  |
|   3   |   like  | 0.0140695107613  |
|   3   |   new   | 0.0103804256544  |
|   4   |   dont  | 0.0310474881308  |
|   4   |   like  | 0.0302677632871  |
|   4   |   get   | 0.025

### Visualize Unigram Model

In [21]:
pyLDAvis.graphlab.prepare(unigram_topic_model, reviews_balanced['Unigram'])

### Create Bigram Topic Model

In [22]:
bigram_topic_model = gl.topic_model.create(reviews_balanced['Bigram'], num_topics=10, num_iterations=200)

### Bigram: Top 5 most relevant words for each topic

In [23]:
bigram_topic_model.get_topics().print_rows(100)

+-------+-----------------+------------------+
| topic |       word      |      score       |
+-------+-----------------+------------------+
|   0   |      jay z      | 0.00497376187878 |
|   0   |     year ago    | 0.00217193921941 |
|   0   |    new album    | 0.00208767387627 |
|   0   |    album good   | 0.00198234219734 |
|   0   |   studio album  |  0.001624214489  |
|   1   |    good song    | 0.00279005808603 |
|   1   |    long time    | 0.00272717172797 |
|   1   |   pretty good   | 0.00226600510218 |
|   1   |      wan na     | 0.00178387635703 |
|   1   |    sound like   | 0.00151136880543 |
|   2   |       r b       | 0.00318759339905 |
|   2   |    sound like   |  0.001990375221  |
|   2   |   second album  | 0.0015200395082  |
|   2   |   debut album   | 0.00147728171613 |
|   2   |    album came   | 0.00130625054783 |
|   3   |    quot quot    | 0.00642113687658 |
|   3   |    elton john   | 0.00265609987429 |
|   3   |   one favorite  | 0.00232691631125 |
|   3   |    

### Visualize Bigram Model

In [24]:
pyLDAvis.graphlab.prepare(bigram_topic_model, reviews_balanced['Bigram'])

### Create Trigram Topic Model

In [25]:
trigram_topic_model = gl.topic_model.create(reviews_balanced['Trigram'], num_topics=10, num_iterations=200)

### Trigram: Top 5 most relevant words for each topic

In [26]:
trigram_topic_model.get_topics().print_rows(100)

+-------+---------------------+-------------------+
| topic |         word        |       score       |
+-------+---------------------+-------------------+
|   0   |   let start saying  | 0.000611192887365 |
|   0   |    notorious b g    | 0.000414667843196 |
|   0   |       r amp b       | 0.000336057825529 |
|   0   |   breath fresh air  | 0.000336057825529 |
|   0   |    hip hop album    | 0.000296752816695 |
|   1   |    red hot chili    | 0.000589651517911 |
|   1   |      im gon na      | 0.000354574500804 |
|   1   |    one hit wonder   | 0.000334984749378 |
|   1   |  one favorite album | 0.000334984749378 |
|   1   |   hard knock life   | 0.000276215495101 |
|   2   |    hip hop album    | 0.000356973104739 |
|   2   |  greatest hit album | 0.000337250833759 |
|   2   |    one best album   | 0.000258361749838 |
|   2   |      n da hood      | 0.000258361749838 |
|   2   |     album jay z     | 0.000238639478858 |
|   3   |  self titled debut  | 0.000590414879239 |
|   3   |   

### Visualize Trigram Model

In [27]:
pyLDAvis.graphlab.prepare(trigram_topic_model, reviews_balanced['Trigram'])

## Challenges:
** Handling large datasets (> 1-million reviews) : **  
**- Visualization with large dataset using pyLDAvis was not possible to render in iPython notebook **  
**- Could not build models due to limited processing power of my computer **  

## Conclusion:
** - Imbalanced review data caused topic modelling find topics more biased towards positive reviews. **  
** - Bigram/Trigram model even though had large feature space was able to find topics both positive and negative reviews more clearly. **