# Basic Topic Models

In this notebook I make a preliminary pass at giving some review data to a basic topic model. These reviews are from [here](http://times.cs.uiuc.edu/~wang296/Data/), and I've focused solely on the TV's category. I've done this for no other reason than it was the first one on the list, and this is really just about testing some ideas out before moving on to a fuller analysis. This data consists of approximately 243k reviews.

I've fit 4 different topic models to these data - one for each of 10, 25, 50, and 100 topics. I've done no model comparison to determine which is best (though such a thing is possible), and have done very little tweaking that might lead them to perform more ideally (i.e. I should probably remove some more of the more common words like TV).

In principle, we could ask how the topics from these topic models vary with respect to the product ratings. We could also incorporate the ratings directly into the model estimation (e.g. [supervised lda](https://arxiv.org/pdf/1003.0783.pdf)). 

In [62]:
import json
from pprint import pprint
import glob
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import lda
import numpy as np

In [87]:
def print_topics(model, vocab, n_top_words=10):
    topic_word = model.topic_word_
    topics = []
    for i, topic_dist in enumerate(topic_word):
        topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
        topics.append('Topic {}: {}'.format(i, ' - '.join(topic_words)))
    
    return(topics)

# Read the data

In [33]:
files = glob.glob('../data/AmazonReviews/TVs/*.json') # data from http://times.cs.uiuc.edu/~wang296/Data/
df_prods = pd.DataFrame()
df_revs = pd.DataFrame()
for i in files:
    with open(i) as data_file:
        f = json.load(data_file)
        try:
            temp_prods = pd.io.json.json_normalize(f['ProductInfo'])
            temp_revs = pd.io.json.json_normalize(f['Reviews'])
        except:
            pass
        
        df_prods = df_prods.append(temp_prods)
        df_revs = df_revs.append(temp_revs)

df_revs.dropna(axis=0, subset=['Content'], inplace=True)

# Create a document-term matrix

In [92]:
cv = CountVectorizer(stop_words='english', max_df=.95, min_df=5)
X = cv.fit_transform(df_revs.Content)
X

<242638x33870 sparse matrix of type '<type 'numpy.int64'>'
	with 11445523 stored elements in Compressed Sparse Row format>

# Define & fit the models

In [76]:
model_10 = lda.LDA(n_topics=10, n_iter=1500, random_state=1)
model_25 = lda.LDA(n_topics=25, n_iter=1500, random_state=1)
model_50 = lda.LDA(n_topics=50, n_iter=1500, random_state=1)
model_100 = lda.LDA(n_topics=100, n_iter=1500, random_state=1)
model_10.fit(X)
model_25.fit(X)
model_50.fit(X)
model_100.fit(X)

<lda.lda.LDA instance at 0x14788e878>

## Below here, I've printed out the top 10 words from each topic in each topic model

In [88]:
print_topics(model_10, cv.get_feature_names())

['Topic 0: tv - remote - wall - stand - mount - sound - picture - good - like - just',
 'Topic 1: tv - 3d - picture - like - just - great - glasses - really - movies - ray',
 'Topic 2: tv - sound - hdmi - cable - picture - audio - hd - set - speakers - use',
 'Topic 3: tv - great - picture - sound - good - quality - easy - price - love - set',
 'Topic 4: tv - smart - netflix - remote - apps - use - app - amazon - internet - samsung',
 'Topic 5: tv - picture - screen - settings - set - black - color - like - plasma - dark',
 'Topic 6: monitor - 4k - computer - display - swfparams - resolution - video - var - hdmi - pc',
 'Topic 7: tv - samsung - problem - service - vizio - customer - just - warranty - new - called',
 'Topic 8: tv - amazon - delivery - set - box - great - shipping - price - time - did',
 'Topic 9: tv - picture - samsung - quality - sony - price - best - lcd - led - better']

In [89]:
print_topics(model_25, cv.get_feature_names())

['Topic 0: remote - tv - set - button - screen - channel - use - menu - control - turn',
 'Topic 1: tv - service - customer - samsung - called - problem - support - told - amazon - said',
 'Topic 2: tv - samsung - warranty - months - buy - year - years - bought - problem - just',
 'Topic 3: 3d - tv - glasses - ray - blu - movies - 2d - movie - player - picture',
 'Topic 4: screen - tv - picture - room - viewing - set - good - like - light - quality',
 'Topic 5: settings - picture - color - mode - setting - set - tv - contrast - brightness - adjust',
 'Topic 6: sound - tv - speakers - picture - quality - good - great - audio - bar - volume',
 'Topic 7: tv - smart - apps - samsung - remote - use - keyboard - netflix - app - internet',
 'Topic 8: tv - great - picture - room - good - bedroom - bought - perfect - size - love',
 'Topic 9: tv - inch - vizio - samsung - picture - led - lg - hdtv - 32 - quality',
 'Topic 10: monitor - 4k - computer - display - resolution - pc - hdmi - use - scr

In [90]:
print_topics(model_50, cv.get_feature_names())

['Topic 0: remote - tv - button - control - buttons - use - like - menu - channel - screen',
 'Topic 1: ray - blu - player - tv - dvd - hd - picture - blue - movies - set',
 'Topic 2: tv - keyboard - apps - remote - web - smart - browser - use - app - internet',
 'Topic 3: support - manual - product - information - does - amazon - samsung - set - tv - problem',
 'Topic 4: service - tv - customer - samsung - called - told - support - problem - said - tech',
 'Topic 5: tv - turn - problem - power - time - just - issue - times - seconds - turned',
 'Topic 6: vizio - tv - remote - picture - smart - new - just - good - purchased - bought',
 'Topic 7: tv - old - picture - samsung - sharp - lcd - years - inch - year - bought',
 'Topic 8: tv - remote - receiver - box - sound - audio - cable - samsung - hdmi - use',
 'Topic 9: screen - tv - black - dark - light - issue - scenes - like - white - issues',
 'Topic 10: viewing - set - color - image - panel - angle - quality - contrast - lcd - displ

In [91]:
print_topics(model_100, cv.get_feature_names())

['Topic 0: sony - tv - bravia - picture - samsung - quality - better - lcd - xbr - best',
 'Topic 1: hdmi - cable - cables - box - ports - port - need - plug - use - using',
 'Topic 2: tv - just - amazing - blu - away - picture - movie - ray - like - watching',
 'Topic 3: just - like - thing - time - home - life - say - new - house - know',
 'Topic 4: channels - cable - antenna - tv - channel - digital - hd - tuner - air - box',
 'Topic 5: time - just - ll - like - know - way - did - ve - simply - point',
 'Topic 6: support - tv - problem - tech - called - work - vizio - did - help - said',
 'Topic 7: tv - samsung - use - camera - smart - features - skype - internet - feature - set',
 'Topic 8: unit - toshiba - picture - good - quality - tcl - sound - better - set - price',
 'Topic 9: color - colors - picture - black - blacks - look - contrast - dark - bright - settings',
 'Topic 10: delivery - amazon - tv - delivered - service - set - time - shipping - box - day',
 'Topic 11: panasoni