<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Topic Modeling
## *Data Science Unit 4 Sprint 1 Assignment 4*

Analyze a corpus of Amazon reviews from Unit 4 Sprint 1 Module 1's lecture using topic modeling: 

- Fit a Gensim LDA topic model on Amazon Reviews
- Select appropriate number of topics
- Create some dope visualization of the topics
- Write a few bullets on your findings in markdown at the end
- **Note**: You don't *have* to use generators for this assignment

In [16]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

In [17]:
#Start Here
df = pd.read_csv('data/imbd_keywords.csv')
df

Unnamed: 0,review,sentiment,keywords
0,One of the other reviewers has mentioned that ...,positive,"['other shows', 'graphic violence', 'prison ex..."
1,A wonderful little production. The filming tec...,positive,"['halliwell', 'michael sheen', 'realism', 'com..."
2,I thought this was a wonderful way to spend ti...,positive,"['spirited young woman', 'devil wears prada', ..."
3,Basically there's a family where a little boy ...,negative,"['playing parents', 'jake', 'parents', 'descen..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"['mr. mattei', 'good luck', 'mattei', 'human r..."
...,...,...,...
40431,Wow. We watched this film in the hopes that it...,positive,"['cheesy movies', 'rock climbing', 'at least s..."
40432,Bored Londoners Henry Kendall and Joan Barry (...,negative,"['bored londoners henry kendall', 'henry kenda..."
40433,An imagination is a terrible thing to waste .....,positive,"['jones', 'talented actors', 'cross eyed', 'hu..."
40434,I was just lucky I found this movie. I've been...,positive,"['more people', 'many friends', 'emilio esteve..."


In [23]:
v1 = TfidfVectorizer(stop_words="english")
X_train = v1.fit_transform(df['review'])
y_train = df['sentiment']

In [24]:
p1 = {
    'n_estimators':[10,20],
    'max_depth':[None, 7]
}

In [25]:
clf = RandomForestClassifier()
gs1 = GridSearchCV(clf, p1, cv=5, n_jobs=-1, verbose=1)
gs1.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   40.0s finished


GridSearchCV(cv=5, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'max_depth': [None, 7], 'n_estimators': [10, 20]},
             verbose=1)

In [None]:
#gs1.predict(["Sample text"])

In [26]:
test_sample = v1.transform(["Sample Text"])
test_sample.shape

(1, 94716)

In [30]:
pred = gs1.predict(test_sample)
pred

array(['positive'], dtype=object)

In [32]:
#df['sentiment'][pred[0]]

In [48]:
# GridSearch with BOTH the vectorizer & classifier
from sklearn.pipeline import Pipeline

# 0. create classifier and vectorizer objects
vect = TfidfVectorizer()
clf = RandomForestClassifier()

# 1. Create a pipeline with a vectorize and classifier
pipe = Pipeline([
            ('vect', vect),
            ('clf', clf)
        ])

params = {
    'vect__max_features': [7000, 11000, 15000],
    'clf__n_estimators': [40, 60, 80],
    'clf__max_depth': [None]
}

# 2. Use Grid Search to optimize the entire pipeline
gs2 = GridSearchCV(pipe, params, cv=3, n_jobs=-1, verbose=1)
gs2.fit(df['review'], y_train)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('vect', TfidfVectorizer()),
                                       ('clf', RandomForestClassifier())]),
             n_jobs=-1,
             param_grid={'clf__max_depth': [None],
                         'clf__n_estimators': [40, 60, 80],
                         'vect__max_features': [7000, 11000, 15000]})

In [42]:
pred = gs2.predict(['sample text'])

In [49]:
gs2.best_score_, gs2.best_params_

(0.8422692372960001,
 {'clf__max_depth': None,
  'clf__n_estimators': 80,
  'vect__max_features': 15000})

In [54]:
import re
X_train = df['review'].apply(lambda x: x.strip())
X_train = X_train.apply(lambda x: re.sub('From: \S+@\S+', '', x))
X_train

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. The filming tec...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
40431    Wow. We watched this film in the hopes that it...
40432    Bored Londoners Henry Kendall and Joan Barry (...
40433    An imagination is a terrible thing to waste .....
40434    I was just lucky I found this movie. I've been...
40435    When tradition dictates that an artist must pa...
Name: review, Length: 40436, dtype: object

In [56]:
from pandarallel import pandarallel #version 1.4.8
pandarallel.initialize(progress_bar=True, nb_workers=4)

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [61]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [59]:
def tokenize(doc):
    return [token.lemma_ for token in nlp(x) if (token.is_stop != true) and (token.is_punct != True)]

In [62]:
# Create 'lemmas' column
X_train['lemmas'] = df['review'].apply(lambda x: [token.lemma_ for token in nlp(x) if (token.is_stop != True) and (token.is_punct != True)])
#X_train['lemmas'] = df['review'].parralel_apply(lambda x: tokenize(x))

In [65]:
from gensim import corpora

In [67]:
# Create Dictionary
id2word = corpora.Dictionary(X_train['lemmas'])

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in X_train['lemmas']]

In [77]:
#Human readable format of corpus
[(id2word[word_id], word_count) for word_id, word_count in corpus[1]]

[('get', 1),
 ('give', 1),
 ('guard', 1),
 ('home', 1),
 ('scene', 1),
 ('set', 1),
 ('thing', 1),
 ('use', 1),
 ('BBC', 1),
 ('Halliwell', 2),
 ('Michael', 1),
 ('Orton', 1),
 ('Sheen', 1),
 ('Williams', 1),
 ('actor', 1),
 ('chosen-', 1),
 ('come', 1),
 ('comedy', 1),
 ('comforting', 1),
 ('concern', 1),
 ('decorate', 1),
 ('diary', 1),
 ('disappear', 1),
 ('discomforting', 1),
 ('dream', 1),
 ('editing', 1),
 ('entire', 1),
 ('entry', 1),
 ('extremely', 1),
 ('fantasy', 1),
 ('fashion', 1),
 ('filming', 1),
 ('flat', 1),
 ('great', 1),
 ('guide', 1),
 ('knowledge', 1),
 ('life', 1),
 ('little', 2),
 ('master', 1),
 ('masterful', 1),
 ('mural', 1),
 ('old', 1),
 ('particularly', 2),
 ('pat', 1),
 ('perform', 1),
 ('piece', 2),
 ('play', 1),
 ('polari', 1),
 ('production', 2),
 ('realism', 2),
 ('reference', 1),
 ('remain', 1),
 ('seamless', 1),
 ('sense', 2),
 ('solid', 1),
 ('surface', 1),
 ('technique', 2),
 ('terribly', 1),
 ('terrificly', 1),
 ('time', 1),
 ('traditional', 1),
 (

In [86]:
#import gensim
from gensim import models 

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20
                                           chunksize=100
                                           passes=10
                                           per_word_topics=True)
lda_model.save('lda_model.model')

In [82]:
## Part 2 : Estimate a LAD Model with Gensim
lda_multicore = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                       id2word=id2word,
                                                       num_topics=20,
                                                       chunksize=100,
                                                       passes=10,
                                                       per_word_topics=True,
                                                       workers=12)
lda_multicore.save('lda_multicore.model')

In [90]:
from gensim import models
from gensim.models import CoherenceModel
lda_multicore = models.LdaModel.load('lda_multicore.model')

In [91]:
# Compute Perplexity
print('\nPerplexity: ', lda_multicore.log_perplexity(corpus)) # A measure of how good the mode is lower the better

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore,
                                    texts=X_train['lemmas'],
                                    dictionary=id2word,
                                    coherence='c_v')

coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -10.037266845885666

Coherence Score:  0.3720056927467681


In [96]:
# Part 3: Interpret LDA results & results & Select the appropriate number of topics
import pyLDAvis
from pyLDAvis import gensim

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_multicore, corpus, id2word)
vis

  and should_run_async(code)


In [101]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = models.ldamulticore.LdaMulticore(corpus=corpus,
                                                       id2word=id2word,
                                                       num_topics=num_topics,
                                                       chunksize=100,
                                                       passes=10,
                                                       per_word_topics=True,
                                                       workers=12)
        
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
        
    return model_list, coherence_values

  and should_run_async(code)


In [None]:
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=X_train['lemmas'], start=2, limit=40, step=6)

  and should_run_async(code)


In [None]:
coherence_values2 = [0.5054, 0.5332, 0.5452, 0.564, 0.5678, 0.5518, 0.519]

In [None]:
from matplotlib.pyplot import plt
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values2)
plt.xlabel("num topics")
plt.ylabel('coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()

In [None]:
for m, cv in zip(x, coherence_values2):
    print("Num topics =", m, 'has Coherence value of', round(cv,4))

In [None]:
#Select model and print topics
optimal_model = model_list[4]
#optimal_model = models.LdaModel.load('optimal_model.model')
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

## Stretch Goals

* Incorporate Named Entity Recognition in your analysis
* Incorporate some custom pre-processing from our previous lessons (like spacy lemmatization)
* Analyze a dataset of interest to you with topic modeling