In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
import pyLDAvis
from pyLDAvis import sklearn as sklearn_lda
import warnings

warnings.filterwarnings('ignore')

# <span style="color:red"> Topic Classification of Wine Reviews: A Worked Example </span>

## Data Skills for Empirical Research

### Winter 2021

# <span style="color:red"> Overview </span>

## <span style="color:red"> Goals </span>
<center><img src="../figures/wine_bottle.jpg" width="20%" style="border:5px solid #000000"/></center>

* We want to determine the substitutability of wines for online retailers using text reviews - if two wines are substitutes, then we expect that reviews about them will be similar
* To determine similarity, we want to know what each review is talking about from a broad topic level - ideally, we additionally want some quantitative metric of topic distance
* Since the corpus of reviews is quite large, we want an unsupervised machine learning algorithm

## <span style="color:red"> Solution: Latent Dirichlet Allocation (LDA) </span>

* Latent Dirichlet Allocation (LDA) is a topic model which models documents as mixtures of underlying topics - which topic a document belongs to is determined by the words it contains
* By using a bag-of-words model, LDA represents each latent topic as a multinomial distribution over the words in a text - these latent topics can be captured using an unsupervised Bayesian learning algorithm
* Advantages

    * Ease of use with the scikit-learn package
    * Relatively low cost to implement as an unsupervised algorithm - no PoS tagging necessary, for example
    
* Disadvantages

    * Validation of topics must be performed manually
    * Metrics of model fit do not necessarily indicate overall model quality (but other metrics can be used, e.g. Jensen-Shannon Divergence, topic coherence)
    

# <span style="color:red"> Procedure </span>

## <span style="color:red"> Pipeline Overview </span>
1. Load in review corpus and fit a count vectorizer to the text
2. Create a LDA model with a specified number of topics and fit the model to the vectorized text

    * Optional: perform a grid-search cross validation over a number of topics to find the best fit
    
3. Get estimated term frequencies of the most frequent terms in each topic
4. Get the probability of a review belonging to each of the $n$ specified categories
5. Write out an HTML visualization of the topics in the model

First, read in the review corpus:

In [2]:
reviews=pd.read_csv('../data/wine_reviews_sample.csv')
print("Number of wines in the sample: %d" % len(reviews['wine_id'].unique()))
print("Number of total reviews: %d" % reviews.shape[0])

Number of wines in the sample: 20
Number of total reviews: 200


After reading in the data, load the review text into a Count Vectorizer:

In [3]:
count_vectorizer=CountVectorizer(stop_words='english')
count_data=count_vectorizer.fit_transform(reviews['text'])

Next, specify some number of topics for the LDA model; in the original wine project, five topics were specified:

In [4]:
num_topics=5
lda_model=LDA(n_components=num_topics)
lda_model.fit(count_data)

LatentDirichletAllocation(n_components=5)

Now that the model has been fitted to the corpus with the proper number of topics, we want to get the top 10 most frequent words and their estimated term frequency for each topic. This is done like so:

In [5]:
n_top_words=10
words=count_vectorizer.get_feature_names()
output_dict={}
for topic_idx,topic in enumerate(lda_model.components_):
    top_features_ind=topic.argsort()[:-n_top_words-1:-1]
    top_features=[words[i] for i in top_features_ind]
    weights=topic[top_features_ind]
    output_dict[(topic_idx+1)]=top_features
    output_dict[('%s_freq' % str((topic_idx+1)))]=weights
output_df=pd.DataFrame.from_dict(output_dict)
output_df.index=output_df.index.rename('Rank')
output_df.index=output_df.index+1
print(output_df.head(10))

           1     1_freq        2     2_freq        3     3_freq        4  \
Rank                                                                       
1       nice  15.687043     good  11.221647    great  13.938013   medium   
2     finish  14.210360     wine   8.209898      red  10.912524   finish   
3       nose  13.199232     just   6.211955    apple  10.166098     body   
4      fruit  13.045087     nose   6.207802  brioche   8.200480   cherry   
5       good  12.568816   finish   5.204493     wine   7.860729    apple   
6     really  12.071051   medium   4.206883   finish   7.634476      red   
7        red  10.503312  tannins   4.205597    lemon   7.200686      med   
8        dry  10.145933    great   4.204606   citrus   6.960853    lemon   
9     cherry   9.473008      red   4.200181    value   6.942235  acidity   
10    palate   8.180290    fruit   4.198380   cherry   6.584335     nose   

         4_freq           5     5_freq  
Rank                                    
1    

Our next task is to get the probability of each review belonging to one of these categories, as well as identifying which category a review most likely belongs to. This is accomplished thusly:

In [6]:
reviews_fitted=lda_model.transform(count_vectorizer.transform(reviews['text']))
predicted_topic=np.argmax(reviews_fitted,axis=1)+1
fitted_output=pd.DataFrame(reviews_fitted,columns=(np.arange(num_topics)+1))
fitted_output.columns=['topic_'+str(col) for col in fitted_output.columns]
fitted_output.index=reviews.index
fitted_output=pd.concat([reviews,fitted_output],axis=1)
fitted_output['predicted_topic']=predicted_topic
print(fitted_output.head(10))

   wine_id                                               text   topic_1  \
0  1248751  Bright, fresh, and lighter bodied with a long ...  0.029179   
1  1248751                 cherry raspberry earthy strawberry  0.040417   
2  1248751  A real big surprising Pinot Noir with a bunch ...  0.020126   
3  1248751  Excellent pinot noir, very drinkable, doesn't ...  0.022507   
4  1248751  Nice cherry taste, a little earthy with just e...  0.925697   
5  1248751  Totally good Pinot for Coastal California - in...  0.009381   
6  1248751  Great wine for the price. Delicate on the pall...  0.020281   
7  1248751  Red fruit: strawberry, raspberry, cherry. Cont...  0.015746   
8  1248751  Bright and fresh Sonoma Pinot. Red cherry, str...  0.018380   
9  1248751  Nice Pinot Noir from California. Grapes source...  0.018537   

    topic_2   topic_3   topic_4   topic_5  predicted_topic  
0  0.029268  0.883258  0.029192  0.029102                3  
1  0.040211  0.040081  0.040306  0.838984           

Finally, to get wine-level predicted topics we do the following:

In [7]:
wine_list=[]
mode_topic=[]
for wine in fitted_output['wine_id'].unique():
    wine_list.append(wine)
    mode_topic.append(fitted_output[fitted_output['wine_id']==wine]['predicted_topic'].mode().values[0])
wine_topic_df=pd.DataFrame({'wine_id':wine_list,'mode_review_topic':mode_topic})
print(wine_topic_df)

    wine_id  mode_review_topic
0   1248751                  5
1   7122486                  3
2     86684                  4
3   1288838                  5
4   1141123                  5
5   1248038                  3
6      9119                  4
7   1194720                  5
8   1578948                  5
9      3069                  2
10  1153696                  5
11  1399928                  1
12  1380949                  1
13  1733954                  1
14    83195                  4
15  1842942                  5
16    19265                  1
17  1944048                  1
18    15615                  4
19  1209439                  5


One last useful tool is the PyLDAVis package, which provides an interactive visualization of the LDA model we have fitted to the reviews. This is a relatively simple procedure:

In [8]:
LDAVis_prepared=sklearn_lda.prepare(lda_model,count_data,count_vectorizer)
pyLDAvis.save_html(LDAVis_prepared,'../html_output/wine_pyldavis.html')

## <span style="color:red"> References </span>
* Wang, Wenxin, Yi Feng, and Wenqiang Dai. “Topic Analysis of Online Reviews for Two Competitive Products Using Latent Dirichlet Allocation.” Electronic Commerce Research and Applications 29 (May 1, 2018): 142–56. https://doi.org/10.1016/j.elerap.2018.04.003.
* “Topic Extraction with Non-Negative Matrix Factorization and Latent Dirichlet Allocation — Scikit-Learn 0.24.1 Documentation.” Accessed January 29, 2021. https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html.
* “2.5. Decomposing Signals in Components (Matrix Factorization Problems) — Scikit-Learn 0.24.1 Documentation.” Accessed January 29, 2021. https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation.


## <span style="color:red"> Packages Used - For Future Reference </span>

In [9]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
import pyLDAvis
from pyLDAvis import sklearn as sklearn_lda