# 5. Topic Modeling with LDA
Use this notebook to practice topic modeling (see lecture slides for reference). First, the usual stuff.

In [1]:
import pandas as pd
reviews_df = pd.read_json('data/json/amazon_reviews.json', lines=True, encoding='utf-8') # to prevent error due to
reviews_df.sample(3)

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
63015,en_0199099,product_en_0705066,reviewer_en_0806719,2,"I hate to write bad reviews, but the fact this...","Hidden Thorn, potential dangerous on using the...",en,office_product
132688,en_0789121,product_en_0463343,reviewer_en_0155196,4,You can't see a few of the lighter colored words.,Cute sign,en,home
60394,en_0844842,product_en_0841159,reviewer_en_0242760,2,"This product might be great, had it actually s...",Didn't stick,en,wireless


In [2]:
reviews = reviews_df['review_body'].tolist()
reviews[:3]

["Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no way to insert the casters. I unpackaged the entire chair and hardware before noticing this. So, I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review of part of a chair I never got to sit in. I will go so far as to include a picture of what their injection molding and quality assurance process missed though. I will be hesitant to buy again. It makes me wonder if there aren't missing structures and supports that don't impede the assembly process.",
 'the cabinet dot were all detached from backing... got me',
 "I received my first order of this product and it was broke so I ordered it again. The second one was broke in more places than the first. I can't blame the shipping process as it's shrink wrapped and boxed."]

In [3]:
def remove_puncts(review_text, alphanumeric_only='True'):
    review_text = review_text.replace('-', ' ')
    clean_review_text = ''.join(e for e in review_text if e.isalnum() or e == ' ').lower()
    clean_review_text = ' '.join(clean_review_text.split())
    return clean_review_text

In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def get_words_tokenized_nopunct_nostop(reviews, stop_w=stop_words):
    review_words_list = []
    for review in reviews:
        clean_review = remove_puncts(review)
        words = word_tokenize(clean_review.lower())
        words_nostop = [word for word in words if not word in stop_w]
        review_words_list.append(words_nostop)
    return review_words_list

In [5]:
tokenized_reviews_list = get_words_tokenized_nopunct_nostop(reviews)

## Train an LDA Model
LDA models require an estimated number of topics. We can choose this to be 31, equal to the number of product types.

In [6]:
import gensim

dictionary = gensim.corpora.Dictionary(tokenized_reviews_list)
corpus = [dictionary.doc2bow(review) for review in tokenized_reviews_list]
topic_count = 31
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=topic_count, id2word=dictionary, passes=50)

## Visualize the LDA Model
We use the famouse [pyLDAvis](https://pypi.org/project/pyLDAvis/) library to make sense of the topics.

First, we set up the notebook to suppress warnings (you'll get a lot due to package deprecations).

In [7]:
import warnings
warnings.filterwarnings('ignore')

## IMPORTANT: Set the slider below to a $\lambda$ value of 0.4
A $\lambda$ of 1 shows all frequently-occurring words, which gives you a global picture of the text in general, but does not help you understand the topics. A $\lambda$ of 0.1 shows only the words unique to each topic, which helps you understand the difference between the topics, but not what the topics are about. A $\lambda$ of 0.4 achieves a good balance. See [this paper](https://aclanthology.org/W14-3110.pdf) for more details.

In [12]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
gensimvis.prepare(ldamodel, corpus, dictionary)

## View topic distribution for a document (review)

In [13]:
ldamodel.get_document_topics(corpus[0])

[(1, 0.03990258),
 (5, 0.020167688),
 (9, 0.046473116),
 (11, 0.25684324),
 (13, 0.08034494),
 (14, 0.045582913),
 (15, 0.020696638),
 (17, 0.02063708),
 (18, 0.07433518),
 (19, 0.0727096),
 (23, 0.2571909),
 (29, 0.02953556),
 (30, 0.024823532)]

# Exercise: What are some topics relevant to you as a designer?
Explore the topics and their keywords. 