In [138]:
import gensim
import logging
import nltk
import pandas as pd
import warnings

from nltk.stem import WordNetLemmatizer
from nltk.util import ngrams

nltk.download('wordnet')

pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [139]:
df = pd.read_csv("reviews.csv")

In [140]:
df.head()

Unnamed: 0,date,partially_cleaned_text,sentiment,cleaned_text
0,18/6/21,This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.,1,healthy dog food good digestion also good small puppies dog eats required amount every feeding
1,7/7/21,I've been very pleased with the Natural Balance dog food. Our dogs have had issues with other dog foods in the past and I had someone recommend Natural Balance grain free since it is possible they were allergic to grains. Since switching I haven't had any issues. It is also helpful that have have different kibble size for larger/smaller sized dogs.,1,pleased natural balance dog food dogs issues dog foods past someone recommend natural balance grain free since possible allergic grains since switching issues also helpful different kibble size larger smaller sized dogs
2,18/6/21,"Before I was educated about feline nutrition, I allowed my cats to become addicted to dry cat food. I always offered both canned and dry, but wish I would have fed them premium quality canned food and limited dry food. I have two 15 year old cats and two 5 year old cats. The only good quality dry foods they will eat are Wellness and Innova. Innova's manufacturer was recently purchased by Procter&Gamble. I began looking for a replacement. After once again offering several samples (from my local holistic pet store) Holistic Select was the only one (other than the usual Wellness and Innova) they would eat. For finicky cats, I recommend trying Holistic Select. It is a good quality food that is very palatable for finicky eaters.",1,educated feline nutrition allowed cats become addicted dry cat food always offered canned dry wish would fed premium quality canned food limited dry food two year old cats two year old cats good quality dry foods eat wellness innova innova manufacturer recently purchased procter gamble began looking replacement offering several samples local holistic pet store holistic select one usual wellness innova would eat finicky cats recommend trying holistic select good quality food palatable finicky eaters
3,7/7/21,"My holistic vet recommended this, along with a few other brands. We tried them all, but my cats prefer this (especially the sardine version). The best part is their coats are so soft and clean and their eyes are so clear. AND (and I don't want to be rude, so I'll say this as delicately as I can) their waste is far less odorous than cats who eat the McDonalds junk found in most stores, which is a definite plus for me! The health benefits are so obvious - I highly recommend Holistic Select!",1,holistic vet recommended along brands tried cats prefer especially sardine version best part coats soft clean eyes clear want rude say delicately waste far less odorous cats eat mcdonalds junk found stores definite plus health benefits obvious highly recommend holistic select
4,1/7/21,"I bought this coffee because its much cheaper than the ganocafe and has the organic reishi mushroom as well as other healthy antioxidants. I didn't expect it to taste good, but it actually does! I've only had it for a few days and for $5 its totally worth it. My sisters all take ganocafe but now I'm introducing them to this less expensive similar coffee. I will follow up on this product in a few weeks. :)",1,bought coffee much cheaper ganocafe organic reishi mushroom well healthy antioxidants expect taste good actually days totally worth sisters take ganocafe introducing less expensive similar coffee follow product weeks


# Data Preparation

The words in remove_common_words functions are added gradually after we see the model results.

Common words that appear in almost all topics' words are added here.

We also identified that some of the topics are tea, coffee, and price

So, to help us build a better lda model, we remove these 3 words and generate other topics and added them to the final topics

In [141]:
def lemmatize(filtered_text):
    return " ".join(list(map(WordNetLemmatizer().lemmatize, filtered_text.split(" "))))

def remove_common_words(filtered_text):
    common_words = ['great', 'taste', 'good', 'like', 'product', 'flavor', 'love', 
                 'really', 'buy', 'tastes', 'better', 'best', 'tried', 'use', 
                 'eat', 'food', 'make', "would", "one", "get", "tea", "coffee", 
                 "price", "amazon", "bag", "dog", "cup", "much"]
    return " ".join(list(filter(lambda x: x not in common_words, filtered_text.split(" "))))

def generate_bigrams(filtered_text):
    result = filtered_text
    for w in ngrams(filtered_text.split(" "), 2):
        result += " " + "_".join(w)
    return result

In [142]:
df["review"] = df["cleaned_text"].apply(lemmatize)
df["review"] = df["review"].apply(generate_bigrams)
df["review"] = df["review"].apply(remove_common_words)

In [143]:
df.head()

Unnamed: 0,date,partially_cleaned_text,sentiment,cleaned_text,review
0,18/6/21,This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.,1,healthy dog food good digestion also good small puppies dog eats required amount every feeding,healthy digestion also small puppy eats required amount every feeding healthy_dog dog_food food_good good_digestion digestion_also also_good good_small small_puppy puppy_dog dog_eats eats_required required_amount amount_every every_feeding
1,7/7/21,I've been very pleased with the Natural Balance dog food. Our dogs have had issues with other dog foods in the past and I had someone recommend Natural Balance grain free since it is possible they were allergic to grains. Since switching I haven't had any issues. It is also helpful that have have different kibble size for larger/smaller sized dogs.,1,pleased natural balance dog food dogs issues dog foods past someone recommend natural balance grain free since possible allergic grains since switching issues also helpful different kibble size larger smaller sized dogs,pleased natural balance issue past someone recommend natural balance grain free since possible allergic grain since switching issue also helpful different kibble size larger smaller sized pleased_natural natural_balance balance_dog dog_food food_dog dog_issue issue_dog dog_food food_past past_someone someone_recommend recommend_natural natural_balance balance_grain grain_free free_since since_possible possible_allergic allergic_grain grain_since since_switching switching_issue issue_also also_helpful helpful_different different_kibble kibble_size size_larger larger_smaller smaller_sized sized_dog
2,18/6/21,"Before I was educated about feline nutrition, I allowed my cats to become addicted to dry cat food. I always offered both canned and dry, but wish I would have fed them premium quality canned food and limited dry food. I have two 15 year old cats and two 5 year old cats. The only good quality dry foods they will eat are Wellness and Innova. Innova's manufacturer was recently purchased by Procter&Gamble. I began looking for a replacement. After once again offering several samples (from my local holistic pet store) Holistic Select was the only one (other than the usual Wellness and Innova) they would eat. For finicky cats, I recommend trying Holistic Select. It is a good quality food that is very palatable for finicky eaters.",1,educated feline nutrition allowed cats become addicted dry cat food always offered canned dry wish would fed premium quality canned food limited dry food two year old cats two year old cats good quality dry foods eat wellness innova innova manufacturer recently purchased procter gamble began looking replacement offering several samples local holistic pet store holistic select one usual wellness innova would eat finicky cats recommend trying holistic select good quality food palatable finicky eaters,educated feline nutrition allowed cat become addicted dry cat always offered canned dry wish fed premium quality canned limited dry two year old cat two year old cat quality dry wellness innova innova manufacturer recently purchased procter gamble began looking replacement offering several sample local holistic pet store holistic select usual wellness innova finicky cat recommend trying holistic select quality palatable finicky eater educated_feline feline_nutrition nutrition_allowed allowed_cat cat_become become_addicted addicted_dry dry_cat cat_food food_always always_offered offered_canned canned_dry dry_wish wish_would would_fed fed_premium premium_quality quality_canned canned_food food_limited limited_dry dry_food food_two two_year year_old old_cat cat_two two_year year_old old_cat cat_good good_quality quality_dry dry_food food_eat eat_wellness wellness_innova innova_innova innova_manufacturer manufacturer_recently recently_purchased purchased_procter procter_gamble gamble_began began_looking looking_replacement replacement_offering offering_several several_sample sample_local local_holistic holistic_pet pet_store store_holistic holistic_select select_one one_usual usual_wellness wellness_innova innova_would would_eat eat_finicky finicky_cat cat_recommend recommend_trying trying_holistic holistic_select select_good good_quality quality_food food_palatable palatable_finicky finicky_eater
3,7/7/21,"My holistic vet recommended this, along with a few other brands. We tried them all, but my cats prefer this (especially the sardine version). The best part is their coats are so soft and clean and their eyes are so clear. AND (and I don't want to be rude, so I'll say this as delicately as I can) their waste is far less odorous than cats who eat the McDonalds junk found in most stores, which is a definite plus for me! The health benefits are so obvious - I highly recommend Holistic Select!",1,holistic vet recommended along brands tried cats prefer especially sardine version best part coats soft clean eyes clear want rude say delicately waste far less odorous cats eat mcdonalds junk found stores definite plus health benefits obvious highly recommend holistic select,holistic vet recommended along brand cat prefer especially sardine version part coat soft clean eye clear want rude say delicately waste far le odorous cat mcdonalds junk found store definite plus health benefit obvious highly recommend holistic select holistic_vet vet_recommended recommended_along along_brand brand_tried tried_cat cat_prefer prefer_especially especially_sardine sardine_version version_best best_part part_coat coat_soft soft_clean clean_eye eye_clear clear_want want_rude rude_say say_delicately delicately_waste waste_far far_le le_odorous odorous_cat cat_eat eat_mcdonalds mcdonalds_junk junk_found found_store store_definite definite_plus plus_health health_benefit benefit_obvious obvious_highly highly_recommend recommend_holistic holistic_select
4,1/7/21,"I bought this coffee because its much cheaper than the ganocafe and has the organic reishi mushroom as well as other healthy antioxidants. I didn't expect it to taste good, but it actually does! I've only had it for a few days and for $5 its totally worth it. My sisters all take ganocafe but now I'm introducing them to this less expensive similar coffee. I will follow up on this product in a few weeks. :)",1,bought coffee much cheaper ganocafe organic reishi mushroom well healthy antioxidants expect taste good actually days totally worth sisters take ganocafe introducing less expensive similar coffee follow product weeks,bought cheaper ganocafe organic reishi mushroom well healthy antioxidant expect actually day totally worth sister take ganocafe introducing le expensive similar follow week bought_coffee coffee_much much_cheaper cheaper_ganocafe ganocafe_organic organic_reishi reishi_mushroom mushroom_well well_healthy healthy_antioxidant antioxidant_expect expect_taste taste_good good_actually actually_day day_totally totally_worth worth_sister sister_take take_ganocafe ganocafe_introducing introducing_le le_expensive expensive_similar similar_coffee coffee_follow follow_product product_week


# Model

## LDA

After finetuning, we found that the below parameters used are the best.

We filter out words / bigrams that appear less than 15 times or in more than half of the reviews.

Then, we set the num of topics to 20 after some trials and comparing with bertopic.

LDA does not perform well as a lot of topics have similar words, but we extract some interesting keywords that can be potential topics

1. Tea
2. Coffee
3. Price
4. Sugar
5. Salt
6. Truffle
7. Gluten free
8. Store (Grocery / Local)
9. Peanut butter
10. Coconut water
11. Milk
12. Ice cream

In [144]:
#Source: https://github.com/marcmuon/nlp_yelp_review_unsupervised/tree/master/notebooks

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

def bigrams(words, bi_min = 15):
    bigram = gensim.models.Phrases(words, min_count = bi_min)
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    return bigram_mod

def get_corpus(df, column):
    words = list(sent_to_words(df[column]))
    bigram_mod = bigrams(words)
    bigram = [bigram_mod[word] for word in words]
    id2word = gensim.corpora.Dictionary(bigram)
    id2word.filter_extremes(no_below = 15, no_above=0.5)
    id2word.compactify()
    corpus = [id2word.doc2bow(text) for text in bigram]
    
    return corpus, id2word, bigram

In [145]:
df_corpus, df_id2word, df_bigram = get_corpus(df, "review")

In [146]:
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    lda = gensim.models.ldamulticore.LdaMulticore(
                           corpus = df_corpus,
                           num_topics = 20, 
                           id2word = df_id2word,
                           per_word_topics = True)



In [147]:
lda.print_topics(30, num_words = 8)

[(0,
  '0.007*"box" + 0.006*"year_old" + 0.006*"truffle" + 0.006*"package" + 0.006*"cooky" + 0.005*"even" + 0.005*"think" + 0.005*"used"'),
 (1,
  '0.009*"even" + 0.007*"baby" + 0.007*"also" + 0.007*"drink" + 0.006*"well" + 0.006*"corn_syrup" + 0.006*"first" + 0.005*"used"'),
 (2,
  '0.008*"also" + 0.008*"item" + 0.007*"time" + 0.006*"box" + 0.006*"hot_chocolate" + 0.006*"mix" + 0.006*"pack" + 0.005*"even"'),
 (3,
  '0.011*"drink" + 0.009*"cat" + 0.007*"gluten_free" + 0.007*"time" + 0.007*"also" + 0.007*"box" + 0.007*"say" + 0.006*"found"'),
 (4,
  '0.011*"sugar" + 0.010*"water" + 0.006*"order" + 0.006*"also" + 0.005*"sweet" + 0.005*"since" + 0.005*"free" + 0.005*"made"'),
 (5,
  '0.009*"bought" + 0.007*"first" + 0.007*"since" + 0.007*"brand" + 0.006*"time" + 0.006*"treat" + 0.006*"buying" + 0.005*"thought"'),
 (6,
  '0.010*"box" + 0.007*"made" + 0.006*"treat" + 0.006*"made_china" + 0.006*"order" + 0.006*"little" + 0.005*"found" + 0.005*"even"'),
 (7,
  '0.007*"made" + 0.007*"box" + 0.

In [148]:
topic_vec = []
for i in range(len(df)):
    top_topics = lda.get_document_topics(df_corpus[i], minimum_probability = 0.0)
    topic_values = sorted(top_topics, key = lambda x: x[1])[-1]
    topic_vec += [topic_values]

topics = list(map(lambda x: x[0], topic_vec))
topic_probs = list(map(lambda x: x[1], topic_vec))

In [149]:
df["topic"] = topics
df["topic_prob"] = topic_probs

In [150]:
df.head()

Unnamed: 0,date,partially_cleaned_text,sentiment,cleaned_text,review,topic,topic_prob
0,18/6/21,This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.,1,healthy dog food good digestion also good small puppies dog eats required amount every feeding,healthy digestion also small puppy eats required amount every feeding healthy_dog dog_food food_good good_digestion digestion_also also_good good_small small_puppy puppy_dog dog_eats eats_required required_amount amount_every every_feeding,2,0.495634
1,7/7/21,I've been very pleased with the Natural Balance dog food. Our dogs have had issues with other dog foods in the past and I had someone recommend Natural Balance grain free since it is possible they were allergic to grains. Since switching I haven't had any issues. It is also helpful that have have different kibble size for larger/smaller sized dogs.,1,pleased natural balance dog food dogs issues dog foods past someone recommend natural balance grain free since possible allergic grains since switching issues also helpful different kibble size larger smaller sized dogs,pleased natural balance issue past someone recommend natural balance grain free since possible allergic grain since switching issue also helpful different kibble size larger smaller sized pleased_natural natural_balance balance_dog dog_food food_dog dog_issue issue_dog dog_food food_past past_someone someone_recommend recommend_natural natural_balance balance_grain grain_free free_since since_possible possible_allergic allergic_grain grain_since since_switching switching_issue issue_also also_helpful helpful_different different_kibble kibble_size size_larger larger_smaller smaller_sized sized_dog,0,0.624098
2,18/6/21,"Before I was educated about feline nutrition, I allowed my cats to become addicted to dry cat food. I always offered both canned and dry, but wish I would have fed them premium quality canned food and limited dry food. I have two 15 year old cats and two 5 year old cats. The only good quality dry foods they will eat are Wellness and Innova. Innova's manufacturer was recently purchased by Procter&Gamble. I began looking for a replacement. After once again offering several samples (from my local holistic pet store) Holistic Select was the only one (other than the usual Wellness and Innova) they would eat. For finicky cats, I recommend trying Holistic Select. It is a good quality food that is very palatable for finicky eaters.",1,educated feline nutrition allowed cats become addicted dry cat food always offered canned dry wish would fed premium quality canned food limited dry food two year old cats two year old cats good quality dry foods eat wellness innova innova manufacturer recently purchased procter gamble began looking replacement offering several samples local holistic pet store holistic select one usual wellness innova would eat finicky cats recommend trying holistic select good quality food palatable finicky eaters,educated feline nutrition allowed cat become addicted dry cat always offered canned dry wish fed premium quality canned limited dry two year old cat two year old cat quality dry wellness innova innova manufacturer recently purchased procter gamble began looking replacement offering several sample local holistic pet store holistic select usual wellness innova finicky cat recommend trying holistic select quality palatable finicky eater educated_feline feline_nutrition nutrition_allowed allowed_cat cat_become become_addicted addicted_dry dry_cat cat_food food_always always_offered offered_canned canned_dry dry_wish wish_would would_fed fed_premium premium_quality quality_canned canned_food food_limited limited_dry dry_food food_two two_year year_old old_cat cat_two two_year year_old old_cat cat_good good_quality quality_dry dry_food food_eat eat_wellness wellness_innova innova_innova innova_manufacturer manufacturer_recently recently_purchased purchased_procter procter_gamble gamble_began began_looking looking_replacement replacement_offering offering_several several_sample sample_local local_holistic holistic_pet pet_store store_holistic holistic_select select_one one_usual usual_wellness wellness_innova innova_would would_eat eat_finicky finicky_cat cat_recommend recommend_trying trying_holistic holistic_select select_good good_quality quality_food food_palatable palatable_finicky finicky_eater,3,0.482602
3,7/7/21,"My holistic vet recommended this, along with a few other brands. We tried them all, but my cats prefer this (especially the sardine version). The best part is their coats are so soft and clean and their eyes are so clear. AND (and I don't want to be rude, so I'll say this as delicately as I can) their waste is far less odorous than cats who eat the McDonalds junk found in most stores, which is a definite plus for me! The health benefits are so obvious - I highly recommend Holistic Select!",1,holistic vet recommended along brands tried cats prefer especially sardine version best part coats soft clean eyes clear want rude say delicately waste far less odorous cats eat mcdonalds junk found stores definite plus health benefits obvious highly recommend holistic select,holistic vet recommended along brand cat prefer especially sardine version part coat soft clean eye clear want rude say delicately waste far le odorous cat mcdonalds junk found store definite plus health benefit obvious highly recommend holistic select holistic_vet vet_recommended recommended_along along_brand brand_tried tried_cat cat_prefer prefer_especially especially_sardine sardine_version version_best best_part part_coat coat_soft soft_clean clean_eye eye_clear clear_want want_rude rude_say say_delicately delicately_waste waste_far far_le le_odorous odorous_cat cat_eat eat_mcdonalds mcdonalds_junk junk_found found_store store_definite definite_plus plus_health health_benefit benefit_obvious obvious_highly highly_recommend recommend_holistic holistic_select,0,0.314979
4,1/7/21,"I bought this coffee because its much cheaper than the ganocafe and has the organic reishi mushroom as well as other healthy antioxidants. I didn't expect it to taste good, but it actually does! I've only had it for a few days and for $5 its totally worth it. My sisters all take ganocafe but now I'm introducing them to this less expensive similar coffee. I will follow up on this product in a few weeks. :)",1,bought coffee much cheaper ganocafe organic reishi mushroom well healthy antioxidants expect taste good actually days totally worth sisters take ganocafe introducing less expensive similar coffee follow product weeks,bought cheaper ganocafe organic reishi mushroom well healthy antioxidant expect actually day totally worth sister take ganocafe introducing le expensive similar follow week bought_coffee coffee_much much_cheaper cheaper_ganocafe ganocafe_organic organic_reishi reishi_mushroom mushroom_well well_healthy healthy_antioxidant antioxidant_expect expect_taste taste_good good_actually actually_day day_totally totally_worth worth_sister sister_take take_ganocafe ganocafe_introducing introducing_le le_expensive expensive_similar similar_coffee coffee_follow follow_product product_week,12,0.486934


In [151]:
df["topic"].value_counts()

0     361
3     360
7     355
11    323
5     292
2     292
4     289
9     276
13    270
8     267
6     266
12    257
17    257
14    256
10    248
16    236
19    220
18    209
15    205
1     205
Name: topic, dtype: int64