In [7]:
# import libraries  
import numpy as np
import pandas as pd
import seaborn as sns
import string
import pprint
import matplotlib.pyplot as plt
import spacy          
import nltk
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim  
from nltk.corpus import stopwords
import warnings

## About this Notebook
This Notebook contains a comprehensive walkthrough of how we use, a very famous Amazon automotive dataset
that contains feature attributes. The dataset used in this notebook has various origin iplications like: flipkart, 
product review and so on.
Here we're performing a simple workshop on this dataset and use it to model various topics and subject of 
interests in ecommerece business. 

### Loading the data
Here, we will load the data into the system and explore to have more in-depth 
knowledge of the significance of data and what it beholds

In [2]:
dFrame = pd.read_csv('../input/ecommerce-product-review-data/Product Review Large Data.csv')
dFrame.head(5)

Unnamed: 0,id,asins,brand,categories,colors,dateAdded,dateUpdated,dimension,ean,keys,...,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,sizes,upc,weight
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",,,Cristina M,,,205 grams
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,Allow me to preface this with a little history...,One Simply Could Not Ask For More,,,Ricky,,,205 grams
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,4.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,,,Tedd Gardiner,,,205 grams
3,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I bought one of the first Paperwhites and have...,Love / Hate relationship,,,Dougal,,,205 grams
4,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I have to say upfront - I don't like coroporat...,I LOVE IT,,,Miljan David Tanic,,,205 grams


## The first view on dataFrame 
Here, looking at the dataframe, as we can see its a pretty comprehensive dataFrame consisting of many feature categories and 
dataPoints. Initially, we are only looking at the first five entry and every data point in the dataframe 27 columns and thereare altogether total of 10971 data points

In [3]:
print(dFrame.shape)

(10971, 27)


## Data Preprocessing
Now we'll preprocess the data thats given to us before we can pass it for training a machine learning model 
and making any prediction. To remove anamolity, and baises and bugs from the model, data preprocessing is highly recommeded
pre-process in machine learning. During this process, we basically clean data entries, deal with missing values, we normalize the dataset and then finally pass it to the model. But before we can do that, we have to follo these steps to clean our data
1. Word Tokenization
2. Getting rid of stop words
3. Vocabulary Lemmatiation

In [4]:
# This method tokenizes the word using gensim process. Learn more about 
def tokenize(sentences, deacc=True):
    for sentence in sentences: 
        yield(gensim.utils.simple_preprocess(str(sentence)))  
data = dFrame['reviews.text'].values.tolist()
wordsData = list(tokenize(data)) # recursively provess each invocation

In [6]:
# After tokenzation, the bag of word looks something like this
print(wordsData[20])

['almost', 'like', 'reading', 'real', 'book', 'don', 'is', 'crisp', 'sharp', 'and', 'easy', 'to', 'read', 've', 'held', 'out', 'from', 'buying', 'reader', 'for', 'years', 'because', 'could', 'never', 'quite', 'get', 'past', 'the', 'fact', 'it', 'wasn', 'book', 'with', 'this', 'new', 'kindle', 'don', 'even', 'notice', 'am', 'immediately', 'immersed', 'wildly', 'better', 'than', 'expected']


Here, we will create a list of stop words. A stop word in Natural Language Processing is basically a 
list of most commonly used words. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.


In [9]:
stopWordList =  stopwords.words('english') + list(string.punctuation)

After extreacting the stop words from provided python library, we will move further on removing the stop words from our word bank and 
apply the process of lemmatization. Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information

In [15]:
def chopStopWords(words):
    return [[word for word in simple_preprocess(str(doc)) if word not in words] for doc in words]


In [16]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    
    textOutput = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        textOutput.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return textOutput

In the code above, we are performing the process of lemmatization. Lemmatization is a text normalization technique used in Natural Language Processing (NLP), that switches any kind of a word to its base root mode. Lemmatization is responsible for grouping different inflected forms of words into the root form, having the same meaning. Read more here at source: https://www.engati.com/glossary/lemmatization#:~:text=Lemmatization%20is%20a%20text%20normalization,form%2C%20having%20the%20same%20meaning.


Now that we have axullary methods for data preprocessing, we'll perform the call funciton and then remove the stop words form
our word bank which comes from the datasource we're processing

In [18]:
dataWordStop = chopStopWords(wordsData)
# initialize spacy 'en' model, use only tagger since we don't need parsing or NER 
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
# lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(dataWordStop, allowed_postags=['NOUN', 'ADJ'])

Now that we've lemmatizationied data and processed it through the python nlp space, lets 
take a close look at what lemmatizationized data looks like


In [19]:
print(data_lemmatized[30])

['update', 'review', 'year', 'old', 'kindle', 'keyboard', 'new', 'model', 'basic', 'cheapie', 'kindle', 'kindle', 'touch', 'course', 'kindle', 'fire', 'quasi', 'model', 'excellent', 'choice', 'one', 'right', 'preference', 'ink', 'kindle', 'kindle', 'keyboard', 'kindle', 'touch', 'new', 'flagship', 'model', 'basic', 'kindle', 'same', 'display', 'same', 'sharp', 'typeface', 'high', 'contrast', 'ink', 'paper', 'eyestrain', 'kindle', 'keyboard', 'old', 'model', 'first', 'one', 'kindle', 'keyboard', 'day', 'dozen', 'book', 'new', 'model', 'neat', 'feature', 'experience', 'book', 'well', 'bad', 'year', 'old', 'kindle', 'keyboard', 'page', 'turn', 'smooth', 'new', 'kindle', 'difference', 'enough', 'worth', 'cost', 'opinion', 'touch', 'screen', 'interface', 'kindle', 'touch', 'neat', 'ipad', 'kindle', 'book', 'book', 'nice', 'current', 'kindle', 'model', 'touch', 'screen', 'feature', 'late', 'version', 'tech', 'product', 'own', 'reason', 'cheap', 'kindle', 'excellent', 'choice', 'more']


Now that everything is in place, we'll compare the nonstop, and lemmatised version of the original
words that comes from the word bank
As we can see below, all the words are properly lemmatised

In [20]:
print(' '.join(wordsData[10]), '\n')

as reviewed by the wife this is the perfect thing for new mommy who loves to read books as soon as had my baby girl had to stop reading my novels because had to give the baby my full attention but how much time do spend laying in bed on my side with the baby while she feeds especially prior to months when feeding took as long as minutes not to mention cluster feeds during growth spurts book was out of the question because it sooo tiring to hold up in side lying breastfeeding position my new kindle was the solution it light easy to disinfect use baby wipes on the case whenever feel like it not clean enough to go on the bed with the baby and you don need to keep finger in the middle of the pages to keep it from closing you know what mean and for some reason it much easier to go through book with kindle compared to an actual book never thought be converted into the kindle culture but here we are 



In [21]:
# Anfter the process of lemmatization
print(' '.join(data_lemmatized[10]))

wife perfect thing new mommy book baby girl novel baby full attention much time bed side baby month feeding minute cluster growth spurt book question tiring side position new kindle solution easy use baby wipe case clean bed baby finger middle page reason easy book kindle actual book kindle culture


## Dictionary and Corpus
One of the reasons why it’s so hard to learn, practice and experiment with Natural Language Processing is due to the lack of available corpora. Building a gold standard corpus is seriously hard work. Gensim's LDA requires the data in a certain format. Firstly, it needs the corpus as a dicionary of id-word mapping, where each word has a unique numeric ID. This is for computationally efficiency purposes. Secondly, it needs the corpus as a term-document frequency matrix which contains the frequency

In [22]:
#dictionary
id2word = corpora.Dictionary(data_lemmatized)
#corpus
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

In [23]:
#Lets take a look and corpus
print(corpus[3])

[(6, 2), (24, 1), (25, 5), (32, 1), (33, 1), (41, 1), (47, 2), (50, 6), (72, 4), (76, 3), (77, 1), (80, 1), (81, 1), (93, 1), (94, 1), (95, 1), (101, 1), (102, 2), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 2), (110, 1), (111, 1), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 1), (118, 1), (119, 1), (120, 1), (121, 2), (122, 1), (123, 1), (124, 1), (125, 3), (126, 1), (127, 1), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 1), (136, 1), (137, 1), (138, 1), (139, 1), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1)]


In [25]:
#Lets take a look at dictionary 
print(id2word)

Dictionary(6207 unique tokens: ['adjustment', 'auto', 'basis', 'case', 'certain']...)


In [26]:
# Readable format of corpus (term,frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('adjustment', 1),
  ('auto', 1),
  ('basis', 1),
  ('case', 1),
  ('certain', 1),
  ('custom', 1),
  ('day', 2),
  ('delivery', 1),
  ('dollar', 1),
  ('easy', 1),
  ('expense', 1),
  ('extra', 1),
  ('fine', 1),
  ('friend', 1),
  ('glad', 1),
  ('great', 1),
  ('hard', 1),
  ('international', 1),
  ('jump', 1),
  ('level', 1),
  ('light', 3),
  ('model', 1),
  ('money', 1),
  ('option', 1),
  ('page', 1),
  ('paperwhite', 4),
  ('party', 1),
  ('ppi', 1),
  ('press', 1),
  ('pricey', 1),
  ('receptive', 1),
  ('review', 1),
  ('same', 1),
  ('screen', 1),
  ('sensitive', 1),
  ('service', 1),
  ('setting', 3),
  ('shipping', 2),
  ('specific', 2),
  ('thing', 1),
  ('third', 1),
  ('time', 3),
  ('touch', 1),
  ('tracking', 1),
  ('trouble', 1),
  ('voyage', 3),
  ('week', 1)]]

### Insight into the top 10 topics 
we print the top 10 topics


In [28]:
# Building lda model for training
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=6, 
                                           random_state=101,
                                           update_every=1,
                                           chunksize=999,
                                           passes=99,
                                           alpha=0.1,
                                           per_word_topics=True)

In [29]:
pprint.pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.124*"good" + 0.064*"quality" + 0.063*"product" + 0.062*"sound" + '
  '0.047*"bass" + 0.023*"nice" + 0.023*"battery" + 0.021*"awesome" + '
  '0.020*"price" + 0.016*"ear"'),
 (1,
  '0.052*"kindle" + 0.027*"fire" + 0.020*"year" + 0.019*"device" + 0.019*"new" '
  '+ 0.018*"book" + 0.015*"screen" + 0.013*"hdx" + 0.013*"model" + '
  '0.012*"last"'),
 (2,
  '0.026*"tv" + 0.020*"fire" + 0.019*"other" + 0.017*"box" + 0.015*"more" + '
  '0.014*"roku" + 0.014*"review" + 0.013*"amazon" + 0.013*"content" + '
  '0.012*"device"'),
 (3,
  '0.047*"headphone" + 0.043*"nice" + 0.020*"apple" + 0.016*"people" + '
  '0.016*"more" + 0.015*"case" + 0.014*"earbud" + 0.013*"year" + '
  '0.012*"magnet" + 0.010*"free"'),
 (4,
  '0.019*"bad" + 0.018*"month" + 0.017*"voice" + 0.017*"great" + 0.016*"prime" '
  '+ 0.015*"speaker" + 0.015*"music" + 0.015*"problem" + 0.015*"echo" + '
  '0.015*"tap"'),
 (5,
  '0.117*"earphone" + 0.042*"price" + 0.030*"range" + 0.028*"quality" + '
  '0.028*"boat" + 0.026*"super

## Model analysis and evluation
Now that we have defined and procceeded the model, afficiency. Its time to 
evaluate the model. Here we'll be using coherence score to judge how well the lda_model performs

In [30]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.565607602946545


## Visualization
Lets dig further into analysis with some visualization.
For interactive visualization, here we'll use pyLDAvis library

In [31]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

  by='saliency', ascending=False).head(R).drop('saliency', 1)


  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


References:
(2022, November 10). Modeling a topic using LDA. Kaggle. Retrieved November 28, 2022, from https://www.kaggle.com/code/vivekgediya/topic-modeling-using-lda-on-product-review 
Natural language processing corpora - NLP-for-hackers. NLP. (2018, June 22). Retrieved November 28, 2022, from https://nlpforhackers.io/corpora/ 