**Tasks -**

1. Normalize case

2. Tokenize (using word_tokenize from NLTK)

3. POS tagging using the NLTK pos tagger

4. For the topic model, we would want to include only nouns

      * First, find out all the POS tags that correspond to nouns

5. Lemmatize (you want different forms of the terms to be treated as one, don't worry about providing POS tag to lemmatizer for now)

6. Remove stop words and punctuation

7. Create a topic model using LDA on the cleaned up data with 10 topics

     ⚫ choose the topic model parameters carefully

      * what is the coherence of the model?

8. Analyze the topics, which pairs of topics can be combined?

9. Create topic model using LDA with what you think is the optimal number of topics

      ⚫ choose the topic model parameters carefully

      ⚫is the coherence better now?
10. The business finally needs to be able to interpret the topics
      
      * name each of the identified topics

      * create a table with the topic name and the top 10 terms in each to present to business
        

**Some code to get you** **started**

In [2]:
import warnings
warnings.filterwarnings("ignore")

# Importing the usual utilities

import numpy as np, pandas as pd
import re, random, os, string
from pprint import pprint #pretty print
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

**Reading in the reviews file**

In [3]:
# Import the dataset

reviews = pd.read_csv("/content/K8 Reviews v0.2.csv")
reviews.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


In [4]:
reviews.shape

(14675, 2)

**Task 1. Normalize case**

In [5]:
reviews_lower = [sent.lower() for sent in reviews.review.values]
reviews_lower[0]

'good but need updates and improvements'

**Task 2. Tokenize (using word_tokenize from NLTK)**

In [7]:
import nltk
nltk.download('punkt_tab')
reviews_token = [word_tokenize(sent) for sent in reviews_lower]
reviews_token [0]

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['good', 'but', 'need', 'updates', 'and', 'improvements']

**Task 3. POS tagging using the NLTK pos tagger**

Check out the POS tagging chapter from the NLTK book

http://www.nltk.org/book/ch05.html

Examples -

In [8]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
# Get the list of tag
nltk.download('tagsets_json')
nltk.download('tagsets')
nltk.help.brown_tagset()

In [12]:
nltk.download('averaged_perceptron_tagger_eng')
nltk.pos_tag(reviews_token [0])

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

In [13]:
sent = "I like to move it".split()
sent_tagged = nltk.pos_tag(sent)

In [14]:
sent_tagged

[('I', 'PRP'), ('like', 'VBP'), ('to', 'TO'), ('move', 'VB'), ('it', 'PRP')]

In [15]:
# Apply on the dataset
reviews_tagged = [nltk.pos_tag(tokens) for tokens in reviews_token]
reviews_tagged[0]

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

**Task 4. For the topic model, we would want to include only nouns**

* First, find out all the POS tags that correspond to nouns

* Limit the data to only terms with these tags

In [None]:
nltk.download('tagsets')
nltk.help.upenn_tagset()

You see that for each term, the POS taggger returns a tuple. The first element of the tuple being the term, the second being the tag.

In [17]:
tagged_tuple = nltk.pos_tag(['great'])
tagged_tuple[0]

('great', 'JJ')

In [18]:
pprint(tagged_tuple[0][0])
pprint(tagged_tuple[0][1])

'great'
'JJ'


In [21]:
reviews_noun=[]
for sent in reviews_tagged:
  reviews_noun.append([token for token in sent if re.search("NN.*", token[1])])

reviews_noun[0]

[('updates', 'NNS'), ('improvements', 'NNS')]

**5.Lemmatize (you want different forms of the terms to be treated as one,  don't worry about providing POS tag to lemmatizer for now)**

In [22]:
# Apply lemmatization
nltk.download('wordnet')

lemm = WordNetLemmatizer()
reviews_lemm=[]

for sent in reviews_noun:
  reviews_lemm.append([lemm.lemmatize(word[0]) for word in sent])

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [23]:
reviews_lemm[0]

['update', 'improvement']

**Task 6. Remove stop words and punctuation (if there are any at all after the POS tagging)**

Use NLTK standard stop word list and the punctuations

In [25]:
nltk.download('stopwords')
from string import punctuation
from nltk.corpus import stopwords
stop_nltk = stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
list(punctuation)

In [28]:
stop_updated = stop_nltk + list(punctuation) + ["..."] + [".."]
reviews_sw_removed=[]
for sent in reviews_lemm:
  reviews_sw_removed.append([term for term in sent if term not in stop_updated])

In [29]:
reviews_sw_removed[1]

['mobile',
 'battery',
 'hell',
 'backup',
 'hour',
 'us',
 'idle',
 'discharged.this',
 'lie',
 'amazon',
 'lenove',
 'battery',
 'charger',
 'hour']

**Task 7. Create a topic model using LDA on the cleaned up data with 10 topics**

* what is the coherence of the model?

Use gensim for this task

In [30]:
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.models import ldamodel

In [31]:
# Create a Dictionary
id2word = corpora.Dictionary(reviews_sw_removed)
print(len(id2word))

#get the words in Dictionary
print(id2word)

9266
Dictionary<9266 unique tokens: ['improvement', 'update', 'amazon', 'backup', 'battery']...>


In [32]:
# Convert document (a list of words) into the bag-of-words format
# It contains the word id and its frequency in each document.

texts = reviews_sw_removed
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus [200])

[(36, 1), (143, 1), (314, 1), (415, 1), (416, 1)]


In [34]:
#Apply the LDA model

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                              id2word=id2word,
                                              num_topics=10,
                                              passes=10) # Number of passes through the corpus during training.
#alpha= 0.2,
#eta= 8.1
#chunksize=100, #chunksize controls how many documents are processed at a time in the training algorithm

In [37]:
# show the first 10 words belonging to each topic

for idx, topic in lda_model.show_topics(formatted=False, num_words= 10):
  print("Topic: {} \nwords: {}".format(idx, [w[0] for w in topic]))
  print("\n")

Topic: 0 
words: ['phone', 'device', 'music', 'software', 'issue', 'ram', 'android', 'camera', 'apps', 'stock']


Topic: 1 
words: ['product', 'money', 'amazon', 'service', 'waste', 'value', 'lenovo', 'customer', 'replacement', 'day']


Topic: 2 
words: ['phone', '....', '.....', 'screen', 'cast', 'box', 'smartphone', 'set', 'thanks', 'month']


Topic: 3 
words: ['network', 'call', 'phone', 'feature', 'problem', 'sim', 'issue', 'option', 'jio', 'support']


Topic: 4 
words: ['camera', 'quality', 'phone', 'performance', 'display', 'mode', 'feature', 'sound', 'budget', 'glass']


Topic: 5 
words: ['note', 'k8', 'lenovo', 'mobile', 'handset', 'system', 'feature', 'speed', 'model', 'update']


Topic: 6 
words: ['price', 'range', 'speaker', 'mobile', 'work', 'condition', 'offer', 'purchase', 'order', 'thank']


Topic: 7 
words: ['delivery', 'hai', 'h', 'charger', 'super', 'heat', 'ho', 'cell', 'complaint', 'gud']


Topic: 8 
words: ['battery', 'phone', 'mobile', 'problem', 'issue', 'day', '

In [40]:
#Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=reviews_sw_removed, dictionary=id2word, coherence='c_v') #
# Get the coherence score using get_coherence() method
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.5819247409386898


In [41]:
!pip install pyLDAvis==3.4.1 # To visualize lda model

Collecting pyLDAvis==3.4.1
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis==3.4.1)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


In [42]:
# Visualization

import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

#vis = pyLDAvis.gensim_models.prepare(lda_model,corpus,id2word)
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

**Task 8. Analyze the topics, which pairs of topics can be combined?**

⚫ you can assume that if a pair of topics has very similar top terms, they are very close and can be combined


**Looking at the topics and each terms following can be combined** -

*Topic possibly talks about 'pricing' Topic closely talks about 'battery related issues' Topic talks about 'performance**


▾ **Task 9. Create topic model using LDA with what you think is the optimal number of topics**

* is the coherence better now?

In [43]:
# Build LDA model

lda_model6 = gensim.models.ldamodel.LdaModel(corpus = corpus,
                                              id2word=id2word,
                                              num_topics=6,
                                              passes=20,
                                             )

  and should_run_async(code)


**Printing the coherence of the model**

In [44]:
# Compute Coherence Score

coherence_model_lda = CoherenceModel(model=lda_model6, texts=reviews_sw_removed, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score:', coherence_lda)

  and should_run_async(code)



Coherence Score: 0.5996337535568715


In [45]:
# Visualization

import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

#vis = pyLDAvis.gensim_models.prepare(lda_model,corpus,id2word)
vis = pyLDAvis.gensim.prepare(lda_model6, corpus, id2word)
vis

  and should_run_async(code)


**Task 10. The business finally needs to be able to interpret the topics**

⚫ name each of the identified topics

⚫ create a table with the topic name and the top 10 terms in each to present to business

In [46]:
x = lda_model6.show_topics (formatted=False)
topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]

  and should_run_async(code)


In [49]:
for topic,words in topics_words:
  print(str(topic)+ "::"+ str(words))
  print()

0::['camera', 'quality', 'mobile', 'price', 'performance', 'feature', 'sound', 'battery', 'video', 'photo']

1::['note', 'k8', 'network', 'call', 'lenovo', 'problem', 'option', 'sim', 'screen', 'app']

2::['phone', 'battery', 'camera', 'backup', 'issue', 'performance', 'price', 'day', 'feature', 'processor']

3::['product', 'phone', 'problem', 'issue', 'battery', 'amazon', 'time', 'day', 'service', 'heating']

4::['waste', '.....', 'device', 'superb', 'dolby', 'money', 'sound', '......', 'atmos', 'ok']

5::['money', 'value', 'hai', 'speaker', 'h', 'earphone', 'worth', 'excellent', 'headphone', 'ho']



  and should_run_async(code)
