# Project 1 - Topic Analysis of Review Data

Project 1 

DESCRIPTION

Help a leading mobile brand understand the voice of the customer by analyzing the reviews of their product on Amazon and the topics that customers are talking about. You will perform topic modeling on specific parts of speech. You’ll finally interpret the emerging topics.

Problem Statement: 

A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view.

Domain: Amazon reviews for a leading phone brand

Analysis to be done: POS tagging, topic modeling using LDA, and topic interpretation

Content: 

Dataset: ‘K8 Reviews v0.2.csv’

Columns:

Sentiment: The sentiment against the review (4,5 star reviews are positive, 1,2 are negative)

Reviews: The main text of the review

Steps to perform:

Discover the topics in the reviews and present it to business in a consumable format. Employ techniques in syntactic processing and topic modeling.

Perform specific cleanup, POS tagging, and restricting to relevant POS tags, then, perform topic modeling using LDA. Finally, give business-friendly names to the topics and make a table for business.

#### Tasks: 

1. Read the .csv file using Pandas. Take a look at the top few records.

2. Normalize casings for the review text and extract the text into a list for easier manipulation.

3. Tokenize the reviews using NLTKs word_tokenize function.

4. Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

5. For the topic model, we should  want to include only nouns.

    1. Find out all the POS tags that correspond to nouns.

2. Limit the data to only terms with these tags.

6. Lemmatize. 

    1. Different forms of the terms need to be treated as one.

    2. No need to provide POS tag to lemmatizer for now.

7. Remove stopwords and punctuation (if there are any). 

8. Create a topic model using LDA on the cleaned-up data with 12 topics.

    1. Print out the top terms for each topic.

    2. What is the coherence of the model with the c_v metric?

9. Analyze the topics through the business lens.

    1. Determine which of the topics can be combined.

10. Create a topic model using LDA with what you think is the optimal number of topics

    1. What is the coherence of the model?

11. The business should be able to interpret the topics.

    1. Name each of the identified topics.

    2. Create a table with the topic name and the top 10 terms in each to present to the business.

### Importing required Libraries

In [47]:
import numpy as np
import pandas as pd
import nltk
import gensim
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [48]:
from collections import Counter
import re

In [207]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

#### 1. Read the .csv file using Pandas. Take a look at the top few records.

In [191]:
df_reviews = pd.read_csv("K8 Reviews v0.2.csv")
df_reviews.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


#### Function to remove any emojis from the reviews

In [192]:
def cleanse_unicode(s):
    if not s:
        return ""
    
   # temp = single_quote_expr.sub("'", s, re.U)
    temp = emoji_pattern.sub("", s, re.U)
    return temp

def remove_emoji(string):
    emoji_pattern = re.compile("["
                               
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    
    return emoji_pattern.sub(r'', string)

df_reviews['review'] = df_reviews.apply(lambda x: remove_emoji(x['review']),axis=1)


In [193]:
df_reviews.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


#### 2. Normalize casings for the review text and extract the text into a list for easier manipulation.

In [194]:
all_reviews = list(df_reviews['review'])
all_reviews[0]

'Good but need updates and improvements'

In [195]:
all_reviews = [review.lower() for review in all_reviews]
all_reviews[0]

'good but need updates and improvements'

#### 3. Tokenize the reviews using NLTKs word_tokenize function.

In [196]:
tokenized_reviews=[]
tokenized_reviews = [word_tokenize(review) for review in all_reviews]
tokenized_reviews[0]

['good', 'but', 'need', 'updates', 'and', 'improvements']

In [197]:
print(tokenized_reviews[:2])

[['good', 'but', 'need', 'updates', 'and', 'improvements'], ['worst', 'mobile', 'i', 'have', 'bought', 'ever', ',', 'battery', 'is', 'draining', 'like', 'hell', ',', 'backup', 'is', 'only', '6', 'to', '7', 'hours', 'with', 'internet', 'uses', ',', 'even', 'if', 'i', 'put', 'mobile', 'idle', 'its', 'getting', 'discharged.this', 'is', 'biggest', 'lie', 'from', 'amazon', '&', 'lenove', 'which', 'is', 'not', 'at', 'all', 'expected', ',', 'they', 'are', 'making', 'full', 'by', 'saying', 'that', 'battery', 'is', '4000mah', '&', 'booster', 'charger', 'is', 'fake', ',', 'it', 'takes', 'at', 'least', '4', 'to', '5', 'hours', 'to', 'be', 'fully', 'charged.do', "n't", 'know', 'how', 'lenovo', 'will', 'survive', 'by', 'making', 'full', 'of', 'us.please', 'don', ';', 't', 'go', 'for', 'this', 'else', 'you', 'will', 'regret', 'like', 'me', '.']]


#### Pos Tagging

In [198]:
pos_tagged_wordlist=[]
pos_tagged_wordlist = [nltk.pos_tag(token_review) for token_review in tokenized_reviews]

In [357]:
pos_tagged_wordlist[:1]

[[('good', 'JJ'),
  ('but', 'CC'),
  ('need', 'VBP'),
  ('updates', 'NNS'),
  ('and', 'CC'),
  ('improvements', 'NNS')]]

#### Function to get only Nounn words

In [374]:
def GetNounPhrase(pos_tag_sent):    
    noun_phrase_list = ['NN','NNP','NNS','NNPS']
    count = 0
    noun_words = []
    # The pos_tagged_wordlist consist of multiple sentences. it is a List within a List.
#     for pos_tag_sent in pos_tagged_wordlist:
    for word,tag in pos_tag_sent:
#         if tag in noun_phrase_list and len(word) > 1:
        if tag in noun_phrase_list:
            count+=1
            noun_words.append(word.lower())
    
    return noun_words

In [375]:
words_noun_tags = [GetNounPhrase(pos_tag_sent) for pos_tag_sent in pos_tagged_wordlist]

In [376]:
words_noun_tags

[['updates', 'improvements'],
 ['mobile',
  'i',
  'battery',
  'hell',
  'backup',
  'hours',
  'uses',
  'idle',
  'discharged.this',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'charger',
  'hours',
  'don'],
 ['i', '%', 'cash', 'january..'],
 [],
 ['phone', 'everthey', 'phone', 'problem', 'amazon', 'phone', 'amazon'],
 ['camerawaste', 'money'],
 ['phone', 'reason', 'k8'],
 ['battery', 'level'],
 ['problems',
  'phone',
  'hanging',
  'problems',
  'note',
  'station',
  'ahmedabad',
  'years',
  'phone',
  'lenovo'],
 ['lot', 'glitches', 'thing', 'options'],
 ['wrost'],
 ['phone', 'charger', 'damage', 'months'],
 ['item', 'battery', 'life'],
 ['i',
  'battery',
  'problem',
  'motherboard',
  'problem',
  'months',
  'mobile',
  'life'],
 ['phone', 'slim', 'battry', 'backup', 'screen'],
 ['headset'],
 ['time', 'i'],
 ['product',
  'prize',
  'range',
  'specification',
  'comparison',
  'mobile',
  'range',
  'i',
  'phone',
  'seal',
  'i',
  'credit',
  'card',
  'i',
  '..',

#### Preprocessing Removing Stop words and Lemmetinzing already tokenized words

In [377]:
from nltk.stem import WordNetLemmatizer

In [378]:
# Remove Stop Words and Lemmatize 
lemmat = WordNetLemmatizer()
def preprocess(tokenized_text):
    preprocessed_text = []
    
    # 2. Remove Stopwords and words less then 3 characters
    cleaned_text = []
    for token in tokenized_text:
        if token not in stop_words and len(token) > 3:
            cleaned_text.append(token)
    
    # 3. Lemmatize
    for word in cleaned_text:
        lemmatized_word = lemmat.lemmatize(word , pos='n')
        preprocessed_text.append(word)
        
    
    return preprocessed_text

In [379]:
preprocessed_text = [preprocess(token_review) for token_review in tokenized_reviews]
print(preprocessed_text[:2])

[['good', 'need', 'updates', 'improvements'], ['worst', 'mobile', 'bought', 'ever', 'battery', 'draining', 'like', 'hell', 'backup', 'hours', 'internet', 'uses', 'even', 'mobile', 'idle', 'getting', 'discharged.this', 'biggest', 'amazon', 'lenove', 'expected', 'making', 'full', 'saying', 'battery', '4000mah', 'booster', 'charger', 'fake', 'takes', 'least', 'hours', 'fully', 'charged.do', 'know', 'lenovo', 'survive', 'making', 'full', 'us.please', 'else', 'regret', 'like']]


In [380]:
import gensim
from gensim.utils import simple_preprocess

#### Create a Dictionary of nouns from POS tagging.

In [381]:
# Create a Dictionary of nouns from POS tagging.
dictionary = gensim.corpora.Dictionary(words_noun_tags)
dictionary

<gensim.corpora.dictionary.Dictionary at 0x17cb5a38248>

In [382]:
# looking at first 20 items from the noun dictionary
count = 0
for k,v in dictionary.iteritems():
    print(k,v)
    count +=1
    if count > 20:
        break

0 improvements
1 updates
2 amazon
3 backup
4 battery
5 charger
6 discharged.this
7 don
8 hell
9 hours
10 i
11 idle
12 lenove
13 lie
14 mobile
15 uses
16 %
17 cash
18 january..
19 everthey
20 phone


#### Creating Bag of Words

In [383]:
bow_corpus = [dictionary.doc2bow(doc) for doc in preprocessed_text]
document_num = 1
bow_doc_x = bow_corpus[document_num]

bow_doc_x
for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], dictionary[bow_doc_x[i][0]], bow_doc_x[i][1]))

Word 2 ("amazon") appears 1 time.
Word 3 ("backup") appears 1 time.
Word 4 ("battery") appears 2 time.
Word 5 ("charger") appears 1 time.
Word 6 ("discharged.this") appears 1 time.
Word 8 ("hell") appears 1 time.
Word 9 ("hours") appears 2 time.
Word 11 ("idle") appears 1 time.
Word 12 ("lenove") appears 1 time.
Word 14 ("mobile") appears 2 time.
Word 15 ("uses") appears 1 time.
Word 29 ("lenovo") appears 1 time.
Word 444 ("internet") appears 1 time.
Word 649 ("regret") appears 1 time.
Word 1345 ("draining") appears 1 time.
Word 1598 ("fake") appears 1 time.
Word 1968 ("worst") appears 1 time.
Word 4552 ("bought") appears 1 time.
Word 6253 ("booster") appears 1 time.
Word 7843 ("making") appears 2 time.
Word 8390 ("else") appears 1 time.


#### Performing LDA with 12 topics initially

In [384]:
lda_model = gensim.models.LdaMulticore(corpus = bow_corpus,num_topics=12, id2word = dictionary, passes =10, workers = 2)

In [385]:
# For each topic, we will explore the words that occur in that topic and its relative weight:
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.105*"battery" + 0.046*"charging" + 0.042*"phone" + 0.027*"fast" + 0.023*"charge" + 0.019*"charger" + 0.019*"hours" + 0.019*"heating" + 0.017*"turbo" + 0.017*"time"


Topic: 1 
Words: 0.085*"phone" + 0.077*"awesome" + 0.050*"super" + 0.021*"perfect" + 0.019*"update" + 0.013*"piece" + 0.012*"performance" + 0.012*"oreo" + 0.010*"need" + 0.009*"till"


Topic: 2 
Words: 0.056*"working" + 0.044*"camera" + 0.032*"speaker" + 0.026*"quality" + 0.024*"phone" + 0.024*"product" + 0.018*"fine" + 0.016*"back" + 0.010*"earphone" + 0.010*"thanks"


Topic: 3 
Words: 0.126*"camera" + 0.075*"good" + 0.039*"quality" + 0.029*"battery" + 0.025*"phone" + 0.021*"front" + 0.020*"mode" + 0.018*"depth" + 0.016*"rear" + 0.016*"average"


Topic: 4 
Words: 0.137*"lenovo" + 0.126*"note" + 0.041*"better" + 0.020*"update" + 0.019*"mobile" + 0.018*"issues" + 0.017*"software" + 0.017*"product" + 0.017*"redmi" + 0.014*"worst"


Topic: 5 
Words: 0.072*"phone" + 0.022*"lenovo" + 0.021*"screen" + 0.012*"g

#### Getting Coherence Score

In [386]:
from gensim.models import CoherenceModel
coherence_model_lda = CoherenceModel(model = lda_model, texts = preprocessed_text, dictionary = dictionary, 
                                     coherence = 'c_v' )
coherence_lda = coherence_model_lda.get_coherence()
print("Coherence Score: ",coherence_lda)

Coherence Score:  0.4990008572811138


#### Naming the 12 topics

Looking at current 12 topics below is what we can list down

- Topic 0 - Handset Performane Satisfied
- Topic 1 - Battery and Customer Service issue
- Topic 2 - Display and Charging Satisfied
- Topic 3 - Camera and Sound Quality
- Topic 4 - Phone call and Network
- Topic 5 - Battery Heating Camera Performance and Price
- Topic 6 - Heating, Charging, and Games
- Topic 7 - Heating, Charging, Good Camera
- Topic 8 - Price and Features
- Topic 9 - Price and Features
- Topic 10 - Camera Battery and Price
- Topic 11 - Battery Features Camera

I think we can reduce it to 7 topics

#### Remodeling with 7 topics

In [249]:
## Remodeling with 7 topics
lda_model2 = gensim.models.LdaMulticore(corpus = bow_corpus,num_topics=7, id2word = dictionary, passes =12, workers = 2)

In [250]:
# For each topic, we will explore the words that occur in that topic and its relative weight:
for idx, topic in lda_model2.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.198*"phone" + 0.053*"good" + 0.028*"price" + 0.013*"worth" + 0.012*"range" + 0.009*"features" + 0.009*"amazing" + 0.009*"camera" + 0.009*"money" + 0.008*"awesome"


Topic: 1 
Words: 0.047*"phone" + 0.031*"battery" + 0.020*"charging" + 0.016*"problem" + 0.015*"mobile" + 0.015*"time" + 0.014*"network" + 0.014*"charge" + 0.014*"lenovo" + 0.013*"issue"


Topic: 2 
Words: 0.101*"lenovo" + 0.092*"note" + 0.034*"awesome" + 0.027*"better" + 0.021*"mobile" + 0.016*"features" + 0.014*"redmi" + 0.012*"superb" + 0.012*"smartphone" + 0.012*"killer"


Topic: 3 
Words: 0.140*"good" + 0.089*"camera" + 0.060*"battery" + 0.036*"quality" + 0.023*"product" + 0.023*"phone" + 0.022*"performance" + 0.019*"backup" + 0.015*"heating" + 0.014*"fast"


Topic: 4 
Words: 0.168*"mobile" + 0.167*"nice" + 0.034*"product" + 0.034*"super" + 0.011*"volte" + 0.010*"delivery" + 0.009*"recorder" + 0.007*"price" + 0.006*"expectations" + 0.006*"cost"


Topic: 5 
Words: 0.029*"call" + 0.024*"update" + 0.024*

#### Naming the new 7 topics

In [251]:
#Topic 0 - Value for Money
#Topic 1 - Battery Charging and network issue
#Topic 2 - Better feature in Lenovo note and Redmi
#Topic 3 - Good Camera and battery life
#Topic 4 - Fast Delivery and met expectation
#Topic 5 - Comparing with Note and Software update
#Topic 6 - Heating issue and Customer service

### Create a table with the topic name and the top 10 terms in each to present to the business.

In [345]:
topic_list = ["Value for Money","Battery Charging and network issue", "Better feature in Lenovo note and Redmi", 
                  "Good Camera and battery life", "Fast Delivery and met expectation","Comparing with Note and Software update",
                  "Heating issue and Customer service"]
topic_list
                  

['Value for Money',
 'Battery Charging and network issue',
 'Better feature in Lenovo note and Redmi',
 'Good Camera and battery life',
 'Fast Delivery and met expectation',
 'Comparing with Note and Software update',
 'Heating issue and Customer service']

In [346]:
# for idx, topic in lda_model2.print_topics(-1):
#     print("Topic: {} \n Words: {}".format(idx, topic))
#     print("\n")

In [347]:
print(lda_model2.get_topics())

[[7.10352970e-06 1.12843909e-03 5.44710970e-03 ... 1.03565231e-04
  7.10159202e-06 7.10160293e-06]
 [4.31280569e-06 9.91523964e-04 8.92002601e-03 ... 4.31126637e-06
  4.31126591e-06 4.31127000e-06]
 [2.53595324e-04 7.40426258e-05 4.09475854e-03 ... 1.42440522e-05
  2.08158774e-04 1.42440922e-05]
 ...
 [1.90649243e-05 1.90855426e-05 6.77437522e-04 ... 1.90649334e-05
  1.90649243e-05 1.90650098e-05]
 [1.16302090e-05 5.24277426e-03 1.27928688e-05 ... 1.16000838e-05
  1.16000801e-05 1.16001111e-05]
 [1.06912676e-05 1.07097503e-05 2.94126030e-02 ... 1.07131855e-05
  1.06912648e-05 1.06912930e-05]]


In [348]:
print(lda_model2.print_topic(0))

0.198*"phone" + 0.053*"good" + 0.028*"price" + 0.013*"worth" + 0.012*"range" + 0.009*"features" + 0.009*"amazing" + 0.009*"camera" + 0.009*"money" + 0.008*"awesome"


In [349]:
print(lda_model2.show_topic(0)[0][0])

phone


In [350]:
topic_word_list = []
data_test={}
for idx, topic in lda_model2.print_topics(-1):
#     print(idx)
#     print(lda_model2.show_topic(idx))
    line_list = []
    for ww in range(0,10):
#         print(lda_model2.show_topic(idx)[ww][0])
        line_list.append(lda_model2.show_topic(idx)[ww][0])
    topic_word_list.append(line_list)
    
    data_test[topic_list[idx]] = line_list
    

In [351]:
line_list

['product',
 'working',
 'heating',
 'amazon',
 'problem',
 'service',
 'worst',
 'return',
 'properly',
 'charger']

In [353]:
df_test = pd.DataFrame(data = topic_list)

In [354]:
df = pd.DataFrame(data_test)

In [355]:
df

Unnamed: 0,Value for Money,Battery Charging and network issue,Better feature in Lenovo note and Redmi,Good Camera and battery life,Fast Delivery and met expectation,Comparing with Note and Software update,Heating issue and Customer service
0,phone,phone,lenovo,good,mobile,call,product
1,good,battery,note,camera,nice,update,working
2,price,charging,awesome,battery,product,screen,heating
3,worth,problem,better,quality,super,lenovo,amazon
4,range,mobile,mobile,product,volte,option,problem
5,features,time,features,phone,delivery,cast,service
6,amazing,network,redmi,performance,recorder,note,worst
7,camera,charge,superb,backup,price,software,return
8,money,lenovo,smartphone,heating,expectations,recording,properly
9,awesome,issue,killer,fast,cost,phone,charger
