This notebook contains topic modelling for positive and negative reviews. I have created a separate model defining 10 possible topics for each of these catagories as well as giving the probable words covered by each topic. 

This is used to show what is common for positive reviews, and what is common for negative ones. Looking at separate words under each topic, we can see what users pay attention to when leaving a negative or a positive comment; what is important for them. 

In [1]:
import nltk
import gensim
from nltk.corpus import stopwords

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Preprocessing

In [3]:
file = "Cell_Phones_&_Accessories.txt"
f_hand = open(file, "r+", encoding="UTF-8")
contents = f_hand.readlines()
# group each review separately in a list inside a larger list of reviews
contents = [contents[x:x+11] for x in range(0, len(contents), 11)]

In [4]:
# split reviews into positive and negative. I didn't take into account the reviews with score 3.0 because they are neutral
pos_revs = []
neg_revs = []
for rev in contents:
    if "review/score: 1.0\n" in rev or "review/score: 2.0\n" in rev:
        neg_revs.append(rev)
    elif "review/score: 4.0\n" in rev or "review/score: 5.0\n" in rev:
        pos_revs.append(rev)
    else:
        continue

In [5]:
stop_words = stopwords.words('english')

In [6]:
# preprocessing
def get_data(revs):
    new_revs = [rev[9] for rev in revs] # line 9 contains the text of the review
    processed = []
    for rev in new_revs:
        # remove stopwords, make lowercase and remove if not an alphabetical symbol
        rev = [w.lower() for w in rev[12:].split() if w.isalpha() and w not in stop_words]
        processed.append(rev)
    return(processed)
    

In [7]:
# apply the previous function to pos and neg reviews separately
pos_revs = get_data(pos_revs) 
neg_revs = get_data(neg_revs)

In [8]:
print(pos_revs[0])
print(neg_revs[0])

['great', 'tried', 'others', 'ten', 'compared', 'real', 'easy', 'use', 'definite', 'recommended', 'buy', 'transfer', 'data']
['first', 'company', 'took', 'money', 'sent', 'email', 'telling', 'product', 'a', 'week', 'half', 'later', 'i', 'received', 'another', 'email', 'telling', 'actually', 'i', 'received', 'email', 'telling', 'i', 'finally', 'got', 'money', 'i', 'went', 'another', 'company', 'buy', 'product', 'work', 'even', 'though', 'depicts', 'i', 'sent', 'numerous', 'emails', 'company', 'i', 'actually', 'find', 'phone', 'number', 'website', 'i', 'still', 'gotten', 'kind', 'what', 'kind', 'customer', 'service', 'no', 'one', 'help', 'my', 'advice', 'waste']


# Create bags of words

In [9]:
# let's firstly create a dictionary
def get_all(pos, neg):
    # unite all processed reviews  
    all_words = []
    for rev in pos:
        all_words.append(rev)
    for rev in neg:
        all_words.append(rev)
    return(all_words)
        

In [10]:
all_words = get_all(pos_revs, neg_revs)
print(all_words[:10])

[['great', 'tried', 'others', 'ten', 'compared', 'real', 'easy', 'use', 'definite', 'recommended', 'buy', 'transfer', 'data'], ['works', 'real', 'little', 'hard', 'set', 'part', 'doesnt', 'work', 'thru', 'handset', 'manager', 'go', 'networking', 'turn'], ['the', 'price', 'right', 'cable', 'compared', 'sony', 'ericssons', 'offering', 'there', 'different', 'prices', 'amazon', 'make', 'sure', 'get', 'i', 'popped', 'cd', 'followed', 'line', 'instruction', 'file', 'disk', 'phone', 'started', 'if', 'similar', 'problem', 'may', 'try', 'i', 'previously', 'installed', 'variety', 'downloadable', 'software', 'it', 'took', 'minutes', 'fumbling', 'i', 'made', 'way', 'phone', 'monitor', 'options', 'opening', 'phone', 'from', 'window', 'hit', 'options', 'there', 'may', 'better', 'way', 'get', 'i', 'go', 'com', 'ports', 'tab', 'enable', 'com', 'it', 'told', 'wouldnt', 'enable', 'ir', 'within', 'phone', 'presumably', 'one', 'remaining', 'i', 'may', 'issue', 'i', 'installed', 'software', 'fresh', 'i', '

In [11]:
# create a dictionary
dictionary = gensim.corpora.Dictionary(all_words)

In [12]:
# bags of words for each set of reviews
dic_pos = [dictionary.doc2bow(rev) for rev in pos_revs[:10000]]
dic_neg = [dictionary.doc2bow(rev) for rev in neg_revs[:10000]]

# Topics in positive reviews 

In [13]:
#Running LDA for positive reviews
lda_model = gensim.models.ldamodel.LdaModel(dic_pos, 
                                           id2word = dictionary,
                                           num_topics=10, 
                                           random_state=100, 
                                           update_every=1, 
                                           chunksize=100, 
                                           passes=10, 
                                           alpha="auto") 

In [14]:
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.046*"around" + 0.031*"card" + 0.028*"hours" + 0.021*"open" + 0.020*"palm" + 0.020*"access" + 0.016*"simply" + 0.016*"to" + 0.015*"deal" + 0.014*"is"


Topic: 1 
Words: 0.094*"i" + 0.040*"phone" + 0.024*"the" + 0.014*"one" + 0.013*"battery" + 0.013*"great" + 0.013*"it" + 0.011*"this" + 0.011*"good" + 0.010*"use"


Topic: 2 
Words: 0.079*"service" + 0.044*"pictures" + 0.043*"fine" + 0.036*"ringtones" + 0.036*"charging" + 0.028*"range" + 0.020*"live" + 0.017*"thank" + 0.013*"sending" + 0.012*"none"


Topic: 3 
Words: 0.051*"download" + 0.029*"online" + 0.019*"u" + 0.018*"stuff" + 0.018*"notice" + 0.018*"lock" + 0.013*"hardly" + 0.012*"straight" + 0.008*"saves" + 0.005*"smallest"


Topic: 4 
Words: 0.024*"cell" + 0.020*"screen" + 0.019*"camera" + 0.017*"phones" + 0.015*"long" + 0.014*"voice" + 0.012*"features" + 0.012*"car" + 0.012*"button" + 0.010*"hard"


Topic: 5 
Words: 0.028*"delivery" + 0.009*"anymore" + 0.008*"nicer" + 0.000*"outbound" + 0.000*"fone" + 0.000*"schw

Possible topics(=reasons for positive comments) deduced by me from words:

1) simplicity and accessibility
2) battery life
3) pictures and ringtones
4) cool staff that can be downloaded
5) good camera
6) fast delivery
7) general look
8) good case quality
9) managable

# Topics in negative reviews

In [15]:
lda_model2 = gensim.models.ldamodel.LdaModel(dic_neg, 
                                           id2word = dictionary,
                                           num_topics=10, 
                                           random_state=100, 
                                           update_every=1, 
                                           chunksize=100, 
                                           passes=10, 
                                           alpha="auto")

In [16]:
for idx, topic in lda_model2.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.048*"usb" + 0.046*"clip" + 0.043*"case" + 0.033*"palm" + 0.027*"feature" + 0.026*"belt" + 0.017*"speak" + 0.016*"luck" + 0.013*"holster" + 0.012*"heavy"


Topic: 1 
Words: 0.035*"the" + 0.021*"headset" + 0.017*"it" + 0.015*"use" + 0.014*"ear" + 0.013*"good" + 0.013*"like" + 0.012*"sound" + 0.012*"this" + 0.012*"bluetooth"


Topic: 2 
Words: 0.126*"i" + 0.045*"phone" + 0.016*"one" + 0.014*"would" + 0.014*"get" + 0.011*"battery" + 0.008*"work" + 0.008*"even" + 0.008*"time" + 0.008*"bought"


Topic: 3 
Words: 0.034*"software" + 0.032*"music" + 0.019*"computer" + 0.014*"dont" + 0.013*"needs" + 0.013*"issue" + 0.012*"windows" + 0.012*"iphone" + 0.011*"itunes" + 0.010*"internet"


Topic: 4 
Words: 0.058*"verizon" + 0.040*"treo" + 0.033*"card" + 0.025*"contact" + 0.021*"all" + 0.018*"sim" + 0.016*"pink" + 0.013*"keyboard" + 0.012*"add" + 0.012*"options"


Topic: 5 
Words: 0.032*"remove" + 0.024*"rating" + 0.024*"accessories" + 0.007*"glove" + 0.000*"flipper" + 0.000*"linksy

Possible topics(=reasons for negative comments) deduced by me from words covered by topics:

1) too heavy
2) bad battery 
3) issues with software and complicated interface
4) accessories that are difficult to remove?