### Reviews about Tada

In this notebook, we are going to perform topic modelling based on sentiments from various review sites. 

In [20]:
# Base
import os, re, string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# from wordcloud import WordCloud

# NLTK
import nltk
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.metrics.distance import edit_distance
from spellchecker import SpellChecker
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem.wordnet import WordNetLemmatizer

# Topic Modelling
import pyLDAvis
import pyLDAvis.gensim
import gensim
from gensim import corpora
pyLDAvis.enable_notebook()

[nltk_data] Downloading package stopwords to D:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data Preparation

We will first get the relevant files we need and thereafter extract them out into a list of reviews.

In [21]:
# Merge into 

path='.'

filename_sentiment_list=[]

for filename in os.listdir(path):
    if filename.endswith(".csv"):
        # Check what are the files in the folder
        print(os.path.join(path, filename))
        
        #Read the files and put to df
        df = pd.read_csv(filename,encoding = "ISO-8859-1")
        
        #Add it into a list
        try:
            filename_sentiment_list.extend(df['comment'].tolist())
            filename_sentiment_list.extend(df['tweet'].tolist())
            filename_sentiment_list.extend(df['title'].tolist())
        except:
            pass

print(filename_sentiment_list)

.\gplay_tada_en.csv
.\reddit_tada_singapore.csv
.\reddit_tada_singapore_extracted.csv
.\twitter_tada_mvl.csv


In [22]:
filename_sentiment_list[5:7]

['Gps location inaccurate show far away for driver pickup',
 "I can't get my gps to get registered in and I can't lock the place where I want to go to. it's quite confusing please edit the appð\x9f\x91\x8d and I have a feeling that this TADA will give a tough fight to grab so get the app edited as soon as possible please"]

### Cleaning

We will also perform cleaning

In [23]:
def clean_list_tokenise(reviews_list):
    # Tokenise the words
    tokenizer = RegexpTokenizer(r'\w+')
    return [tokenizer.tokenize(sentence) for sentence in reviews_list]

def clean_list_lemma(reviews_list):
    # Lemmatization
    lemma = WordNetLemmatizer()
    return [[lemma.lemmatize(word) for word in sentence] for sentence in reviews_list]

def clean_list_stopwords(reviews_list, stop_other=[]):
    def stopword_condition(word):
        word = word.lower()
        
        return word not in stopwords.words('english') \
            and word not in stop_other \
            and word[:5] != 'http' \
            and word[:5] != 'https' \
            and word[:2] != 'RT' \
            and word[0] != '@'
    
    return [[w for w in s if stopword_condition(w)] for s in reviews_list]

In [24]:
sentiment_filtered = clean_list_stopwords(clean_list_tokenise(filename_sentiment_list))
len(sentiment_filtered)

155

In [25]:
sentiment_filtered_sentences = [' '.join(s) for s in sentiment_filtered]
sentiment_filtered_sentences[:5]

['Never found driver several uses going',
 '30 mins later still cannot find driver',
 'Change login interface login sign option sign cuz mobile number already used',
 'Fair always super high compare Ryde Grab Planning uninstall soon']

### Segregating by Sentiment

We will segregate the reviews by sentiment. This is achieved by the `analyse_sentiment_vader` function, which uses vader to help us analyse the degree to which the particular review or comment is positive, negative or neutral. It also computes a compound score ranging from -1 to 1 which takes into account all the scores. More reading for the sentiment analysis are available at:

- https://opensourceforu.com/2016/12/analysing-sentiments-nltk/
- http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html

In [26]:
def analyse_sentiment_vader(df, col_name):
    sid = SentimentIntensityAnalyzer()
    vader = lambda text: sid.polarity_scores(text)
    
    df['vader'] = df[col_name].apply(vader)
    df = pd.merge(df, df['vader'].apply(pd.Series), left_index=True, right_index=True)
    return df.drop(['vader'], axis=1)

In [27]:
df_sentiment_filtered = pd.DataFrame(sentiment_filtered_sentences, columns=['comment'])
df_sentiment_filtered = analyse_sentiment_vader(df_sentiment_filtered, 'comment')
df_sentiment_filtered.head(10)

Unnamed: 0,comment,neg,neu,pos,compound
0,Never found driver several uses going,0.0,1.0,0.0,0.0
1,30 mins later still cannot find driver,0.0,1.0,0.0,0.0
2,Driver made wait 5min accepting canceled witho...,0.088,0.697,0.215,0.409
3,Change login interface login sign option sign ...,0.0,0.894,0.106,0.0772
4,Fair always super high compare Ryde Grab Plann...,0.0,0.563,0.437,0.7351
5,Gps location inaccurate show far away driver p...,0.0,1.0,0.0,0.0
6,get gps get registered lock place want go quit...,0.195,0.558,0.247,0.1689
7,Awesome ride hailing app much better grab,0.0,0.417,0.583,0.7906
8,TADA drivers tendencies turn pick another cust...,0.282,0.653,0.065,-0.6495
9,cant even sign n apps say already use,0.0,1.0,0.0,0.0


In [28]:
df_sentiment_filtered.tail(10)

Unnamed: 0,comment,neg,neu,pos,compound
145,Cannot find place,0.0,1.0,0.0,0.0
146,find destination location,0.0,1.0,0.0,0.0
147,App crashes creating account Using s8,0.0,0.694,0.306,0.296
148,publish article CNA app even opened,0.0,1.0,0.0,0.0
149,App keeps crashing launch,0.0,1.0,0.0,0.0
150,Bad app Cannot register,0.538,0.462,0.0,-0.5423
151,Good initiative,0.0,0.256,0.744,0.4404
152,Pls fix app auto close rate app properly,0.0,0.843,0.157,0.0772
153,Application keep crashing,0.0,1.0,0.0,0.0
154,Unable open app,0.0,1.0,0.0,0.0


In [29]:
pos_list = list(df_sentiment_filtered[df_sentiment_filtered['compound'] >= 0]['comment'])
neg_list = list(df_sentiment_filtered[df_sentiment_filtered['compound'] < 0]['comment'])

len(pos_list), len(neg_list)

(124, 31)

In [30]:
pos_list_clean = clean_list_lemma(clean_list_tokenise(pos_list))
neg_list_clean = clean_list_lemma(clean_list_tokenise(neg_list))

In [31]:
len(pos_list_clean), len(neg_list_clean)

(124, 31)

Ignore code below

### Topic Modelling

In [32]:
neg_dict = corpora.Dictionary(neg_list_clean)
doc_term_matrix = [neg_dict.doc2bow(doc) for doc in neg_list_clean]

print(doc_term_matrix[0:10])

[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 2), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)], [(8, 1), (18, 1), (19, 1), (20, 1), (21, 1)], [(3, 1), (22, 1), (23, 1), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)], [(3, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1)], [(8, 1), (31, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1)], [(4, 1), (8, 2), (34, 2), (35, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 2), (51, 1), (52, 1), (53, 1), (54, 2), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1)], [(8, 1), (50, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1)], [(8, 1), (33, 1), (35, 1), (39, 1), (56, 2), (60, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1)], [(31, 1), (50, 1), (84, 1), (85, 1), (86, 1)], [(8, 1), (56, 1), (76, 1), (87, 1), (88, 2), 

In [44]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
#doc_term_matrix = frequency of terms of all documents
#dictionary = all unique terms
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word=neg_dict, passes=50)

In [45]:
print(ldamodel.print_topics(num_topics=3, num_words=5))

[(0, '0.039*"keep" + 0.039*"sign" + 0.039*"code" + 0.034*"App" + 0.034*"location"'), (1, '0.108*"app" + 0.040*"ride" + 0.037*"Grab" + 0.034*"try" + 0.029*"using"'), (2, '0.113*"driver" + 0.076*"app" + 0.056*"time" + 0.037*"price" + 0.033*"Grab"')]


In [46]:
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, neg_dict)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


# Topic Modelling (Positive)

In [36]:
pos_dict = corpora.Dictionary(pos_list_clean)
doc_term_matrix = [neg_dict.doc2bow(doc) for doc in pos_list_clean]

print(doc_term_matrix[0:10])

[[(8, 1), (184, 1)], [(8, 1), (53, 1), (91, 1), (202, 1)], [(3, 1), (8, 2), (17, 1), (161, 1), (202, 1), (210, 1)], [(94, 1), (159, 2), (213, 1), (220, 1)], [(44, 1), (133, 1)], [(8, 1), (175, 1), (218, 1)], [(1, 1), (3, 1), (54, 1), (76, 3), (137, 1), (138, 1)], [(3, 1), (35, 1), (103, 1)], [(34, 1), (88, 1), (89, 1), (147, 1), (159, 1), (235, 1)], [(3, 1)]]


In [42]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
#doc_term_matrix = frequency of terms of all documents
#dictionary = all unique terms
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word=pos_dict, passes=50)

In [43]:
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, pos_dict)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))
