### Reviews about CDG

In this notebook, we are going to perform topic modelling based on sentiments from various review sites. 

In [1]:
# Base
import os, re, string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# from wordcloud import WordCloud

# NLTK
import nltk
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.metrics.distance import edit_distance
from spellchecker import SpellChecker
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem.wordnet import WordNetLemmatizer

# Topic Modelling
import pyLDAvis
import pyLDAvis.gensim
import gensim
from gensim import corpora
pyLDAvis.enable_notebook()

[nltk_data] Downloading package stopwords to D:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


### Data Preparation

We will first get the relevant files we need and thereafter extract them out into a list of reviews.

In [2]:
# Merge into 

path='.'

filename_sentiment_list=[]

for filename in os.listdir(path):
    if filename.endswith(".csv"):
        # Check what are the files in the folder
        print(os.path.join(path, filename))
        
        #Read the files and put to df
        df = pd.read_csv(filename,encoding = "ISO-8859-1")
        
        #Add it into a list
        try:
            filename_sentiment_list.extend(df['comment'].tolist())
            filename_sentiment_list.extend(df['tweet'].tolist())
            filename_sentiment_list.extend(df['title'].tolist())
        except:
            pass

print(filename_sentiment_list)

.\gplay_cdge.csv
.\reddit_cdgtaxi.csv
.\reddit_cdgtaxi_en_extracted.csv
.\reddit_comfortdelgro.csv
.\reddit_comfortdelgro_extracted.csv
.\twitter_cdgtaxi.csv
.\twitter_comfortdelgro.csv


In [3]:
filename_sentiment_list[5:7]

['Easy to use', 'Good app.']

### Cleaning

We will also perform cleaning

In [4]:
def clean_list_tokenise(reviews_list):
    # Tokenise the words
    tokenizer = RegexpTokenizer(r'\w+')
    return [tokenizer.tokenize(sentence) for sentence in reviews_list]

def clean_list_lemma(reviews_list):
    # Lemmatization
    lemma = WordNetLemmatizer()
    return [[lemma.lemmatize(word) for word in sentence] for sentence in reviews_list]

def clean_list_stopwords(reviews_list, stop_other=[]):
    def stopword_condition(word):
        word = word.lower()
        
        return word not in stopwords.words('english') \
            and word not in stop_other \
            and word[:5] != 'http' \
            and word[:5] != 'https' \
            and word[:2] != 'RT' \
            and word[0] != '@'
    
    return [[w for w in s if stopword_condition(w)] for s in reviews_list]

In [5]:
sentiment_filtered = clean_list_stopwords(clean_list_tokenise(filename_sentiment_list))
len(sentiment_filtered)

1193

In [6]:
sentiment_filtered_sentences = [' '.join(s) for s in sentiment_filtered]
sentiment_filtered_sentences[:5]

['Horrible experience waited 8mins cab arrive cancelled stating show WTH',
 'better Please benchmark Grab get better customer satisfactory eg using credit card ride completed',
 'good',
 'Good one',
 'Unable use app SMS OTP delivered Malaysia number']

### Segregating by Sentiment

We will segregate the reviews by sentiment. This is achieved by the `analyse_sentiment_vader` function, which uses vader to help us analyse the degree to which the particular review or comment is positive, negative or neutral. It also computes a compound score ranging from -1 to 1 which takes into account all the scores. More reading for the sentiment analysis are available at:

- https://opensourceforu.com/2016/12/analysing-sentiments-nltk/
- http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html

In [7]:
def analyse_sentiment_vader(df, col_name):
    sid = SentimentIntensityAnalyzer()
    vader = lambda text: sid.polarity_scores(text)
    
    df['vader'] = df[col_name].apply(vader)
    df = pd.merge(df, df['vader'].apply(pd.Series), left_index=True, right_index=True)
    return df.drop(['vader'], axis=1)

In [8]:
df_sentiment_filtered = pd.DataFrame(sentiment_filtered_sentences, columns=['comment'])
df_sentiment_filtered = analyse_sentiment_vader(df_sentiment_filtered, 'comment')
df_sentiment_filtered.head(10)

Unnamed: 0,comment,neg,neu,pos,compound
0,Horrible experience waited 8mins cab arrive ca...,0.579,0.421,0.0,-0.8636
1,better Please benchmark Grab get better custom...,0.0,0.405,0.595,0.9042
2,good,0.0,0.0,1.0,0.4404
3,Good one,0.0,0.256,0.744,0.4404
4,Unable use app SMS OTP delivered Malaysia number,0.0,0.843,0.157,0.0772
5,Easy use,0.0,0.256,0.744,0.4404
6,Good app,0.0,0.256,0.744,0.4404
7,easy use,0.0,0.256,0.744,0.4404
8,app working Samsung S7 try sign lot times nobo...,0.0,1.0,0.0,0.0
9,Great app rely grab ridiculous price hikes,0.216,0.431,0.353,0.3818


In [9]:
df_sentiment_filtered.tail(10)

Unnamed: 0,comment,neg,neu,pos,compound
1183,write credit 10 account,0.0,0.536,0.464,0.3818
1184,category select writing,0.0,1.0,0.0,0.0
1185,Anyhow wack need get access form fill,0.0,1.0,0.0,0.0
1186,Im curious main gripes current taxi apps,0.0,0.723,0.277,0.3182
1187,using Cabbie years pretty much satisfied bad r...,0.063,0.829,0.108,0.3612
1188,advantage using app smsing directly,0.0,0.667,0.333,0.25
1189,SGTransport mobilityandme sadly Windows Phone ...,0.049,0.82,0.131,0.6908
1190,Also recommend uber smoothest app gps works we...,0.079,0.526,0.396,0.967
1191,wanted recommend uber app works supply enough ...,0.094,0.75,0.156,0.25
1192,true guess thinking app much overall product u...,0.0,0.741,0.259,0.4215


In [10]:
pos_list = list(df_sentiment_filtered[df_sentiment_filtered['compound'] >= 0]['comment'])
neg_list = list(df_sentiment_filtered[df_sentiment_filtered['compound'] < 0]['comment'])

len(pos_list), len(neg_list)

(844, 349)

In [11]:
pos_list_clean = clean_list_lemma(clean_list_tokenise(pos_list))
neg_list_clean = clean_list_lemma(clean_list_tokenise(neg_list))

In [12]:
len(pos_list_clean), len(neg_list_clean)

(844, 349)

Ignore code below

### Topic Modelling

In [13]:
neg_dict = corpora.Dictionary(neg_list_clean)
doc_term_matrix = [neg_dict.doc2bow(doc) for doc in neg_list_clean]

print(doc_term_matrix[0:10])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)], [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], [(16, 1)], [(11, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1)], [(11, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 2), (31, 2), (32, 2), (33, 1), (34, 1), (35, 1), (36, 2), (37, 1), (38, 1), (39, 1), (40, 3), (41, 1), (42, 1), (43, 1), (44, 1)], [(5, 1), (15, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 2)], [(6, 1), (16, 1), (57, 1), (58, 1), (59, 1)], [(60, 1), (61, 1), (62, 1), (63, 1)], [(49, 1), (64, 2), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1)], [(69, 1), (78, 1), (79, 1)]]


In [29]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
#doc_term_matrix = frequency of terms of all documents
#dictionary = all unique terms
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word=neg_dict, passes=50)

In [30]:
print(ldamodel.print_topics(num_topics=2, num_words=5))

[(0, '0.028*"Uber" + 0.026*"Grab" + 0.017*"taxi" + 0.014*"driver" + 0.010*"ride"'), (1, '0.039*"app" + 0.020*"taxi" + 0.017*"booking" + 0.013*"cab" + 0.010*"use"')]


In [31]:
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, neg_dict)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


# Topic Modelling (Positive)

In [25]:
pos_dict = corpora.Dictionary(pos_list_clean)
doc_term_matrix = [neg_dict.doc2bow(doc) for doc in pos_list_clean]

print(doc_term_matrix[0:10])

[[(68, 1), (112, 1), (120, 1), (192, 1), (199, 1), (241, 1), (243, 2), (290, 1), (322, 1), (494, 1)], [(1368, 1)], [(153, 1), (908, 1)], [(11, 1), (17, 1), (42, 1), (237, 1), (254, 1), (859, 1)], [(42, 1)], [(11, 1), (908, 1)], [(42, 1)], [(11, 1), (22, 1), (76, 1), (85, 1), (325, 1), (695, 1), (849, 1), (932, 1)], [(11, 1), (55, 1), (368, 1), (626, 1), (1126, 1)], [(908, 1)]]


In [20]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
#doc_term_matrix = frequency of terms of all documents
#dictionary = all unique terms
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word=pos_dict, passes=50)

In [21]:
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, pos_dict)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))
