### Reviews about Ryde

In this notebook, we are going to perform topic modelling based on sentiments from various review sites. 

In [1]:
# Base
import os, re, string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# from wordcloud import WordCloud

# NLTK
import nltk
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.metrics.distance import edit_distance
from spellchecker import SpellChecker
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem.wordnet import WordNetLemmatizer

# Topic Modelling
import pyLDAvis
import pyLDAvis.gensim
import gensim
from gensim import corpora
pyLDAvis.enable_notebook()

[nltk_data] Downloading package stopwords to D:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


### Data Preparation

We will first get the relevant files we need and thereafter extract them out into a list of reviews.

In [2]:
# Merge into 

path='.'

filename_sentiment_list=[]

for filename in os.listdir(path):
    if filename.endswith(".csv"):
        # Check what are the files in the folder
        print(os.path.join(path, filename))
        
        #Read the files and put to df
        df = pd.read_csv(filename,encoding = "ISO-8859-1")
        
        #Add it into a list
        try:
            filename_sentiment_list.extend(df['comment'].tolist())
            filename_sentiment_list.extend(df['tweet'].tolist())
            filename_sentiment_list.extend(df['title'].tolist())
        except:
            pass

print(filename_sentiment_list)

.\gplay_ryde_en.csv
.\reddit_rydepool_en.csv
.\reddit_rydesg_en.csv
.\reddit_ryde_en.csv
.\reddit_ryde_en_extracted.csv
.\twitter_ryde.csv
.\twitter_rydepool.csv
.\twitter_rydesg.csv
['Always enjoy te rides.Vry gd service', "don't install..full of crabs", 'Good network', 'Too many mandatory notifications that cannot be disabled. Rides more expensive or same price as Grab.', "Cannot even login when I logged out. Says my email is invalid but since my mobile number is registered, it won't allow me to login!", 'useless piece of xxxx', 'Its a good app but the problem for me is when i book i have to wait so long to get a driver please do something about it to me this app is good cause its cheaper then grab so please do something before i go back to grab', 'Not so fast to get driver. Need to catch up with Grab', 'App does not provide real-time updates on drivers location, always keep having to restart the app', 'No discount no promo', 'Nice to visit you', 'Well', 'difficult to book', 'So lous

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
filename_sentiment_list[5:7]

['useless piece of xxxx',
 'Its a good app but the problem for me is when i book i have to wait so long to get a driver please do something about it to me this app is good cause its cheaper then grab so please do something before i go back to grab']

### Cleaning

We will also perform cleaning

In [4]:
def clean_list_tokenise(reviews_list):
    # Tokenise the words
    tokenizer = RegexpTokenizer(r'\w+')
    return [tokenizer.tokenize(sentence) for sentence in reviews_list]

def clean_list_lemma(reviews_list):
    # Lemmatization
    lemma = WordNetLemmatizer()
    return [[lemma.lemmatize(word) for word in sentence] for sentence in reviews_list]

def clean_list_stopwords(reviews_list, stop_other=[]):
    def stopword_condition(word):
        word = word.lower()
        
        return word not in stopwords.words('english') \
            and word not in stop_other \
            and word[:5] != 'http' \
            and word[:5] != 'https' \
            and word[:2] != 'RT' \
            and word[0] != '@'
    
    return [[w for w in s if stopword_condition(w)] for s in reviews_list]

In [5]:
sentiment_filtered = clean_list_stopwords(clean_list_tokenise(filename_sentiment_list))
len(sentiment_filtered)

1024

In [6]:
sentiment_filtered_sentences = [' '.join(s) for s in sentiment_filtered]
sentiment_filtered_sentences[:5]

['Always enjoy te rides Vry gd service',
 'install full crabs',
 'Good network',
 'many mandatory notifications cannot disabled Rides expensive price Grab',
 'Cannot even login logged Says email invalid since mobile number registered allow login']

### Segregating by Sentiment

We will segregate the reviews by sentiment. This is achieved by the `analyse_sentiment_vader` function, which uses vader to help us analyse the degree to which the particular review or comment is positive, negative or neutral. It also computes a compound score ranging from -1 to 1 which takes into account all the scores. More reading for the sentiment analysis are available at:

- https://opensourceforu.com/2016/12/analysing-sentiments-nltk/
- http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html

In [7]:
def analyse_sentiment_vader(df, col_name):
    sid = SentimentIntensityAnalyzer()
    vader = lambda text: sid.polarity_scores(text)
    
    df['vader'] = df[col_name].apply(vader)
    df = pd.merge(df, df['vader'].apply(pd.Series), left_index=True, right_index=True)
    return df.drop(['vader'], axis=1)

In [8]:
df_sentiment_filtered = pd.DataFrame(sentiment_filtered_sentences, columns=['comment'])
df_sentiment_filtered = analyse_sentiment_vader(df_sentiment_filtered, 'comment')
df_sentiment_filtered.head(10)

Unnamed: 0,comment,neg,neu,pos,compound
0,Always enjoy te rides Vry gd service,0.0,0.652,0.348,0.4939
1,install full crabs,0.0,1.0,0.0,0.0
2,Good network,0.0,0.256,0.744,0.4404
3,many mandatory notifications cannot disabled R...,0.0,0.86,0.14,0.0772
4,Cannot even login logged Says email invalid si...,0.0,0.775,0.225,0.296
5,useless piece xxxx,0.583,0.417,0.0,-0.4215
6,good app problem book wait long get driver ple...,0.096,0.534,0.37,0.7717
7,fast get driver Need catch Grab,0.0,1.0,0.0,0.0
8,App provide real time updates drivers location...,0.0,1.0,0.0,0.0
9,discount promo,0.0,1.0,0.0,0.0


In [9]:
df_sentiment_filtered.tail(10)

Unnamed: 0,comment,neg,neu,pos,compound
1014,Shall Thanks EDIT gave paypal call iterated FC...,0.088,0.602,0.31,0.5859
1015,said random accounts selected promo nice suppo...,0.0,0.515,0.485,0.8834
1016,category select contact,0.0,1.0,0.0,0.0
1017,Getting error Gonna try emailing paypal hopefu...,0.201,0.597,0.201,0.0
1018,deleted,0.0,1.0,0.0,0.0
1019,credited 10 USD 10 SGD account spend realize t...,0.0,0.711,0.289,0.743
1020,Sorry sure already paid perhaps chat Paypal se...,0.092,0.563,0.345,0.5574
1021,write credit 10 account,0.0,0.536,0.464,0.3818
1022,category select writing,0.0,1.0,0.0,0.0
1023,Anyhow wack need get access form fill,0.0,1.0,0.0,0.0


In [10]:
pos_list = list(df_sentiment_filtered[df_sentiment_filtered['compound'] >= 0]['comment'])
neg_list = list(df_sentiment_filtered[df_sentiment_filtered['compound'] < 0]['comment'])

len(pos_list), len(neg_list)

(724, 300)

In [11]:
pos_list_clean = clean_list_lemma(clean_list_tokenise(pos_list))
neg_list_clean = clean_list_lemma(clean_list_tokenise(neg_list))

In [12]:
len(pos_list_clean), len(neg_list_clean)

(724, 300)

Ignore code below

### Topic Modelling

In [36]:
neg_dict = corpora.Dictionary(neg_list_clean)
doc_term_matrix = [neg_dict.doc2bow(doc) for doc in neg_list_clean]

print(doc_term_matrix[0:10])

[[(0, 1), (1, 1), (2, 1)], [(3, 1), (4, 1)], [(3, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)], [(12, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 3), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1)], [(7, 1), (22, 2), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1)], [(7, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1)], [(57, 1), (58, 1)], [(19, 1), (22, 1), (49, 1), (59, 1), (60, 1), (61, 1), (62, 2), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1)], [(19, 3), (49, 1), (54, 1), (62, 1), (70, 1), (71, 1), (72, 1), (73, 2), (74, 1), (75, 2), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1)], [(22, 1), (66, 1), (76, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1)]]


In [41]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
#doc_term_matrix = frequency of terms of all documents
#dictionary = all unique terms
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word=neg_dict, passes=50)

In [42]:
print(ldamodel.print_topics(num_topics=2, num_words=5))

[(0, '0.017*"app" + 0.015*"driver" + 0.011*"Grab" + 0.007*"service" + 0.006*"Ryde"'), (1, '0.013*"driver" + 0.010*"grab" + 0.009*"get" + 0.009*"app" + 0.008*"Grab"')]


In [43]:
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, neg_dict)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


# Topic Modelling (Positive)

In [31]:
pos_dict = corpora.Dictionary(pos_list_clean)
doc_term_matrix = [neg_dict.doc2bow(doc) for doc in pos_list_clean]

print(doc_term_matrix[0:10])

[[(5, 1), (49, 1), (90, 1), (1897, 1)], [(691, 1)], [(1141, 1)], [(229, 1), (304, 1), (334, 1), (342, 1), (366, 1), (493, 1)], [(78, 1), (79, 1), (143, 1), (258, 1), (264, 1), (272, 1), (273, 2), (329, 1), (955, 1), (1807, 1)], [(3, 1), (7, 1), (13, 1), (22, 2), (109, 1), (154, 2), (196, 2), (354, 1), (415, 1), (432, 1), (675, 2), (1143, 2), (1559, 1), (1834, 1)], [(7, 1), (105, 1), (109, 1), (334, 1), (1516, 1)], [(7, 1), (12, 1), (22, 1), (104, 1), (114, 1), (117, 1), (129, 1), (155, 1), (162, 1), (803, 1)], [(426, 1), (483, 1)], []]


In [32]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
#doc_term_matrix = frequency of terms of all documents
#dictionary = all unique terms
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word=pos_dict, passes=50)

In [33]:
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, pos_dict)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))
