### Reviews about Ryde

In this notebook, we are going to perform topic modelling based on sentiments from various review sites. 

In [21]:
# Base
import os, re, string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# from wordcloud import WordCloud

# NLTK
import nltk
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.metrics.distance import edit_distance
from spellchecker import SpellChecker
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem.wordnet import WordNetLemmatizer

# Topic Modelling
import pyLDAvis
import pyLDAvis.gensim
import gensim
from gensim import corpora
pyLDAvis.enable_notebook()

[nltk_data] Downloading package stopwords to D:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data Preparation

We will first get the relevant files we need and thereafter extract them out into a list of reviews.

In [22]:
# Merge into 

path='.'

filename_sentiment_list=[]

for filename in os.listdir(path):
    if filename.endswith(".csv"):
        # Check what are the files in the folder
        print(os.path.join(path, filename))
        
        #Read the files and put to df
        df = pd.read_csv(filename,encoding = "ISO-8859-1")
        
        #Add it into a list
        try:
            filename_sentiment_list.extend(df['comment'].tolist())
            filename_sentiment_list.extend(df['tweet'].tolist())
            filename_sentiment_list.extend(df['title'].tolist())
        except:
            pass

print(filename_sentiment_list)

.\gojek.csv
.\reddit_go-jek_extracted.csv
.\reddit_gojek_extracted.csv
.\twitter_go-jek.csv
.\twitter_gojek.csv
[' \r\r\n\r\r\nIndonesian ride-hailing firm Go-Jek has roped in half a dozen car  rental firms to supply vehicles and sign on drivers in preparation for a  [Singapore launch](https://www.straitstimes.com/singapore/transport/ride-hailing-firm-go-jek-to-launch-in-singapore) set for next month.\r\r\n\r\r\nThe Straits Times understands that the Grab rival - whose investors  include state investment firm Temasek and Internet giant Google - will  enter the market solo.\r\r\n\r\r\nPreviously, the Jakarta app start-up was [in talks with local taxi giant ComfortDelGro](https://www.straitstimes.com/singapore/transport/go-jek-comfortdelgro-in-tie-up-talks-according-to-report) to explore a tie-up.\r\r\n\r\r\nThe latter had earlier entered into an agreement to buy 12,500 cars from Uber\'s Lion City Rentals - [a deal which was terminated ](https://www.straitstimes.com/singapore/transport/c

In [23]:
filename_sentiment_list[5:7]

['just gonna wait for the promo codes',
 'In the long term, the one who wins is the one who can embrace self-driving cars fastest. Taxi and grab drivers are a dying trade. No way they can compete with self-driving cars.  \r\r\nConsumers and companies are the biggest winners. Taxi fares are going to fall though the roof.  \r\r\n']

### Cleaning

We will also perform cleaning

In [24]:
def clean_list_tokenise(reviews_list):
    # Tokenise the words
    tokenizer = RegexpTokenizer(r'\w+')
    return [tokenizer.tokenize(sentence) for sentence in reviews_list]

def clean_list_lemma(reviews_list):
    # Lemmatization
    lemma = WordNetLemmatizer()
    return [[lemma.lemmatize(word) for word in sentence] for sentence in reviews_list]

def clean_list_stopwords(reviews_list, stop_other=[]):
    def stopword_condition(word):
        word = word.lower()
        
        return word not in stopwords.words('english') \
            and word not in stop_other \
            and word[:5] != 'http' \
            and word[:5] != 'https' \
            and word[:2] != 'RT' \
            and word[0] != '@'
    
    return [[w for w in s if stopword_condition(w)] for s in reviews_list]

In [25]:
sentiment_filtered = clean_list_stopwords(clean_list_tokenise(filename_sentiment_list))
len(sentiment_filtered)

405

In [26]:
sentiment_filtered_sentences = [' '.join(s) for s in sentiment_filtered]
sentiment_filtered_sentences[:5]

['Indonesian ride hailing firm Go Jek roped half dozen car rental firms supply vehicles sign drivers preparation Singapore launch www straitstimes com singapore transport ride hailing firm go jek launch singapore set next month Straits Times understands Grab rival whose investors include state investment firm Temasek Internet giant Google enter market solo Previously Jakarta app start talks local taxi giant ComfortDelGro www straitstimes com singapore transport go jek comfortdelgro tie talks according report explore tie latter earlier entered agreement buy 12 500 cars Uber Lion City Rentals deal terminated www straitstimes com singapore transport comfortdelgro uber deal Uber exit Singapore market www straitstimes com business companies markets grab buys ubers south east asia business uber gets 275 stake grab early year According industry sources six rental firms invited Go Jek sign private hire drivers supply vehicles Following Competition Consumer Commission Singapore CCCS non exclusi

### Segregating by Sentiment

We will segregate the reviews by sentiment. This is achieved by the `analyse_sentiment_vader` function, which uses vader to help us analyse the degree to which the particular review or comment is positive, negative or neutral. It also computes a compound score ranging from -1 to 1 which takes into account all the scores. More reading for the sentiment analysis are available at:

- https://opensourceforu.com/2016/12/analysing-sentiments-nltk/
- http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html

In [27]:
def analyse_sentiment_vader(df, col_name):
    sid = SentimentIntensityAnalyzer()
    vader = lambda text: sid.polarity_scores(text)
    
    df['vader'] = df[col_name].apply(vader)
    df = pd.merge(df, df['vader'].apply(pd.Series), left_index=True, right_index=True)
    return df.drop(['vader'], axis=1)

In [28]:
df_sentiment_filtered = pd.DataFrame(sentiment_filtered_sentences, columns=['comment'])
df_sentiment_filtered = analyse_sentiment_vader(df_sentiment_filtered, 'comment')
df_sentiment_filtered.head(10)

Unnamed: 0,comment,neg,neu,pos,compound
0,Indonesian ride hailing firm Go Jek roped half...,0.032,0.841,0.128,0.9853
1,one people complaining Grab crappy service sta...,0.254,0.376,0.371,0.3612
2,Please help take Grab peg really think theyâ T...,0.0,0.583,0.417,0.6124
3,new player entered game,0.0,1.0,0.0,0.0
4,go,0.0,1.0,0.0,0.0
5,gonna wait promo codes,0.0,1.0,0.0,0.0
6,long term one wins one embrace self driving ca...,0.0,0.748,0.252,0.8442
7,Go Jek soon rename Go Jek Cao,0.0,1.0,0.0,0.0
8,passenger wait,0.0,1.0,0.0,0.0
9,Singapore next per article Thailand Singapore ...,0.0,1.0,0.0,0.0


In [29]:
df_sentiment_filtered.tail(10)

Unnamed: 0,comment,neg,neu,pos,compound
395,need worry grab n uber around taxi less picky ...,0.102,0.81,0.088,-0.1027
396,ComfortDelgro one fav cab service im rush call...,0.0,0.765,0.235,0.5106
397,Comfort Delgro also got phone app works simila...,0.0,0.762,0.238,0.7579
398,im sure option pick,0.0,0.566,0.434,0.3182
399,deleted 0 6431 pastebin com FcrFs94k 97666,0.0,1.0,0.0,0.0
400,support sg grab com go Grab app help centre re...,0.0,0.671,0.329,0.6597
401,complained shit Grab,0.863,0.137,0.0,-0.743
402,show booking id,0.0,1.0,0.0,0.0
403,kindness monetary incentive see driver deliver...,0.111,0.536,0.353,0.9643
404,grab hitch driver spotted,0.0,1.0,0.0,0.0


General Topic Modelling

In [30]:
gen_dict = corpora.Dictionary(df_sentiment_filtered)
doc_term_matrix = [gen_dict.doc2bow(doc) for doc in gen_list_clean]

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

In [None]:
Lda = gensim.models.ldamodel.LdaModel

In [None]:
print(ldamodel.print_topics(num_topics=3, num_words=5))

In [10]:
pos_list = list(df_sentiment_filtered[df_sentiment_filtered['compound'] >= 0]['comment'])
neg_list = list(df_sentiment_filtered[df_sentiment_filtered['compound'] < 0]['comment'])

len(pos_list), len(neg_list)

(293, 112)

In [11]:
pos_list_clean = clean_list_lemma(clean_list_tokenise(pos_list))
neg_list_clean = clean_list_lemma(clean_list_tokenise(neg_list))

In [12]:
len(pos_list_clean), len(neg_list_clean)

(293, 112)

Ignore code below

### Topic Modelling (Negative)

In [13]:
neg_dict = corpora.Dictionary(neg_list_clean)
doc_term_matrix = [neg_dict.doc2bow(doc) for doc in neg_list_clean]

print(doc_term_matrix[0:10])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 2), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)], [(0, 1), (5, 3), (7, 1), (16, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 3), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 3), (58, 1)], [(5, 2), (8, 2), (15, 1), (17, 1), (20, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1)], [(78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1)], [(1, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1)], [(100

In [14]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
#doc_term_matrix = frequency of terms of all documents
#dictionary = all unique terms
ldamodel = Lda(doc_term_matrix, num_topics=5, id2word=neg_dict, passes=50)

In [15]:
print(ldamodel.print_topics(num_topics=5, num_words=5))

[(0, '0.023*"driver" + 0.020*"Uber" + 0.010*"cancel" + 0.009*"uber" + 0.009*"taxi"'), (1, '0.011*"Go" + 0.011*"need" + 0.010*"price" + 0.009*"Grab" + 0.009*"wrong"'), (2, '0.012*"car" + 0.012*"Grab" + 0.010*"driver" + 0.009*"grab" + 0.008*"cancel"'), (3, '0.012*"get" + 0.011*"driver" + 0.009*"Grab" + 0.009*"ride" + 0.009*"Uber"'), (4, '0.019*"driver" + 0.015*"grab" + 0.011*"u" + 0.008*"car" + 0.008*"drop"')]


In [16]:
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, neg_dict)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


# Topic Modelling (Positive)

In [17]:
pos_dict = corpora.Dictionary(pos_list_clean)
doc_term_matrix = [neg_dict.doc2bow(doc) for doc in pos_list_clean]

print(doc_term_matrix[0:10])

[[(1, 1), (5, 7), (7, 2), (8, 3), (10, 1), (16, 1), (23, 1), (43, 2), (46, 1), (57, 3), (69, 1), (72, 1), (78, 2), (88, 6), (95, 1), (97, 4), (101, 11), (108, 8), (109, 11), (110, 4), (123, 1), (126, 1), (140, 2), (141, 1), (146, 2), (166, 4), (186, 5), (211, 5), (219, 2), (224, 5), (226, 3), (243, 1), (244, 1), (255, 1), (256, 1), (261, 2), (263, 1), (266, 1), (272, 1), (292, 1), (314, 1), (315, 1), (323, 1), (328, 1), (332, 1), (335, 3), (338, 2), (343, 3), (351, 1), (358, 1), (372, 1), (410, 1), (421, 1), (424, 1), (432, 1), (434, 2), (446, 1), (485, 1), (504, 1), (571, 4), (574, 2), (580, 1), (583, 1), (612, 1), (627, 1), (711, 6), (734, 1), (768, 1), (772, 2), (774, 2), (775, 2), (785, 1), (805, 2), (835, 1), (839, 3), (891, 1), (961, 1), (987, 2), (1031, 1), (1036, 1), (1115, 1), (1118, 2)], [(108, 1), (140, 1), (154, 1), (319, 1), (372, 1), (639, 1), (680, 1), (855, 1)], [(56, 1), (100, 1), (108, 1), (165, 1), (551, 1), (1084, 1)], [(623, 1), (1014, 1)], [(43, 1)], [(76, 1), (15

In [18]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
#doc_term_matrix = frequency of terms of all documents
#dictionary = all unique terms
ldamodel = Lda(doc_term_matrix, num_topics=5, id2word=pos_dict, passes=50)

In [19]:
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, pos_dict)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))
