In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("googleplaystore_user_reviews.csv")
df.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


In [3]:
df.Sentiment.value_counts()

Positive    23998
Negative     8271
Neutral      5163
Name: Sentiment, dtype: int64

In [4]:
df.Sentiment.isna().sum()

26863

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
App                       64295 non-null object
Translated_Review         37427 non-null object
Sentiment                 37432 non-null object
Sentiment_Polarity        37432 non-null float64
Sentiment_Subjectivity    37432 non-null float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB


Tentatively, I am removing the rows with NaN values for the reviews. I cannot run a sentiment analysis if the reviews are missing. I will try and see if there is a way to impute these later.

In [6]:
df = df.dropna(axis=0)
df.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3
5,10 Best Foods for You,Best way,Positive,1.0,0.3


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37427 entries, 0 to 64230
Data columns (total 5 columns):
App                       37427 non-null object
Translated_Review         37427 non-null object
Sentiment                 37427 non-null object
Sentiment_Polarity        37427 non-null float64
Sentiment_Subjectivity    37427 non-null float64
dtypes: float64(2), object(3)
memory usage: 1.7+ MB


In [8]:
df.Sentiment.value_counts()

Positive    23998
Negative     8271
Neutral      5158
Name: Sentiment, dtype: int64

----
I lost 5 Neutral reviews. Other than that, the other figures remained the same.

In [9]:
# df.App.value_counts()

Bowmasters                                            312
Helix Jump                                            273
Angry Birds Classic                                   273
Calorie Counter - MyFitnessPal                        254
Candy Crush Saga                                      240
Duolingo: Learn Languages Free                        240
Garena Free Fire                                      222
8 Ball Pool                                           219
Calorie Counter - Macros                              200
10 Best Foods for You                                 194
CBS Sports App - Scores, News, Stats & Watch Live     192
Google Photos                                         191
Alto's Adventure                                      175
8fit Workouts & Meal Planner                          171
DRAGON BALL LEGENDS                                   167
Candy Crush Soda Saga                                 166
Clash Royale                                          165
Adobe Acrobat 

1074 apps in this list, with at least 30 reviews in each

### Pre-processing

In [10]:
import re

replace_wo_space = re.compile("[.;:!\'?,\"()\[\]]")
replace_with_space = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

reviews = df['Translated_Review'].str.replace(replace_wo_space,'').str.lower()
reivews = reviews.str.replace(replace_with_space,'')
reviews.head(10)

0     i like eat delicious food thats im cooking foo...
1       this help eating healthy exercise regular basis
3            works great especially going grocery store
4                                          best idea us
5                                              best way
6                                               amazing
8                                   looking forward app
9                   it helpful site  it help foods get 
10                                             good you
11    useful information the amount spelling errors ...
Name: Translated_Review, dtype: object

In [11]:
df['Sentiment'] = df['Sentiment'].map({'Positive': 2, 'Neutral': 1, 'Negative':0})
df.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,2,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,2,0.25,0.288462
3,10 Best Foods for You,Works great especially going grocery store,2,0.4,0.875
4,10 Best Foods for You,Best idea us,2,1.0,0.3
5,10 Best Foods for You,Best way,2,1.0,0.3


In [13]:
reviews.count()

37427

In [15]:
df.Sentiment.count()

37427

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True)
cv.fit(reviews)
X = cv.transform(reviews)

print(X.shape)

(37427, 23545)


Each review has been converted into a group of integers that allows the program to access the words using the corresponding integers, i.e. it creates a dictionary of some sort, and reads in the words from there. This way, it does not have to deal with the words directly, and it will be easier for the model to read in those words

In [17]:
from sklearn.model_selection import train_test_split

target = df['Sentiment']
#target = pd.get_dummies(df['Sentiment'])

X_train, X_test, y_train, y_test = train_test_split(X, target, test_size = 0.3, random_state=42)

###  Sentiment Analysis using Logistic Regression

In [20]:
from sklearn.linear_model import LogisticRegressionCV

grid_params = [0.25, 0.5, 0.75, 1]
lr = LogisticRegressionCV(Cs=grid_params, solver='newton-cg', cv=5, multi_class='multinomial', random_state=42)
lr.fit(X_train, y_train)

LogisticRegressionCV(Cs=[0.25, 0.5, 0.75, 1], class_weight=None, cv=5,
           dual=False, fit_intercept=True, intercept_scaling=1.0,
           max_iter=100, multi_class='multinomial', n_jobs=None,
           penalty='l2', random_state=42, refit=True, scoring=None,
           solver='newton-cg', tol=0.0001, verbose=0)

We cannot use the default solver, liblinear, because this is a multiclass problem, i.e. the target variable is not binary; it can be used to classify the review as 'positive', 'neutral' or 'negative'.
We use the 'newton-cg' solver, because we're interested in using the 'l2' regularizer. We would, however, shift to the 'saga' problem if we are slightly underfitted. The Ridge regularizer is used right now because it always gives a definite answer, unlike Lasso, which is slightly more generalized, and may give multiple answers.

In [21]:
lr.scores_

{0: array([[0.91223049, 0.91700057, 0.92043503, 0.92386949],
        [0.90954198, 0.91545802, 0.91832061, 0.91927481],
        [0.91240458, 0.91717557, 0.91965649, 0.91984733],
        [0.92154991, 0.92498568, 0.92689445, 0.92765795],
        [0.91523482, 0.91733486, 0.92038946, 0.92134402]]),
 1: array([[0.91223049, 0.91700057, 0.92043503, 0.92386949],
        [0.90954198, 0.91545802, 0.91832061, 0.91927481],
        [0.91240458, 0.91717557, 0.91965649, 0.91984733],
        [0.92154991, 0.92498568, 0.92689445, 0.92765795],
        [0.91523482, 0.91733486, 0.92038946, 0.92134402]]),
 2: array([[0.91223049, 0.91700057, 0.92043503, 0.92386949],
        [0.90954198, 0.91545802, 0.91832061, 0.91927481],
        [0.91240458, 0.91717557, 0.91965649, 0.91984733],
        [0.92154991, 0.92498568, 0.92689445, 0.92765795],
        [0.91523482, 0.91733486, 0.92038946, 0.92134402]])}

In [22]:
lr.C_

array([1., 1., 1.])

In [23]:
from sklearn.metrics import accuracy_score

y_train_pred = lr.predict(X_train)
print("Training accuracy is {}".format(accuracy_score(y_train, y_train_pred)))

y_test_pred = lr.predict(X_test)
print("Testing accuracy is {}".format(accuracy_score(y_test, y_test_pred)))

Training accuracy is 0.9856095885182075
Testing accuracy is 0.9283996794015495


In [24]:
s = pd.Series(['this is a terrible app','i love this app', 'more to be expected'])
s = cv.transform(s)
lr.predict(s)

array([0, 2, 2], dtype=int64)

From the above analysis, we can see that it has identified the thir

### Sentiment analysis using NLTK

Step 1: Remove stop words, i.e. he,she,they

In [25]:
import nltk

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading wordnet: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

In [26]:
from nltk.corpus import stopwords

heshe = stopwords.words('english')
reviews_stopworded = reviews.apply(lambda x: ' '.join([word for word in x.split()
                                                          if word not in heshe]))
reviews_stopworded.head()

0    like eat delicious food thats im cooking food ...
1           help eating healthy exercise regular basis
3           works great especially going grocery store
4                                         best idea us
5                                             best way
Name: Translated_Review, dtype: object

Step 2: Normalization through Lemmatization

In [27]:
from nltk.stem import WordNetLemmatizer

lem = WordNetLemmatizer()
reviews_lem = reviews_stopworded.apply(lambda x: ' '.join([lem.lemmatize(word) for word in x.split()]))
reviews_lem.head()

0    like eat delicious food thats im cooking food ...
1           help eating healthy exercise regular basis
3            work great especially going grocery store
4                                          best idea u
5                                             best way
Name: Translated_Review, dtype: object

In [28]:
ngram_cv = CountVectorizer(binary=True, ngram_range=(1,2))
ngram_cv.fit(reviews_lem)
X = ngram_cv.transform(reviews_lem)

#### NLTK with Logistic Regression

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.3, random_state=42)

# grid_params = [0.25, 0.5, 0.75, 1]
# lr_nltk = LogisticRegressionCV(Cs=grid_params, solver='newton-cg', cv=5, multi_class='multinomial', random_state=42)
# lr_nltk.fit(X_train, y_train)

from sklearn.linear_model import LogisticRegression

lr_nltk = LogisticRegression(solver='newton-cg', multi_class='multinomial', C=1.0, random_state=42)
lr_nltk.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=42, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False)

In [30]:
y_train_pred = lr_nltk.predict(X_train)
print("Training accuracy is {}".format(accuracy_score(y_train, y_train_pred)))

y_test_pred = lr_nltk.predict(X_test)
print("Testing accuracy is {}".format(accuracy_score(y_test, y_test_pred)))

Training accuracy is 0.9945797389113673
Testing accuracy is 0.919405111764182


In [31]:
s = pd.Series(['this is a terrible app', 'i love this app', 'more to be expected'])
s = ngram_cv.transform(s)
lr_nltk.predict(s)

array([0, 2, 1], dtype=int64)

#### NLTK with Support Vector Machines

In [32]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

grid_params = {'C':[0.25,0.5,0.75,1.0]}
svm = GridSearchCV(LinearSVC(multi_class='ovr', max_iter=2500, random_state=42),grid_params,cv=5)
svm.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=2500,
     multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
     verbose=0),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': [0.25, 0.5, 0.75, 1.0]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)

In [33]:
svm.cv_results_



{'mean_fit_time': array([4.33812947, 6.98477254, 9.5795547 , 9.4163734 ]),
 'std_fit_time': array([0.8104096 , 1.7152252 , 1.93701435, 0.72884227]),
 'mean_score_time': array([0.00654416, 0.00704088, 0.00624142, 0.00626078]),
 'std_score_time': array([0.00206647, 0.00110341, 0.00110827, 0.00209564]),
 'param_C': masked_array(data=[0.25, 0.5, 0.75, 1.0],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.25}, {'C': 0.5}, {'C': 0.75}, {'C': 1.0}],
 'split0_test_score': array([0.91203969, 0.91032246, 0.90917764, 0.90803282]),
 'split1_test_score': array([0.91183206, 0.91221374, 0.91183206, 0.91183206]),
 'split2_test_score': array([0.91335878, 0.9139313 , 0.91354962, 0.91278626]),
 'split3_test_score': array([0.91735064, 0.91506013, 0.91601451, 0.91677801]),
 'split4_test_score': array([0.91714395, 0.91828942, 0.91714395, 0.91466208]),
 'mean_test_score': array([0.91434461, 0.9139629 , 0.91354302, 0.91281777]),
 'std_te

In [34]:
svm.best_score_

0.9143446064585082

In [35]:
svm.best_estimator_

LinearSVC(C=0.25, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=2500,
     multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
     verbose=0)

In [36]:
y_train_pred = svm.predict(X_train)
print("Training accuracy is {}".format(accuracy_score(y_train, y_train_pred)))

y_test_pred = svm.predict(X_test)
print("Testing accuracy is {}".format(accuracy_score(y_test, y_test_pred)))

Training accuracy is 0.996450110695473
Testing accuracy is 0.9218096001424883


In [37]:
s = pd.Series(['this is a terrible app', 'i love this app', 'more to be expected'])
s = ngram_cv.transform(s)
svm.predict(s)

array([0, 2, 1], dtype=int64)

We can use ngram_cv as the model for the X&y-values because they're under the same X_train, y_train models that was used to transform the data for the LinearSVC model.

---

There are 3 methods we used to perform sentiment analysis here.

1. Logistic Regression
2. NLTK pre-processing<br />
    a. Logistic Regression<br />
    b. Linear SVM

NOTE: This was a multi-class classification analysis.

### Reasonings

In Logistic Regression, since this was a multi-class classification, we changed the solver from the default 'liblinear' to 'newton-cg'. We had the option of using other solvers, but the focus was on using the 'l2' penalty since there were limited  features. I would have used 'saga' if we had more multiple features.

Trying to improve upon the model, the data was pre-processed using the Natural Language Toolkit. There are 2 things done here that differed it from the above model. The first was that we removed the stop words, i.e. he, she, they, a, I. After that, we used a Lemmatizer that truncates words into their proper forms to prevent counting the same word in different forms, i.e. -ing, -s, -ed. It was re-vectorized and passed through the same Logistic Regression function, and the score improved.

I also ran the LinearSVM because I was interested in finding out whether using a different model would yield any scores that were significantly different from the results we had achieved above.