### **📝 Instructions**
Sentiment analysis
Naive Bayes models are very useful when we want to analyze sentiment, classify texts into topics or recommendations, as the characteristics of these challenges meet the theoretical and methodological assumptions of the model very well.

In this project you will practice with a dataset to create a review classifier for the Google Play store.

In [32]:
# Import libreries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import (MinMaxScaler,
                                   StandardScaler,
                                   LabelEncoder,
                                   OneHotEncoder)
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import (chi2,
                                       SelectKBest,
                                       f_regression)
from sklearn.model_selection import (train_test_split,
                                     GridSearchCV) # For Optimize
from sklearn.metrics import (accuracy_score,
                             mean_squared_error,
                            confusion_matrix,
                            classification_report)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.naive_bayes import (GaussianNB,
                                 MultinomialNB,
                                 BernoulliNB)
from sklearn.feature_extraction.text import CountVectorizer

# Optimizer
from sklearn.model_selection import RandomizedSearchCV
from pickle import dump

#### **Step 1: Loading the dataset**
The dataset can be found in this project folder under the name `playstore_reviews.csv`. You can load it into the code directly from the link:
``` link
https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv
```

Or download it and add it by hand in your repository. In this dataset, you will find the following variables:

- `package_name`. Name of the mobile application (categorical)
- `review`. Comment about the mobile application (categorical)
- `polarity`. Class variable (0 or 1), being 0 a negative comment and 1, positive (numeric)

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv")
df.head()

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0


In [3]:
df_raw = df.copy()
df_raw.to_csv("../data/raw/df_raw_NB.csv", index=False)

#### **Step 2: Study of variables and their content**
In this case, we have only 3 variables: 2 predictors and a dichotomous label. Of the two predictors, we are really only interested in the comment part, since the fact of classifying a comment as positive or negative will depend on its content, not on the application from which it was written. Therefore, the `package_name` variable should be removed.

When we work with text, as in this case, it does not make sense to do an EDA, the process is different, since the only variable we are interested in is the one that contains the text. In other cases where the text is part of a complex set with other numeric predictor variables and the prediction objective is different, then it makes sense to apply an EDA.

However, we cannot work with plain text; it must first be processed. This process consists of several steps:

1. Removing spaces and converting the text to lowercase:
```py
df["column"] = df["column"].str.strip().str.lower()
```
1. Divide the dataset into train and test: X_train, X_test, y_train, y_test.
2. Transform the text into a word count matrix. This is a way to obtain numerical features from the text. For this, we use the training set to train the transformer and apply it in test:
```py
vec_model = CountVectorizer(stop_words = "english")
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()
```
Once we have finished we will have the predictors ready to train the model.

In [4]:
# Preprocessing
df_interim = (
    df_raw
        .copy()
        .set_axis(
            df_raw.columns.str.replace(' ','_')
                          .str.replace(r'r\W', '', regex=True)
                          .str.lower()
                          .str.slice(0, 40), axis=1
        )
        .drop('package_name', axis=1)
)
df_interim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   review    891 non-null    object
 1   polarity  891 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 14.0+ KB


In [5]:
df_interim['review'] = df_interim['review'].str.strip().str.lower()
df_interim.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offlin...,0
1,"messenger issues ever since the last update, i...",0
2,profile any time my wife or anybody has more t...,0
3,the new features suck for those of us who don'...,0
4,forced reload on uploading pic on replying com...,0


In [6]:
df_interim.to_csv("../data/interim/fd_interim_NB.csv", index=False)

In [23]:
# Split 
df = df_interim.copy()
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [14]:
display(df_train.head())
display(df_test.head())

Unnamed: 0,review,polarity
331,just did the latest update on viber and yet ag...,0
733,keeps crashing it only works well in extreme d...,0
382,the fail boat has arrived the 6.0 version is t...,0
704,"superfast, just as i remember it ! opera mini ...",1
813,installed and immediately deleted this crap i ...,1


Unnamed: 0,review,polarity
709,love/hate has bug and security issues. i tried...,0
439,whatsapp i use this app now that blackberry me...,1
840,usefully verry nice app,1
720,fonts why in the heck is this thing analysing ...,0
39,app doesn't work after latest upgrade the face...,0


In [25]:
X_train = df_train['review'].reset_index(drop=True)
y_train = df_train['polarity'].reset_index(drop=True)
X_test = df_test['review'].reset_index(drop=True)
y_test = df_test['polarity'].reset_index(drop=True)

In [26]:
display(X_train.head())

0    just did the latest update on viber and yet ag...
1    keeps crashing it only works well in extreme d...
2    the fail boat has arrived the 6.0 version is t...
3    superfast, just as i remember it ! opera mini ...
4    installed and immediately deleted this crap i ...
Name: review, dtype: object

In [27]:
vec_model = CountVectorizer(stop_words = "english") # Only work with data series not DataFrame
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()

X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

#### **Step 3: Build a naive bayes model**
Start solving the problem by implementing a model, from which you will have to choose which of the three implementations to use: GaussianNB, MultinomialNB or BernoulliNB, according to what we have studied in the module. Try now to train it with the two other implementations and confirm if the model you have chosen is the right one.

In [28]:
model = MultinomialNB()
model.fit(X_train, y_train)

In [29]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0])

In [30]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.8156424581005587


In [31]:
for model_aux in [GaussianNB(), BernoulliNB()]:
    model_aux.fit(X_train, y_train)
    y_pred_aux = model_aux.predict(X_test)
    print(f"{model_aux} with accuracy: {accuracy_score(y_test, y_pred_aux)}")

GaussianNB() with accuracy: 0.8044692737430168
BernoulliNB() with accuracy: 0.770949720670391


We confirm that the best model is "MultinomialNB()", because the data has a disordered behavior, not being Gaussian, nor Binomial.

#### **Step 4: Optimize the previous model**
After training the model in its three implementations, choose the best option and try to optimize its results with a random forest, if possible.

In [37]:
hyperparams = {
    'alpha': np.linspace(1.0, 10.0, 100),
    'fit_prior': [True, False]
}

grid_search = RandomizedSearchCV(model, hyperparams, n_iter=50, scoring='accuracy', cv=5, random_state=42)
grid_search

In [38]:
grid_search.fit(X_train, y_train)
print(f'Best hyperparameters: {grid_search.best_estimator_}')

Best hyperparameters: MultinomialNB(alpha=np.float64(1.8181818181818183), fit_prior=False)


In [39]:
model = MultinomialNB(alpha=1.8181818181818183, fit_prior=False)
model.fit(X_train, y_train)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.8156424581005587

#### **Step 5: Save the model**
Store the model in the appropriate folder.

In [40]:
dump(model, open("../models/naive_bayes_alpha-1.81818181_fit_prior-False_42.sav", 'wb'))

#### **Step 6: Explore other alternatives**
Which other models of the ones we have studied could you use to try to overcome the results of a Naive Bayes? Argue this and train the model.

In [45]:
hyperparams = {
    'alpha': np.linspace(0.01, 10.0, 100),
    'fit_prior': [True, False],
    'class_prior': [None, [0.6, 0.4], [0.5, 0.5], [0.7, 0.3]]
}

grid_search = RandomizedSearchCV(model, hyperparams, n_iter=100, scoring='accuracy', cv=5, random_state=42)
grid_search

grid_search.fit(X_train, y_train)
print(f'Best hyperparameters: {grid_search.best_estimator_}')

best_model = grid_search.best_estimator_

alpha = best_model.alpha
fit_prior = best_model.fit_prior
class_prior = best_model.class_prior

Best hyperparameters: MultinomialNB(alpha=np.float64(1.7254545454545454), class_prior=[0.6, 0.4],
              fit_prior=False)


In [46]:
model = MultinomialNB(alpha=alpha, fit_prior=fit_prior, class_prior=class_prior)
model.fit(X_train, y_train)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.8324022346368715