# Sentiment analysis

Your time to shine!

We will use data from `Amazon Reviews: Unlocked Mobile Phones`, available on Kaggle platform and described as:

> PromptCloud extracted 400 thousand reviews of unlocked mobile phones sold on Amazon.com to find out insights with respect to reviews, ratings, price and their relationships. \[...\] Data was acquired in December, 2016 by the crawlers build to deliver \[their\] data extraction services.

([source](https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones/data#))

#### Load useful librairies and data

In [None]:
import pandas as pd
import numpy as np

import nltk

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

In [None]:
# Load data
data = pd.read_csv('./data/644_1225_compressed_Amazon_Unlocked_Mobile.csv.zip', compression="zip")

# Let's keep only a fraction of the data to speed up computations
data = data.sample(frac=0.1, random_state=10)

# Drop missing values
data.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
data = data[data['Rating'] != 3]

# Consider 4 and 5 as positive ratings (encoded as 1)
# and 1 and 2 as negative ones (encoded as 0)
data['positive_rating'] = np.where(data['Rating'] > 3, 1, 0)

In [None]:
data.sample(3)

## Data exploration

Before diving into the sentiment analysis, what can you tell me about the data?

In [None]:
# Your code here


## Sentiment analysis

In [None]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data['Reviews'], data['positive_rating'], random_state=0
)

Let's investigate positive / negative review classification.

#### With CountVectorizer

* What would be the AUC score (on test data) of a classifier using CountVectorizer and a Logistic Regression model (with `max_iter=1500`)?


In [None]:
# Your code here


* With the classifier built above, which 20 tokens are the most associated with negative reviews?

In [None]:
# Your code here


* With the classifier built above, which 20 tokens are the most associated with positive reviews?

In [None]:
# Your code here


#### With TF-IDF Vectorizer

* What would be the AUC score (on test data) of a classifier using TfidfVectorizer specifying a minimum document frequency of 2 (`min_df=3`) and a Logistic Regression model (with `max_iter=1500`)?


In [None]:
# Your code here


* Which tokens have the 10 smallest and 10 largest TF-IDF coefficients?

In [None]:
# Your code here


* With the classifier built above, which 20 tokens are the most associated with negative reviews?

In [None]:
# Your code here


* With the classifier built above, which 20 tokens are the most associated with positive reviews?

In [None]:
# Your code here


## Testing our sentiment analyizer

Are our classifiers able to discriminate the following reviews:
* 'not an issue, phone is working'
* 'an issue, phone is not working'

?

In [None]:
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

What seems to be the issue?

## Toward improving our positive / negative review classifier

_Hint to move forward_ : Consider the option `ngram_range` of `CountVectorizer`

In [None]:
# Your code here


Is the issue now fixed?

In [None]:
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))