# DSFM text-as-data workshop

## 4. NLP Deep learning: sentiment analysis

Creator: [Data Science for Managers - EPFL Program](https://www.dsfm.ch)

Source: [https://github.com/dsfm-org/code-bank.git](https://github.com/dsfm-org/code-bank.git)

License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository.

### Overview

In this notebook we will try to classify positive and negative reviews. We will starts with a simple approach (TF-IDF + Naive Bayes) to create a baseline and then we will improve the model using neural networks.

## Part 1: TF-IDF and Naive Bayes


Q1: Load the `yelp_review.csv` dataset and store it under `df`

In [None]:
import pandas as pd

import numpy as np
# Fix random seed for reproducibility
np.random.seed(42)

df = pd.read_csv("./data/yelp_review.csv")
df.shape

Q2: Let's start by simplifing a bit the problem by transforming the `stars` column into a boolean feature composed of positive and negative sentiments.

By looking at the `median` and `mean` value of the `stars` columns, pick a good threshold and generate the `target` column. All reviews with a stars rating above such threshold will be targeted as `1` and the rest will be set as `0`.

In [None]:
print("Stars median: ", # YOUR CODE HERE #)
print("Stars mean: ", # YOUR CODE HERE #)

In [None]:
df['target'] = # YOUR CODE HERE #
df.head(2)

Q3: Look at the target distribution. Is that even distributed? What's the baseline we will need to beat?

In [None]:
# YOUR CODE HERE #

Q4: Split the DataFrame into a `train_df` (80%) and `test_df` (20%). You can use scikit-learn `train_test_split`.

In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = # YOUR CODE HERE #

print("train_df.shape: ", train_df.shape)
print("test_df.shape: ", test_df.shape)

Q5: Fit a `TfidfVectorizer` from scikit-learn with `max_features=500` on `train_df['text']` and store it under `X_train`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = # YOUR CODE HERE #

X_train = # YOUR CODE HERE #

Q6: Fit the `test_df` and store it under `X_test`

In [None]:
X_test = # YOUR CODE HERE #

Q7: Get `y_train` and `y_test` target as numpy array

In [None]:
y_train = # YOUR CODE HERE #
y_test = # YOUR CODE HERE #

Q8: Apply Naive Bayes and predict the `y_predicted` values using [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

In [None]:
from sklearn.naive_bayes import GaussianNB
import numpy as np

gnb = # YOUR CODE HERE #

y_predicted = # YOUR CODE HERE #

Q9: Display the accuracy as well as the roc_auc_score

In [None]:
from sklearn.metrics import accuracy_score

print(f"Accuracy score: # YOUR CODE HERE #")

from sklearn.metrics import roc_auc_score

print(f"Roc-auc score: # YOUR CODE HERE #")

## Part 2: Neural Networks

For the next section, we will use [Tensorflow Keras](https://www.tensorflow.org/guide/keras/sequential_model), a well known Framework for Deep Learning. Training neural networks, however, if is computationally very expensive and therefore a GPU processor is usually required. Google provides a free and easy way to get access to a GPU...

[Colab](https://colab.research.google.com/) is a tool from Google that allows you to share and use a special (Google) type of Jupyter Notebooks on the cloud. It also allows you to use a computer with GPUs for free. You do not need a Google Account to __see__ the following notebook, but you will need a Google account (such as a Gmail account) if you want to actually execute the following code.

__Click here to now continue this example on Google Colab...__ [4-nlp-deep-learning](https://colab.research.google.com/drive/1XMNrfIx0Ttm2DkdlWi668TGMZvNNt9gv#scrollTo=n-GJZv_xcnrj)