# Tutorial 8 (I)
# Text Classification - Sentiment Analysis

One of the common uses for binary classification in machine learning is analyzing text for sentiment, assigning a text string a score from <b>0</b> to <b>1</b>, where <b>0</b> represents negative sentiment and <b>1</b> represents positive sentiment. A restaurant review such as "Best meal we have ever had and awesome service, too!" might score <b>0.9</b>, while a statement such as "Long lines and poor customer service" might score <b>0.1</b>. Marketing departments sometimes use sentiment-anlysis models to monitor social-media services for feedback so they can respond quickly if, for example, comments regarding their company suddenly turn negative.

To train a sentiment-analysis model, you need a dataset containing text strings labeled with 0s (for negative sentiment) and 1s (for positive sentiment). Several such datasets are available in the public domain. We will use one containing 50,000 movie reviews, each labeled with a <b>0</b> or <b>1</b>. Once the model is trained, scoring a text string for sentiment is a simple matter of passing it to the model and asking for the probability that the predicted label is <b>1</b>. A probability of 80% means the sentiment score is <b>0.8</b> and that the text is very positive.

## Load and prepare the data

The first step is to load the dataset and prepare it for use in machine learning. Because machine-learning models can't deal with text, we use scikit-learn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class to vectorize the training text. Then we split the data for training and testing.

In [1]:
import pandas as pd
 
df = pd.read_csv('Data/reviews.csv', encoding='ISO-8859-1')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Data/reviews.csv'

Find out how many rows the dataset contains and confirm that there are no missing values.

In [None]:
df.info()

Check for duplicate rows in the dataset.

In [None]:
df.groupby('Sentiment').describe()

The dataset contains a few hundred duplicate rows. Let's remove them and check for balance.

In [None]:
df = df.drop_duplicates()
df.groupby('Sentiment').describe()

Use [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to vectorize the text in the DataFrame's "Text" column using a built-in dictionary of stop words. Set `min_df` to 20 to ignore words that appear less than 20 times in the corpus of training text. This reduces the likelihood of out-of-memory errors and will probably make the model more accurate as well. Also use the `ngram_range` parameter to allow `CountVectorizer` to rank word pairs as well as individual words.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range = (1, 2), stop_words = 'english', min_df = 20)
x = vectorizer.fit_transform(df['Text'])
y = df['Sentiment']

In addition to creating sparse matrices of vectorized text, `Countvectorizer` converts text to lowercase, removes stop words and punctuation characters, and more. Let's see how it cleans text before vectorizing it by transforming a string, and then reversing the transform.

In [None]:
text = vectorizer.transform(['The long l3ines   and; pOOr customer# service really turned me off...123.'])
text = vectorizer.inverse_transform(text)
print(text)

Split the dataset for training and testing. We'll do a 50/50 split since the dataset contains nearly 50,000 samples.

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=0)  

## Train a logistic-regression model

The next step is to train a classifier. We use scikit-learn's [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier, which uses [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) to fit a model to the data.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 1000, random_state = 0)
model.fit(x_train, y_train)

Validate the trained model with the 50% of the dataset aside for testing and show a confusion matrix.

In [None]:
%matplotlib inline
from sklearn.metrics import ConfusionMatrixDisplay as cmd

cmd.from_estimator(model, x_test, y_test,
                   display_labels=['Negative', 'Positive'],
                   cmap='Blues', xticks_rotation='vertical')

The model correctly identified 10,795 negative reviews while misclassifying 1,574 of them. It correctly identified 10,966 positive reviews and got it wrong 1,456 times.

## Use the model to analyze text

Let us score a review by vectorizing the text of that review and passing it to the model's `predict_proba` method. Are the results consistent with what you would expect?

In [None]:
review = 'The long lines and poor customer service really turned me off.'
model.predict_proba(vectorizer.transform([review]))[0][1]

Now score a more positive review and see if the model agrees that the sentiment is positive.

In [None]:
review = 'The food was great and the service was excellent!'
model.predict_proba(vectorizer.transform([review]))[0][1]

Feel free to try sentences of your own and see if you agree with the sentiment scores the model predicts. It’s not perfect, but it’s good enough that if you run hundreds of reviews or comments through it, you should get a reliable indication of the positivity or negativity expressed in the text.

## References
* Applied Machine Learning and AI for Engineers, Jeff Prosise, O'Reilly Media, Inc., November 2022, 425 pages.