# Hands-on introduction to ML training
So far, we have worked with simple, tabular data. In this notebook, we will learn how to deal with text in a sentiment analysis problem.

### Step 1: Load and explore data
The first step is figuring out the data source. In this case we will use a pre-existing dataset. We will:
1. Create a folder 'data'
2. Download the file from public github repo using python package "requests" and save the IMDB_Data.csv file in the data folder.

In [240]:
%config IPCompleter.greedy=True #Helps with auto-complete

import numpy as np
import pandas as pd
import os

try:
    os.mkdir('data')
except OSError as error:
    print(error)

import requests, csv

url = 'https://raw.githubusercontent.com/techno-nerd/ML_101_Course/main/07%20Unstructured%20Data%20-%20Text/data/IMDB_Data.csv'
r = requests.get(url)
with open('data/IMDB_Data.csv', 'w') as f:
  writer = csv.writer(f)
  for line in r.iter_lines():
    writer.writerow(line.decode('utf-8').split(','))

[Errno 17] File exists: 'data'


In [241]:
dataset = pd.read_csv('data/IMDB_Data.csv')

In [242]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     1000 non-null   object
 1   sentiment  1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


### [Kaggle Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

This dataset is about movie reviews, and analysing their sentiment. Note that this dataset originally had 50,000 rows. Due to time constraints, the first 1000 rows have been saved.

In [243]:
dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,"""A wonderful little production. <br /><br />Th...",positive
2,"""I thought this was a wonderful way to spend t...",positive
3,Basically there's a family where a little boy ...,negative
4,"""Petter Mattei's """"Love in the Time of Money""""...",positive


In [244]:
dataset['sentiment'].value_counts()

sentiment
positive    501
negative    499
Name: count, dtype: int64

### Step 2: Data preparation

Normally, for Natural Language Processing (NLP), the following steps are usually taken:
1. Removal of HTML content, like the "<br>" tags
2. Removal of punctuations and special characters
3. Removal of stopwords ("is", "the", "a", etc.), which are not significant
4. Lemmatizing - Turning multiple words into a common root. For example, learnt, learning and learn to the root: Learn
5. Vectorisation - Encoding the cleaning text into numerical values
6. Replace "positive" and "negative" with 1 and 0 for the labels (sentiment)

Then, we split the data into training (80%) and testing (20%)

#### Cleaning Step 1:

We will use the Beautiful Soup library to get rid of the HTML content

In [245]:
from bs4 import BeautifulSoup

test_review = dataset['review'][1]
soup = BeautifulSoup(test_review, "html.parser")

In [246]:
print(test_review)

"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done. A w

In [247]:
review = soup.get_text()
print(review)

"A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done. A wonderful little production. The film

In [248]:
reviews_clean = pd.DataFrame()
for i in dataset.index:
    soup = BeautifulSoup(dataset['review'][i], "html.parser")
    review = np.array(soup.get_text()).reshape((1, 1))
    review = pd.DataFrame(review)
    reviews_clean = pd.concat([reviews_clean, review], axis=0, ignore_index=True)

reviews_clean.columns = ['review']


In [249]:
print(reviews_clean.head())

                                              review
0  One of the other reviewers has mentioned that ...
1  "A wonderful little production. The filming te...
2  "I thought this was a wonderful way to spend t...
3  Basically there's a family where a little boy ...
4  "Petter Mattei's ""Love in the Time of Money""...


#### Cleaning Step 2:

We will use the re library to get rid of the punctuations and special characters

In [250]:
import re

for i in reviews_clean.index:
    #Replace anything that is not alphabetical
    review = re.sub('[^a-zA-Z]', ' ', reviews_clean['review'][i])
    review = review.lower()

    #Split the text into a list for iterating over words later
    review = review.split()
    reviews_clean['review'][i] = review

In [251]:
print(reviews_clean.head(2))

                                              review
0  [one, of, the, other, reviewers, has, mentione...
1  [a, wonderful, little, production, the, filmin...


#### Cleaning Step 3:

We will use the nltk library for removing stop words

In [252]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

for i in reviews_clean.index:
    review = [word for word in reviews_clean['review'][i] if not word in set(stopwords.words('english'))]
    reviews_clean['review'][i] = review

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/arushgarg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [253]:
print(reviews_clean.head(2))

                                              review
0  [one, reviewers, mentioned, watching, oz, epis...
1  [wonderful, little, production, filming, techn...


#### Cleaning Step 4:

Using the WordNetLemmatizer from nltk, we will turn the words into their original roots

In [254]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lem = WordNetLemmatizer()

for i in reviews_clean.index:
    review = [lem.lemmatize(word) for word in reviews_clean['review'][i]]
    review = ' '.join(review) #Turning the review back into a string
    reviews_clean['review'][i] = review

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/arushgarg/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [255]:
print(reviews_clean.head())

                                              review
0  one reviewer mentioned watching oz episode hoo...
1  wonderful little production filming technique ...
2  thought wonderful way spend time hot summer we...
3  basically family little boy jake think zombie ...
4  petter mattei love time money visually stunnin...


#### Cleaning Step 5:

We will use a CountVectorizer from sklearn to create an array of the number of times a word appears in the review. This will be used as the input for our model.

This is one of many different vectorizers. Others include binary count vectorizer and TFIDF (Term Frequency Inverse Document Frequency) 

In [256]:
from sklearn.feature_extraction.text import CountVectorizer

count_vec = CountVectorizer(ngram_range=(1, 3))

reviews_vec = count_vec.fit_transform(reviews_clean['review'])
reviews_vec = reviews_vec.toarray()
print(reviews_vec[:2])

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


#### Cleaning Step 6:

Using panda's in-built replace method, we will turn the label (sentiment) into a numerical feature

In [257]:
labels = dataset['sentiment'].replace({'positive':1, 'negative':0})
labels = labels.to_numpy()
print(labels[:5])

[1 1 1 0 1]


As always, now that we have the data ready, we split it into train and test sets.

In [258]:
import sklearn.model_selection as ms

train_features, test_features, train_labels, test_labels = ms.train_test_split(reviews_vec, labels, test_size=0.2)
print(train_features.shape)
print(test_features.shape)
print(train_labels.shape)
print(test_labels.shape)

(800, 232922)
(200, 232922)
(800,)
(200,)


### Step 3: Model Selection and Training

Since the feature space is very large, we cannot use most stand-alone models, like Decision Trees or Logistic Regression. Hence, we will use a a Support Vector Machine (SVM) and a Random Forest (ensemble of decision trees).

### Support Vector Machines (SVM)

SVMs are very powerful for this kind of problems, because they can handle the large feature space. They try to find an equation that separates the two classes.

In [259]:
from sklearn.svm import LinearSVC

svm = LinearSVC(C=0.5) #The 'C' value tells the model to maximise accuracy, not margins
svm = svm.fit(train_features, train_labels)



In [260]:
def ClassifierMetrics(labels, predictions):
    total = labels.size
    result = (labels == predictions)
    correct = result.sum()
    accuracy = (correct)/total

    #Precision (correct '1' prediction / total '1' prediction)
    precision = (result[predictions == 1.0].sum()) / (predictions == 1.0).sum()

    #Recall = (correct '1' predictions / total number of '1's)

    recall = (result[predictions == 1.0].sum()) / (labels == 1.0).sum()

    return [accuracy, precision, recall]

In [261]:
svm_pred = svm.predict(test_features)
svm_metrics = ClassifierMetrics(test_labels, svm_pred)

print("SVM TEST Metrics:")
print(f"Accuracy: {svm_metrics[0]}")
print(f"Precision: {svm_metrics[1]}")
print(f"Recall: {svm_metrics[2]}")

SVM TEST Metrics:
Accuracy: 0.825
Precision: 0.8041237113402062
Recall: 0.8297872340425532


### Random Forest Classifiers

Random Forests can also perform well on these kind of problems because of the way they divide features amongst the different trees.

In [262]:
from sklearn.ensemble import RandomForestClassifier

r_forest = RandomForestClassifier(n_estimators=500, min_samples_leaf=3)
r_forest = r_forest.fit(train_features, train_labels)

In [263]:
r_forest_pred = r_forest.predict(test_features)
r_forest_metrics = ClassifierMetrics(test_labels, r_forest_pred)

print("SVM TEST Metrics:")
print(f"Accuracy: {r_forest_metrics[0]}")
print(f"Precision: {r_forest_metrics[1]}")
print(f"Recall: {r_forest_metrics[2]}")

SVM TEST Metrics:
Accuracy: 0.815
Precision: 0.7522123893805309
Recall: 0.9042553191489362


Both of these models performed pretty well. In the next lesson, we will see how well neural networks, a completely different architecture, will perform.

### Try it Yourself!

Replace the "text" variable with your movie review and see if the models correctly identify your sentiment.

In the preprocess function, each of the different steps have been performed to prepare the data for input.

In [264]:
def preprocess(text):
    #Regex
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = text.lower()
    text = text.split()

    #Stopwords
    text = [word for word in text if not word in set(stopwords.words('english'))]

    #Lemmatizing
    text = [lem.lemmatize(word) for word in text]
    text = ' '.join(text)

    #Vectorizing

    text = count_vec.transform([text])
    text = text.toarray()

    return text

In [265]:
text = "Oppenheimer is an absolute masterpiece that brilliantly captures the complexity and intrigue of one of history's most enigmatic figures, J. Robert Oppenheimer. The film's storytelling is riveting, with outstanding performances that bring the characters to life. Director Christopher Nolan's meticulous attention to detail and cinematography are breathtaking, creating a visually stunning experience. The film's exploration of science, morality, and the human condition is thought-provoking and leaves a lasting impact. Oppenheimer is a must-see for anyone who appreciates powerful storytelling and outstanding filmmaking."
text = preprocess(text)

svm_user_pred = svm.predict(text)
if svm_user_pred == 0:
    print("Your review is negative")
else:
    print("Your review is positive")

Your review is positive


In [266]:
forest_user_pred = r_forest.predict(text)
if forest_user_pred == 0:
    print("Your review is negative")
else:
    print("Your review is positive")

Your review is positive
