# Deep Learning Tutorial

## Reading the Data

Import the `pandas` package, then use the `read_csv` function to read the labeled training data.

In [2]:
import pandas as pd

# Read the labeled training data
train = pd.read_csv("../data/labeled_train_data.tsv", header=0, delimiter="\t", quoting=3)

print(train.shape)
print(train.columns.values)
print(train["review"][0])

(25000, 3)
['id' 'sentiment' 'review']
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actu

## Data Cleaning and Text Preprocessing

Import the Beautiful Soup library. Remove HTML markup from review text.

In [3]:
from bs4 import BeautifulSoup

# Initialize the BeautifulSoup object on a single review
example_review = BeautifulSoup(train["review"][0])

# Print the raw text and the text without tags or markup
print("Raw review text:\n", train["review"][0])
print("Modified review text:\n", example_review.get_text())

Raw review text:
 "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit w

For simplicity, remove all punctuation and numbers. However, in sentiment analysis problems, it is important to remember that punctuation and numbers often carry sentiment and should usually be treated as words.

To remove punctuation and numbers, use regular expressions.

In [4]:
import re

# Use regular expressions to do a find and-replace
letters_only = re.sub("[^a-zA-Z]", " ", example_review.get_text())

print("Text after removing punctuation and numbers:\n", letters_only)

Text after removing punctuation and numbers:
  With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film b

Convert reviews to lower_case and split them into individual words ("tokenization").

In [5]:
# Convert to lower-case and split into words
lower_case = letters_only.lower()
tokens = lower_case.split()

print("Lower-case tokens:\n", tokens)

Lower-case tokens:
 ['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', 'moonwalker', 'is', 'part', 'biography', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'mj', 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'm', 'kay', 'visually', 'impressive', 'but', 'of', 'course', 'this', 'i

Decide how to deal with frequently occurring words that don't carry much meaning ("stop words"), such as "a", "and", "is", and "the".

Import a stop word list from the Python Natural Language Toolkit and remove stop words from the review text.

In [6]:
import nltk
nltk.download("stopwords")

from nltk.corpus import stopwords
print("Stop words in NLTK:", stopwords.words("english"))

# Remove stop words from the review text
tokens = [token for token in tokens if token not in stopwords.words("english")]
print("Tokens after removing stop words:\n", tokens)

Stop words in NLTK: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 's

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rohanmistry/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Create reusable function to clean review text.

In [7]:
stops = set(stopwords.words("english"))

def clean_review_text(review: str) -> list:
    """
    Cleans the review text by removing HTML tags, markup, punctuation, numbers, and stop words.

    Parameters:
        review (str): The raw review text to be cleaned.
    
    Returns: 
        list: A list of cleaned tokens from the review.
    """

    text = BeautifulSoup(review).get_text()
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    tokens = letters_only.lower().split()
    tokens = [token for token in tokens if token not in stops]

    return (" ".join(tokens))

In [8]:
clean_review = clean_review_text(train["review"][0])
print("Cleaned review text:\n", clean_review)

Cleaned review text:
 stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually 

Loop through and clean entire training set.

In [9]:
# Get the number of reviews in the training set
num_reviews = train["review"].size

# Initialize an empty list to hold the cleaned reviews
cleaned_train_reviews = []

# Iterate through each review in the training set
for i in range(num_reviews):
    if (i + 1) % 1000 == 0:
        print(f"Cleaning review {i + 1} of {num_reviews}")
    
    # Clean the review and append it to the list
    cleaned_review = clean_review_text(train["review"][i])
    cleaned_train_reviews.append(cleaned_review)

Cleaning review 1000 of 25000
Cleaning review 2000 of 25000
Cleaning review 3000 of 25000
Cleaning review 4000 of 25000
Cleaning review 5000 of 25000
Cleaning review 6000 of 25000
Cleaning review 7000 of 25000
Cleaning review 8000 of 25000
Cleaning review 9000 of 25000
Cleaning review 10000 of 25000
Cleaning review 11000 of 25000
Cleaning review 12000 of 25000
Cleaning review 13000 of 25000
Cleaning review 14000 of 25000
Cleaning review 15000 of 25000
Cleaning review 16000 of 25000
Cleaning review 17000 of 25000
Cleaning review 18000 of 25000
Cleaning review 19000 of 25000
Cleaning review 20000 of 25000
Cleaning review 21000 of 25000
Cleaning review 22000 of 25000
Cleaning review 23000 of 25000
Cleaning review 24000 of 25000
Cleaning review 25000 of 25000


Use bag-of-words approach to convert training reviews to numeric representation for machine learning. Build a vocabulary from all reviews, then create feature vectors with the count of each word in each review.

To limit the size of the feature vectors, use the 5000 most frequent words. Use the `feature_extraction` module from `scikit-learn` to create bag-of-words features.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer with parameters to limit the vocabulary size
vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=5000)

# Fit the vectorizer on the cleaned training reviews, learn the vocabulary, and transform the reviews into feature vectors
train_data_features = vectorizer.fit_transform(cleaned_train_reviews)

# Convert the resulting sparse matrix to a dense format
train_data_features = train_data_features.toarray()

print("Shape of the training data features:", train_data_features.shape)

Shape of the training data features: (25000, 5000)


In [11]:
vocabulary = vectorizer.get_feature_names_out()
print(vocabulary)

['abandoned' 'abc' 'abilities' ... 'zombie' 'zombies' 'zone']


Use a Random Forest classifier for supervised learning.

In [12]:
from sklearn.ensemble import RandomForestClassifier

print("Training the random forest classifier...")

# Initialize the Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators=100)

# Fit the forest to the training data features using the bag of words as features and the sentiment labels as the response variable
forest = forest.fit(train_data_features, train["sentiment"])

Training the random forest classifier...


Run the trained Random Forest classifier on the test set and create a submission file.

In [13]:
# Read the test data
test = pd.read_csv("../data/test_data.tsv", header=0, delimiter="\t", quoting=3)

# Verify the test data shape
print("Test data shape:", test.shape)

Test data shape: (25000, 2)


In [14]:
# Clean and parse test reviews
num_test_reviews = len(test["review"])
cleaned_test_reviews = []

print("Cleaning test reviews...")
for i in range(num_test_reviews):
    if (i + 1) % 1000 == 0:
        print(f"Cleaning test review {i + 1} of {num_test_reviews}")
    
    cleaned_review = clean_review_text(test["review"][i])
    cleaned_test_reviews.append(cleaned_review)

# Transform the cleaned test reviews into feature vectors
test_data_features = vectorizer.transform(cleaned_test_reviews)
test_data_features = test_data_features.toarray()

# Use the trained Random Forest classifier to predict sentiment for the test set
result = forest.predict(test_data_features)

# Copy results to a DataFrame for submission
output = pd.DataFrame(data={"id": test["id"], "sentiment": result})

# Write the DataFrame to a CSV file for submission
output.to_csv("../data/tutorial_submission.csv", index=False, quoting=3)

Cleaning test reviews...
Cleaning test review 1000 of 25000
Cleaning test review 2000 of 25000
Cleaning test review 3000 of 25000
Cleaning test review 4000 of 25000
Cleaning test review 5000 of 25000
Cleaning test review 6000 of 25000
Cleaning test review 7000 of 25000
Cleaning test review 8000 of 25000
Cleaning test review 9000 of 25000
Cleaning test review 10000 of 25000
Cleaning test review 11000 of 25000
Cleaning test review 12000 of 25000
Cleaning test review 13000 of 25000
Cleaning test review 14000 of 25000
Cleaning test review 15000 of 25000
Cleaning test review 16000 of 25000
Cleaning test review 17000 of 25000
Cleaning test review 18000 of 25000
Cleaning test review 19000 of 25000
Cleaning test review 20000 of 25000
Cleaning test review 21000 of 25000
Cleaning test review 22000 of 25000
Cleaning test review 23000 of 25000
Cleaning test review 24000 of 25000
Cleaning test review 25000 of 25000
