This notebook trains a binary classifier on a dataset which contains movie reviews which are labelled as containing either *positive* or *negative* sentiment towards the movie. 

First we will install *sklearn* which we will be using to do the machine learning.

In [41]:
%%capture
pip install sklearn

Next we will install the dataset. We will use the IMDB sentiment analysis dataset available from the [huggingface datasets library](https://huggingface.co/datasets/imdb) and described in [Maas et al. 2011](https://aclanthology.org/P11-1015.pdf).

In [42]:
%%capture
pip install datasets

Now let's load the IMDB training set. We will print out the last instance.

In [3]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to C:/Users/adamt/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to C:/Users/adamt/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Let's convert the training data into the format expected by scikit-learn - a list of input vectors (documents) and a list of associated output labels.

In [4]:
train_dataset = imdb_dataset['train']
train_data = []
train_data_labels = []
for item in train_dataset:
  train_data.append(item['text'])
  train_data_labels.append(item['label'])
print(train_data[-1])
print(train_data_labels[-1])

The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.
1


We'll use the CountVectorizer class to extract the words in each review as the features the algorithm will learn from. Each document is represented as a 200 dimension vector of word counts. Only the 200 most frequent words are used in this version. 

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
features = vectorizer.fit_transform(train_data).toarray()

As a sanity check, let's check we have a 2-d array where each row is one of the 25,000 instances and each column is one of 200 words. Print out the words that will be used for classification.

In [6]:
print(features.shape)
print(vectorizer.get_feature_names_out())

(25000, 200)
['10' 'about' 'acting' 'action' 'actors' 'actually' 'after' 'again' 'all'
 'also' 'an' 'and' 'another' 'any' 'are' 'around' 'as' 'at' 'back' 'bad'
 'be' 'because' 'been' 'before' 'being' 'best' 'better' 'between' 'big'
 'both' 'br' 'but' 'by' 'can' 'cast' 'character' 'characters' 'could'
 'did' 'didn' 'director' 'do' 'does' 'doesn' 'don' 'down' 'end' 'enough'
 'even' 'ever' 'every' 'fact' 'few' 'film' 'films' 'find' 'first' 'for'
 'from' 'funny' 'get' 'give' 'go' 'going' 'good' 'got' 'great' 'had' 'has'
 'have' 'he' 'her' 'here' 'him' 'his' 'horror' 'how' 'however' 'if' 'in'
 'into' 'is' 'it' 'its' 'just' 'know' 'life' 'like' 'little' 'long' 'look'
 'lot' 'love' 'made' 'make' 'makes' 'man' 'many' 'may' 'me' 'more' 'most'
 'movie' 'movies' 'much' 'my' 'never' 'new' 'no' 'not' 'nothing' 'now'
 'of' 'off' 'old' 'on' 'one' 'only' 'or' 'original' 'other' 'out' 'over'
 'own' 'part' 'people' 'plot' 'pretty' 'quite' 're' 'real' 'really'
 'right' 'same' 'say' 'scene' 'scenes' 'see'

Split the data into a training and validation (dev) set. We'll use the validation set to test our model. We'll use 75% of the data for training and 25% for testing.

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.75,random_state=123)

We will use Logistic Regression to do the classification. Create the model.

In [8]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

Train the model.

In [9]:
model = model.fit(X=X_train,y=y_train)

Test the model on the validation set.

In [10]:
y_pred = model.predict(X_val)

Now let's calculate the accuracy of the model's predictions on the validation set.

In [11]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_val,y_pred))

0.76368


Now let's prepare some test data. Use the same 1000 as in the BERT notebook.

In [12]:
test_dataset = imdb_dataset['test'].shuffle(seed=42).select(range(1000))
test_data = []
test_data_labels = []
for item in test_dataset:
  test_data.append(item['text'])
  test_data_labels.append(item['label'])

Apply the model to the test data.

In [13]:
test_pred=model.predict(vectorizer.transform(test_data).toarray())

In [14]:
print(accuracy_score(test_pred,test_data_labels))

0.74


## Possible improvements

In this section, we will consider the changes that can be made to the features and model that will improve the performance.

### Stop word removal

Stop words are words that are typically not of importance, so they can be removed from the corpus. This is task dependent as some tasks make use of stop words such as predicting the next word in a sequence.

For our usecase, sentiment analysis, stop words such as the, at and on are not of importance for determining the sentiment of a review. There are lists of stop words available through sklearn and nltk.

In [15]:
# Import nltk and the stop words list.
import nltk
from nltk.corpus import stopwords

In [16]:
# Download the stop words list and print the list and the length.
nltk.download('stopwords')
stop_words = stopwords.words('english')
print(stop_words)
print(len(stop_words))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\adamt\AppData\Roaming\nltk_data...


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data]   Unzipping corpora\stopwords.zip.


In [17]:
from sklearn.feature_extraction import _stop_words
 
# Assign the sklearn stopword list to a variable and print the list and the length.
sklearn_stop_words = _stop_words.ENGLISH_STOP_WORDS
print(sklearn_stop_words)
print(len(sklearn_stop_words))

frozenset({'across', 'without', 'onto', 'this', 'due', 'your', 'beyond', 'thru', 'system', 'latterly', 'see', 'otherwise', 'were', 'find', 'eleven', 'what', 'hereafter', 'thus', 'many', 'however', 'more', 're', 'neither', 'someone', 'these', 'to', 'its', 'never', 'himself', 'co', 'even', 'five', 'alone', 'he', 'will', 'couldnt', 'off', 'for', 'not', 'until', 'nor', 'therein', 'get', 'whereby', 'hers', 'yours', 'show', 'very', 'often', 'cannot', 'also', 'while', 'fifty', 'sincere', 'next', 'can', 'been', 'inc', 'became', 'seeming', 'than', 'behind', 'anywhere', 'how', 'whether', 'with', 'am', 'thereby', 'ever', 'some', 'again', 'because', 'why', 'towards', 'ie', 'have', 'six', 'throughout', 'anyway', 'two', 'should', 'else', 'those', 'whereas', 'mill', 'whose', 'against', 'has', 'another', 'either', 'within', 'via', 'anyhow', 'top', 'nowhere', 'during', 'whoever', 'whence', 'thereafter', 'un', 'only', 'anyone', 'namely', 'cry', 'which', 'becomes', 'upon', 'bottom', 'them', 'at', 'cant',

Looking at the two lists we can see that the sklearn list is longer with 318 words while the nltk list has 179. Some of the words in the sklearn list seem like they may be of importance for determining the sentiment, such as nothing, fire, sincere and cry.

We can test both stop word lists, but I believe that the nltk stop word list may be the better suited for our task.

To implement the removal of stop words we can pass this list to the CountVectorizer to remove these words from the corpus.

### Word document frequency

Typically, words that appear in many examples of both positive and negative reviews are not very useful for discerning the two. In a similar way, words that occur only in a single document and are unique will not be very useful in classifying a document.

These can both be implemented through parameters in the CountVectorizer. So we will only consider words that appear a reasonable number of times and will likely be useful for discerning the class.

### N-gram count

The CountVectorizer offers the option of creating bigrams and including these in the training corpus. This involves taking every pair of words that appear in the corpus and treating them like a new word. This can be very usefil for terms such as 'special effects', which may be a word for discerning the class.

### TF-IDF features

Sklearn offers a text feature extractor that generates tf-idf scores for words in a corpus. We can pass our results from the CountVectorizer to get these scores. The only aspect that this can make an improvement on is the term frequencies having less impact on the prediction, which would mean a more diverse occurence of positive words would be more meaningful than lots of occurrences of the same word.

### Max features

This is the most obvious change that can be made to improve the models accuracy. Increasing the number of features gives the model more data to work with, but some of this may not be useful. Including all of the data is not wise as there is a diminishing return on adding more features. We trade added complexity for a marginal improvement in accuracy when we reach a certain point. For this reason, I will be focusing more on limiting the number of features to those that are important or useful 

To test all of these I will make use of GridSearchCV to test combinations for all of these parameters. The only segment I cannot introduce here is the tf-idf feature extractor as it would need to be part of the pipeline. I will evaluate this after I have found the best parameters for the rest of the parameters I would like to use.

My prediction for a good set of parameters will be an increased number of features, the removal of words that occur too frequently or infrequently. Stop word removal may be of use, but if we remove common words these will likely be removed. N-gram length may be useful for finding terms consisting of either one or two words. Lastly, tf-idf may be useful to get a balance between just taking whether a word occurs in a text and having the counts. This means that an increased count does have a bigger impact, but not as much as if we took the raw counts themselves. This will likely lead to better understanding of how these words and their occurrences impact the classification of a document.

In [18]:
# import GridSearchCV and Pipeline.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Define a CountVectorizer to use
vectorizer = CountVectorizer(lowercase=True)

# Create a LogisticRegression model and set the max iterations to be higher as
# there were times while testing that the model did not converge in 100 iterations.
logistic = LogisticRegression(max_iter=400)

# Create the pipeline.
pipe = Pipeline(steps=[("vectorizer", vectorizer), ("logistic", logistic)])

# Define the parameters to test.
param_grid = {
    "vectorizer__stop_words": [None, stop_words, sklearn_stop_words],
    "vectorizer__max_df": [0.7, 0.8, 0.9, 1.0],
    "vectorizer__min_df": [0, 0.05, 0.01, 0.02],
    "vectorizer__ngram_range": [(1,1), (1,2)],
    "vectorizer__max_features": [200, 500, 1000],
}

# Initialize the GridSearchCV object.
search = GridSearchCV(pipe, param_grid, n_jobs=8, verbose=3, cv=4)

# Run the GridSearchCV with all of the data, we will test it on the same sub
# sample as before when we find the best parameters.
search.fit(train_data, train_data_labels)

# Print the best parameters.
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

Fitting 4 folds for each of 288 candidates, totalling 1152 fits


216 fits failed out of a total of 1152.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
216 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\adamt\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\adamt\AppData\Roaming\Python\Python39\site-packages\sklearn\pipeline.py", line 378, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\adamt\AppData\Roaming\Python\Python39\site-packages\sklearn\pipeline.py", line 336, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "D:\ProgramData\anaconda3\lib\site-packages\joblib\memory.py", line 349

Best parameter (CV score=0.850):
{'vectorizer__max_df': 0.9, 'vectorizer__max_features': 1000, 'vectorizer__min_df': 0.02, 'vectorizer__ngram_range': (1, 1), 'vectorizer__stop_words': None}


Results here

Now we try these parameters with the original train and test split and also the separate test data.

In [32]:
# Initialize the CountVectorizer with the parameters
vectorizer = CountVectorizer(lowercase=True, max_df=1.0, max_features=1000, min_df=0.02, ngram_range=(1,1)) #lowercase=True

# Generate the features.
features = vectorizer.fit_transform(train_data)

In [24]:
# Print out the shape of features and the feature names.
print(features.shape)
print(vectorizer.get_feature_names_out())

(25000, 1000)
['10' '20' '30' '80' 'able' 'about' 'above' 'absolutely' 'across' 'act'
 'acted' 'acting' 'action' 'actor' 'actors' 'actress' 'actual' 'actually'
 'add' 'admit' 'after' 'again' 'against' 'age' 'ago' 'agree' 'air' 'all'
 'almost' 'alone' 'along' 'already' 'also' 'although' 'always' 'am'
 'amazing' 'america' 'american' 'among' 'an' 'animation' 'annoying'
 'another' 'any' 'anyone' 'anything' 'anyway' 'apart' 'apparently'
 'appear' 'appears' 'are' 'aren' 'around' 'art' 'as' 'ask' 'at'
 'atmosphere' 'attempt' 'attempts' 'attention' 'audience' 'average'
 'avoid' 'away' 'awful' 'baby' 'back' 'background' 'bad' 'badly' 'based'
 'basically' 'be' 'beautiful' 'beauty' 'became' 'because' 'become'
 'becomes' 'been' 'before' 'begin' 'beginning' 'begins' 'behind' 'being'
 'believable' 'believe' 'best' 'better' 'between' 'beyond' 'big' 'bit'
 'black' 'blood' 'body' 'book' 'boring' 'both' 'box' 'boy' 'br' 'break'
 'brilliant' 'bring' 'brings' 'british' 'brother' 'brought' 'budget'
 'bunch

In [25]:
# Split the data into train and test the same as the demonstration.
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.75,random_state=123)

In [26]:
# Create a new LogisticRegression model.
model = LogisticRegression(max_iter=400)

# Train the model on the training data.
model = model.fit(X=X_train,y=y_train)

In [27]:
# Predict the values from the test set.
y_pred = model.predict(X_val)

In [28]:
# Get the accuracy of the predictions on the test set.
print(accuracy_score(y_val,y_pred))

0.85264


In [30]:
# Test the model on the separate data for comparision with the original model.
test_pred=model.predict(vectorizer.transform(test_data))
print(accuracy_score(test_pred,test_data_labels))

0.84


We can see that these parameters have made a large imporvement on the original model.

It has gone from 0.763 to 0.853 in the test data and from 0.74 to 0.84 in the separate test data.

The only method we have left to test is the TfidfTransformer, which we can do now with the parameters we have got.

In [33]:
# Import the TfidfTransformer.
from sklearn.feature_extraction.text import TfidfTransformer

# Initialize the CountVectorizer with the parameters
vectorizer = CountVectorizer(stop_words=stop_words, lowercase=True, max_df=0.8, min_df=0.01, ngram_range=(1,2)) #lowercase=True

# Generate the features.
features = vectorizer.fit_transform(train_data)

# Initialize the TfidfTransformer.
tf_idf = TfidfTransformer(use_idf=True,norm='l2', smooth_idf=True)

# Fit the data to the TfidfTransformer.
tf_idf_features = tf_idf.fit_transform(features)

In [34]:
# Print out the shape of the tf_idf_features to ensure it is the same shape as before.
print(tf_idf_features.shape)

(25000, 1821)


In [35]:
# Split the data into train and test the same as the demonstration.
X_train, X_val, y_train, y_val = train_test_split(tf_idf_features,train_data_labels,train_size=0.75,random_state=123)

In [36]:
# Create a new LogisticRegression model.
model = LogisticRegression(max_iter=400)

# Train the model on the training data.
model = model.fit(X=X_train,y=y_train)

In [37]:
# Predict the values from the test set.
y_pred = model.predict(X_val)

In [38]:
# Get the accuracy of the predictions on the test set.
print(accuracy_score(y_val,y_pred))

0.87152


In [39]:
# Test the model on the separate data for comparision with the original model.
test_pred=model.predict(tf_idf.transform(vectorizer.transform(test_data)))
print(accuracy_score(test_pred,test_data_labels))

0.862


It seems that incorporating the TfidfTransformer has further improved the accuracy by 2% for both the test set and the separate test set.

In conclusion, the area that the original model could be improved in was mainly the feature representation being fed in to the model. Considering the best representation to get the features which are likely to determine the class of a document will lead to better results. We have found that having a moderate size collection of words, ~1,000 consisting of words that are not very common but occur in a small collection of documents works best. Removing stop words is a good option, but setting a max document frequency will likely remove most of these. Unfortunately, n-gram count was not a very useful option, possibly due to the scarcity of pairs in many documents. Lastly, we can see that considering the tf-idf score worked well and this means that the repeated occurrence of a positve word is not as significant after the first occurrence.

This assignment has been interesting and has helped me get a better understanding of the options when pre-processing textual data for a machine learning task.