# Sentiment Analysis on IMDB Movie Reviews

---

[Article](https://news.machinelearning.sg/posts/sentiment_analysis_on_movie_reviews_with_xlnet) | [Github](https://github.com/eugenesiow/practical-ml/blob/master/notebooks/Sentiment_Analysis_Movie_Reviews.ipynb) | More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml)

---



Notebook to train an XLNet model to perform sentiment analysis. The [dataset](https://ai.stanford.edu/~amaas/data/sentiment/) used is a balanced collection of (50,000 - 1:1 train-test ratio) IMDB movie reviews with binary labels: **`postive`** or **`negative`** from the paper by [Maas et al. (2011)](https://ai.stanford.edu/~ang/papers/acl11-WordVectorsSentimentAnalysis.pdf). The current state-of-the-art model on this dataset is XLNet by [Yang et al. (2019)](https://arxiv.org/pdf/1906.08237.pdf) which has an accuracy of [96.2%](http://nlpprogress.com/english/sentiment_analysis.html). We get an accuracy of 92.2% due to the limitations of GPU memory on Colab (we use XLNet base instead of the large model), train to 1 epoch only for speed and we are unable to replicate all the hyperparameters (sequence length).

The notebook is structured as follows:
* Setting up the GPU Environment
* Getting Data
* Training and Testing the Model
* Using the Model (Running Inference)

## Task Description

> Sentiment analysis is the task of classifying the polarity of a given text.

# Setting up the GPU Environment

#### Ensure we have a GPU runtime

If you're running this notebook in Google Colab, select `Runtime` > `Change Runtime Type` from the menubar. Ensure that `GPU` is selected as the `Hardware accelerator`. This will allow us to use the GPU to train the model subsequently.

#### Install Dependencies and Restart Runtime

In [None]:
!pip install -q transformers
!pip install -q simpletransformers
!pip install -q datasets

[K     |████████████████████████████████| 163kB 23.5MB/s 
[K     |████████████████████████████████| 71kB 10.4MB/s 
[K     |████████████████████████████████| 245kB 42.9MB/s 
[K     |████████████████████████████████| 17.7MB 218kB/s 
[?25h

You might see the error `ERROR: google-colab X.X.X has requirement ipykernel~=X.X, but you'll have ipykernel X.X.X which is incompatible` after installing the dependencies. **This is normal** and caused by the `simpletransformers` library.

The **solution** to this will be to **reset the execution environment** now. Go to the menu `Runtime` > `Restart runtime` then continue on from the next section to download and process the data.

# Getting Data

#### Dataset Description

The IMDb dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative (this is the polarity). The dataset contains of an even number of positive and negative reviews (balanced). Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. There are 25,000 highly polar movie reviews for training, and 25,000 for testing. 

#### Pulling the data from `huggingface/datasets`

We use Hugging Face's awesome datasets library to get the pre-processed version of the original [IMDB dataset](https://ai.stanford.edu/~amaas/data/sentiment/). 

The code below pulls the train and test datasets from [huggingface/datasets](https://github.com/huggingface/datasets) using `load_dataset('imdb')` and transform them into `pandas` dataframes for use with the `simpletransformers` library to train the model.

In [None]:
import pandas as pd
from datasets import load_dataset

dataset_train = load_dataset('imdb',split='train')
dataset_train.rename_column_('label', 'labels')
train_df=pd.DataFrame(dataset_train)

dataset_test = load_dataset('imdb',split='test')
dataset_test.rename_column_('label', 'labels')
test_df=pd.DataFrame(dataset_test)

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)
Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)


Once done we can take a look at the `head()` of the training set to check if our data has been retrieved properly.

In [None]:
train_df.head()

Unnamed: 0,labels,text
0,1,Bromwell High is a cartoon comedy. It ran at t...
1,1,Homelessness (or Houselessness as George Carli...
2,1,Brilliant over-acting by Lesley Ann Warren. Be...
3,1,This is easily the most underrated film inn th...
4,1,This is not the typical Mel Brooks film. It wa...


We also double check the dataset properties are exactly the same as those reported in the papers (25,000 train, 25,000 test size, balanced). **`0`** is the **`negative`** polarity class while **`1`** is the **`positive`** polarity class.

In [None]:
data = [[train_df.labels.value_counts()[0], test_df.labels.value_counts()[0]], 
        [train_df.labels.value_counts()[1], test_df.labels.value_counts()[1]]]
# Prints out the dataset sizes of train test and validate as per the table.
pd.DataFrame(data, columns=["Train", "Test"])

Unnamed: 0,Train,Test
0,12500,12500
1,12500,12500


# Training and Testing the Model

#### Set the Hyperparmeters

First we setup the hyperparamters, using the hyperparemeters specified in the  Yang et al. (2019) paper whenever possible (we take Yelp hyperparameters as IMDB ones are not specified). The comparison of hyperparameters is shown in the table below. The major difference is due to GPU memory limitations we are unable to use a sequence length of 512, instead we use a sliding window on a sequence length of 64. We also train to 1 epoch only as want the training to complete fast.

|Parameter  	    |Ours  	    |Paper  	|
|-	                |-	        |-	        |
|Epochs  	        |1  	    |?  	    |
|Batch Size  	    |128  	  |128  	    |
|Seq Length  	    |64  	    |512  	    |
|Learning Rate      |1e-5       |1e-5       |
|Weight decay       |1e-2       |1e-2       |

In [None]:
train_args = {
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'sliding_window': True,
    'max_seq_length': 64,
    'num_train_epochs': 1,
    'learning_rate': 0.00001,
    'weight_decay': 0.01,
    'train_batch_size': 128,
    'fp16': True,
    'output_dir': '/outputs/',
}

#### Train the Model

Once we have setup the hyperparemeters in the `train_args` dictionary, the next step would be to train the model. We use the [`xlnet-base-cased` model](https://huggingface.co/xlnet-base-cased) from the awesome [Hugging Face Transformers](https://github.com/huggingface/transformers) library and use the [Simple Transformers library](https://simpletransformers.ai/docs/classification-models/) on top of it to make it so we can train the classification model with just 2 lines of code.

[XLNet](https://arxiv.org/pdf/1906.08237.pdf) is an auto-regressive language model which outputs the joint probability of a sequence of tokens based on the transformer architecture with recurrence. Although its also bigger than BERT and has a (slightly) different architecture, it's change in training objective is probably the biggest contribution. It's training objective is to predict each word in a sequence using any combination of other words in that sequence which seems to perform better on ambiguous contexts.

In [None]:
from simpletransformers.classification import ClassificationModel
import pandas as pd
import logging
import sklearn

logging.basicConfig(level=logging.DEBUG)
transformers_logger = logging.getLogger('transformers')
transformers_logger.setLevel(logging.WARNING)

# We use the XLNet base cased pre-trained model.
model = ClassificationModel('xlnet', 'xlnet-base-cased', num_labels=2, args=train_args) 

# Train the model, there is no development or validation set for this dataset 
# https://simpletransformers.ai/docs/tips-and-tricks/#using-early-stopping
model.train_model(train_df)

# Evaluate the model in terms of accuracy score
result, model_outputs, wrong_predictions = model.eval_model(test_df, acc=sklearn.metrics.accuracy_score)

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlnet-base-cased/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlnet-base-cased/resolve/main/pytorch_model.bin HTTP/1.1" 302 0
Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceC

HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model: 173143 features created from 25000 samples.





HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 1', max=1353.0, style=ProgressStyle(de…











INFO:simpletransformers.classification.classification_model: Training of xlnet model complete. Saved to /outputs/.
INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.
INFO:simpletransformers.classification.classification_model: Sliding window enabled


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))

Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7fdb13174cc0>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1203, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1177, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
INFO:simpletransformers.classification.classification_model: 25000 features created from 25000 samples.





HBox(children=(FloatProgress(value=0.0, description='Running Evaluation', max=21178.0, style=ProgressStyle(des…




INFO:simpletransformers.classification.classification_model:{'mcc': 0.8431831642048767, 'tp': 11596, 'tn': 11443, 'fp': 1057, 'fn': 904, 'acc': 0.92156, 'eval_loss': 0.3834925892168989}


We see that the output accuracy from the model after training for 1 epoch is **92.2%** ('acc': 0.92156).

## Using the Model (Running Inference)

Running the model to do some predictions/inference is as simple as calling `model.predict(input_list)`.

In [None]:
samples = ['The script is nice.Though the casting is absolutely non-watchable.No style. the costumes do not look like some from the High Highbury society. Comparing Gwyneth Paltrow with Kate Beckinsale I can only say that Ms. Beckinsale speaks British English better than Ms. Paltrow, though in Ms. Paltrow\'s acting lies the very nature of Emma Woodhouse. Mr. Northam undoubtedly is the best Mr. Knightley of all versions, he is romantic and not at all sharp-looking and unfeeling like Mr. Knightley in the TV-version. P.S.The spectator cannot see at all Mr. Elton-Ms. Smith relationship\'s development as it was in the motion version, so one cannot understand where was all Emma\'s trying of make a Elton-Smith match (besides of the portrait).']
predictions, _ = model.predict(samples)
label_dict = {0: 'negative', 1: 'positive'}
for idx, sample in enumerate(samples):
  print('{} - {}: {}'.format(idx, label_dict[predictions[idx]], sample))

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.
INFO:simpletransformers.classification.classification_model: Sliding window enabled


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model: 1 features created from 1 samples.





HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


0 - negative: The script is nice.Though the casting is absolutely non-watchable.No style. the costumes do not look like some from the High Highbury society. Comparing Gwyneth Paltrow with Kate Beckinsale I can only say that Ms. Beckinsale speaks British English better than Ms. Paltrow, though in Ms. Paltrow's acting lies the very nature of Emma Woodhouse. Mr. Northam undoubtedly is the best Mr. Knightley of all versions, he is romantic and not at all sharp-looking and unfeeling like Mr. Knightley in the TV-version. P.S.The spectator cannot see at all Mr. Elton-Ms. Smith relationship's development as it was in the motion version, so one cannot understand where was all Emma's trying of make a Elton-Smith match (besides of the portrait).


We can connect to Google Drive with the following code to save any files you want to persist. You can also click the `Files` icon on the left panel and click `Mount Drive` to mount your Google Drive.

The root of your Google Drive will be mounted to `/content/drive/My Drive/`. If you have problems mounting the drive, you can check out this [tutorial](https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166).

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

You can move the model checkpount files which are saved in the `/outputs/` directory to your Google Drive.

In [None]:
import shutil
shutil.move('/outputs/', "/content/drive/My Drive/outputs/")

More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml) and do drop us some feedback on how to improve the notebooks on the [Github repo](https://github.com/eugenesiow/practical-ml/).