<a href="https://colab.research.google.com/github/agrudkow/xlnet/blob/master/notebooks/colab_imdb_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# XLNet IMDB movie review classification project

This notebook is for classifying the [imdb sentiment dataset](https://ai.stanford.edu/~amaas/data/sentiment/).  It will be easy to edit this notebook in order to run all of the classification tasks referenced in the [XLNet paper](https://arxiv.org/abs/1906.08237). Whilst you cannot expect to obtain the state-of-the-art results in the paper on a GPU, this model will still score very highly. 

## Setup
Install dependencies

In [1]:
! pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |▎                               | 10kB 23.4MB/s eta 0:00:01[K     |▌                               | 20kB 31.1MB/s eta 0:00:01[K     |▉                               | 30kB 22.3MB/s eta 0:00:01[K     |█                               | 40kB 25.9MB/s eta 0:00:01[K     |█▍                              | 51kB 24.7MB/s eta 0:00:01[K     |█▋                              | 61kB 27.3MB/s eta 0:00:01[K     |██                              | 71kB 18.4MB/s eta 0:00:01[K     |██▏                             | 81kB 19.3MB/s eta 0:00:01[K     |██▌                             | 92kB 18.0MB/s eta 0:00:01[K     |██▊                             | 102kB 18.0MB/s eta 0:00:01[K     |███                             | 112kB 18.0MB/s eta 0:00:01[K     |███▎        

Download the pretrained XLNet model and unzip

In [2]:
# only needs to be done once
! wget https://storage.googleapis.com/xlnet/released_models/cased_L-24_H-1024_A-16.zip
! unzip cased_L-24_H-1024_A-16.zip 

--2021-05-04 19:58:43--  https://storage.googleapis.com/xlnet/released_models/cased_L-24_H-1024_A-16.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.253.63.128, 142.250.31.128, 172.217.15.80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.253.63.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1338042341 (1.2G) [application/zip]
Saving to: ‘cased_L-24_H-1024_A-16.zip’


2021-05-04 19:58:54 (111 MB/s) - ‘cased_L-24_H-1024_A-16.zip’ saved [1338042341/1338042341]

Archive:  cased_L-24_H-1024_A-16.zip
   creating: xlnet_cased_L-24_H-1024_A-16/
  inflating: xlnet_cased_L-24_H-1024_A-16/xlnet_model.ckpt.index  
  inflating: xlnet_cased_L-24_H-1024_A-16/xlnet_model.ckpt.data-00000-of-00001  
  inflating: xlnet_cased_L-24_H-1024_A-16/spiece.model  
  inflating: xlnet_cased_L-24_H-1024_A-16/xlnet_model.ckpt.meta  
  inflating: xlnet_cased_L-24_H-1024_A-16/xlnet_config.json  


Git clone XLNet repo for access to run_classifier and the rest of the xlnet module

In [3]:
! git clone https://github.com/agrudkow/xlnet.git

Cloning into 'xlnet'...
remote: Enumerating objects: 190, done.[K
remote: Counting objects: 100% (68/68), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 190 (delta 23), reused 60 (delta 15), pack-reused 122[K
Receiving objects: 100% (190/190), 3.57 MiB | 29.93 MiB/s, done.
Resolving deltas: 100% (82/82), done.


Downgrade tensorflow to v1

In [4]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


## Define Variables
Define all the dirs: data, xlnet scripts & pretrained model. 
If you would like to save models then you can authenticate a GCP account and use that for the OUTPUT_DIR & CHECKPOINT_DIR - you will need a large amount storage to fix these models. 

Alternatively it is easy to integrate a google drive account, checkout this guide for [I/O in colab](https://colab.research.google.com/notebooks/io.ipynb) but rememeber these will take up a large amount of storage. 


In [13]:
SCRIPTS_DIR = 'xlnet' #@param {type:"string"}
DATA_DIR = 'xlnet/ists/images' #@param {type:"string"}
OUTPUT_DIR = 'proc_data/ists' #@param {type:"string"}
PRETRAINED_MODEL_DIR = 'xlnet_cased_L-24_H-1024_A-16' #@param {type:"string"}
CHECKPOINT_DIR = 'exp/ists' #@param {type:"string"}

## Run Model
This will set off the fine tuning of XLNet. There are a few things to note here:


1.   This script will train and evaluate the model
2.   This will store the results locally on colab and will be lost when you are disconnected from the runtime
3.   This uses the large version of the model (base not released presently)
4.   We are using a max seq length of 128 with a batch size of 8 please refer to the [README](https://github.com/zihangdai/xlnet#memory-issue-during-finetuning) for why this is.
5. This will take approx 4hrs to run on GPU.



In [None]:
train_command = "CUDA_VISIBLE_DEVICES=0 python xlnet/run_classifier.py \
  --do_train=True \
  --do_eval=True \
  --eval_all_ckpt=True \
  --eval_split=test \
  --task_name=ists \
  --data_dir="+DATA_DIR+" \
  --output_dir="+OUTPUT_DIR+" \
  --model_dir="+CHECKPOINT_DIR+" \
  --uncased=False \
  --spiece_model_file="+PRETRAINED_MODEL_DIR+"/spiece.model \
  --model_config_path="+PRETRAINED_MODEL_DIR+"/xlnet_config.json \
  --init_checkpoint="+PRETRAINED_MODEL_DIR+"/xlnet_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=8 \
  --eval_batch_size=8 \
  --num_hosts=1 \
  --num_core_per_host=1 \
  --learning_rate=2e-5 \
  --train_steps=200 \
  --warmup_steps=500 \
  --save_steps=100 \
  --iterations=5"

! {train_command}


## Running & Results
These are the results that I got from running this experiment
### Params
*    --max_seq_length=128 \
*    --train_batch_size= 8 

### Times
*   Training: 1hr 11mins
*   Evaluation: 2.5hr

### Results
*  Most accurate model on final step
*  Accuracy: 0.92416, eval_loss: 0.31708


### Model

*   The trained model checkpoints can be found in 'exp/imdb'

