<a href="https://colab.research.google.com/github/agrudkow/xlnet/blob/master/notebooks/colab_imdb_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# XLNet IMDB movie review classification project

This notebook is for classifying the [imdb sentiment dataset](https://ai.stanford.edu/~amaas/data/sentiment/).  It will be easy to edit this notebook in order to run all of the classification tasks referenced in the [XLNet paper](https://arxiv.org/abs/1906.08237). Whilst you cannot expect to obtain the state-of-the-art results in the paper on a GPU, this model will still score very highly. 

## Setup
Install dependencies

In [1]:
! pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |▎                               | 10kB 14.4MB/s eta 0:00:01[K     |▌                               | 20kB 19.2MB/s eta 0:00:01[K     |▉                               | 30kB 10.5MB/s eta 0:00:01[K     |█                               | 40kB 8.4MB/s eta 0:00:01[K     |█▍                              | 51kB 5.5MB/s eta 0:00:01[K     |█▋                              | 61kB 6.4MB/s eta 0:00:01[K     |██                              | 71kB 6.4MB/s eta 0:00:01[K     |██▏                             | 81kB 6.2MB/s eta 0:00:01[K     |██▌                             | 92kB 6.1MB/s eta 0:00:01[K     |██▊                             | 102kB 5.3MB/s eta 0:00:01[K     |███                             | 112kB 5.3MB/s eta 0:00:01[K     |███▎                

Download the pretrained XLNet model and unzip

In [2]:
# only needs to be done once
#! wget https://storage.googleapis.com/xlnet/released_models/cased_L-24_H-1024_A-16.zip
#! unzip cased_L-24_H-1024_A-16.zip 

In [3]:
# Download and unzip base model
! wget https://storage.googleapis.com/xlnet/released_models/cased_L-12_H-768_A-12.zip
! unzip cased_L-12_H-768_A-12.zip

--2021-05-25 13:12:54--  https://storage.googleapis.com/xlnet/released_models/cased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.132.128, 74.125.201.128, 74.125.202.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.132.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 433638019 (414M) [application/zip]
Saving to: ‘cased_L-12_H-768_A-12.zip’


2021-05-25 13:12:59 (100 MB/s) - ‘cased_L-12_H-768_A-12.zip’ saved [433638019/433638019]

Archive:  cased_L-12_H-768_A-12.zip
   creating: xlnet_cased_L-12_H-768_A-12/
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.index  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.data-00000-of-00001  
  inflating: xlnet_cased_L-12_H-768_A-12/spiece.model  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.meta  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_config.json  


Git clone XLNet repo for access to run_classifier and the rest of the xlnet module

In [4]:
! git clone https://github.com/agrudkow/xlnet.git

Cloning into 'xlnet'...
remote: Enumerating objects: 242, done.[K
remote: Counting objects: 100% (120/120), done.[K
remote: Compressing objects: 100% (83/83), done.[K
remote: Total 242 (delta 50), reused 91 (delta 28), pack-reused 122[K
Receiving objects: 100% (242/242), 3.80 MiB | 15.75 MiB/s, done.
Resolving deltas: 100% (109/109), done.


Downgrade tensorflow to v1

In [5]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


## Define Variables
Define all the dirs: data, xlnet scripts & pretrained model. 
If you would like to save models then you can authenticate a GCP account and use that for the OUTPUT_DIR & CHECKPOINT_DIR - you will need a large amount storage to fix these models. 

Alternatively it is easy to integrate a google drive account, checkout this guide for [I/O in colab](https://colab.research.google.com/notebooks/io.ipynb) but rememeber these will take up a large amount of storage. 


In [17]:
SCRIPTS_DIR = 'xlnet' #@param {type:"string"}
DATA_DIR = 'xlnet/ists/images' #@param {type:"string"}
OUTPUT_DIR = 'proc_data/ists' #@param {type:"string"}
PRETRAINED_MODEL_DIR = 'xlnet_cased_L-12_H-768_A-12' #@param {type:"string"}
CHECKPOINT_DIR = 'exp/ists' #@param {type:"string"}
PREDICIT_DIR = 'xlnet/pred/ists/images-8000' #@param {type:"string"}

## Run Model
This will set off the fine tuning of XLNet. There are a few things to note here:


1.   This script will train and evaluate the model
2.   This will store the results locally on colab and will be lost when you are disconnected from the runtime
3.   This uses the large version of the model (base not released presently)
4.   We are using a max seq length of 128 with a batch size of 8 please refer to the [README](https://github.com/zihangdai/xlnet#memory-issue-during-finetuning) for why this is.
5. This will take approx 4hrs to run on GPU.



In [16]:
train_command = "CUDA_VISIBLE_DEVICES=0 python xlnet/run_classifier.py \
  --do_train=True \
  --do_eval=True \
  --eval_all_ckpt=True \
  --eval_split=test \
  --task_name=ists \
  --data_dir="+DATA_DIR+" \
  --output_dir="+OUTPUT_DIR+" \
  --model_dir="+CHECKPOINT_DIR+" \
  --uncased=False \
  --spiece_model_file="+PRETRAINED_MODEL_DIR+"/spiece.model \
  --model_config_path="+PRETRAINED_MODEL_DIR+"/xlnet_config.json \
  --init_checkpoint="+PRETRAINED_MODEL_DIR+"/xlnet_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=8 \
  --eval_batch_size=8 \
  --num_hosts=1 \
  --num_core_per_host=1 \
  --learning_rate=2e-5 \
  --train_steps=8000 \
  --warmup_steps=500 \
  --save_steps=2000"

! {train_command}





W0525 14:17:39.571574 140434229835648 module_wrapper.py:139] From xlnet/run_classifier.py:679: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0525 14:17:39.571855 140434229835648 module_wrapper.py:139] From xlnet/run_classifier.py:679: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0525 14:17:39.572097 140434229835648 module_wrapper.py:139] From xlnet/run_classifier.py:704: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.


W0525 14:17:39.629650 140434229835648 module_wrapper.py:139] From /content/xlnet/model_utils.py:27: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.


W0525 14:17:39.630038 140434229835648 module_wrapper.py:139] From /content/xlnet/model_utils.py:36: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Single device mode.
I0525 14:17:39.630232 140434

## Predcit

In [18]:
predict_command = "CUDA_VISIBLE_DEVICES=0 python xlnet/run_classifier.py \
  --do_train=False \
  --do_predict=True \
  --eval_all_ckpt=True \
  --eval_split=test \
  --task_name=ists \
  --data_dir="+DATA_DIR+" \
  --output_dir="+OUTPUT_DIR+" \
  --model_dir="+CHECKPOINT_DIR+" \
  --predict_dir="+PREDICIT_DIR+" \
  --uncased=False \
  --spiece_model_file="+PRETRAINED_MODEL_DIR+"/spiece.model \
  --model_config_path="+PRETRAINED_MODEL_DIR+"/xlnet_config.json \
  --init_checkpoint="+PRETRAINED_MODEL_DIR+"/xlnet_model.ckpt \
  --max_seq_length=128 \
  --predict_batch_size=8 \
  --num_hosts=1 \
  --num_core_per_host=1"

! {predict_command}




W0525 15:06:29.718904 140336404424576 module_wrapper.py:139] From xlnet/run_classifier.py:679: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0525 15:06:29.719152 140336404424576 module_wrapper.py:139] From xlnet/run_classifier.py:679: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0525 15:06:29.719393 140336404424576 module_wrapper.py:139] From xlnet/run_classifier.py:687: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.


W0525 15:06:29.719647 140336404424576 module_wrapper.py:139] From xlnet/run_classifier.py:688: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.


W0525 15:06:29.773324 140336404424576 module_wrapper.py:139] From /content/xlnet/model_utils.py:27: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.


W0525 15:06:29.773730 140336404424576 module_wrapper.py:139] From /cont

# Push results to github

#### Check repo status

In [19]:
%cd /content/xlnet &> /dev/null
!git status

%cd /content &> /dev/null

/content/xlnet
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31mpred/ists/images-8000/[m

nothing added to commit but untracked files present (use "git add" to track)
/content


#### Check repo diff

In [21]:
%cd /content/xlnet &> /dev/null
!git diff

%cd /content &> /dev/null

/content/xlnet
/content


#### Setup github environment vars

In [19]:
%cd /content/xlnet &> /dev/null

files = '.' #@param {type:"string"}
branch = 'master' #@param {type:"string"}

%cd /content &> /dev/null

#### Commit changes

In [20]:
# &> /dev/null - hide output
%cd /content/xlnet &> /dev/null

from getpass import getpass

uname = getpass('User name:')
email = getpass('Email:')
# token -> https://docs.github.com/en/github/authenticating-to-github/keeping-your-account-and-data-secure/creating-a-personal-access-token
# Wystarczy zaznaczyć opcje 'Access public repositories'
token = getpass('Token:')

!git config --global user.email $email 

# Zmień nazwę
!git config --global --replace-all user.name 'Artur Grudkowski'
!git remote set-url origin https://{uname}:{token}@github.com/agrudkow/xlnet.git &> /dev/null

# create a file, then add it to stage
!git checkout $branch
!git add $files
!git commit -m 'feat(pred): add prediciotns for images' -m "Config: base-xlnet, 8000 steps, 500 warm-up steps" 
!git pull --rebase 
!git push origin $branch

uname = ''
email = ''
token = ''
!git remote set-url origin '' &> /dev/null

%cd /content &> /dev/null


/content/xlnet
User name:··········
Email:··········
Token:··········
Already on 'master'
Your branch is up to date with 'origin/master'.
[master bcbe209] feat(pred): add prediciotns for images
 2 files changed, 24391 insertions(+)
 create mode 100644 pred/ists/images-8000/ists.logits.json
 create mode 100644 pred/ists/images-8000/ists.tsv
Current branch master is up to date.
Counting objects: 7, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (7/7), 195.89 KiB | 5.60 MiB/s, done.
Total 7 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/agrudkow/xlnet.git
   a3be0a9..bcbe209  master -> master
/content/xlnet


# Zip checkpoints

In [34]:
!pwd
%cd /content/exp/ists/
!zip -r  /content/images-4000-ckpt.zip *.ckpt-4000.*

/content/exp/ists
/content/exp/ists
  adding: model.ckpt-4000.data-00000-of-00001 (deflated 20%)
  adding: model.ckpt-4000.index (deflated 68%)
  adding: model.ckpt-4000.meta (deflated 92%)


# Copy files to Google drive

### Mount drive

In [31]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Copy  selected files

In [35]:
%cp -av "/content/images-4000-ckpt.zip" "/content/drive/MyDrive/nlp"

'/content/images-4000-ckpt.zip' -> '/content/drive/MyDrive/nlp/images-4000-ckpt.zip'


## Running & Results
These are the results that I got from running this experiment
### Params
*    --max_seq_length=128 \
*    --train_batch_size= 8 

### Times
*   Training: 1hr 11mins
*   Evaluation: 2.5hr

### Results
*  Most accurate model on final step
*  Accuracy: 0.92416, eval_loss: 0.31708


### Model

*   The trained model checkpoints can be found in 'exp/imdb'

