<a href="https://colab.research.google.com/github/aditya-malte/Colab-XLNet-FineTuning/blob/master/notebooks/colab_imdb_tpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Makes squad work in colab 


**XLNet** is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs [Transformer-XL](https://arxiv.org/abs/1901.02860) as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.

For a detailed description of technical details and experimental results, please refer to our original paper:

​        [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)

​        Zhilin Yang\*, Zihang Dai\*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le 


### SQuAD2.0









#Colab TPU Demo on SQUAD2.0 Dataset

## Install sentencepiece


In [2]:
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/3d/efb655a670b98f62ec32d66954e1109f403db4d937c50d779a75b9763a29/sentencepiece-0.1.83-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |▎                               | 10kB 15.6MB/s eta 0:00:01[K     |▋                               | 20kB 2.1MB/s eta 0:00:01[K     |█                               | 30kB 3.1MB/s eta 0:00:01[K     |█▎                              | 40kB 2.1MB/s eta 0:00:01[K     |█▋                              | 51kB 2.5MB/s eta 0:00:01[K     |██                              | 61kB 3.0MB/s eta 0:00:01[K     |██▏                             | 71kB 3.4MB/s eta 0:00:01[K     |██▌                             | 81kB 3.9MB/s eta 0:00:01[K     |██▉                             | 92kB 4.3MB/s eta 0:00:01[K     |███▏                            | 102kB 3.4MB/s eta 0:00:01[K     |███▌                            | 112kB 3.4MB/s eta 0:00:01[K     |███▉                     

##Import dependencies

In [0]:
#install dependencies
import os
import csv
import tensorflow as tf
import pandas as pd  
import subprocess
import sys

## Set up the TPU and connect to Cloud Bucket

In [4]:
import datetime
import json
import pprint
import random
import string
import sys
import tensorflow as tf

print(os.environ)

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TPU address is grpc://10.0.84.122:8470
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 384801044293389941),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 4220751183557260883),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 7517815317676144645),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 12889389932413451109),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 1668683676229607260),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/dev

##Download GitHub Repository

In [5]:
git_url = "https://github.com/gonwi/Colab-XLNet-FineTuning.git"  #@param {type:"string"}
os.system("git clone "+git_url)
%cd Colab-XLNet-FineTuning

/content/Colab-XLNet-FineTuning


In [6]:
!git pull origin master
#Use if you have updated git repo and want changes to reflect

From https://github.com/gonwi/Colab-XLNet-FineTuning
 * branch            master     -> FETCH_HEAD
Already up to date.


##Download the SQUAD dataset


In [7]:
repo_name = 'Colab-XLNet-FineTuning' #@param {type:"string"}
%ls
%cd {repo_name}
!ls

classifier_utils.py  [0m[01;34mmisc[0m/            run_classifier.py  train_gpu.py
data_utils.py        modeling.py      run_race.py        train.py
function_builder.py  model_utils.py   run_squad.py       xlnet.py
gpu_utils.py         [01;34mnotebooks[0m/       [01;34mscripts[0m/
__init__.py          prepro_utils.py  squad_utils.py
LICENSE              README.md        tpu_estimator.py
[Errno 2] No such file or directory: 'Colab-XLNet-FineTuning'
/content/Colab-XLNet-FineTuning
classifier_utils.py  misc	      run_classifier.py  train_gpu.py
data_utils.py	     modeling.py      run_race.py	 train.py
function_builder.py  model_utils.py   run_squad.py	 xlnet.py
gpu_utils.py	     notebooks	      scripts
__init__.py	     prepro_utils.py  squad_utils.py
LICENSE		     README.md	      tpu_estimator.py


##Download Squad dataset



In [8]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
!wget https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json 
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json


--2019-11-03 05:44:50--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.111.153, 185.199.110.153, 185.199.109.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30288272 (29M) [application/json]
Saving to: ‘train-v1.1.json’


2019-11-03 05:44:51 (156 MB/s) - ‘train-v1.1.json’ saved [30288272/30288272]

--2019-11-03 05:44:58--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.111.153, 185.199.110.153, 185.199.109.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4854279 (4.6M) [application/json]
Saving to: ‘dev-v1.1.json’


2019-11-03 05:44:58 (53.7 MB/s) - ‘dev-v1.1.json’ saved [4854279/4854279]

--2019-11-03 05:45:00--  

In [0]:
!mkdir squad
!mv train-v1.1.json squad/
!mv dev-v1.1.json squad/
!mv train-v2.0.json squad/
!mv dev-v2.0.json squad/

In [81]:
!ls squad

dev-v1.1.json  dev-v2.0.json  train-v1.1.json  train-v2.0.json


# XLNet End to End (Fine-tuning + Evaluation) in 5 minutes with Cloud TPU

## Instructions

<h3><a href="https://cloud.google.com/tpu/"><img valign="middle" src="https://raw.githubusercontent.com/GoogleCloudPlatform/tensorflow-without-a-phd/master/tensorflow-rl-pong/images/tpu-hexagon.png" width="50"></a>  &nbsp;&nbsp;Train on TPU</h3>

   1. Create a Cloud Storage bucket for your TensorBoard logs at http://console.cloud.google.com/storage and fill in the BUCKET parameter in the "Parameters" section below.
 
   1. On the main menu, click Runtime and select **Change runtime type**. Set "TPU" as the hardware accelerator.
   1. Click Runtime again and select **Runtime > Run All** (Watch out: the "Colab-only auth for this notebook and the TPU" cell requires user input). You can also run the cells manually with Shift-ENTER.

In [84]:
TASK = 'SQUAD' #@param {type:"string"}

TASK_DATA_DIR = 'squad' #@param {type:"string"}
print('***** Task data directory: {} *****'.format(TASK_DATA_DIR))
!ls $TASK_DATA_DIR

BUCKET = 'xlnet-brainrex' #@param {type:"string"}
assert BUCKET, 'Must specify an existing GCS bucket name'
OUTPUT_DIR = 'gs://{}/xlnet/output/{}'.format(BUCKET, TASK)
MODEL_DIR = 'gs://{}/xlnet/model/{}'.format(BUCKET, TASK)

tf.gfile.MakeDirs(OUTPUT_DIR)
tf.gfile.MakeDirs(MODEL_DIR)

print('***** Model output directory: {} *****'.format(OUTPUT_DIR))



***** Task data directory: squad *****
dev-v1.1.json  dev-v2.0.json  train-v1.1.json  train-v2.0.json
***** Model output directory: gs://xlnet-brainrex/xlnet/output/SQUAD *****


##Download the XLNet-Large model

*   Containes pre-trained weights




In [10]:
os.system("wget https://storage.googleapis.com/xlnet/released_models/cased_L-24_H-1024_A-16.zip")
os.system("unzip cased_L-24_H-1024_A-16.zip")
!ls

cased_L-24_H-1024_A-16.zip  modeling.py        squad
classifier_utils.py	    model_utils.py     squad_utils.py
data_utils.py		    notebooks	       tpu_estimator.py
evaluate-v1.1.py	    prepro_utils.py    train_gpu.py
function_builder.py	    README.md	       train.py
gpu_utils.py		    run_classifier.py  xlnet_cased_L-24_H-1024_A-16
__init__.py		    run_race.py        xlnet.py
LICENSE			    run_squad.py
misc			    scripts


In [11]:
%cd xlnet_cased_L-24_H-1024_A-16
!ls

/content/Colab-XLNet-FineTuning/xlnet_cased_L-24_H-1024_A-16
spiece.model	   xlnet_model.ckpt.data-00000-of-00001  xlnet_model.ckpt.meta
xlnet_config.json  xlnet_model.ckpt.index


In [12]:
file_names = os.listdir(os.getcwd())
print(file_names)

['spiece.model', 'xlnet_model.ckpt.index', 'xlnet_model.ckpt.meta', 'xlnet_config.json', 'xlnet_model.ckpt.data-00000-of-00001']


##Copy the weights to Google Cloud Bucket

In [13]:
for file_name in file_names:
  print(file_name)
  os.system("gsutil cp "+ file_name + " " + MODEL_DIR)
os.system("gsutil ls " + MODEL_DIR)
%cd ..

spiece.model


NameError: ignored

In [0]:
!rm run_squad.py

In [39]:
!wget https://raw.githubusercontent.com/gonwi/Colab-XLNet-FineTuning/master/run_squad.py

--2019-11-02 18:28:29--  https://raw.githubusercontent.com/gonwi/Colab-XLNet-FineTuning/master/run_squad.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 46080 (45K) [text/plain]
Saving to: ‘run_squad.py’


2019-11-02 18:28:29 (2.21 MB/s) - ‘run_squad.py’ saved [46080/46080]



##Copy the spiece.model file to local directory

In [88]:
os.system("gsutil cp -r " + MODEL_DIR + "/spiece.model spiece.model")
!ls

cased_L-24_H-1024_A-16.zip  modeling.py        spiece.model
classifier_utils.py	    model_utils.py     squad
data_utils.py		    notebooks	       squad_utils.py
evaluate-v1.1.py	    prepro_utils.py    tpu_estimator.py
function_builder.py	    README.md	       train_gpu.py
gpu_utils.py		    run_classifier.py  train.py
__init__.py		    run_race.py        xlnet_cased_L-24_H-1024_A-16
LICENSE			    run_squad.py       xlnet.py
misc			    scripts


##Choose Hyperparameters

##Pre processing SQUAD dataset for XLnet format
This will take quite some time in order to accurately map character positions (raw data) to sentence piece positions (used for training).


In [46]:
#TODO add multicore processing to preprocessing
prepo_command = "python run_squad.py \
  --use_tpu=True \
  --use_colab_tpu=True \
  --do_prepro \
  --num_proc=4 \
  --proc_id=2 \
  --spiece_model_file=./spiece.model \
  --train_file=./squad/train-v2.0.json \
  --output_dir="+OUTPUT_DIR+" \
  --uncased=False \
  --max_seq_length=512 \
"
prepo_command

'python run_squad.py   --use_tpu=True   --use_colab_tpu=True   --do_prepro   --num_proc=4   --proc_id=2   --spiece_model_file=./spiece.model   --train_file=./squad/train-v2.0.json   --output_dir=gs://xlnet-big/xlnet/output/SQUAD   --uncased=False   --max_seq_length=512 '

In [47]:
!{prepo_command}




W1102 18:34:02.971790 140615083079552 module_wrapper.py:139] From run_squad.py:1156: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1102 18:34:02.972000 140615083079552 module_wrapper.py:139] From run_squad.py:1156: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1102 18:34:02.972150 140615083079552 module_wrapper.py:139] From run_squad.py:1158: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.


W1102 18:34:04.141826 140615083079552 module_wrapper.py:139] From run_squad.py:1132: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Read examples from ./squad/train-v2.0.json
I1102 18:34:04.142056 140615083079552 run_squad.py:1132] Read examples from ./squad/train-v2.0.json

W1102 18:34:04.142212 140615083079552 module_wrapper.py:139] From run_squad.py:237: The name tf.gfile.Open is deprecated. Please use tf

##Fine tune XLnet for SQUAD 2.0 with TPU


In [0]:
TRAIN_BATCH_SIZE = 32
EVAL_BATCH_SIZE = 8
PREDICT_BATCH_SIZE = 8
LEARNING_RATE = 2e-5
MAX_SEQ_LENGTH = 256
NUM_TRAIN_STEPS = 4000
WARMUP_STEPS = 500
LEARNING_RATE = 2e-5

# Model configs
SAVE_CHECKPOINTS_STEPS = 500
NUM_ITERATIONS = 500

In [93]:
train_squad_command = "python run_squad.py \
  --use_tpu=True \
  --use_colab_tpu=True \
  --data_dir=./"+TASK_DATA_DIR+" \
  --output_dir="+OUTPUT_DIR+" \
  --model_dir="+MODEL_DIR+" \
  --num_hosts=1 \
  --num_core_per_host=8 \
  --spiece_model_file=./spiece.model \
  --model_config_path="+MODEL_DIR+"/xlnet_config.json \
  --init_checkpoint="+MODEL_DIR+"/xlnet_model.ckpt \
  --train_file=./squad/train-v2.0.json \
  --predict_file=./squad/dev-v1.1.json \
  --uncased=False \
  --max_seq_length=512 \
  --do_train=False \
  --train_batch_size=16 \
  --do_predict=True \
  --predict_batch_size=32 \
  --learning_rate=3e-5 \
  --overwrite_data  \
  --adam_epsilon=1e-6 \
  --iterations=1000 \
  --save_steps=1000 \
  --train_steps=10000 \
  --warmup_steps=1000 \
"

print(train_squad_command)

python run_squad.py   --use_tpu=True   --use_colab_tpu=True   --data_dir=./squad   --output_dir=gs://xlnet-brainrex/xlnet/output/SQUAD   --model_dir=gs://xlnet-brainrex/xlnet/model/SQUAD   --num_hosts=1   --num_core_per_host=8   --spiece_model_file=./spiece.model   --model_config_path=gs://xlnet-brainrex/xlnet/model/SQUAD/xlnet_config.json   --init_checkpoint=gs://xlnet-brainrex/xlnet/model/SQUAD/xlnet_model.ckpt   --train_file=./squad/train-v2.0.json   --predict_file=./squad/dev-v1.1.json   --uncased=False   --max_seq_length=512   --do_train=False   --train_batch_size=16   --do_predict=True   --predict_batch_size=32   --learning_rate=3e-5   --overwrite_data    --adam_epsilon=1e-6   --iterations=1000   --save_steps=1000   --train_steps=10000   --warmup_steps=1000 


In [1]:
!{train_squad_command}

/bin/bash: {train_squad_command}: command not found
