<a href="https://colab.research.google.com/github/rahulni/Artificial-Intelligence-Deep-Learning-Machine-Learning-Tutorials/blob/master/Question_Answering_System_using_BERT_%2B_SQuAD_2_0_on_Colab_TPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This colab file is created by [Pragnakalp Techlabs](https://www.pragnakalp.com/).

You can copy this colab in your drive and then execute the command in given order. For more details check our blog [NLP Tutorial: Setup Question Answering System using BERT + SQuAD on Colab TPU](https://www.pragnakalp.com/nlp-tutorial-setup-question-answering-system-bert-squad-colab-tpu/)

Check our [BERT based Question and Answering system demo for English and other 8 languages](https://www.pragnakalp.com/demos/BERT-NLP-QnA-Demo/).

You can also [purchase the Demo of our BERT based QnA system including fine-tuned models](https://www.pragnakalp.com/bert-question-n-answering-system-in-python/).

##**BERT Fine-tuning and Prediction on SQUAD 2.0 using Cloud TPU!**

---



### **Overview**
**BERT**, or Bidirectional Embedding Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper can be found here: https://arxiv.org/abs/1810.04805.

**SQuAD** Stanford Question Answering Dataset is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

This colab file shows how to fine-tune BERT on SQuAD dataset, and then how to perform the prediction. Using this you can create your own **Question Answering System.**

**Prerequisite** : You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud Storage) bucket to run this colab file.

Please follow the Google Cloud for how to create GCP account and GCS bucket. You have $300 free credit to get started with any GCP product. You can learn more about it at https://cloud.google.com/tpu/docs/setup-gcp-account

You can create your GCS bucket from here http://console.cloud.google.com/storage.


### **Change Runtime to TPU**

> On the main menu, click on **Runtime** and select **Change runtime type**. Set "**TPU**" as the hardware accelerator.


### **Clone the BERT github repository**


> First Step is to Clone the BERT github repository, below is the way by which you can clone the repo from github.



In [0]:
!git clone https://github.com/google-research/bert.git

Cloning into 'bert'...
remote: Enumerating objects: 336, done.[K
remote: Total 336 (delta 0), reused 0 (delta 0), pack-reused 336[K
Receiving objects: 100% (336/336), 283.40 KiB | 3.73 MiB/s, done.
Resolving deltas: 100% (185/185), done.


### **Confirm that BERT repo is cloned properly.**


> "ls -l" is used for long listing, if BERT repo is cloned properly you can see the BERT folder in current directory.



In [0]:
ls -l

total 8
drwxr-xr-x 3 root root 4096 Dec  7 12:03 [0m[01;34mbert[0m/
drwxr-xr-x 1 root root 4096 Nov 27 22:38 [01;34msample_data[0m/


In [0]:
cd bert

/content/bert


### **BERT repository files**


> use ls -l to check the content inside BERT folder, you can see all files related to BERT.



In [0]:
ls -l

total 396
-rw-r--r-- 1 root root  1323 Dec  7 12:03 CONTRIBUTING.md
-rw-r--r-- 1 root root 16475 Dec  7 12:03 create_pretraining_data.py
-rw-r--r-- 1 root root 13898 Dec  7 12:03 extract_features.py
-rw-r--r-- 1 root root   616 Dec  7 12:03 __init__.py
-rw-r--r-- 1 root root 11358 Dec  7 12:03 LICENSE
-rw-r--r-- 1 root root 37922 Dec  7 12:03 modeling.py
-rw-r--r-- 1 root root  9191 Dec  7 12:03 modeling_test.py
-rw-r--r-- 1 root root 11242 Dec  7 12:03 multilingual.md
-rw-r--r-- 1 root root  6258 Dec  7 12:03 optimization.py
-rw-r--r-- 1 root root  1721 Dec  7 12:03 optimization_test.py
-rw-r--r-- 1 root root 66488 Dec  7 12:03 predicting_movie_reviews_with_bert_on_tf_hub.ipynb
-rw-r--r-- 1 root root 45390 Dec  7 12:03 README.md
-rw-r--r-- 1 root root   110 Dec  7 12:03 requirements.txt
-rw-r--r-- 1 root root 34783 Dec  7 12:03 run_classifier.py
-rw-r--r-- 1 root root 11426 Dec  7 12:03 run_classifier_with_tfhub.py
-rw-r--r-- 1 root root 18667 Dec  7 12:03 run_pretraining.py
-rw-r--r-

### **Download the BERT PRETRAINED MODEL**


BERT Pretrained Model List :


*   [BERT-Large, Uncased (Whole Word Masking)](https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   [BERT-Large, Cased (Whole Word Masking)](https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   [BERT-Base, Uncased](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) : 12-layer, 768-hidden, 12-heads, 110M parameters
*   [BERT-Large, Uncased](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   [BERT-Base, Cased](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip): 12-layer, 768-hidden, 12-heads , 110M parameters
*   [BERT-Large, Cased](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   [BERT-Base, Multilingual Cased (New, recommended)](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip) : 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
*   [BERT-Base, Multilingual Uncased (Orig, not recommended) (Not recommended, use Multilingual Cased instead)](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip) : 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
*   [BERT-Base, Chinese](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip) : Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

BERT has release **BERT-Base** and **BERT-Large** models. Uncased means that the text has been lowercased before WordPiece tokenization, e.g., John Smith becomes john smith, whereas Cased means that the true case and accent markers are preserved. 

**When using a cased model, make sure to pass --do_lower=False at the time of training.** 

You can download any model of your choice. We have used **BERT-Large-Uncased Model.**


In [0]:
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip

--2019-12-07 12:04:02--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.201.128, 2607:f8b0:4001:c1b::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.201.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1247797031 (1.2G) [application/zip]
Saving to: ‘uncased_L-24_H-1024_A-16.zip’


2019-12-07 12:04:12 (126 MB/s) - ‘uncased_L-24_H-1024_A-16.zip’ saved [1247797031/1247797031]



In [0]:
# Unzip the pretrained model
!unzip uncased_L-24_H-1024_A-16.zip

Archive:  uncased_L-24_H-1024_A-16.zip
   creating: uncased_L-24_H-1024_A-16/
  inflating: uncased_L-24_H-1024_A-16/bert_model.ckpt.meta  
  inflating: uncased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001  
  inflating: uncased_L-24_H-1024_A-16/vocab.txt  
  inflating: uncased_L-24_H-1024_A-16/bert_model.ckpt.index  
  inflating: uncased_L-24_H-1024_A-16/bert_config.json  


![alt text](https://)### **Download the SQUAD 2.0 Dataset**

In [0]:
#Download the SQUAD train and dev dataset
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2019-12-07 12:04:49--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.110.153, 185.199.109.153, 185.199.108.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2019-12-07 12:04:50 (152 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]

--2019-12-07 12:04:51--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.110.153, 185.199.109.153, 185.199.108.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2019-12-07 12:04:51 (47.4 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



### **Set up your TPU environment**
*   Verify that you are connected to a TPU device
*   You will get know your TPU Address that is used at time of fine-tuning
*   Perform Google Authentication to access your bucket
*   Upload your credentials to TPU to access your GCS bucket

In [0]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is => ', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TPU address is =>  grpc://10.29.165.74:8470
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 7635328332459542216),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 2874832984123831798),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 4040080567667641550),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 5679213459339138745),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 13541815098229104680),
 _DeviceAttributes(/job:tpu_worker/replica:0/task

### **Create output directory** 


> Need to create a output directory at GCS (Google Cloud Storage) bucket, where you will get your fine_tuned model after training completion. For that you need to provide your BUCKET name and OUPUT DIRECTORY name.

> Also need to move Pre-trained Model at GCS (Google Cloud Storage) bucket, as Local File System is not Supported on TPU. If you don't move your pretrained model to TPU you may face an error. 




In [0]:
BUCKET = 'bertnlpdemo' #@param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'
output_dir_name = 'bert_output' #@param {type:"string"}
BUCKET_NAME = 'gs://{}'.format(BUCKET)
OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET,output_dir_name)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

***** Model output directory: gs://bertnlpdemo/bert_output *****


### **Move Pretrained Model to GCS Bucket** 


> Need to move Pre-trained Model at GCS (Google Cloud Storage) bucket, as Local File System is not Supported on TPU. If you don't move your pretrained model to TPU you may face the error. 



> The **gsutil** **mv** command allows you to move data between your local file system and the cloud, move data within the cloud, and move data between cloud storage providers.




In [0]:
!gsutil mv /content/bert/uncased_L-24_H-1024_A-16 $BUCKET_NAME

Copying file:///content/bert/uncased_L-24_H-1024_A-16/bert_config.json [Content-Type=application/json]...
Removing file:///content/bert/uncased_L-24_H-1024_A-16/bert_config.json...
Copying file:///content/bert/uncased_L-24_H-1024_A-16/bert_model.ckpt.index [Content-Type=application/octet-stream]...
Removing file:///content/bert/uncased_L-24_H-1024_A-16/bert_model.ckpt.index...
Copying file:///content/bert/uncased_L-24_H-1024_A-16/vocab.txt [Content-Type=text/plain]...
Removing file:///content/bert/uncased_L-24_H-1024_A-16/vocab.txt...
Copying file:///content/bert/uncased_L-24_H-1024_A-16/bert_model.ckpt.meta [Content-Type=application/octet-stream]...
Removing file:///content/bert/uncased_L-24_H-1024_A-16/bert_model.ckpt.meta...

==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying f

### **Training**

> Below is the command to run the training. To run the training on TPU you need to make sure about below Hyperparameter, that is tpu must be true and provide the tpu_address that we have find out above.

1.   --use_tpu=True
2.   --tpu_name=YOUR_TPU_ADDRESS





In [0]:
!python run_squad.py \
  --vocab_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/vocab.txt \
  --bert_config_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_model.ckpt \
  --do_train=True \
  --train_file=train-v2.0.json \
  --do_predict=True \
  --predict_file=dev-v2.0.json \
  --train_batch_size=24 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --use_tpu=True \
  --tpu_name=grpc://10.1.118.82:8470 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --version_2_with_negative=True \
  --output_dir=$OUTPUT_DIR




W1203 11:55:59.995234 139660688828288 module_wrapper.py:139] From run_squad.py:1127: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1203 11:55:59.995468 139660688828288 module_wrapper.py:139] From run_squad.py:1127: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1203 11:55:59.995662 139660688828288 module_wrapper.py:139] From /content/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W1203 11:56:01.214218 139660688828288 module_wrapper.py:139] From run_squad.py:1133: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related op

### **Create Testing File**


> We are creating input_file.json as a blank json file and then writing the data in SQUAD format in the file.


*   **touch** is used to create a file
*   **%%writefile** is used to write a file in the colab



> You can pass your own questions and context in the below file.


In [0]:
!touch input_file.json

In [0]:
%%writefile input_file.json
{
    "version": "v2.0",
    "data": [
        {
            "title": "your_title",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "question": "Who is current CEO?",
                            "id": "56ddde6b9a695914005b9628",
                            "is_impossible": ""
                        },
                        {
                            "question": "Who founded google?",
                            "id": "56ddde6b9a695914005b9629",
                            "is_impossible": ""
                        },
                        {
                            "question": "when did IPO take place?",
                            "id": "56ddde6b9a695914005b962a",
                            "is_impossible": ""
                        }
                    ],
                    "context": "Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock. They incorporated Google as a privately held company on September 4, 1998. An initial public offering (IPO) took place on August 19, 2004, and Google moved to its headquarters in Mountain View, California, nicknamed the Googleplex. In August 2015, Google announced plans to reorganize its various interests as a conglomerate called Alphabet Inc. Google is Alphabet's leading subsidiary and will continue to be the umbrella company for Alphabet's Internet interests. Sundar Pichai was appointed CEO of Google, replacing Larry Page who became the CEO of Alphabet."                
                 }
            ]
        }
    ]
}

Overwriting input_file.json


### **Prediction**


> Below is the command to perform your own custom prediction, that is you can change the input_file.json by providing your paragraph and questions after then execute the below command.



In [0]:
!python run_squad.py \
  --vocab_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/vocab.txt \
  --bert_config_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=$OUTPUT_DIR/model.ckpt-10859 \
  --do_train=False \
  --max_query_length=30  \
  --do_predict=True \
  --predict_file=input_file.json \
  --predict_batch_size=8 \
  --n_best_size=3 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --output_dir=output/




W1207 12:10:47.304228 140202182711168 module_wrapper.py:139] From run_squad.py:1127: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1207 12:10:47.304487 140202182711168 module_wrapper.py:139] From run_squad.py:1127: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1207 12:10:47.304673 140202182711168 module_wrapper.py:139] From /content/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W1207 12:10:48.564931 140202182711168 module_wrapper.py:139] From run_squad.py:1133: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related op