<a href="https://colab.research.google.com/github/srulikbd/Machine-Learning-with-Python/blob/master/Fine_Tuning_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning GPT2 on Colab GPU… For Free!

This is a colab notebook for the [associated Medium article](https://medium.com/p/340468c92ed)

## Installing Dependencies
We would run pip3 install transformers normally in Bash, but because this is in Colab, we have to run it with !

In [None]:
!pip3 install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |████████████████████████████████| 675kB 8.1MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 23.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 53.4MB/s 
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |█████████

## Getting WikiText Data

You can read more about WikiText data here. Overall, there's WikiText-2 and WikiText-103. We're going to use WikiText-2 because it's smaller, and we have limits in terms of how long we can run on GPU, and how much data we can load into memory in Colab. To download and run

In [None]:
%%bash
# wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
# unzip wikitext-2-raw-v1.zip

## Fine-Tuning GPT2

HuggingFace actually provides a script to help fine tune models here. We can just download the script by running

In [None]:
! wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py

--2020-06-23 13:26:28--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10399 (10K) [text/plain]
Saving to: ‘run_language_modeling.py’


2020-06-23 13:26:28 (108 MB/s) - ‘run_language_modeling.py’ saved [10399/10399]



Now we are ready to fine tune.

There are many parameters to the script, and you can understand them by reading the manual. I'm just going to go over the important ones for basic training.

- `output_dir` is where the model will be output
- `model_type` is what model you want to use. In our case, it's gpt2 
- `model_name_or_path` is the path to the model. If you want to train from scratch, you can leave this blank. In our case, it's also gpt2 
- `do_train` tells it to train
- `train_data_file` points to the training file
- `do_eval` tells it to evaluate afterwards. Not always required, but good to have
- `eval_data_file` points to the evaluation file

Some extra ones you MAY care about, but you can also skip this.
- `save_steps` is when to save checkpoints. If you have limited memory, you can set this to -1 so it'll skip saving until the end
- `per_gpu_train_batch_size` is batch size for GPU. You can increase this if your GPU has enough memory. To be safe, you can start with 1 and ramp it up if you still have memory
- `num_train_epochs` is the number of epochs to train. Since we're fine-tuning, I'm going to set this to 2


In [None]:
import pandas as pd
import numpy as np
data=pd.read_csv("/content/sample_data/DB_spanish_clean-train.csv",encoding='latin1', header=None)
print(data.head())
clean_data = data[0]
np.savetxt(r'/content/sample_data/LML-small-train.txt', data.values, fmt='%s')

data=pd.read_csv("/content/sample_data/DB_spanish_clean-test.csv",encoding='latin1', header=None)
print(data.head())
clean_data = data[0]
np.savetxt(r'/content/sample_data/LML-small-test.txt', data.values, fmt='%s')

                                                   0
0  ï»¿La sociedad civil pide la urgente liberaci?...
1  Espa?a va a la ruina pero aumenta las  transfe...
2  Israel mantiene prisioneros a 200 ni?os con la...
3  Muere un preso palestino por negligencia m?dic...
4  Israel confisca terrenos palestinos de la mezq...
                                                   0  ...        5
0  ï»¿"Espero que se reanude la audiencia de dete...  ...   coâ¦"
1  Estado criminal de Israel ha asesinado un ni?o...  ...      NaN
2  #IAI de #Israel utiliza inteligencia artificia...  ...      NaN
3  Â¿C?mo afecta a #Israel el conflicto entre la ...  ...      NaN
4  #LaCuarentenaMata Instituto cardiovascular anu...  ...      NaN

[5 rows x 6 columns]


In [None]:
#ORIGINAL code
# %%bash
# export TRAIN_FILE=wikitext-2-raw/wiki.train.raw
# export TEST_FILE=wikitext-2-raw/wiki.test.raw
# export MODEL_NAME=bert-base-multilingual-cased
# export OUTPUT_DIR=output
 
# python run_language_modeling.py \
#     --output_dir=$OUTPUT_DIR \
#     --model_type=$MODEL_NAME \
#     --model_name_or_path=$MODEL_NAME \
#     --do_train \
#     --train_data_file=$TRAIN_FILE \
#     --do_eval \
#     --eval_data_file=$TEST_FILE \
#     --per_gpu_train_batch_size=1 \
#     --save_steps=-1 \
#     --num_train_epochs=2 \
#     --mlm \
#     --line_by_line
 
# python run_language_modeling.py \
#     --help
 
 
 
%%bash
export TRAIN_FILE=/content/sample_data/LML-small-train.txt
export TEST_FILE=/content/sample_data/LML-small-test.txt
export MODEL_NAME=bert-base-multilingual-cased
export OUTPUT_DIR=/content/output
 
 
 
python run_language_modeling.py \
    --output_dir=$OUTPUT_DIR \
    --model_type=$MODEL_NAME \
    --model_name_or_path=$MODEL_NAME \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --per_gpu_train_batch_size=1 \
    --save_steps=-1 \
    --num_train_epochs=2 \
    --mlm \
    --line_by_line \
    --overwrite_output_dir
 
python run_language_modeling.py \
    --help

{"loss": 3.4809115029647946, "learning_rate": 4.990124042032077e-05, "epoch": 0.003950383187169155, "step": 500}
{"loss": 3.370515186324716, "learning_rate": 4.980248084064155e-05, "epoch": 0.00790076637433831, "step": 1000}
{"loss": 3.291939713080108, "learning_rate": 4.9703721260962314e-05, "epoch": 0.011851149561507466, "step": 1500}
{"loss": 3.198993044391449, "learning_rate": 4.9604961681283084e-05, "epoch": 0.01580153274867662, "step": 2000}
{"loss": 3.1517792300977745, "learning_rate": 4.950620210160386e-05, "epoch": 0.019751915935845778, "step": 2500}
{"loss": 3.2199107205306645, "learning_rate": 4.940744252192463e-05, "epoch": 0.02370229912301493, "step": 3000}
{"loss": 3.1380331875588743, "learning_rate": 4.93086829422454e-05, "epoch": 0.027652682310184088, "step": 3500}
{"loss": 3.1134904662013434, "learning_rate": 4.920992336256617e-05, "epoch": 0.03160306549735324, "step": 4000}
{"loss": 3.1257499504419974, "learning_rate": 4.911116378288694e-05, "epoch": 0.035553448684522

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



# New Section

## Results

To use it, you can run something like

In [None]:
# from transformers import GPT2Tokenizer, GPT2LMHeadModel
# import torch
# import numpy as np

# OUTPUT_DIR = "./output"
# device = 'cpu'
# if torch.cuda.is_available():
#     device = 'cuda'

# tokenizer = GPT2Tokenizer.from_pretrained(OUTPUT_DIR)
# model = GPT2LMHeadModel.from_pretrained(OUTPUT_DIR)
# model = model.to(device)
                                        
# def generate(input_str, length=250, n=5):
#   cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)
#   model.eval()
#   with torch.no_grad():
#     for i in range(length):
#       outputs = model(cur_ids[:, -1024:], labels=cur_ids[:, -1024:])
#       loss, logits = outputs[:2]
#       softmax_logits = torch.softmax(logits[0,-1], dim=0)
#       next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n)
#       cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim=1)
#     output_list = list(cur_ids.squeeze().to('cpu').numpy())
#     output_text = tokenizer.decode(output_list)
#     return output_text

# def choose_from_top(probs, n=5):
#     ind = np.argpartition(probs, -n)[-n:]
#     top_prob = probs[ind]
#     top_prob = top_prob / np.sum(top_prob) # Normalize
#     choice = np.random.choice(n, 1, p = top_prob)
#     token_id = ind[choice][0]
#     return int(token_id)

# generated_text = generate(" = Toronto Raptors = \n")
# print(generated_text)

## Compressing/Zipping Model

In order for us to preserve this model, we should compress it and save it somewhere. This can be done easily with

In [None]:
! tar -czf bert-base-multilingual-cased-Spanish.tar.gz output/

which creates a file called `gpt2-tuned.tar.gz`

## Saving it to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Now you can copy your output model to your Google Drive by running

In [None]:
!cp bert-base-multilingual-cased-Spanish.tar.gz /content/drive/My\ Drive/