<a href="https://colab.research.google.com/github/rar8393/NLG_tests/blob/main/Conditional_Text_Generation_with_GPT_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Related article: https://www.ivanlai.project-ds.net/post/conditional-text-generation-by-fine-tuning-gpt-2

Preprocessing code in [this](https://github.com/ivanlai/Conditional_Text_Generation) Github repository.

### Install and import libraries

In [None]:
%%time
%%capture
!pip install transformers

CPU times: user 21.3 ms, sys: 7.37 ms, total: 28.7 ms
Wall time: 6.25 s


Check GPU memory available (Colab could offer 12GB or 16GB). 

Our configuration works on 16GB. The batch size needs to be reduced if only 12GB were available.




In [None]:
!nvidia-smi

Mon Jan 25 18:23:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   67C    P8    12W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import os
import io
import requests
import numpy as np
import pandas as pd
import re
import zipfile
import random
import time
import csv
import datetime
from itertools import compress
from collections import Counter, defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from transformers import AutoTokenizer, AutoConfig, AutoModelForPreTraining, \
                         AdamW, get_linear_schedule_with_warmup, \
                         TrainingArguments, BeamScorer, Trainer

import torch
from torch.utils.data import Dataset, random_split, DataLoader, \
                             RandomSampler, SequentialSampler

from IPython.display import clear_output

print(f"PyTorch version: {torch.__version__}")

PyTorch version: 1.7.0+cu101


### Configurations

In [None]:
DEBUG           = False

INPUT_DIR       = 'articles'

USE_APEX        = True
APEX_OPT_LEVEL  = 'O1'

MODEL           = 'gpt2' #{gpt2, gpt2-medium, gpt2-large, gpt2-xl}

UNFREEZE_LAST_N = 6 #The last N layers to unfreeze for training

SPECIAL_TOKENS  = { "bos_token": "<|BOS|>",
                    "eos_token": "<|EOS|>",
                    "unk_token": "<|UNK|>",                    
                    "pad_token": "<|PAD|>",
                    "sep_token": "<|SEP|>"}
                    
MAXLEN          = 768  #{768, 1024, 1280, 1600}

TRAIN_SIZE      = 0.8

if USE_APEX:
    TRAIN_BATCHSIZE = 4
    BATCH_UPDATE    = 16
else:
    TRAIN_BATCHSIZE = 2
    BATCH_UPDATE    = 32

EPOCHS          = 4
LR              = 5e-4
EPS             = 1e-8
WARMUP_STEPS    = 1e2

SEED            = 2020

In [None]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

seed_everything(SEED)

### Download News Aggregator Dataset

https://archive.ics.uci.edu/ml/datasets/News+Aggregator

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

In [None]:
columns = ['ID',
           'TITLE',
           'URL',
           'PUBLISHER',
           'CATEGORY', #News category (b = business, t = science and technology, e = entertainment, m = health)
           'Alphanumeric ID',
           'HOSTNAME Url',
           'TIMESTAMP']

df = pd.read_csv("newsCorpora.csv", sep='\t', header=None, names=columns)
print(f"df size: {len(df) :,}")

df.head(2)

df size: 422,419


Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,Alphanumeric ID,HOSTNAME Url,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207


### Download articles

In [None]:
#Download news articles
!gdown --id 1FqqDZOVd8_LfOEA_2DEY5W70-DpQVF-F
!unzip -q '/content/news_articles.zip'

Downloading...
From: https://drive.google.com/uc?id=1FqqDZOVd8_LfOEA_2DEY5W70-DpQVF-F
To: /content/news_articles.zip
49.2MB [00:00, 135MB/s] 


In [None]:
%%time

data = dict()            
for root, dirs, files in os.walk(INPUT_DIR, topdown=True):
    t0 = time.time()

    for i, f in enumerate(files):
        #id, category, title, keywords, text
        id = int(f[:-4])        
        tmp = df[['CATEGORY', 'TITLE']][df.ID==id].values
        category, title = tmp[0][0], tmp[0][1]

        with open(f'{INPUT_DIR}/{f}', "r") as infile:
            text = infile.read()
        
        data[id] = [title, text]

        if i%1000==0 and i>0:
            clear_output(wait=True)
            print(f"({os.getpid()}) Items processed: {i :,}/{len(files):,}; {(time.time()-t0)/60 :.1f} minutes")

            if DEBUG:
                break

print(f"Number of articles: {len(data) :,}")

(65) Items processed: 39,000/39,024; 7.7 minutes
Number of articles: 39,024
CPU times: user 7min 34s, sys: 6.07 s, total: 7min 40s
Wall time: 7min 39s


### Download Keywords 
Keywords of these articles have been extracted offline

In [None]:
#Download Keywords
!gdown --id 1C1WWvnt2egzhRmVXMSJhARz5GOLDGzMC
!unzip -q '/content/keywords.csv.zip'

Downloading...
From: https://drive.google.com/uc?id=1C1WWvnt2egzhRmVXMSJhARz5GOLDGzMC
To: /content/keywords.csv.zip
  0% 0.00/746k [00:00<?, ?B/s]100% 746k/746k [00:00<00:00, 101MB/s]


In [None]:
def read_keywords():
    keywords = dict()
    with open('keywords.csv', newline='') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:        
            keywords[int(row[0])] = row[1:]          
    print(f"Number of entries in keywords: {len(keywords) :,}")  
    return keywords

In [None]:
%%time
keywords = read_keywords()

all_keywords = set()
for k, v in keywords.items():
    for w in v:
        all_keywords.add(w)
        
for id in data.keys():
    data[id].append(keywords[id])

print(f"Number of unique keywords: {len(all_keywords) :,}")  

Number of entries in keywords: 39,024
Number of unique keywords: 29,291
CPU times: user 108 ms, sys: 4.93 ms, total: 113 ms
Wall time: 109 ms


### Datasets and loaders

In [None]:
class myDataset(Dataset):

    def __init__(self, data, tokenizer, randomize=True):

        title, text, keywords = [], [], []
        for k, v in data.items():
            title.append(v[0])
            text.append(v[1])
            keywords.append(v[2])

        self.randomize = randomize
        self.tokenizer = tokenizer 
        self.title     = title
        self.text      = text
        self.keywords  = keywords  

    #---------------------------------------------#

    @staticmethod
    def join_keywords(keywords, randomize=True):
        N = len(keywords)

        #random sampling and shuffle
        if randomize: 
            M = random.choice(range(N+1))
            keywords = keywords[:M]
            random.shuffle(keywords)

        return ','.join(keywords)

    #---------------------------------------------#

    def __len__(self):
        return len(self.text)

    #---------------------------------------------#
    
    def __getitem__(self, i):
        keywords = self.keywords[i].copy()
        kw = self.join_keywords(keywords, self.randomize)
        
        input = SPECIAL_TOKENS['bos_token'] + self.title[i] + \
                SPECIAL_TOKENS['sep_token'] + kw + SPECIAL_TOKENS['sep_token'] + \
                self.text[i] + SPECIAL_TOKENS['eos_token']

        encodings_dict = tokenizer(input,                                   
                                   truncation=True, 
                                   max_length=MAXLEN, 
                                   padding="max_length")   
        
        input_ids = encodings_dict['input_ids']
        attention_mask = encodings_dict['attention_mask']
        
        return {'label': torch.tensor(input_ids),
                'input_ids': torch.tensor(input_ids), 
                'attention_mask': torch.tensor(attention_mask)}

In [None]:
def split_data(data, S=TRAIN_SIZE):
    # Shuffle ids
    ids = list(data.keys())
    random.shuffle(ids)

    # Split into training and validation sets    
    train_size = int(S * len(data))

    train_ids = ids[:train_size]
    val_ids = ids[train_size:]

    train_data = dict()
    for id in train_ids:
        train_data[id] = data[id]

    val_data = dict()
    for id in val_ids:
        val_data[id] = data[id]

    return train_data, val_data

### Loading Tokenizer, Config and Model

In [None]:
def get_tokenier(special_tokens=None):
    tokenizer = AutoTokenizer.from_pretrained(MODEL) #GPT2Tokenizer

    if special_tokens:
        tokenizer.add_special_tokens(special_tokens)
        print("Special tokens added")
    return tokenizer

def get_model(tokenizer, special_tokens=None, load_model_path=None):

    #GPT2LMHeadModel
    if special_tokens:
        config = AutoConfig.from_pretrained(MODEL, 
                                            bos_token_id=tokenizer.bos_token_id,
                                            eos_token_id=tokenizer.eos_token_id,
                                            sep_token_id=tokenizer.sep_token_id,
                                            pad_token_id=tokenizer.pad_token_id,
                                            output_hidden_states=False)
    else: 
        config = AutoConfig.from_pretrained(MODEL,                                     
                                            pad_token_id=tokenizer.eos_token_id,
                                            output_hidden_states=False)    

    #----------------------------------------------------------------#
    model = AutoModelForPreTraining.from_pretrained(MODEL, config=config)

    if special_tokens:
        #Special tokens added, model needs to be resized accordingly
        model.resize_token_embeddings(len(tokenizer))

    if load_model_path:
        model.load_state_dict(torch.load(load_model_path))

    model.cuda()
    return model

In [None]:
%%time

tokenizer = get_tokenier(special_tokens=SPECIAL_TOKENS)
model = get_model(tokenizer, 
                  special_tokens=SPECIAL_TOKENS,
                #   load_model_path='pytorch_model.bin'
                 )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…


Special tokens added


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…


CPU times: user 14.7 s, sys: 3.45 s, total: 18.2 s
Wall time: 36.8 s


In [None]:
# - Freeze selective layers:
# - Freeze all layers except last n:
for parameter in model.parameters():
    parameter.requires_grad = False

for i, m in enumerate(model.transformer.h):        
    #Only un-freeze the last n transformer blocks
    if i+1 > 12 - UNFREEZE_LAST_N:
        for parameter in m.parameters():
            parameter.requires_grad = True 

for parameter in model.transformer.ln_f.parameters():        
    parameter.requires_grad = True

for parameter in model.lm_head.parameters():        
    parameter.requires_grad = True

In [None]:
train_data, val_data = split_data(data)

train_dataset = myDataset(train_data, tokenizer)
val_dataset = myDataset(val_data, tokenizer, randomize=False)

f'There are {len(train_dataset) :,} samples for training, and {len(val_dataset) :,} samples for validation testing'

'There are 31,219 samples for training, and 7,805 samples for validation testing'

### Fine-tune GPT2 using Trainer

In [None]:
%%time

training_args = TrainingArguments(
    output_dir="/content/",
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=TRAIN_BATCHSIZE,
    per_device_eval_batch_size=TRAIN_BATCHSIZE,
    gradient_accumulation_steps=BATCH_UPDATE,
    evaluation_strategy="epoch",
    fp16=True,
    fp16_opt_level=APEX_OPT_LEVEL,
    warmup_steps=WARMUP_STEPS,    
    learning_rate=LR,
    adam_epsilon=EPS,
    weight_decay=0.01,        
    save_total_limit=1,
    load_best_model_at_end=True,     
)

#---------------------------------------------------#
trainer = Trainer(
    model=model,
    args=training_args,    
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer
)

#---------------------------------------------------#
trainer.train()
trainer.save_model()    



Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
0,No log,1.53961,419.4604,18.607
1,2.991300,1.502831,419.394,18.61
2,1.525100,1.479752,419.1184,18.622
3,1.455600,1.469079,420.1237,18.578


CPU times: user 3h 10min 55s, sys: 2h 23min 22s, total: 5h 34min 17s
Wall time: 5h 34min 42s


In [None]:
# Save to G-Drive ----------------------------------#
# !cp -r 'pytorch_model.bin' '/content/drive/MyDrive/Colab Notebooks/Text Generation/pytorch_model_V2.bin'

### Generating text with Fine-tuned GPT-2 model

In [None]:
# !cp -r '/content/drive/MyDrive/Colab Notebooks/Text Generation/pytorch_model_V2.bin' 'pytorch_model.bin' 

In [None]:
tokenizer = get_tokenier(special_tokens=SPECIAL_TOKENS)
model = get_model(tokenizer, 
                  special_tokens=SPECIAL_TOKENS,
                  load_model_path='pytorch_model.bin')

Special tokens added


In [None]:
title = "We got a lot of grief when our photo became a meme"
keywords = ['train', 'lads', 'drinking', 'picture', 'funny', 'instagram']
kw = myDataset.join_keywords(keywords, randomize=False)

prompt = SPECIAL_TOKENS['bos_token'] + title + \
         SPECIAL_TOKENS['sep_token'] + kw + SPECIAL_TOKENS['sep_token']
         
generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
device = torch.device("cuda")
generated = generated.to(device)

model.eval();

In [None]:
# Top-p (nucleus) text generation (10 samples):
sample_outputs = model.generate(generated, 
                                do_sample=True,   
                                min_length=50, 
                                max_length=MAXLEN,
                                top_k=30,                                 
                                top_p=0.7,        
                                temperature=0.9,
                                repetition_penalty=2.0,
                                num_return_sequences=10
                                )

for i, sample_output in enumerate(sample_outputs):
    text = tokenizer.decode(sample_output, skip_special_tokens=True)
    a = len(title) + len(','.join(keywords))    
    print("{}: {}\n\n".format(i+1,  text[a:]))

1: It's been an amazing ride so far. It’s just such incredible people that made it to the top and now we have something bigger than anyone could ever dream for!
Now everyone has had their share in this fun moment – from famous rockers like The Rolling Stones on down below… or celebrities who donned costumes (or simply posed as they did at concerts). Here are some highlights:

The train was full-on hilarious after all — but not by accident - here is how one Twitter user posted her own reaction pic...


2: It was fun. It wasn't really funny but it didn’t hurt us at all and we did appreciate the fact that people were laughing out loud about how much they enjoyed what I put on Instagram as well!
The pictures are from last week in which Lindsay Lohan showed off her new drink while driving to work with an Italian businessman named Giannelli Vittorio (who has been linked up by The Sun). There is also this picture taken during one trip back home for my birthday where she had dinner alongside o

In [None]:
# Beam-search text generation:
sample_outputs = model.generate(generated, 
                                do_sample=True,   
                                max_length=MAXLEN,                                                      
                                num_beams=5,
                                repetition_penalty=5.0,
                                early_stopping=True,      
                                num_return_sequences=1
                                )

for i, sample_output in enumerate(sample_outputs):
    text = tokenizer.decode(sample_output, skip_special_tokens=True)
    a = len(title) + len(','.join(keywords))    
    print("{}: {}\n\n".format(i+1,  text[a:]))

1: The world’s most popular dating app crashed on its iOS and Android operating systems this week.

It has since been taken down from Apple’s App Store after users complained that the picture was too close to their faces. The company said in a statement: “This is not an isolated incident. We take great pride in what we do and look forward to working with others to make the sharing experience even better.”

Scroll down for video



Our hearts go out to all those who have lost their lives trying to find love over the past few days - but sadly it looks like there may be more to this tragic story than meets the eye

No-one knows exactly how many people were affected by the crash, which took place between late February and early March last year (stock image)

HOW IT WORKS A simple message will appear on your iPhone or iPad screen telling you if you are having trouble finding someone online. It then lets you meet up with them so they can pick you up where you left off. In other words, as soo

### Generating text with raw GPT2

In [None]:
tokenizer = get_tokenier()
model = get_model(tokenizer)

In [None]:
prompt = title

generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
device = torch.device("cuda")
generated = generated.to(device)

model.eval()
sample_outputs = model.generate(generated, 
                                do_sample=True,   
                                max_length=MAXLEN,                                                      
                                num_beams=5,
                                repetition_penalty=5.0,
                                early_stopping=True,      
                                num_return_sequences=1
                                )

for i, sample_output in enumerate(sample_outputs):
    print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: We got a lot of grief when our photo became a meme.

"I didn't know what to do with it," she said. "It was just so much fun."


