<a href="https://colab.research.google.com/github/sljm12/machine_learning_notebooks/blob/master/GPT2_for_Data_Augementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Want to investigate using GPT2 for Data Augenmentation

https://huggingface.co/blog/how-to-generate

https://arxiv.org/pdf/2003.02245v1.pdf

https://github.com/falloutdurham/beginners-pytorch-deep-learning/blob/master/chapter9/Chapter9.5.ipynb

https://docs.fast.ai/tutorial.transformers

In [1]:
!pip -qq install transformers

In [2]:
import torch
import os
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm, trange
import torch.nn.functional as F

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2',num_beams=4)

input_ids = tokenizer.encode("The night is still young, as the world sleeps ",return_tensors='pt')


generated = model.generate(input_ids, max_length=100, num_beams=5, early_stopping=True, no_repeat_ngram_size=2, num_return_sequences=5, top_k=0, top_p=0.9)

for i in generated:
  sequence = tokenizer.decode(i, skip_special_tokens=True)
  print(sequence)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The night is still young, as the world sleeps  and there is no sign of the sun rising. The sun is rising, and the moon is falling. There is nothing to be seen, but the sky is full of stars.
The sun has risen. It is time for us to go back to sleep. We are going to wake up in the morning and we will be able to look at the stars and see what is going on around us. This is the time we need to
The night is still young, as the world sleeps  and there is no sign of the sun rising. The sun is rising, and the moon is falling. There is nothing to be seen, but the sky is full of stars.
The sun has risen. It is time for us to go back to sleep. We are going to wake up in the morning and we will be able to look at the stars and see what is going on around us. This is the time when we should
The night is still young, as the world sleeps  and there is no sign of the sun rising. The sun is rising, and the moon is falling. There is nothing to be seen, but the sky is full of stars.
The sun has risen. It 

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2',num_beams=4)

generated = tokenizer.encode("Stephen Lee has two kids Chloe and Caleb.")
context = torch.tensor([generated])
past = None

for i in range(100):
    output, past = model(context, past=past)
    token = torch.argmax(output[..., -1, :])

    generated += [token.tolist()]
    context = token.unsqueeze(0)

sequence = tokenizer.decode(generated)

print(sequence)

# Finetuning

## Dataset

In [3]:
!wget -qq https://www.dropbox.com/s/duoi46s4db28xac/news-category-dataset.zip
!unzip news-category-dataset.zip

Archive:  news-category-dataset.zip
replace News_Category_Dataset_v2.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [4]:
import json
from  pathlib import Path
data = []
with open("/content/News_Category_Dataset_v2.json") as f:
  lines = f.readlines()
  for l in lines:
    j=json.loads(l)
    data.append(j)

In [5]:
df=pd.DataFrame(data=data)

In [6]:
df.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


In [45]:
df["category"].unique()

array(['CRIME', 'ENTERTAINMENT', 'WORLD NEWS', 'IMPACT', 'POLITICS',
       'WEIRD NEWS', 'BLACK VOICES', 'WOMEN', 'COMEDY', 'QUEER VOICES',
       'SPORTS', 'BUSINESS', 'TRAVEL', 'MEDIA', 'TECH', 'RELIGION',
       'SCIENCE', 'LATINO VOICES', 'EDUCATION', 'COLLEGE', 'PARENTS',
       'ARTS & CULTURE', 'STYLE', 'GREEN', 'TASTE', 'HEALTHY LIVING',
       'THE WORLDPOST', 'GOOD NEWS', 'WORLDPOST', 'FIFTY', 'ARTS',
       'WELLNESS', 'PARENTING', 'HOME & LIVING', 'STYLE & BEAUTY',
       'DIVORCE', 'WEDDINGS', 'FOOD & DRINK', 'MONEY', 'ENVIRONMENT',
       'CULTURE & ARTS'], dtype=object)

In [7]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2',num_beams=4)

In [30]:
class NewsDataSet(Dataset):
    
    def __init__(self, dataframe, control_code, truncate=False, gpt2_type="gpt2", max_length=768):

        self.tokenizer = GPT2Tokenizer.from_pretrained(gpt2_type)
        self.tweets = []

        # This uses the same CSV of Sentiment140 that we created in Chapter 5
        
        for i in dataframe.itertuples():
              self.tweets.append(torch.tensor(
                  self.tokenizer.encode(f"<|{control_code}|>{i[1]}<|sep|>{i[2]}<|endoftext|>")
              ))
                
        if truncate:
            self.tweets = self.tweets[:20000]
        self.tweet_count = len(self.tweets)
        
    def __len__(self):
        return self.tweet_count

    def __getitem__(self, item):
        return self.tweets[item]

In [31]:
dataset = NewsDataSet(df, 'newsheader')

In [54]:
dataset.__getitem__(1)

tensor([   27,    91, 10827, 25677,    91,    29,  8743,  4176,  5302,  1040,
         6031,   489,    78,   843,  8047,    88,  9986,  1114,   383,  2864,
         2159,  5454,   338, 15934, 10940, 50256])

In [55]:
dataset.__len__()

200853

In [32]:
tokenizer.decode(dataset.__getitem__(1))

"<|newsheader|>ENTERTAINMENT<|sep|>Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song<|endoftext|>"

In [33]:
def pack_tensor(new_tensor, packed_tensor, max_seq_len):
    if packed_tensor is None:
        return new_tensor, True, None
    if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
        return packed_tensor, False, new_tensor
    else:
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
        return packed_tensor, True, None

In [34]:
def train(
    dataset,
    model,
    tokenizer,
    batch_size=16,
    epochs=4,
    lr=2e-5,
    max_seq_len=400,
    warmup_steps=5000,
    gpt2_type="gpt2",
    device="cuda",
    output_dir=".",
    output_prefix="wreckgar",
    test_mode=False,
    save_model_on_epoch=False,
):

    acc_steps = 100

    model = model.to(device)
    model.train()

    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1
    )

    train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

    accumulating_batch_count = 0
    input_tensor = None

    for epoch in range(epochs):

        print(f"Training epoch {epoch}")
        for idx, entry in tqdm(enumerate(train_dataloader)):
            (input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, 768)

            if carry_on and idx != len(train_dataloader) - 1:
                continue

            input_tensor = input_tensor.to(device)
            outputs = model(input_tensor, labels=input_tensor)
            loss = outputs[0]
            loss.backward()

            if (accumulating_batch_count % batch_size) == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                model.zero_grad()

            accumulating_batch_count += 1
            input_tensor = None
        if save_model_on_epoch:
            torch.save(
                model.state_dict(),
                os.path.join(output_dir, f"{output_prefix}-{epoch}.pt"),
            )
    return model

In [14]:
!mkdir "/content/trained_models"

In [35]:
model = train(
    dataset,
    GPT2LMHeadModel.from_pretrained("gpt2"),
    GPT2Tokenizer.from_pretrained("gpt2"),
    batch_size=16,
    epochs=1,
    lr=3e-5,
    max_seq_len=140,
    warmup_steps=5000,
    gpt2_type="gpt2",
    device="cuda",
    output_dir="/content/trained_models",
    output_prefix="gpt_newsheadlines_categories",
    save_model_on_epoch=True
)

0it [00:00, ?it/s]

Training epoch 0


200853it [40:35, 82.46it/s]


In [None]:
nmodel.save

In [39]:
def generate(
    model,
    tokenizer,
    prompt,
    entry_count=10,
    entry_length=100,
    top_p=0.8,
    temperature=1.,
):

    model.eval()

    generated_num = 0
    generated_list = []

    filter_value = -float("Inf")

    with torch.no_grad():

        for entry_idx in trange(entry_count):

            entry_finished = False

            generated = torch.tensor(tokenizer.encode(prompt)).to("cuda").unsqueeze(0)

            # Using top-p (nucleus sampling): https://github.com/huggingface/transformers/blob/master/examples/run_generation.py

            for i in range(entry_length):
                outputs = model(generated, labels=generated)
                loss, logits = outputs[:2]
                logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(
                    F.softmax(sorted_logits, dim=-1), dim=-1
                )

                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[
                    ..., :-1
                ].clone()
                sorted_indices_to_remove[..., 0] = 0

                indices_to_remove = sorted_indices[sorted_indices_to_remove]
                logits[:, indices_to_remove] = filter_value

                next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
                generated = torch.cat((generated, next_token), dim=1)

                if next_token in tokenizer.encode("<|endoftext|>"):
                    entry_finished = True

                if entry_finished:

                    generated_num = generated_num + 1

                    output_list = list(generated.to("cpu").squeeze().numpy())
                    output_text = tokenizer.decode(output_list)

                    generated_list.append(output_text)
                    break
            
            if not entry_finished:
                output_list = list(generated.to("cpu").squeeze().numpy())
                output_text = f"{tokenizer.decode(output_list)}<|endoftext|>" 
                generated_list.append(output_text)
                
    return generated_list

In [26]:
torch.save(
                model.state_dict(),
                os.path.join("./", "model.pt"),
            )

In [42]:
generated_tweets = generate(model.to('cuda'), GPT2Tokenizer.from_pretrained("gpt2"),"<|newsheadlines|>POLITICS<|sep|>",entry_count=10)

100%|██████████| 10/10 [00:07<00:00,  1.29it/s]


In [43]:
for i in generated_tweets:
  print(i)

<|newsheadlines|>POLITICS<|sep|>Brexit Supporters Rally Against Legislation To Prevent Lawmakers From Pushing Back On Their LGBT Values<|nyt|]<|endoftext|>
<|newsheadlines|>POLITICS<|sep|>Ex-British officer sentenced to 10 years for feeding a child to a stranger over photos <|newsheadlines|>GERMANY <|sep|>Foreign ministry: ISIS' team beat in Libya attack <|newsheadlines|>UK joins Morocco, Saudi Arabia, China on tech summit <|newsheadlines|>Nuclear accord: Trump should impose tougher sanctions against Iran <|newsheadlines|>POTUS defends Trump's travel ban on Muslims <|newsheadlines<|endoftext|>
<|newsheadlines|>POLITICS<|sep|>Sanders Supporters Deny Delay in Throwing His Support To Clinton This Summer<|endoftext|>
<|newsheadlines|>POLITICS<|sep|>Robber Regulators: Obama Warns Against Suing Americans <http://www.washingtonpost.com/blogs/the-fix/wp/2015/04/23/obama-warns-against-suing-americans/> // WaPo // Aiden Brown – April 23, 2015* "Senator Dianne Feinstein, the chairwoman of the Sen

In [44]:
generated_tweets = generate(model.to('cuda'), GPT2Tokenizer.from_pretrained("gpt2"),"<|newsheadlines|>CRIME<|sep|>",entry_count=10)
for i in generated_tweets:
  print(i)

100%|██████████| 10/10 [00:06<00:00,  1.47it/s]

<|newsheadlines|>CRIME<|sep|>Murder For Which Defendant Doesn't Ask: 'Please Shoot Him'

>SHOT BY 'JACKET' DRIVER <|sep|>FBI Responds: 'Again' Isn't Justice<|endoftext|>
<|newsheadlines|>CRIME<|sep|>5 Egyptians Fight Face Off On One Nearby Border, Claim Pleading For Cop Death<|endoftext|>
<|newsheadlines|>CRIME<|sep|>Young Man Killed By White Helicopter Flying near Cologne (VIDEO)</|src><|endoftext|>
<|newsheadlines|>CRIME<|sep|>Will This Week Be America's Worst Year Ever? <|dailymail.co.uk|>American Horror Story Sets Record for Year on Worst TV Show To Hit 100,000 Visits <|vegas.com|>The Great American Christmas <|url=https://www.facebook.com/events/18555711927102878/>How To Raise Kids For The Perfect Year <|yiannopoulos.net|>The Angry Video Game Nerd Who Just Killed Princess Leia<|endoftext|>
<|newsheadlines|>CRIME<|sep|>Death penalty stays at bay for defendants sentenced to death in 2012 death penalty case case</|newsheadlines|>MIA<|sep|>More being killed in U.S. prisons than ever b




In [50]:
generated_tweets = generate(model.to('cuda'), GPT2Tokenizer.from_pretrained("gpt2"),"<|newsheadlines|>SCIENCE<|sep|>",entry_count=10)
for i in generated_tweets:
  print(i)

100%|██████████| 10/10 [00:05<00:00,  1.74it/s]

<|newsheadlines|>SCIENCE<|sep|>Scientists Get Transcranial Magnetic Stimulation For Strains Shaped By UV Rays <|sep|>STARTS<|sep|>Scientists Speak Up Against UV Rays<|endoftext|>
<|newsheadlines|>SCIENCE<|sep|>Is Staging the Worst Smog Ever Possible?<|endoftext|>
<|newsheadlines|>SCIENCE<|sep|>US student named among last five to die in space disaster -- yet no flight plans saved

<|sep|>Next step: lab builds large-scale solar arrays for climate monitors<|endoftext|>
<|newsheadlines|>SCIENCE<|sep|>Egyptian Cop Smashes Finger Through Heart and Other Shipments On Inauguration Day<|endoftext|>
<|newsheadlines|>SCIENCE<|sep|>The Answer: The Problem With 'Topical Pregnancy Care'<|endoftext|>
<|newsheadlines|>SCIENCE<|sep|>More than 4,500 Scientists Lose Their Testicles as Exotic Surgery Threatens To Get To Good With More Form Of Health<|endoftext|>
<|newsheadlines|>SCIENCE<|sep|>Researchers discover red volcanic ash in Alaska<|endoftext|>
<|newsheadlines|>SCIENCE<|sep|>Huge Outcrops of Shark




In [46]:
generated_tweets = generate(model.to('cuda'), GPT2Tokenizer.from_pretrained("gpt2"),"<|newsheadlines|>WOLRD NEWS<|sep|>",entry_count=10)
for i in generated_tweets:
  print(i)

100%|██████████| 10/10 [00:05<00:00,  1.87it/s]

<|newsheadlines|>WOLRD NEWS<|sep|>Black Knight's King Says Black Lives Matter Harbors Racism<|endoftext|>
<|newsheadlines|>WOLRD NEWS<|sep|>Lawmakers' ideas of equal pay raise $12-million dollars<|endoftext|>
<|newsheadlines|>WOLRD NEWS<|sep|>Wisconsin With Almost 30,000 Chickens Following Veto Legalization of Voodoo -- So Many That They Were "Vegetarians".<|endoftext|>
<|newsheadlines|>WOLRD NEWS<|sep|>Democrats Pay Overnight $19 Million In Payday for Chief Defense Secretary With Labor Party Pay Cut<|endoftext|>
<|newsheadlines|>WOLRD NEWS<|sep|>WOLRD U.S. VOTE AGAINST LEGAL SYSTEM IN DOMESTIC FUNDING, RELATING TO HOW IT VOCATES ON DOMESTIC WORK<|sep|>Clinton Trumped Democratic Nominee For Secretaries Of State<|endoftext|>
<|newsheadlines|>WOLRD NEWS<|sep|>Donald Trump Dumps Democrat In Glass Door To Attack Hillary Clinton<|via Nexis]<|endoftext|>
<|newsheadlines|>WOLRD NEWS<|sep|>US national security officials warn Trump administration of impeachment, deepening tensions<|endoftext|>





In [14]:
!ls -lh /content/trained_models

total 487M
-rw-r--r-- 1 root root 487M Nov 24 06:33 newsheadlines-0.pt


In [15]:
!cp /content/trained_models/*.pt '/content/drive/MyDrive/Machine Learning/'

In [20]:
torch.save(model,"/content/gpt_news.pt")

In [21]:
!ls -lh /content

total 618M
drwx------ 4 root root 4.0K Nov 24 06:52 drive
-rw-r--r-- 1 root root 487M Nov 24 07:18 gpt_news.pt
-rw-r--r-- 1 root root  81M Oct  1  2019 News_Category_Dataset_v2.json
-rw-r--r-- 1 root root  26M Nov 24 05:19 news-category-dataset.zip
-rw-r--r-- 1 root root  26M Nov 24 06:00 news-category-dataset.zip.1
drwxr-xr-x 1 root root 4.0K Nov 13 17:33 sample_data
drwxr-xr-x 2 root root 4.0K Nov 24 06:33 trained_models


In [23]:
torch.load()

TypeError: ignored