# Using GPT-2 to generate fake tweets for the Real or Not competition

Models benefit from larger training datasets. When working with images, the original dataset is often augmented by creating transformations of each image. For example, a single image of a chinchilla can be augmented in the following ways:

<img src="https://raw.githubusercontent.com/aleju/imgaug-doc/master/readme_images/examples_grid.jpg" width="450px">

(credit: https://github.com/aleju/imgaug)

This notebook is my attempt at doing a similar type of augmentation with **text**. Real tweets (after being cleaned by my other notebook [Tweet Cleaner](https://www.kaggle.com/jdparsons/tweet-cleaner)) are sent to the GPT-2 "small" 117M model as the starter text, and the output is assigned the same real-or-not label as the original text. The GPT-2 code is a modified version of the code found in this notebook: https://www.kaggle.com/bkkaggle/generate-your-own-text-with-openai-s-gpt-2-117m.

Here are some example results - good, bad, and hilarious. The generated text for disasters seems to pick up a "news" style of writing, while non-disaster text stays more casual. The GPT-2 default settings seem to result in lots of image credits and twitter links, which isn't ideal. **If you have any ideas for improving this notebook, please let me know in the comments!**

| Original Text | Fake Text | Label |
| :------------ | :-------------- | :----- |
| Forest fire near La Ronge Sask. Canada |  The fire was brought under control by a single vehicle on Sunday afternoon.  The fire started at about 11:50 p.m | 1 |
| Police investigating after an e-bike collided with a car in Little Portugal. E-bike rider suffered serious non-life threatening injuries. | PITTSBURGH (AP) — A car collided with an e-bike on the side of a street Saturday night in Little Portugal and injured a woman, police said.  Police investigating after an e-bike collided with a | 1 |
| My car is so fast | I can get to work or just leave town. I'm not going anywhere." | 0 |
| My car is so fast | Just like any other car, I start moving quickly and the sound is amazing. I really | 0 |
| These boxes are ready to explode! Exploding Kittens finally arrived! gameofkittens #explodingkittensÛ_ https://t.co/TFGrAyuDC5 | twitter.com/U4YK5XgXqTThis is a rush transcript. Copy may not be in its final form.  AMY GOODMAN: Let's go back to the recent story in the New York Times that New | 0 |
| major rainstorm happening! I'm gonna lie down and drift away to the storming around me for a little while. bebacksoon. | The best way to get your dog and pet out of a storm is to make sure that you have the shelter in place. In some cases, you can even make your dog and pet stay in a park, in a shelter. | 1 |
| I went to pick up my lunch today and the bartender was holding my change hostage because he wanted my number. ?? | At this point I was telling him I didn't want to pay up, so I called back and told him to go out and get it for me. He said I would pay up, but then I called my parents and told them I didn't | 1 (seems misclassified) |
| @AmirKingKhan you would have been annihilated so you might as well thank @FloydMayweather | Mark Cuban (@MarkCuban) July 11, 2016  What's more, he didn't just win the fight.  When he won the heavyweight title, he was the first fighter in UFC history to knock out a UFC | 0 |
| Aftershock ã¢ (2010) Fullã¢ Streaming - YouTube http://t.co/vVE3UsesGf | I think I'm going to be using it right now. I'd like to watch it on my phone just to see if its worth it.  Reply · Report PostIt is hard | 0 |
| That horrible sinking feeling when youÛªve been at home on your phone for a while and you realise its been on 3G this whole time | We have seen all the time that 3G has been an issue, it is our most critical aspect, we do not want to see this happen to our children and our grandchildren when it comes to 3G. | 1 (seems misclassified) |
| Accident on I-24 W #NashvilleTraffic. Traffic moving 8m slower than usual. https://t.co/0GHk693EgJ | NMA Traffic (@NMATraffic) December 14, 2017  The car was spotted in the parking lot by several people in the area at 7:45 p.m., but all were not injured.  There was no word | 1 |

---

Code linted via http://pep8online.com/ and https://yapf.now.sh/ to follow the Google python style guide (mostly).

In [None]:
import os
import sys
import torch
import random
import numpy as np
from nltk.tokenize import word_tokenize
import numpy as np
import pandas as pd
import time

# From https://github.com/graykode/gpt-2-Pytorch couldn't find a pip version
# I uploaded this gpt-2-Pytorch library as a dataset, so it would permanently
# reside in the input folder, which allowed the notebook commit sucessfully.
os.chdir('/kaggle/input/gpt2pytorch/gpt-2-Pytorch')
sys.path.insert(1, '/kaggle/input/gpt2pytorch/gpt-2-Pytorch/')

from GPT2.model import (GPT2LMHeadModel)
from GPT2.utils import load_weight
from GPT2.config import GPT2Config
from GPT2.sample import sample_sequence
from GPT2.encoder import get_encoder

# set pandas preview to use full width of browser
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)


### Helper to view file paths of imported data

In [None]:
view_local_files = False

if view_local_files is True:
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))

# Functions to interract with GPT-2

In [None]:
# https://www.kaggle.com/bkkaggle/generate-your-own-text-with-openai-s-gpt-2-117m

state_dict = torch.load(
    '../../../input/gpt2pytorch-modelbin/gpt2-pytorch_model.bin',
    map_location='cpu' if not torch.cuda.is_available() else None)

seed = random.randint(0, 2147483647)
np.random.seed(seed)
torch.random.manual_seed(seed)
torch.cuda.manual_seed(seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load Model
enc = get_encoder()
config = GPT2Config()
model = GPT2LMHeadModel(config)
model = load_weight(model, state_dict)
model.to(device)
model.eval()

def force_period(text):
    """If input string does not end with common punctuation,
    a period is added to the end. credit:
    https://stackoverflow.com/a/41402588
    A dangling word at the end of a sentence that doesn't
    end with punctuation causes GPT-2 to go off topic.
    """
    if text[-1] not in ['!', ',', '.', '\n']:
        text += '.'
    
    return text

def clean(text):
    """Removes various characters and string patterns
    generated by GPT-2.
    """
    text = text.replace('\n', ' ').replace('<|endoftext|>', '').strip()
    
    return text

def text_generator(state_dict,
                   text,
                   match_length=True,
                   match_length_multiplier=2,
                   length=50,
                   temperature=0.5,
                   top_k=30):
    """code by TaeHwan Jung(@graykode)
    Original Paper and repository here : https://github.com/openai/gpt-2
    GPT2 Pytorch Model : https://github.com/huggingface/pytorch-pretrained-BERT
    Modifications by John David Parsons for the Kaggle "Real or Not?" competition
    Depends on external GPT2 variables initialized outside of this function.
    
    Args:
        text: sentence to begin with.
        length: number of words to generate, only read if match_length is False
        temperature: 0=deterministic, 1.0 is wildly creative and risks going off topic
        
    Returns:
        A string of GPT-2 generated text, based on the input text.
    """
    
    text = force_period(text)
    
    # very short texts benefit from a longer multiplier
    if len(text) < 30:
        match_length_multiplier += 1
    
    # very long texts do not need as much multiplier
    if len(text) > 120 and match_length_multiplier > 1:
        match_length_multiplier -= 1
    
    if match_length is True:
        length = len(word_tokenize(text)) * match_length_multiplier

    # max tweet length is 280 characters, estimating a max of 50 words
    length = min(length, 50)
    unconditional = False

    context_tokens = enc.encode(text)

    out = sample_sequence(
        model=model,
        length=length,
        context=context_tokens if not unconditional else None,
        start_token=enc.encoder['<|endoftext|>'] if unconditional else None,
        batch_size=1,
        temperature=temperature,
        top_k=top_k,
        device=device)
    out = out[:, len(context_tokens):].tolist()

    text = enc.decode(out[0])
    text = clean(text)

    return text


def get_fake_tweets(df, num_samples=10):
    """Generates fake text similar to the original. NOTE:
    enabling the GPU will speed up execution by about 2x.
    60 rows took 85 seconds on a CPU, 45 seconds on a GPU

    Args:
        df: A pandas dataframe with columns 'text' and 'target'
        num_samples: number of rows to generate

    Returns:
        A pandas dataframe containing only the new generated
        text. The dataframe has the following columns:
        'original_text', 'fake_text', 'target'
    """
    
    start_time = time.time()
    expanded_rows = []

    for i, row in df.sample(num_samples).iterrows():
        row_original_text = row['text']
        row_target = row['target']

        generated_text = text_generator(state_dict, row_original_text)
        expanded_row = [row_original_text, generated_text, row_target]
        expanded_rows.append(expanded_row)

    print("--- %s seconds ---" % (time.time() - start_time))

    expanded_df = pd.DataFrame(
        expanded_rows, columns=['original_text', 'fake_text', 'target'])

    return expanded_df

Load data

In [None]:
#train_df = pd.read_csv('../../../input/nlp-getting-started/train.csv')
train_df = pd.read_csv('../../../input/tweet-cleaner/train_df_clean.csv')

train_df = train_df[['text', 'target']]

train_df

# Example usage

Do a test run of the text_generator method. It's a lot of fun to set test_tweet to your own prose!

In [None]:
# iloc of interesting test tweets
# 5725 = rescuing bodies in the water
# 333 = Windows is ethics armageddon
# 5678 = Dog buried alive
# 7611 = e-bike crash
# 7 = fire in the woods

test_tweet = train_df.iloc[7611]['text']
#test_tweet = 'Wow, it is super stormy out right now. The lightning woke me up :/'
generated_tweet = text_generator(state_dict, test_tweet, match_length_multiplier=2)

print('ORIGINAL: ' + test_tweet)
print('GPT-2: ' + generated_tweet)

Starting with the original training data, randomly sample rows to use as the source material for generating new fake tweets. The same original tweet will result in different fake tweets, so it is safe to sample the same row multiple times. get_fake_tweets returns a dataframe with the following columns: 'original_text', 'fake_text', 'target'. Select just the 'fake_text' column and rename to 'text' so the new rows can be concatenated to the original training dataframe. If you want to double the size of your training data, set num_samples to over 9000 and come back in a few hours...

In [None]:
# num_samples=3000 took around 40 min with the GPU
faked_df = get_fake_tweets(train_df, num_samples=10)
faked_df = faked_df[['fake_text', 'target']]
faked_df.columns = ['text', 'target']
faked_df.to_csv('../../../working/faked_df.csv', index=False)
faked_df

Finally, save the newly expanded training data to CSV. Download this file and plug it in to your existing pipeline as a bigger training set!

In [None]:
train_df_combined = pd.concat([train_df, faked_df])
train_df_combined.to_csv('../../../working/train_df_combined.csv', index=False)