## **What are we going to do:**

- load the train and test dataset from pickle file
- prepare the 3 different subsets for ``Democratic``,``Republic`` and ``Independent`` candidates
- load the pre-trained GPT-2 model and tokenizer
- initialize ``Trainer`` with ``TrainingArguments``
- train and save the model
- test the model

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/99/84/7bc03215279f603125d844bf81c3fb3f2d50fe8e511546eb4897e4be2067/transformers-4.0.0-py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 9.1MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 30.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 47.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893257 sha256=34a73dd997d2b0c852a

In [None]:
!nvidia-smi

Tue Dec  8 20:47:49 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
# Import pickle files output from EDA phase
train_data = pd.read_pickle('/content/drive/My Drive/train_data.pkl')
test_data = pd.read_pickle('/content/drive/My Drive/test_data.pkl')
train_data.head()

Mounted at /content/drive


Unnamed: 0,File,Text,Label,Party,Discussion,Vote,NumSents,Tokens,Total_tokens,Unique_tokens,lemmas
0,282_400436_1413023_DMN,"mr. speaker , i would like to say a word about...",DMN,D,M,N,17,"[would, like, say, word, illinois, also, proba...",167,127,"[word, illinois, probably, people, opposite, w..."
1,088_400272_2994052_DON,"mr. speaker , today we have some very clear ch...",DON,D,O,N,16,"[today, clear, choices, every, day, face, blac...",196,149,"[today, clear, choice, day, face, black, white..."
2,038_400080_0251064_DON,"mr. speaker , i yield myself such time as i ma...",DON,D,O,N,15,"[may, consume, would, like, briefly, describe,...",152,111,"[consume, briefly, describe, substitute, super..."
3,132_400227_0763073_DON,"mr. chairman , i yield back the balance of my ...",DON,D,O,N,1,"[back, balance]",2,2,[balance]
4,282_400380_1838049_ROY,"madam chairman , will the gentleman yield ? \n",ROY,R,O,Y,1,[],0,0,[]


In [3]:
# Combine lemmas, ignore empty lists due to short speeches of all stopwords
def combine_lemmas(preprocess_lists):
    total_lemma_tokens = []
    
    for lemma_list in preprocess_lists:
        if not lemma_list:
            pass
        else:
            total_lemma_tokens.append(lemma_list)
    return total_lemma_tokens

In [4]:
ind_text = pd.DataFrame(columns = ['Text']) 
dem_text = pd.DataFrame(columns = ['Text']) 
rep_text = pd.DataFrame(columns = ['Text']) 
ind_text['Text'] = train_data[train_data.Party == 'I']['Text']
dem_text['Text'] = train_data[train_data.Party == 'D']['Text']
rep_text['Text'] = train_data[train_data.Party == 'R']['Text']

In [5]:
import re
import pickle
import random

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [6]:
# Create file with speeches for independent candidate
ind_speech = '. '.join(ind_text['Text'])
text_file = open("ind_speech.txt", "w")
n = text_file.write(ind_speech)
text_file.close()

In [7]:
# Create file with speeches for independent candidate
dem_speech = '. '.join(dem_text['Text'])
text_file = open("dem_speech.txt", "w")
n = text_file.write(dem_speech)
text_file.close()

In [8]:
# Create file with speeches for independent candidate
rep_speech = '. '.join(rep_text['Text'])
text_file = open("rep_speech.txt", "w")
n = text_file.write(rep_speech)
text_file.close()

In [9]:
print("Train dataset for independent speakers : "+str(len(ind_speech)))
print("Train dataset for democratic speakers : "+str(len(dem_speech)))
print("Train dataset for republic speakers : "+str(len(rep_speech)))

Train dataset for independent speakers : 27442
Train dataset for democratic speakers : 4928519
Train dataset for republic speakers : 3609384


After we uploaded the file with use `unzip` to extract the recipes.json. 

# Prepare the dataset and build a ``TextDataset``

The next step is to extract the instructions from all recipes and build a `TextDataset`. The `TextDataset` is a custom implementation of the [Pytroch `Dataset` class](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) implemented by the transformers library. If you want to know more about Dataset in Pytroch you can check out this [youtube video](https://www.youtube.com/watch?v=PXOzkkB5eH0&ab_channel=PythonEngineer).

First, we are going to split the `recipes.json` into a `train` and `test` section and extract `Instructions` from the recipes and write them into a `train_dataset.txt` and `test_dataset.txt`

the next step is to download the tokenizer, which we use. We use the tokenizer from the `openai-gpt` model on [huggingface](https://huggingface.co/openai-gpt).

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openai-gpt")  #("anonymous-german-nlp/german-gpt2")
# ind_speech
ind_path = 'ind_speech.txt'
dem_path = 'dem_speech.txt'
rep_path = 'rep_speech.txt'

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=656.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=815973.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=458495.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1272610.0, style=ProgressStyle(descript…




In [None]:
from transformers import TextDataset,DataCollatorForLanguageModeling

def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)

In [11]:
## Vijay Code 
from transformers import TextDataset,DataCollatorForLanguageModeling

def load_dataset(ind_path,dem_path,rep_path,tokenizer):
    ind_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=ind_path,
          block_size=128)

    dem_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=dem_path,
          block_size=128)

    rep_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=rep_path,
          block_size=128) 
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return ind_dataset,dem_dataset,rep_dataset,data_collator

ind_dataset,dem_dataset,rep_dataset,data_collator = load_dataset(ind_path,dem_path,rep_path,tokenizer)



# Initialize `Trainer` with `TrainingArguments` and GPT-2 model

The [Trainer](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) class provides an API for feature-complete training. It is used in most of the [example scripts](https://huggingface.co/transformers/examples.html) from Huggingface. Before we can instantiate our `Trainer` we need to download our GPT-2 model and create a [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) to access all the points of customization during training. In the `TrainingArguments`, we can define the Hyperparameters we are going to use in the training process like our `learning_rate`, `num_train_epochs`, or  `per_device_train_batch_size`. A complete list can you find [here](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments).

In [12]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained("openai-gpt")

# Training model with independent candidate speeches

training_args = TrainingArguments(
    output_dir="./gpt2-independent_speech", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    )


trainer_ind = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=ind_dataset
    #eval_dataset=test_dataset,
    #prediction_loss_only=True
)



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=478750579.0, style=ProgressStyle(descri…




Some weights of OpenAIGPTLMHeadModel were not initialized from the model checkpoint at openai-gpt and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
# Training model with democratic candidate speeches

training_args = TrainingArguments(
    output_dir="./gpt2-democratic_speech", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    )


trainer_dem = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dem_dataset
    #eval_dataset=test_dataset,
    #prediction_loss_only=True
)



In [14]:
# Training model with republican candidate speeches
training_args = TrainingArguments(
    output_dir="./gpt2-republic_speech", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    )


trainer_rep = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=rep_dataset
    #eval_dataset=test_dataset,
    #prediction_loss_only=True
)

# Train and save the model

To train the model we can simply run `Trainer.train()`.

In [15]:
trainer_ind.train()

trainer_dem.train()

trainer_rep.train()

Step,Training Loss


Step,Training Loss
500,3.625354


Step,Training Loss
500,3.178641


TrainOutput(global_step=537, training_loss=3.1678639417254058)

After training is done you can save the model by calling `save_model()`. This will save the trained model to our `output_dir` from our `TrainingArguments`.

In [16]:
trainer_ind.save_model()

trainer_dem.save_model()

trainer_rep.save_model()

# Test the model

To test the model we are going to use another [highlight of the transformers library](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) called `pipeline`. [Pipelines](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) are objects that offer a simple API dedicated to several tasks, among others also `text-generation`

In [17]:
from transformers import pipeline

ind_speech = pipeline('text-generation',model='./gpt2-independent_speech', tokenizer='openai-gpt',config={'max_length':800})

dem_speech = pipeline('text-generation',model='./gpt2-democratic_speech', tokenizer='openai-gpt',config={'max_length':800})

rep_speech = pipeline('text-generation',model='./gpt2-republic_speech', tokenizer='openai-gpt',config={'max_length':800})



In [21]:
#ind_speech('job cut')
ind_speech('employement')[0]['generated_text']

'employement and abuse of life and liberty from the government of countries that have become inhospitable . the question is , in what country has the power to decide ? what nation has the power to regulate the lives and interests of individuals and nations ? our children'

In [19]:
#dem_speech('china')
dem_speech('employement')[0]['generated_text']

"employement of freedom is an issue that should not be addressed with a ` ` bureaucratic mindset ' ' . the rule is a direct problem that needs to be addressed . the problem with the rule should be addressed with a ` ` republican ' ' solution ."

In [20]:
rep_speech('employement')[0]['generated_text']

'employement ? but if you believe in a good job , i would say let us say it and let the committee continue to investigate it because i believe in these cases that a good employer has a job . we all have jobs and that means a lot'

In [22]:
# Model Evaluation
!pip install rouge
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [27]:
reference =['employement of freedom is an issue that should not be addressed with a bureaucratic mindset. the rule is a direct problem that needs to be addressed . the problem with the rule should be addressed with a republican solution .'.split()]

# Output generated from democratic speech model
Candidate = 'employement of freedom is an issue that should not be addressed with a ` ` bureaucratic mindset ' ' . the rule is a direct problem that needs to be addressed . the problem with the rule should be addressed with a ` ` republican ' ' solution .'
candidate = Candidate.split()
score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
score

0.8666666666666667

In [28]:

reference = 'employement of freedom is an issue that should not be addressed with a bureaucratic mindset. the rule is a direct problem that needs to be addressed . the problem with the rule should be addressed with a republican solution .'
Candidate = 'employement of freedom is an issue that should not be addressed with a ` ` bureaucratic mindset ' ' . the rule is a direct problem that needs to be addressed . the problem with the rule should be addressed with a ` ` republican ' ' solution .'
print('Reference :: ',len(reference))
print('Candidate :: ',len(Candidate))
print(Candidate)
rouge = Rouge()
scores = rouge.get_scores(Candidate, reference)
print(scores)

Reference ::  225
Candidate ::  236
employement of freedom is an issue that should not be addressed with a ` ` bureaucratic mindset  . the rule is a direct problem that needs to be addressed . the problem with the rule should be addressed with a ` ` republican  solution .
[{'rouge-1': {'f': 0.9499999950125001, 'p': 0.9047619047619048, 'r': 1.0}, 'rouge-2': {'f': 0.8974358924490468, 'p': 0.8536585365853658, 'r': 0.9459459459459459}, 'rouge-l': {'f': 0.9787233992575827, 'p': 0.9583333333333334, 'r': 1.0}}]
