# Finetuning GPT2 on Unseen Legal data 
Vedant Shenoy | 26/04/24

Introduction:
In this notebook, a GPT2 model will be finetuned on textual data collected from Law StackExchange.
The model which has 335 million parameters will be finetuned on a dataset with 380 million parameters.

## Imports

In [16]:
from datasets import load_dataset, Dataset
import re
import unicodedata
import os
import time
import datetime

import pandas as pd
import seaborn as sns
import numpy as np
import random

import torch
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
torch.manual_seed(42)

from transformers import GPT2LMHeadModel,  GPT2Tokenizer, GPT2Config, GPT2LMHeadModel
from transformers import AdamW, get_linear_schedule_with_warmup

import html
import json

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Data

The data used for this project was sourced from [HuggingFace](https://huggingface.co/datasets/ymoslem/Law-StackExchange). The dataset contains all StackExchange legal questions and their answers up until August 2023. Every question has several replies with their associated scores. THere are a total of 24,400 rows in the dataset.

To maximise model performance while dealing with limited hardware resources, we will be using a large subset of the dataset (20,000). 

In [2]:
dataset = load_dataset("ymoslem/Law-StackExchange",split="train[0:20000]")

Downloading readme:   0%|          | 0.00/407 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 106M/106M [00:00<00:00, 132MB/s]  


Generating train split: 0 examples [00:00, ? examples/s]

In [3]:
dataset

Dataset({
    features: ['question_id', 'tags', 'question_title', 'answers', 'license', 'question_body', 'link', 'score'],
    num_rows: 20000
})

## Data-Preprocessing

One of the most important aspect about training a LLM is providing it with good quality training data. Even if it comes at the cost of reducing the overall training data. 

This project has taken a lot of inspiration from the following paper which also trains Mistral 7B model for answering legal queries: https://arxiv.org/abs/2403.03883

The first thing to do is convert the data to a dataframe to make it easier to perform preprocessing


In [4]:
df = dataset.to_pandas()

In [12]:
df.head()

Unnamed: 0,question_id,tags,question_title,answers,license,question_body,link,score
0,94665,"[criminal-law, driving, sentencing]",Why is drunk driving causing accident punished...,"{'answer_id': 94666, 'body': '<h3>Moral luck</...",CC BY-SA 4.0,<p>When people drink and drive and then cause ...,https://law.stackexchange.com/questions/94665/...,23
0,94665,"[criminal-law, driving, sentencing]",Why is drunk driving causing accident punished...,"{'answer_id': 94674, 'body': '<p>Drunk driving...",CC BY-SA 4.0,<p>When people drink and drive and then cause ...,https://law.stackexchange.com/questions/94665/...,23
0,94665,"[criminal-law, driving, sentencing]",Why is drunk driving causing accident punished...,"{'answer_id': 94677, 'body': '<p>Drivers are n...",CC BY-SA 4.0,<p>When people drink and drive and then cause ...,https://law.stackexchange.com/questions/94665/...,23
0,94665,"[criminal-law, driving, sentencing]",Why is drunk driving causing accident punished...,"{'answer_id': 94669, 'body': '<p>Have you seen...",CC BY-SA 4.0,<p>When people drink and drive and then cause ...,https://law.stackexchange.com/questions/94665/...,23
0,94665,"[criminal-law, driving, sentencing]",Why is drunk driving causing accident punished...,"{'answer_id': 94681, 'body': '<p><strong>The q...",CC BY-SA 4.0,<p>When people drink and drive and then cause ...,https://law.stackexchange.com/questions/94665/...,23


 Let's explore the data

In [8]:
df.answers.iloc[0][0]

{'answer_id': 94666,
 'body': '<h3>Moral luck</h3>\n<p>You have raised the issue of <em>moral luck</em>, a long recognized problem in criminal theory. The classic expositions of this issue are by <a href="https://en.m.wikipedia.org/wiki/Thomas_Nagel" rel="noreferrer">Thomas Nagel</a>, in his chapter, &quot;<a href="https://rintintin.colorado.edu/%7Evancecd/phil1100/Nagel1.pdf" rel="noreferrer">Moral Luck</a>&quot; (1979) and <a href="https://en.m.wikipedia.org/wiki/Bernard_Williams" rel="noreferrer">Bernard Williams</a>, &quot;<a href="https://bibliotecamathom.files.wordpress.com/2012/10/williams_-_moral_luck.pdf" rel="noreferrer">Moral Luck</a>&quot; (1976). Specifically, you are describing what they call <em>outcome</em> luck, or <em>consequential</em> luck.</p>\n<p>Driving while intoxicated vs. driving while intoxicated and causing death is not the only example where moral luck results in a distinction in punishment. Other examples are:</p>\n<ul>\n<li>dangerous driving vs. dangerous

As the data was collected from an online forum, every question has several replies. Which is stored in the form of List of Dictionaries where every dictionary contains a response along with some metadata. The most simple approach to this challenge would be to train the data on all the responses. As the model we have selected is an older model, we have to also consider factors such as context window. If the total text is too long, the model might not learn the proper structure behind answering a question. 

Therefore, for this project we have taken the approach of adding the question before every response. This will ensure that the model understands the aforementioned structure while also learning to respond in multiple ways to a single question. This also helps increase the amount of training data which is crucial as the model is fairly large. But this does come at the cost of compute resources

In [9]:
df = df.explode('answers')

In [11]:
df.head()

Unnamed: 0,question_id,tags,question_title,answers,license,question_body,link,score
0,94665,"[criminal-law, driving, sentencing]",Why is drunk driving causing accident punished...,"{'answer_id': 94666, 'body': '<h3>Moral luck</...",CC BY-SA 4.0,<p>When people drink and drive and then cause ...,https://law.stackexchange.com/questions/94665/...,23
0,94665,"[criminal-law, driving, sentencing]",Why is drunk driving causing accident punished...,"{'answer_id': 94674, 'body': '<p>Drunk driving...",CC BY-SA 4.0,<p>When people drink and drive and then cause ...,https://law.stackexchange.com/questions/94665/...,23
0,94665,"[criminal-law, driving, sentencing]",Why is drunk driving causing accident punished...,"{'answer_id': 94677, 'body': '<p>Drivers are n...",CC BY-SA 4.0,<p>When people drink and drive and then cause ...,https://law.stackexchange.com/questions/94665/...,23
0,94665,"[criminal-law, driving, sentencing]",Why is drunk driving causing accident punished...,"{'answer_id': 94669, 'body': '<p>Have you seen...",CC BY-SA 4.0,<p>When people drink and drive and then cause ...,https://law.stackexchange.com/questions/94665/...,23
0,94665,"[criminal-law, driving, sentencing]",Why is drunk driving causing accident punished...,"{'answer_id': 94681, 'body': '<p><strong>The q...",CC BY-SA 4.0,<p>When people drink and drive and then cause ...,https://law.stackexchange.com/questions/94665/...,23


Next step, we will drop null values

In [14]:
df.dropna(inplace=True)

The response body contains a lot of unwanted information in the form of HTML entities and formating characters. It is essential to properly remove them before proceding. We will also be performing text normalisation so that the LLM does not mislearn representations of words which look similar to us but have different unicode representations.

In [17]:
def clean_text_answer(html_text):
    # Remove HTML tags    
    html_text = html_text["body"]
    clean_text = re.sub(r'<.*?>', '', html_text)
    # Decode HTML entities    
    clean_text = html.unescape(clean_text)
    # Remove newline characters
    clean_text = clean_text.replace('\n', ' ')
    # Remove multiple whitespaces
    clean_text = re.sub(r'\s+', ' ', clean_text).strip()
    #Normalise the data
    clean_text = unicodedata.normalize('NFKC',clean_text)
    
    return clean_text

df["answers"] = df["answers"].apply(clean_text_answer)


In [19]:
df["answers"].iloc[0]

'Moral luck You have raised the issue of moral luck, a long recognized problem in criminal theory. The classic expositions of this issue are by Thomas Nagel, in his chapter, "Moral Luck" (1979) and Bernard Williams, "Moral Luck" (1976). Specifically, you are describing what they call outcome luck, or consequential luck. Driving while intoxicated vs. driving while intoxicated and causing death is not the only example where moral luck results in a distinction in punishment. Other examples are: dangerous driving vs. dangerous driving that causes death a successful offence vs. an attempted offence (generally resulting in a maximum sentence less than that of the successful offence) Nagel writes: If someone has had too much to drink and his car swerves on to the sidewalk, he can count himself morally lucky if there are no pedestrians in its path. If there were, he would be to blame for their deaths, and would probably be prosecuted for manslaughter. But if he hurts no one, although his reckl

We now perform similar text cleaning for both question_body and question_text columns

In [17]:
def clean_text_question(html_text):
    # Remove HTML tags
    clean_text = re.sub(r'<.*?>', '', html_text)
    # Decode HTML entities
    clean_text = html.unescape(clean_text)
    # Remove newline characters
    clean_text = clean_text.replace('\n', ' ')
    # Remove multiple whitespaces
    clean_text = re.sub(r'\s+', ' ', clean_text).strip()
    
    clean_text = unicodedata.normalize('NFKC',clean_text)
    
    return clean_text


In [18]:
df["question_body"] = df["question_body"].apply(clean_text_question)


In [19]:
df["question_title"] = df["question_title"].apply(clean_text_question)


Finally, we combine everything in order to generate training data.

In [20]:
df["training_text"] = df["question_title"]+df["question_body"]+df["answers"]

In [21]:
train_text = df.training_text.copy()

In [22]:
train_text

0        Why is drunk driving causing accident punished...
0        Why is drunk driving causing accident punished...
0        Why is drunk driving causing accident punished...
0        Why is drunk driving causing accident punished...
0        Why is drunk driving causing accident punished...
                               ...                        
19996    Who has jurisdiction over civilian crimes on a...
19997    Can identification be confirmed over a mobile ...
19998    In the USA, how is a war officially ended from...
19999    How to freelance in the UK without violating t...
19999    How to freelance in the UK without violating t...
Name: training_text, Length: 31747, dtype: object

## Model Training

We need to initialise a GPT-2 tokenizer with custom tokens for the beginning (bos_token), end (eos_token), and padding (pad_token) of sequences.

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>') #gpt2-medium


In [24]:
# We set batch_size of 4 which is the number of training examples used in one iteration of training.
batch_size = 4

As we are using a custom data, we need to develop a dataset class. During initialisation, it tokenises each text in the list, prepending and appending special tokens (<|startoftext|> and <|endoftext|>), and truncates or pads the sequences to fit within the specified max_length.

In [25]:
class GPT2Dataset(Dataset):

    def __init__(self, txt_list, tokenizer, gpt2_type="gpt2", max_length=768):

        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []

        for txt in txt_list:

            encodings_dict = tokenizer('<|startoftext|>'+ txt + '<|endoftext|>', truncation=True, max_length=max_length, padding="max_length")

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx] 

In [26]:
dataset = GPT2Dataset(train_text, tokenizer, max_length=768)

# Split into training and validation sets
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

28,572 training samples
3,175 validation samples


## Develop dataloaders

In [27]:
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

We configure a GPT-2 model for language modeling tasks by loading its configuration with output_hidden_states=False to disable the output of hidden states during inference. It then instantiates the model using this configuration and moves it to the GPU for faster computation.

In [36]:
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)

# instantiate the model
model = GPT2LMHeadModel.from_pretrained("gpt2", config=configuration)

# Tell pytorch to run this model on the GPU.
device = torch.device("cuda")

model.cuda()


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50259, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50259, bias=False)
)

In [37]:
# some default parameters 
epochs = 1
learning_rate = 6e-4
warmup_steps = 1e2
epsilon = 1e-8

# this produces sample output every 1000 steps

sample_every = 1000

In [38]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
optimizer = AdamW(model.parameters(),
                  lr = learning_rate,
                  eps = epsilon
                )

In [39]:
# Total number of training steps is [number of batches] x [number of epochs]. 
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
# This changes the learning rate as the training loop progresses
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = warmup_steps, 
                                            num_training_steps = total_steps)

In [53]:
def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

## Model Training

In [41]:
total_t0 = time.time()

training_stats = []

model = model.to(device)

for epoch_i in range(0, epochs):

    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    t0 = time.time()

    total_train_loss = 0

    model.train()

    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        model.zero_grad()        

        outputs = model(  b_input_ids,
                          labels=b_labels, 
                          attention_mask = b_masks,
                          token_type_ids=None
                        )

        loss = outputs[0]  

        batch_loss = loss.item()
        total_train_loss += batch_loss

        # Get sample every x batches.
        if step % sample_every == 0 and not step == 0:

            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}. Loss: {:>5,}.   Elapsed: {:}.'.format(step, len(train_dataloader), batch_loss, elapsed))

            model.eval()

            sample_outputs = model.generate(
                                    bos_token_id=random.randint(1,30000),
                                    do_sample=True,   
                                    top_k=50, 
                                    max_length = 200,
                                    top_p=0.95, 
                                    num_return_sequences=1
                                )
            for i, sample_output in enumerate(sample_outputs):
                  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
            
            model.train()

        loss.backward()

        optimizer.step()

        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)       
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))
        
    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    t0 = time.time()

    model.eval()

    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)
        
        with torch.no_grad():        

            outputs  = model(b_input_ids, 
                             attention_mask = b_masks,
                            labels=b_labels)
          
            loss = outputs[0]  
            
        batch_loss = loss.item()
        total_eval_loss += batch_loss        

    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    validation_time = format_time(time.time() - t0)    

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch 1,000  of  7,143. Loss: 1.678596019744873.   Elapsed: 0:08:56.
0:  valveIs it illegal for a non-state person to use fake documents to get a job on a job in another state?I am a professional photographer. I'm not a lawyer but I've studied at universities. I live in Maryland but I'm not employed by another state. The issue I'm facing is not that someone would have obtained my work (which is not legally a crime), but rather that if they did, what do the laws actually look like for that? Is the fake work illegal if it is created by someone other than my employer? If not, is there some sort of affirmative defense against it? Is there some sort of statutory defense against it?I am a professional photographer. I'm not a lawyer but I'm a programmer. I've studied at universities. I live in Maryland but I'm not employed by another state. The issue I'm facing is not that someone would have obtained my work (which is not legally a crime), but rather that if they did, what


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch 2,000  of  7,143. Loss: 1.3311100006103516.   Elapsed: 0:17:54.
0:  establishedHow can a university legally charge the university to cover costs?A private university is obligated to pay for the costs of providing free legal representation on campus. How can a University legally charge for legal representation on campus? Suppose the university is required to provide a website to students and staff that is designed for a college student (they are free to do that). Then the university cannot charge money for attorneys or any sort of service for students (they are free to do that). What exactly does this say about costs? If students do not need a lawyer, then how can they justify it? Also, how can one claim to own any university costs on their own? So, assuming you can show money is in addition to paying for an attorney, then you can claim to be in a state of Washington, OR Oregon. How can this affect your tuition? How can one claim to take some state's public libraries/public tran

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch 3,000  of  7,143. Loss: 2.1925065517425537.   Elapsed: 0:26:52.
0: ervedMust it be legal for you to copy your text and email/fax?My wife is allergic to electric electricity so she takes a taxi to a pub, where the electricity is turned on. When the electricity fails, it can damage her eye, and it is impossible to read. However, she is very lucky not to be allergic to electric electricity, so she does not have any troubles with that. Given that this is not a private situation, I can understand why it might not work. It would require a licence and the taxi driver would need to sign them, etc. It could be worth hiring a lawyer - and would be cheaper than hiring a surgeon to do it, since she would be getting a better bill from the taxi. If the taxi driver cannot verify your identity, then there may be no problems - the taxi driver does need proof from you. My only concern would be that your wife will be physically harmed, and if she cannot speak to you, there is no way to get


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch 4,000  of  7,143. Loss: 1.2311097383499146.   Elapsed: 0:35:50.
0:  murdersAre there specific legal requirements for the UK to be "reasonable" under the GDPR in this case?GDPR Article 5(2) and Article 7(1) say that "processing activities should not be done without specific legal basis". If the data subject, without any such legal basis, wishes to delete data relating to an offence (for example, fraud), is the data subject entitled to the full legal right to delete all the personal data of the data subject? Or is this limited to the law of the case? Or does the data subject's right to erasure come only for the purposes stated by Article 8(2)(b)? Or is this limited to the general procedure that applies when processing activities are carried out for legal purpose. Or does the data subject's right to erasure also come only for the purposes stated by Article 6(3)? Or does the data subject's right to erasure come only for the purposes stated by Article 7(2)(a)? Or does the


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch 5,000  of  7,143. Loss: 1.4661880731582642.   Elapsed: 0:44:48.
0:  minsHow can US President Obama be punished for lying in the US Senate?In the United States Senate he has been lying to the US Senate about the contents of classified communication regarding US intelligence agency personnel. However, he does not have standing to bring criminal charges against the person. He is not allowed to lie during this time. His statements about this nature were clearly a violation of the First Amendment and have gone against the rules of the Senate. How can the US President be punished for lying in the U.S Senate? Could he be sued for this? What if there is a pattern of dishonesty on the Senate floor that is not intentional but not intentional?The Constitution gives Congress the power to impeach the President for any act or omission, in case it deems that the President lied on oath, and if it deemed he was lying in his own words, and if it did, it could still prosecute him, so long as the 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch 6,000  of  7,143. Loss: 1.394681692123413.   Elapsed: 0:53:46.
0: attleHow can one prove someone is innocent, as is most common and I'm sure is not the case?A recent comment by @user6726 states: There is the "legal" question, whether any evidence has been offered by a company that it did not have access to (or does not actually know of) the information that was uncovered. It seems like the relevant law states that a company can charge damages to victims of a crime if a person (or a person) committed a crime but a person did not act in the way that could reasonably have been committed. I.e., if there was no evidence that a defendant is guilty and there was no reasonable expectation of privacy or safety of anyone, then can a person say that he/she is not guilty and there was no expectation of privacy or safety of anyone but a person who does not act in the way that could reasonably have been committed (if there were no evidence that the defendant is guilty and there was no reason

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch 7,000  of  7,143. Loss: 2.2902512550354004.   Elapsed: 1:02:44.
0:  ConfederWhy is there no "right to a lawyer?"I know that the "right to a lawyer" was only used to protect some people from legal action but what makes a law abiding citizen not defend a legal one from legal action? What's the reason why no person is forced to be a lawyer? Is this because the law is not written to protect everyone who represents others in civil cases? And if it is, why's a lawyer considered to be a "rights-in-the-law"? I'm from India.The "right to a lawyer" is a legal right that can be waived by the right-of-procedure, or a specific type of right, depending on the case. A contract between two parties states the legal rights of each. It's possible that the contract might contain "exception" which will prevent a future court to enforce the contract, but if the contract contains an exception to this, the court will usually enforce it. In this case, the

  Average training loss: 1.94
  Training epoch

In [45]:
# Create a DataFrame from our training statistics.
df_stats = pd.DataFrame(data=training_stats)

# Use the 'epoch' as the row index.
df_stats = df_stats.set_index('epoch')

df_stats

Unnamed: 0_level_0,Training Loss,Valid. Loss,Training Time,Validation Time
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1.940416,1.729608,1:04:03,0:02:17


# Trial Run

In [56]:
model.eval()

prompt = "<|startoftext|> What is something that will land me in prison?"

generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
generated = generated.to(device)

print(generated)

sample_outputs = model.generate(
                                generated, 
                                #bos_token_id=random.randint(1,30000),
                                do_sample=True,   
                                top_k=50, 
                                max_length = 150,
                                top_p=0.95, 
                                num_return_sequences=1,
                                )

for i, sample_output in enumerate(sample_outputs):
    print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[50257,  1867,   318,  1223,   326,   481,  1956,   502,   287,  3770,
            30]], device='cuda:0')
0:  What is something that will land me in prison?I can see why many people will want to go to prison. However, I will soon find myself in prison! This is in Germany (and is not legal for me in my home country) What is the situation like for those who would like to go to prison? Would they be able to serve as a witness? If the witness lives in Germany, then they will be able to testify. If they don't, then they will be in legal trouble. And then they are in trouble later? If the witness lives in France, there is no legal trouble but France could sue for money damages. They would be a party in the lawsuit in Germany if the French government wanted to bring the case




In [52]:
output_dir = './model_save/'

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

Saving model to ./model_save/


('./model_save/tokenizer_config.json',
 './model_save/special_tokens_map.json',
 './model_save/vocab.json',
 './model_save/merges.txt',
 './model_save/added_tokens.json')

## Evaluation

In [57]:
!pip install -U evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl.metadata (29 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


In [79]:
def generate_text(prompt,num_return_sequences=1):
    # Initialize the model and tokenizer)
    model.eval()

    # Prepare the input
    input_text = "<|startoftext|> " + prompt
    inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)

    # Generate the output
    sample_outputs = model.generate(
        inputs,
        do_sample=True,
        top_k=50,
        max_length=150,
        top_p=0.95,
        num_return_sequences=num_return_sequences,
    )

    # Decode and return the generated text
    generated_text = []
    for output in sample_outputs:
        generated_text.append(tokenizer.decode(output, skip_special_tokens=True))
    return generated_text

In [76]:
predictions = generate_text("What is the punishment for drunk driving?",5)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [77]:
from evaluate import load
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(predictions=predictions, model_id='gpt2')

  0%|          | 0/1 [00:00<?, ?it/s]

In [78]:
results

{'perplexities': [18.795846939086914,
  18.251562118530273,
  12.350462913513184,
  11.93504810333252,
  13.493307113647461],
 'mean_perplexity': 14.96524543762207}

The model has a mean perplexity of 14.96. This score is still higher compared to some of the SoTA models we use nowadays. But it is still a repectable number. 


## Conclusion

This notebook successfully trained a GPT-2 model on data that wasnt present in its training data. The model also performed well and is capable of generating coherent replies indicated by its perplexity metric of 14.96. Some things that can be improved is the total number of the epochs. Due to time constraints, the total number of epochs was set to 1. Going forward the number of epochs can easily be increased for better performance.