Connect Colab with drive

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


# 1. General

GPT-2 is a large transform-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with one simple goal: to predict the next word, given all previous words in a text. The wide variety of data sets makes this simple objective to accommodate a wide variety of naturally occurring tasks in diverse domains. GPT-2 is a direct upgrade of GPT, with more than 10 times the number of parameters and trained on more than 10 times the amount of data.

## 1.1. Import libraries and set up some config

In [1]:
import numpy as np
import pandas as pd 
import torch
import logging
from tqdm import tqdm
import math
import argparse
import os

In [2]:
!git clone https://github.com/huggingface/transformers
!pip install transformers/
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers.optimization import AdamW, get_linear_schedule_with_warmup

Cloning into 'transformers'...
remote: Enumerating objects: 139946, done.[K
remote: Counting objects: 100% (1927/1927), done.[K
remote: Compressing objects: 100% (744/744), done.[K
remote: Total 139946 (delta 1262), reused 1571 (delta 1009), pack-reused 138019[K
Receiving objects: 100% (139946/139946), 137.96 MiB | 21.48 MiB/s, done.
Resolving deltas: 100% (104790/104790), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./transformers
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manyl

## 1.2 Initialization parameters

In [None]:
parser = argparse.ArgumentParser()
parser.add_argument('--seed', type=int, default=88888) # an optional argument to set the random seed for reproducibility
parser.add_argument("--model_name", default="gpt2", type=str) # the name of the pre-trained model to use
parser.add_argument("--max_seq_length", default=512, type=int) # the maximum length of input sequences 
parser.add_argument("--train_batch_size", default=4, type=int) #  the batch size to use during training
parser.add_argument("--valid_batch_size", default=4, type=int) # the batch size to use during validation
parser.add_argument("--num_train_epochs", default=3, type=int) #  the number of epochs to train
parser.add_argument("--warmup", default=0.1, type=float) #  the fraction of training steps to use for learning rate warmup
parser.add_argument("--learning_rate", default=5e-5, type=float) # the learning rate to use for training
parser.add_argument("--input_text_path", default='/content/gdrive/MyDrive/TBT-DL/data', type=str) #  the path to the directory containing the input text files
args, _ = parser.parse_known_args()

# 2. Data Preprocessing

## 2.1 Combine question + answer => story



In [None]:
DATAPATH=args.input_text_path
def combinetext(prompt, story):
    fp=open(os.path.join(DATAPATH,prompt),encoding='utf8')
    fs=open(os.path.join(DATAPATH,story),encoding='utf8')
    prompts=fp.readlines()
    stories=fs.readlines()
    assert len(prompts)==len(stories)
    combine=[]
    for i in range(len(prompts)):
        combine.append(prompts[i].rstrip()+' <sep> '+" ".join(stories[i].split()[:300]))
    return combine

#do a littel text clean with punctuations
def cleanpunctuation(s):
    for p in '!,.:;?':
        s=s.replace(' '+p,p)
    s=s.replace(' '+'n\'t','n\'t')
    s=s.replace(' '+'\'s','\'s')
    s=s.replace(' '+'\'re','\'re')
    s=s.replace(' '+'\'ve','\'ve')
    s=s.replace(' '+'\'ll','\'ll')
    s=s.replace(' '+'\'am','\'am')
    s=s.replace(' '+'\'m','\'m')
    s=s.replace(' '+'\' m','\'m')
    s=s.replace(' '+'\'m','\'m')
    s=s.replace(' '+'\' ve','\'ve')
    s=s.replace(' '+'\' s','\'s')
    s=s.replace('<newline>','\n')
    return s   

train_text=combinetext('valid.wp_source', 'valid.wp_target')
train_text=list(map(cleanpunctuation,train_text))
valid_text=combinetext('test.wp_source', 'test.wp_target')
valid_text=list(map(cleanpunctuation,valid_text))

In [None]:
# Example about the question and thestory
train_text[6]

"[ WP ] Everyone in the world has magic with various levels of mastery over it. You are extremely powerful with almost no control so you find a demon that's very weak but extremely good at controlling his powers. <sep> `` Imagine you're in a field. '' Green extends in all directions. `` You're alone, the earth is flat, and the blue sky touches the horizon. '' Blue shoots from the ground, arcing overhead. `` The sun appears, tiny in the sky. '' There's a bright light, rays casting shadow behind me. `` What color is it? '' \n \n `` Yellow. '' It burns so brightly, winking playfully. \n \n `` Good. '' She licks her chapped lips, the sound distorting my tiny sun's light. `` Look ahead of you. There's a sheep. '' Something soft and downy wanders across the green, its shadow stretching far beyond the horizon. `` What color is it? '' \n \n My brows crease. `` Uh- '' \n \n `` What color is it? '' \n \n The green wavers. Baa baa black sheep, have you any wool? `` Uh. '' Mary had a little lamb, 

In [None]:
print("Length of the story in train text:",len(train_text))
print("Length of the story in valid tex:",len(valid_text))


Length of the story in train text: 15620
Length of the story in valid tex: 15138


## 2.2 Tokenize and download data




In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token=tokenizer.eos_token

inputs_train = tokenizer(train_text, padding=True,truncation=True,max_length=args.max_seq_length)
inputs_valid=tokenizer(valid_text, padding=True,truncation=True,max_length=args.max_seq_length)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [None]:
def create_labels(inputs):
    labels=[]
    for ids,attention_mask in zip(inputs['input_ids'],inputs['attention_mask']):
        label=ids.copy()
        real_len=sum(attention_mask)
        padding_len=len(attention_mask)-sum(attention_mask)
        label[:]=label[:real_len]+[-100]*padding_len
        labels.append(label)
    inputs['labels']=labels
    
create_labels(inputs_train)
create_labels(inputs_valid)


In [None]:
# Example
print(inputs_train['input_ids'][6])
print(inputs_train['attention_mask'][6])
print(inputs_train['labels'][6])


[58, 28993, 2361, 11075, 287, 262, 995, 468, 5536, 351, 2972, 2974, 286, 30677, 625, 340, 13, 921, 389, 4457, 3665, 351, 2048, 645, 1630, 523, 345, 1064, 257, 3222, 326, 338, 845, 4939, 475, 4457, 922, 379, 12755, 465, 5635, 13, 1279, 325, 79, 29, 7559, 18450, 345, 821, 287, 257, 2214, 13, 10148, 3469, 14582, 287, 477, 11678, 13, 7559, 921, 821, 3436, 11, 262, 4534, 318, 6228, 11, 290, 262, 4171, 6766, 18105, 262, 17810, 13, 10148, 4518, 20611, 422, 262, 2323, 11, 610, 2259, 16965, 13, 7559, 383, 4252, 3568, 11, 7009, 287, 262, 6766, 13, 10148, 1318, 338, 257, 6016, 1657, 11, 24823, 13092, 9082, 2157, 502, 13, 7559, 1867, 3124, 318, 340, 30, 10148, 220, 198, 220, 198, 7559, 12550, 13, 10148, 632, 20246, 523, 35254, 11, 266, 8040, 711, 2759, 13, 220, 198, 220, 198, 7559, 4599, 13, 10148, 1375, 300, 3378, 607, 442, 6320, 11914, 11, 262, 2128, 1233, 24707, 616, 7009, 4252, 338, 1657, 13, 7559, 6803, 4058, 286, 345, 13, 1318, 338, 257, 15900, 13, 10148, 13742, 2705, 290, 866, 88, 11569, 36

## 2.3 Story Data

In [None]:
class StoryDataset:
    def __init__(self, inputs):
        self.ids = inputs['input_ids']
        self.attention_mask = inputs['attention_mask']
        self.labels=inputs['labels']

    def __len__(self):
        return len(self.ids)

    def __getitem__(self, item):

        return [torch.tensor(self.ids[item], dtype=torch.long),
                torch.tensor(self.attention_mask[item], dtype=torch.long),
                torch.tensor(self.labels[item], dtype=torch.long)]
            

## 2.4 Create Train and Valid DataLoader

In [None]:
train_batch_size=args.train_batch_size
valid_batch_size=args.valid_batch_size
traindata=StoryDataset(inputs_train)
train_dataloader = torch.utils.data.DataLoader(
    traindata,
    shuffle=False,
    batch_size=train_batch_size)

validdata=StoryDataset(inputs_valid)
valid_dataloader = torch.utils.data.DataLoader(
    validdata,
    shuffle=False,
    batch_size=valid_batch_size)

# 3. Model & Tuning

## 3.1 Create story


In [None]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
model.to('cuda')
model.eval()
eval_loss=[]
for inputs in tqdm(valid_dataloader, desc="eval"):
    d1,d2,d3=inputs
    d1=d1.to('cuda')        
    d2=d2.to('cuda')
    d3=d3.to('cuda')

    with torch.no_grad():
        output = model(input_ids=d1, attention_mask=d2,labels=d3)
        batch_loss=output[0]
    eval_loss+=[batch_loss.cpu().item()]
    del batch_loss
eval_loss=np.mean(eval_loss)
perplexity=math.exp(eval_loss)
print(f'The average perplexity for valid dataset before fine-tuning is {perplexity}') 

eval: 100%|██████████| 3785/3785 [12:27<00:00,  5.07it/s]

The average perplexity for valid dataset before fine-tuning is 39.27880379562042





In [None]:
# Print model parameters
print('Number of model parameters: {:,}'.format(sum([p.nelement() for p in model.parameters()])))

Number of model parameters: 124,439,808


## 3.2 Generate story before fine-tuned

In [None]:
prompt=valid_text[300][:valid_text[300].find('<sep>')]
target=valid_text[300][valid_text[300].find('<sep>')+5:]

def generate_story(prompt,target,k=0,p=0.9,output_length=300,temperature=1,num_return_sequences=3,repetition_penalty=1.0):
    print("====prompt====\n")
    print(prompt+"\n")
    print('====target story is as below===\n')
    print(target+"\n")
    encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
    model.to('cpu')
    model.eval()
    output_sequences = model.generate(
        input_ids=encoded_prompt,
        max_length=output_length,
        temperature=temperature,
        top_k=k,
        top_p=p,
        repetition_penalty=repetition_penalty,
        do_sample=True,
        num_return_sequences=num_return_sequences
    )
    if len(output_sequences.shape) > 2:
        output_sequences.squeeze_()
    for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
        print("=== GENERATED SEQUENCE {} ===".format(generated_sequence_idx + 1))
        generated_sequence = generated_sequence.tolist()
        # Decode text
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
        # Remove all text after eos token
        text = text[: text.find(tokenizer.eos_token)]
        print(text)

generate_story(prompt,target)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


====prompt====

Children's logic dictates the way the world works. [ WP ] 

====target story is as below===

 “ That ’ s not an option I ’ m currently willing to exercise. ” 
 
 I pinch the bridge of my nose to stave off the headache building behind my eyes. If this goes on much longer, I ’ m gon na have to start to start cutting back on the vegetables. 
 
 “ She ’ s dangerous, Jimmy. You know that. You ’ ve seen it. Dealt with it first hand. She just doesn ’ t play by anyone ’ s rules. ” 
 
 Ali finished off her sucker and unwrapped a fresh one, offering it to me. I declined. I ’ d sworn off the things after my third cavity scare. That one saw me at the dentist for the third time in as many months. I don ’ t care what my dad says, I know that guy is evil. Who owns a drill like that? A murderer, that ’ s who. I still hear the damn thing in my nightmares. 
 
 While she savored the smooth flavor of blue-raspberry, I pondered her words. We both knew she was right. The situation was spiral

## 3.3 Tuning the parameter

In [None]:
num_train_epochs = args.num_train_epochs
training_steps_per_epoch=len(train_dataloader)
total_num_training_steps = int(training_steps_per_epoch*num_train_epochs)
weight_decay=0
learning_rate=args.learning_rate
adam_epsilon=1e-8
warmup_steps=int(total_num_training_steps*args.warmup)
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": weight_decay,
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_num_training_steps
)



## 3.4 Train the data 
Train with 3 epoch and 4 both train batch size and valid batch size

In [None]:
print("***** Running training *****")
print("  Total_num_training_step = {}".format(total_num_training_steps))
print("  Num Epochs = {}".format(num_train_epochs))
print(f"  Train_batch_size per device = {train_batch_size}")
print(f"  Valid_batch_size per device = {valid_batch_size}")
model.to('cuda')
for epoch in range(num_train_epochs):
    print(f"Start epoch {epoch+1} of {num_train_epochs}")
    train_loss=0
    epoch_iterator = tqdm(train_dataloader,desc='Iteration')
    model.train()
    model.zero_grad()    
    for _, inputs in enumerate(epoch_iterator):        
        d1,d2,d3=inputs
        d1=d1.to('cuda')
        d2=d2.to('cuda')
        d3=d3.to('cuda')
        output = model(input_ids=d1, attention_mask=d2,labels=d3)
        batch_loss=output[0]
        batch_loss.backward()
        optimizer.step()
        scheduler.step()
        model.zero_grad()
        train_loss+=batch_loss.item()
        epoch_iterator.set_description('(batch loss=%g)' % batch_loss.item())
        del batch_loss
    print(f'Average train loss per example={train_loss/training_steps_per_epoch} in epoch{epoch+1}')    
    print(f'Starting evaluate after epoch {epoch+1}')
    eval_loss=[]    
    model.eval()    
    for inputs in tqdm(valid_dataloader, desc="eval"):
        d1,d2,d3=inputs
        d1=d1.to('cuda')        
        d2=d2.to('cuda')
        d3=d3.to('cuda')
        with torch.no_grad():
            output = model(input_ids=d1, attention_mask=d2,labels=d3)
            batch_loss=output[0]
        eval_loss+=[batch_loss.cpu().item()]
        del batch_loss
    eval_loss=np.mean(eval_loss)
    perplexity=math.exp(eval_loss)
    print(f'Average valid loss per example={eval_loss} in epoch{epoch+1}')    
    print(f'Perplextiy for valid dataset in epoch{epoch+1} is {perplexity}')
    

***** Running training *****
  Total_num_training_step = 11715
  Num Epochs = 3
  Train_batch_size per device = 4
  Valid_batch_size per device = 4
Start epoch 1 of 3


(batch loss=2.864): 100%|██████████| 3905/3905 [38:43<00:00,  1.68it/s]


Average train loss per example=3.3024861756421235 in epoch1
Starting evaluate after epoch 1


eval: 100%|██████████| 3785/3785 [12:28<00:00,  5.06it/s]


Average valid loss per example=3.195605313730681 in epoch1
Perplextiy for valid dataset in epoch1 is 24.424953978503165
Start epoch 2 of 3


(batch loss=2.72499): 100%|██████████| 3905/3905 [38:48<00:00,  1.68it/s]


Average train loss per example=3.1011770549412727 in epoch2
Starting evaluate after epoch 2


eval: 100%|██████████| 3785/3785 [12:29<00:00,  5.05it/s]


Average valid loss per example=3.188751458806752 in epoch2
Perplextiy for valid dataset in epoch2 is 24.258121264105988
Start epoch 3 of 3


(batch loss=2.69443): 100%|██████████| 3905/3905 [38:44<00:00,  1.68it/s]


Average train loss per example=3.0045577795343728 in epoch3
Starting evaluate after epoch 3


eval: 100%|██████████| 3785/3785 [12:29<00:00,  5.05it/s]

Average valid loss per example=3.1864533291467745 in epoch3
Perplextiy for valid dataset in epoch3 is 24.202436965510298





## 3.5 Generate stories
Use the fine-tuened model to generate stories with the same prompt I used before fine-tuning.

In [None]:
prompt=valid_text[300][:valid_text[300].find('<sep>')]
target=valid_text[300][valid_text[300].find('<sep>')+5:]
generate_story(prompt,target)

====prompt====

Children's logic dictates the way the world works. [ WP ] 

====target story is as below===

 “ That ’ s not an option I ’ m currently willing to exercise. ” 
 
 I pinch the bridge of my nose to stave off the headache building behind my eyes. If this goes on much longer, I ’ m gon na have to start to start cutting back on the vegetables. 
 
 “ She ’ s dangerous, Jimmy. You know that. You ’ ve seen it. Dealt with it first hand. She just doesn ’ t play by anyone ’ s rules. ” 
 
 Ali finished off her sucker and unwrapped a fresh one, offering it to me. I declined. I ’ d sworn off the things after my third cavity scare. That one saw me at the dentist for the third time in as many months. I don ’ t care what my dad says, I know that guy is evil. Who owns a drill like that? A murderer, that ’ s who. I still hear the damn thing in my nightmares. 
 
 While she savored the smooth flavor of blue-raspberry, I pondered her words. We both knew she was right. The situation was spiral

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


=== GENERATED SEQUENCE 1 ===
Children's logic dictates the way the world works. [ WP ] 
 - -- - 
 
 We are robots: we are the technicians that make our world safe. Sometimes you have to hand it to us. We serve the government. The machines are not robots: they are *humans*. 
 
 When the drones blasted our world, we could make their advances, and even go back and adapt. The machines were making it seem like *our* community lives. It felt like a community *inside* us. We talked to them. We learned their jokes. We tried to steal their culture. They were alive. We just 'd gone home. Now we needed to help! 
 
 It was nothing too special for us, just another robot locked in a lab and locked out of our own creation. But our power is what Sawyer thought of most: something his people used. A way to not be afraid. A way to improve themselves. 
 
 These other creations aren't so great for us. The same goes for the humans. Our ancestors moved further and farther out west, not to defend our world fr

In [None]:
def get_generation_with_target(prompt,target,k=0,p=0.9,output_length=300,temperature=1,num_return_sequences=3,repetition_penalty=1.0):
    print("====prompt====\n")
    print(prompt+"\n")
    print('====target story is as below===\n')
    print(target+"\n")
    encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
    model.to('cpu')
    model.eval()
    output_sequences = model.generate(
        input_ids=encoded_prompt,
        max_length=output_length,
        temperature=temperature,
        top_k=k,
        top_p=p,
        repetition_penalty=repetition_penalty,
        do_sample=True,
        num_return_sequences=num_return_sequences
    )
    if len(output_sequences.shape) > 2:
        output_sequences.squeeze_()
    for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
        print("=== GENERATED SEQUENCE {} ===".format(generated_sequence_idx + 1))
        generated_sequence = generated_sequence.tolist()
        # Decode text
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
        # Remove all text after eos token
        text = text[: text.find(tokenizer.eos_token)]
        print(text)

In [None]:
promt = "Once upon a time, in a faraway land, there lived a young prince named Alexander. He was kind, brave, and intelligent, and his people loved him dearly. One day, while taking a walk in the forest, he stumbled upon a beautiful maiden in distress. She was being attacked by a fierce dragon, and the prince knew he had to do something to help her. Drawing his sword, he charged at the dragon, ready to engage in battle. [ WP ] "
target = "The dragon was slain by the prince, and the maiden was saved. The prince and the maiden fell in love and got married. They lived happily ever after. [ WP ] "

get_generation_with_target(promt,target,k=0,p=0.9,output_length=300,temperature=1,num_return_sequences=3,repetition_penalty=1.0)

====prompt====

Once upon a time, in a faraway land, there lived a young prince named Alexander. He was kind, brave, and intelligent, and his people loved him dearly. One day, while taking a walk in the forest, he stumbled upon a beautiful maiden in distress. She was being attacked by a fierce dragon, and the prince knew he had to do something to help her. Drawing his sword, he charged at the dragon, ready to engage in battle. [ WP ] 

====target story is as below===

The dragon was slain by the prince, and the maiden was saved. The prince and the maiden fell in love and got married. They lived happily ever after. [ WP ] 



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


=== GENERATED SEQUENCE 1 ===
Once upon a time, in a faraway land, there lived a young prince named Alexander. He was kind, brave, and intelligent, and his people loved him dearly. One day, while taking a walk in the forest, he stumbled upon a beautiful maiden in distress. She was being attacked by a fierce dragon, and the prince knew he had to do something to help her. Drawing his sword, he charged at the dragon, ready to engage in battle. [ WP ] 
 <sep> Alexander had gone back to his home village, and found himself quickly in trouble. He figured he was near the castle, or at least he thought he was. After the incident, his village was hit by a terrifying maelstrom, and he started getting ill. 
 `` Ah, stranger, suppose you don't need medicine anymore? '' 
 `` No, but I'm still yours, and you don't need to worry. I heard your name, professor. '' 
 `` Oh yeah, I was thinking about you too! '' 
 
 He turned around and hid in the bushes, where his cloak he had been wearing sat concealed i

# 4. Evalute the model
## 4.1 Using BLEU score to evalute 

In [None]:
import torch
torch.cuda.empty_cache() # Release some RAM of GPU 

In [None]:
# Evaluate the model with BLEU score
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoothie = SmoothingFunction().method4
model.eval()
model.to('cuda')
bleu_score=[]
for inputs in tqdm(valid_dataloader, desc="eval"):
    d1,d2,d3=inputs
    d1=d1.to('cuda')        
    d2=d2.to('cuda')
    d3=d3.to('cuda')
    with torch.no_grad():
        output = model.generate(input_ids=d1, attention_mask=d2)
    output=output.cpu().numpy()
    output=output.tolist()
    for i in range(len(output)):
        output[i]=tokenizer.decode(output[i], clean_up_tokenization_spaces=True)
        output[i]=output[i][:output[i].find(tokenizer.eos_token)]
        target=valid_text[i][valid_text[i].find('<sep>')+5:]
        bleu_score+=[sentence_bleu([target.split()],output[i].split(),smoothing_function=smoothie)]
        
print(f'\nAverage BLEU score for valid dataset is {np.mean(bleu_score)*100}')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
eval:  56%|█████▌    | 2119/3785 [07:08<05:39,  4.90it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Input length of input_ids is 512, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
eval:  56%|█████▌    | 2120/3785 [07:08<05:39,  4.91it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Input length of input_ids is 512, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
eval:  56%|█████▌    | 2121


Average BLEU score for valid dataset is 0.9755052289628869





# 5 Deploy to Hugging Face


In [None]:
from huggingface_hub import login
login('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')

In [None]:
model.push_to_hub('baotoan2002/GPT-2')


In [None]:
tokenizer.push_to_hub('baotoan2002/GPT-2', private=True, use_auth_token=True)

# 6 Pipeline
You can use this model directly with a pipeline for text generation.

In [5]:
# Pipelines can also be loaded from the hub
from transformers import pipeline
generator = pipeline('text-generation', model='baotoan2002/GPT-2')
generator("Once upon a time,", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Once upon a time, everyone had the ability to gain super powers which could grant them specific abilities based on their life circumstances. When you die, you'},
 {'generated_text': "Once upon a time, you can buy a weapon and it gives you true power! However, because you have the `` power '' it is a little"},
 {'generated_text': 'Once upon a time, the Roman Empire was a vast and oppressive empire. Powerful wizards and villains rose up against the overwhelming power that was the Roman Empire'},
 {'generated_text': 'Once upon a time, mankind had a sentient intelligent life form, but all of its behaviors was completely custom built for the purposes of military culture and entertainment'},
 {'generated_text': 'Once upon a time, the sky was green. The moon was pink. The stars were blue. And there was no sea, no sky. Life'}]