# Using GPT to do text generarion

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 33.4 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 53.7 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 59.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 4.7 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 60.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    F

In [None]:
# basic generation

In [None]:
from transformers import pipeline

In [None]:
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-1.3B')

Downloading:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/200 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/779k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

In [None]:
text = "Object Detection is an useful skill in computer vision"
result = generator(text, max_length=100, do_sample=True, temperature=0.9,num_beams = 5,top_p=1.0)
print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Object Detection is an useful skill in computer vision. It is used to detect objects in images. For example, object detection is used to detect objects in images, such as people, animals, vehicles, and the like.
Object detection can be used in a variety of applications. For example, object detection can be used to detect objects in images, such as people, animals, vehicles, and the like. Object detection can also be used to detect objects in videos. For example, object detection can


# finetune

In [None]:
!unzip /content/archive.zip -d archive

Archive:  /content/archive.zip
  inflating: archive/arxivData.json  


In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPT2LMHeadModel

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium', bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium').cuda()
model.resize_token_embeddings(len(tokenizer))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Embedding(50259, 1024)

In [None]:
descriptions = pd.read_json('/content/archive/arxivData.json')['summary']

In [None]:
descriptions.head()

0    We propose an architecture for VQA which utili...
1    Recent approaches based on artificial neural n...
2    We introduce the multiresolution recurrent neu...
3    Multi-task learning is motivated by the observ...
4    We present MILABOT: a deep reinforcement learn...
Name: summary, dtype: object

In [None]:
#757
max_length =0
for description in descriptions:
  max_length = max(len(tokenizer.encode(description)),max_length)

In [None]:
print(max_length)

757


In [None]:
class ArchiveDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

In [None]:
#max_length=757
dataset = ArchiveDataset(descriptions, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])

In [None]:
training_args = TrainingArguments(output_dir='./results', num_train_epochs=1, logging_steps=100, save_steps=5000,
                                  per_device_train_batch_size=1, per_device_eval_batch_size=1,
                                  warmup_steps=10, weight_decay=0.05, logging_dir='./logs', report_to = 'none')

In [None]:
Trainer(model=model,  args=training_args, train_dataset=train_dataset, 
        eval_dataset=val_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

***** Running training *****
  Num examples = 36900
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 36900


Step,Training Loss
100,2.8065
200,0.984
300,0.9027
400,0.9743
500,0.9212
600,0.9349
700,0.9056
800,0.92
900,0.9389
1000,0.9509


Saving model checkpoint to ./results/checkpoint-5000
Configuration saved in ./results/checkpoint-5000/config.json
Model weights saved in ./results/checkpoint-5000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-10000
Configuration saved in ./results/checkpoint-10000/config.json
Model weights saved in ./results/checkpoint-10000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-15000
Configuration saved in ./results/checkpoint-15000/config.json
Model weights saved in ./results/checkpoint-15000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-20000
Configuration saved in ./results/checkpoint-20000/config.json
Model weights saved in ./results/checkpoint-20000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-25000
Configuration saved in ./results/checkpoint-25000/config.json
Model weights saved in ./results/checkpoint-25000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-30000
Configuration saved in ./resu

TrainOutput(global_step=36900, training_loss=0.8322155838219454, metrics={'train_runtime': 30095.459, 'train_samples_per_second': 1.226, 'train_steps_per_second': 1.226, 'total_flos': 5.06673342001152e+16, 'train_loss': 0.8322155838219454, 'epoch': 1.0})

In [None]:
generated = tokenizer("<|startoftext|> ", return_tensors="pt").input_ids.cuda()
sample_outputs = model.generate(generated, do_sample=True, top_k=50, 
                                max_length=300, top_p=0.95, temperature=1.9, num_return_sequences=20)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0:   On December 21-22 2013 General Shatter reported their surprise about our method
for computing $H\{S^C}_{\mu}$ and its complexity, at least theoretically compared
with other Bayesian methods, as an experiment with his method to study the impact in
its use on the accuracy with probability the observed measurements on some problems can
are obtained using one variable and many factors in statistical learning problems. Shatter
may take another option after having this experimenter (who is from physics). If so it may have in
hundreds of points for a simple task that in general is far from trivial to
be used so as to compare and to be compared compared with his approach based, with
severall being considered in various theoretical tools that use a classical
general method: an elliographical, the Euclan metric based statistical tool.
1:   Since time cannot run or travel faster than one does, the time for
applied science's "big data," the amount of time that it is needed to run
analytics or