## GPT for style completion

In [1]:
from transformers import GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline, \
                         Trainer, TrainingArguments


In [2]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')  # load up a standard gpt2 model

tokenizer.pad_token = tokenizer.eos_token  
# set our pad token to be the eos token. This lets gpt know how to fill space

In [3]:
# load up our data into a dataset
pds_data = TextDataset(
    tokenizer=tokenizer,
    file_path='../data/PDS2.txt',  # Principles of Data Science - Sinan Ozdemir
    block_size=64  # length of each chunk of text to use as a datapoint
)



In [4]:
pds_data[0], pds_data[0].shape  # inspect the first point

(tensor([  200, 47231,  6418,   286,  6060,  5800,   198, 12211,  5061,   198,
           198,    32, 31516,   338,  5698,   284, 13905,  7605,   290,  4583,
           284,   198, 11249,   304,   171,   105,   222, 13967,  1366,    12,
         15808,  5479,   198,   198, 46200,   272, 18024,  9536,   343,   198,
         16012,   346, 31250,   671,   198,   198,  3483, 29138,  2751, 33363,
           532,   337,  5883,  4339,    40,   628,   200, 47231,  6418,   286,
          6060,  5800,   198, 12211]),
 torch.Size([64]))

In [255]:
print(tokenizer.decode(pds_data[0]))

Principles of Data Science
Second Edition

A beginner's guide to statistical techniques and theory to
build eﬀective data-driven applications

Sinan Ozdemir
Sunil Kakade

BIRMINGHAM - MUMBAI

Principles of Data Science
Second


In [6]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,  
    # MLM is Masked Language Modelling (for BERT + auto-encoding tasks)
)

In [7]:
# example of how collator pads data dynamically
collator_example = data_collator([tokenizer('I am an input'), tokenizer('So am I')])

collator_example

{'input_ids': tensor([[   40,   716,   281,  5128],
        [ 2396,   716,   314, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 0]]), 'labels': tensor([[  40,  716,  281, 5128],
        [2396,  716,  314, -100]])}

In [8]:
collator_example.input_ids  # 50256 is our pad token id

tensor([[   40,   716,   281,  5128],
        [ 2396,   716,   314, 50256]])

In [9]:
tokenizer.pad_token_id

50256

In [10]:
collator_example.attention_mask  # Note the 0 in the attention mask where we have a pad token

tensor([[1, 1, 1, 1],
        [1, 1, 1, 0]])

In [11]:
collator_example.labels  # note the -100 to ignore loss calculation for the padded token
# Labels are shifted inside the GPT model so we don't need to worry about that

tensor([[  40,  716,  281, 5128],
        [2396,  716,  314, -100]])

In [12]:
model = GPT2LMHeadModel.from_pretrained('gpt2')  # load up a GPT2 model

pretrained_generator = pipeline(  # create a generator with built in params
    'text-generation', model=model, tokenizer='gpt2',
    config={'max_length': 200, 'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)

In [187]:
print('----------')
for generated_sequence in pretrained_generator('This dataset shows the relationship', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


----------
This dataset shows the relationship between the number of years that a student has left the room.
In this blog post, we will look at how to use this data to create user profiles for other departments
based on their use of KPI.

----------
This dataset shows the relationship between
weighting and each value for the model
Let's look at some basic relationships to predict the probability of a certain type of event:
data['id'] = z_test.mean()
# predict a
----------
This dataset shows the relationship between gender (as shown by the bar chart) and each major quantitative measure of medical attention
about the population:

[ 463 ]

Basic Statistics

Chapter 18

Now let's look at statistics
----------


In [14]:
training_args = TrainingArguments(
    output_dir="./gpt2_pds", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=32,  # batch size for evaluation
    logging_steps=10,
    load_best_model_at_end=True,
    evaluation_strategy='epoch',
    save_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=pds_data.examples[:int(len(pds_data.examples)*.8)],
    eval_dataset=pds_data.examples[int(len(pds_data.examples)*.8):]
)

trainer.evaluate()

***** Running Evaluation *****
  Num examples = 470
  Batch size = 32


{'eval_loss': 4.5039801597595215,
 'eval_runtime': 194.2009,
 'eval_samples_per_second': 2.42,
 'eval_steps_per_second': 0.077}

In [15]:
trainer.train()

***** Running training *****
  Num examples = 1878
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 177


Epoch,Training Loss,Validation Loss
1,3.3165,3.481012
2,3.0561,3.44698
3,2.896,3.451081


***** Running Evaluation *****
  Num examples = 470
  Batch size = 32
Saving model checkpoint to ./gpt2_pds/checkpoint-59
Configuration saved in ./gpt2_pds/checkpoint-59/config.json
Model weights saved in ./gpt2_pds/checkpoint-59/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 470
  Batch size = 32
Saving model checkpoint to ./gpt2_pds/checkpoint-118
Configuration saved in ./gpt2_pds/checkpoint-118/config.json
Model weights saved in ./gpt2_pds/checkpoint-118/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 470
  Batch size = 32
Saving model checkpoint to ./gpt2_pds/checkpoint-177
Configuration saved in ./gpt2_pds/checkpoint-177/config.json
Model weights saved in ./gpt2_pds/checkpoint-177/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./gpt2_pds/checkpoint-118 (score: 3.4469802379608154).


TrainOutput(global_step=177, training_loss=3.169361955028469, metrics={'train_runtime': 8020.7923, 'train_samples_per_second': 0.702, 'train_steps_per_second': 0.022, 'total_flos': 184014913536000.0, 'train_loss': 3.169361955028469, 'epoch': 3.0})

In [16]:
trainer.evaluate()  # loss decrease is slowing down so we are hitting our limit

***** Running Evaluation *****
  Num examples = 470
  Batch size = 32


{'eval_loss': 3.4469802379608154,
 'eval_runtime': 186.6424,
 'eval_samples_per_second': 2.518,
 'eval_steps_per_second': 0.08,
 'epoch': 3.0}

In [17]:
trainer.save_model()

Saving model checkpoint to ./gpt2_pds
Configuration saved in ./gpt2_pds/config.json
Model weights saved in ./gpt2_pds/pytorch_model.bin


In [18]:
loaded_model = GPT2LMHeadModel.from_pretrained('./gpt2_pds')

finetuned_generator = pipeline(
    'text-generation', model=loaded_model, tokenizer=tokenizer,
    config={'max_length': 200, 'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)

loading configuration file ./gpt2_pds/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "do_sample": true,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "max_length": 50,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.19.4",
  "use_cache": true,
  

In [186]:
# examples are now sustainably about data
print('----------')
for generated_sequence in finetuned_generator('This dataset shows the relationship', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


----------
This dataset shows the relationship between education and age.

This data was obtained from
pf2scs.kdf using the KdfTree method of classification, which is cross-validated with Python and the
PdfFrame utility
----------
This dataset shows the relationship between COS and our sample distribution, we can easily see on the graph:
This leads us to the next big thing: we can visualize variables as a single line graph. This graph has a cross-sectional length of
----------
This dataset shows the relationship between
height and length of the data set:

(b_height = height[1] == 0)
So, that in our case is about 4200
people in height, which means that
I
----------
