 We’ll compare similar-sized versions of both models on the same prompt, so that the main difference  is  the pretraining  dataset,  and  we’ll  use  the text-generation  pipeline  to investigate the model outputs:

In [1]:
from transformers import pipeline, set_seed
generation_gpt = pipeline("text-generation", model="openai-gpt")
generation_gpt2 = pipeline("text-generation", model="gpt2")

Downloading:   0%|          | 0.00/656 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/457M [00:00<?, ?B/s]

  state_dict = torch.load(resolved_archive_file, map_location="cpu")
Some weights of OpenAIGPTLMHeadModel were not initialized from the model checkpoint at openai-gpt and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/797k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/448k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

  state_dict = torch.load(resolved_archive_file, map_location="cpu")


Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Next, let’s create a simple function to count the number of parameters in each model:

In [2]:
def model_size(model):
    return sum(t.numel() for t in model.parameters())
print(f"GPT  size: {model_size(generation_gpt.model)/1000**2:.1f}M parameters")
print(f"GPT2 size: {model_size(generation_gpt2.model)/1000**2:.1f}M parameters")

GPT  size: 116.5M parameters
GPT2 size: 124.4M parameters


The original GPT model is about the same size as the smallest GPT-2 model. Now we
can generate three different completions from each model, each with the same input
prompt:

In [3]:
def enum_pipeline_ouputs(pipe, prompt, num_return_sequences):
    out = pipe(prompt, num_return_sequences=num_return_sequences,
               clean_up_tokenization_spaces=True)
    return "\n".join(f"{i+1}." + s["generated_text"] for i, s in enumerate(out))
prompt = "\nWhen they came back"
print("GPT completions:\n" + enum_pipeline_ouputs(generation_gpt, prompt, 3))
print("")
print("GPT-2 completions:\n" + enum_pipeline_ouputs(generation_gpt2, prompt, 3))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPT completions:
1.
When they came back inside, i was glad they weren't the girls'friend. and they seemed okay. 
 " that was good, " i said to chase after he left. 
 " i like it. they seem happy. " 
 i
2.
When they came back on, they never came back. the other thing was, they were never coming back. 
 the world had moved on to other possibilities. in the meantime, they had been too damn smart for their own good. 
 and now
3.
When they came back. 
 when they got it, they opened the doors. they did not have to wait long to find out who the other person was. they saw the tall stranger with black hair on his head. he was wearing a black suit

GPT-2 completions:
1.
When they came back at their door, in the darkness, Mrs. Smith had her husband's foot on her shoulder. She was surprised so soon. Then suddenly she had a vision. It was a woman's voice that told her that her daughter
2.
When they came back to the main road, they realized the front door was closed, and the police were still there 

In [4]:
from datasets import load_dataset, DownloadConfig
download_config = DownloadConfig(delete_extracted=True)
dataset = load_dataset("./codeparrot", split="train",
                       download_config=download_config)

Resolving data files:   0%|          | 0/184 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
import psutil
import os

print(f"Number of python files code in dataset : {len(dataset)}")
ds_size = sum(os.stat(f["filename"]).st_size for f in dataset.cache_files)
# os.stat.st_size is expressed in bytes, so we convert to GB
print(f"Dataset size (cache file) : {ds_size / 2**30:.2f} GB")
# Process.memory_info is expressed in bytes, so we convert to MB
print(f"RAM used: {psutil.Process(os.getpid()).memory_info().rss >> 20} MB")

Number of python files code in dataset : 18695559
Dataset size (cache file) : 183.59 GB
RAM used: 742 MB


Some datasets (reaching up to 1 TB or more) will be difficult to fit even on a standard
hard  drive.  In  this  case,  an  alternative  to  scaling  up  the  server  you  are  using  is  to
stream the dataset

In [6]:
streamed_dataset = load_dataset('./codeparrot', split="train", streaming=True)

Resolving data files:   0%|          | 0/184 [00:00<?, ?it/s]