<a href="https://colab.research.google.com/github/wadra/LLM_from_Scratch/blob/main/code/myedit/GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT

In this notebook (based on [Sinan Ozdemir's](https://github.com/sinanuozdemir/oreilly-gpt-hands-on-nlg/blob/main/notebooks/Introduction_to_GPT.ipynb)), we:

1. Use `transformers` pipeline objects to generate text very easily (using a GPT model)
2. Explore tokens

### Load dependencies

In [31]:

! pip install transformers==4.41.2

Collecting transformers==4.41.2
  Downloading transformers-4.41.2-py3-none-any.whl (9.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.20,>=0.19 (from transformers==4.41.2)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.13.3
    Uninstalling tokenizers-0.13.3:
      Successfully uninstalled tokenizers-0.13.3
  Attempting uninstall: transformers
    Found existing installation: transformers 4.28.0
    Uninstalling transformers-4.28.0:
      Successfully uninstalled transformers-4.28.0
Successfully installed tokenizers-0.19.1 transformers-4.41.2


In [1]:
from transformers import pipeline, GPT2Tokenizer

### Hello, Pipeline!

Let's use the `pipeline` object to generate text.

Other examples of tasks we can carry out with pipelines include:
* `"sentiment-analysis"`
* `"ner"` (named entity recognition)
* `"summarization"`
* `"translation_en_to_fr"`
* `"feature-extraction"`

In [3]:
generator = pipeline('text-generation', model = 'gpt2')

generator("The capital of Germany is Berlin. The capital of China is Beijing. The capital of France is",
          max_new_tokens=2,)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The capital of Germany is Berlin. The capital of China is Beijing. The capital of France is Paris.'}]

In [5]:
generator("The capital of Germany is Berlin. The capital of China is Beijing. The capital of Pakistan is",
          max_new_tokens=2,)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The capital of Germany is Berlin. The capital of China is Beijing. The capital of Pakistan is Islamabad.'}]

### Exploring tokens

In [11]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2') # load up a tokenizer

In [7]:
'love' in tokenizer.get_vocab()

True

In [8]:
'Sinan' in tokenizer.get_vocab()

False

Encode a string:

In [12]:
tokenizer.encode('Sinan loves a beautiful day')

[46200, 272, 10408, 257, 4950, 1110]

...then convert the ids into tokens:

In [None]:
tokenizer.convert_ids_to_tokens(tokenizer.encode('Sinan loves a beautiful day'))

(The `Ġ` character denotes a space before the token.)