# Language Models 1

## 1. Inference

See [here](https://huggingface.co/tasks/text-generation) and [here](https://huggingface.co/docs/transformers/generation_strategies).

If you need to load/save to your drive:

```python
import sys
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive/')

import os
os.chdir('drive/My Drive/2023-DMLAP/DMLAP/python') # change to your directory
```

## Workflow

1. On Colab, you need to install `transformers`:

```python
!pip install -Uq transformers
```

2. Locally, I recommend creating a new environment when working with Huggingface, simply because it'll be faster and because the preferred library behind HF is PyTorch, which can conflict with TensorFlow...

```
conda create -n dmlap.hug python
conda activate dmlap.hug
```

For some models, for instance the recent Llama 2, you need to be logged into HF.

Please create an account [here](https://huggingface.co/) and then to ['/settings/tokens'](https://huggingface.co/settings/tokens) to create an access token.

In the case of Llama, you need two things: [request and be granted access by Meta](https://ai.meta.com/llama/), *then* [request and be granted access by HF](https://huggingface.co/meta-llama). With the former you can also download the weights locally (downloading all the models takes **a lot** of space, 330Gb, and most of them are literally unrunnable due to their size, so I wouldn't recommend it).

```python
from huggingface_hub import notebook_login
if not (Path.home()/'.huggingface'/'token').exists():
    notebook_login()
```

In [None]:
from pathlib import Path

import torch
from transformers import pipeline

In [None]:
generator = pipeline(
    'text-generation', # the specific task, which is also the tag on huggingface
    model='gpt2',      # so many more models here: https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads
    device=0           # the default is just cpu, see here: https://huggingface.co/docs/transformers/pipeline_tutorial#device
)

In [None]:
from transformers import GenerationConfig
generation_config = GenerationConfig.from_pretrained("gpt2") # https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.from_pretrained.example
generation_config.pad_token_id = generation_config.eos_token_id # see here, modified: https://github.com/huggingface/transformers/issues/19853#issuecomment-1290759818

The Huggingface is transitioning towards the use of generation config files (good for automation).

In [None]:
generation_config.max_length = 25
generation_config.do_sample = True
generation_config.top_p = 0.95
generation_config.temperature = .9

### Quick vocab note:

`bos`: beginning of sentence  
`eos`: end of sentence  
`pad`: padding

These are special tokens that have been inserted into the text at training time.

For instance, in our case the 'beginning' of the text is 'endoftext', as during training the texts are fed to the network one after the other, with this special token between them:
```python
print(generator.tokenizer.bos_token) # '<|endoftext|>'
```

In [None]:
generation_config

GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 25,
  "pad_token_id": 50256,
  "temperature": 0.9,
  "top_p": 0.95,
  "transformers_version": "4.32.1"
}

In [None]:
# torch.manual_seed(1)
generator(
    "Once upon a time,",
    generation_config=generation_config
)

[{'generated_text': "Once upon a time, I'd just start playing games and play games. There was that same desire that, in other words"}]

Parallel generation!

In [None]:
# torch.manual_seed(1)
generator(
    ["Once upon a time,"] * 2,
    generation_config=generation_config
)

[[{'generated_text': 'Once upon a time, one may think of our great heroes as "good guys." After all, they\'re the ones who'}],
 [{'generated_text': 'Once upon a time, he said, an individual in the United States is, as if someone had just arrived in the United'}]]

---

## Deeper

What does the pipeline do under the hood?

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained(
    "gpt2",
    pad_token_id=tokenizer.eos_token_id # add the EOS token as PAD token to avoid warnings
).to("cuda") # to GPU

### Note

Huggingface automates everything, so instead of `GPT2LMHeadModel` and `GPT2Tokenizer` you can use `AutoModelForCausalLM`, `AutoTokenizer`.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to("cuda")
```

### The tokenizer

See [this](https://huggingface.co/docs/transformers/preprocessing).

In [None]:
toks = tokenizer.encode("Oh sweet midnight")
print(toks)
print(tokenizer.decode(toks))
print()

toks = tokenizer(["Oh sweet midnight", "harbinger of doom"])
print(toks)
print(tokenizer.batch_decode(toks['input_ids']))

[5812, 6029, 15896]
Oh sweet midnight

{'input_ids': [[5812, 6029, 15896], [9869, 65, 3889, 286, 27666]], 'attention_mask': [[1, 1, 1], [1, 1, 1, 1, 1]]}
['Oh sweet midnight', 'harbinger of doom']


In [None]:
input_ids = tokenizer.encode('Once upon a time', return_tensors='pt') # pytorch tensors
print(input_ids)

batched_input_ids = torch.tile(input_ids, (4,1)).to("cuda") # just copying the tensor 4 times
print(batched_input_ids)

tensor([[7454, 2402,  257,  640]])
tensor([[7454, 2402,  257,  640],
        [7454, 2402,  257,  640],
        [7454, 2402,  257,  640],
        [7454, 2402,  257,  640]], device='cuda:0')


In [None]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('Once upon a time', return_tensors='pt') # pytorch tensors

batched_input_ids = torch.tile(input_ids, (4,1)).to("cuda") # copy and place on GPU

# same logic as before
from transformers import GenerationConfig
generation_config = GenerationConfig.from_pretrained("gpt2")
generation_config.pad_token_id = generation_config.eos_token_id
generation_config.max_length = 25
generation_config.do_sample = True
generation_config.top_p = 0.95
generation_config.temperature = .9

# generate text until the output length (which includes the context length) reaches 50
output = model.generate(
    batched_input_ids, # try input_ids as well for a single strand
    generation_config=generation_config,
)

In [None]:
texts = tokenizer.batch_decode(output, skip_special_tokens=True)

for t in texts:
    print(t)
    print("-" * 40)

Once upon a time, as the spirit guides you into the world, you will find that you have a strong sense of self
----------------------------------------
Once upon a time, it was a beautiful world, and so we saw it from it's window. He was always in
----------------------------------------
Once upon a time, when the universe is in a state of entropy, we have a series of infinite, infinite-dimensional
----------------------------------------
Once upon a time, I didn't think there were any problems with the system; now I see there are.


----------------------------------------


---

# Experiments

1. Test everything! Make sure you understand and develop an intuition of:
 - the various parameters: `temperature`, `top_k`, `top_p`;
 - the `tokenizer` object to convert text into tokens and back;
 - how to handle the whole pipeline;
   Also, you can search for different [models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)! (Some of them may exceed your GPU capacity, beware). People have finetuned language models on many types of texts.
2. Can you think of a way to introduce computational thinking into this? Ideas:
  - First, you could explore ways of making things look nicer? Instead of just having a list of objects? You could write a nice print function that knows exactly how to take the model output and print it in a nice way. The specialised Python package with many text functionalities is [textwrap](https://docs.python.org/3/library/textwrap.html) (see also [here](https://www.geeksforgeeks.org/textwrap-text-wrapping-filling-python/);
  - Can you think of ways to construct a writing **loop**? By that, I mean:  
    a. Prepare prompt  
    b. Generate one or more strands of text  
    c. Select text from strands, go back to a.  
    This could simply mean writing a system of helper functions and classes to assist you in the writing...
  - One could imagine all sorts of strange ways to work with text, from programmatically chunking the generated text and scrambling it before using it again as a prompt, to explore what the model does if you use unreasonable parameters (e.g. a very high or low `temperature`).
  - Also, can you think of ways to work with various strands of text (Taking advantage of the fact that a model can generate in parallel)?

3. Something that has already been the subject of a lot of debate and controversy, is the exploration of the *biases* of the models (and there are tons!). GPT-2 was trained mostly on Internet text, top-ranked reddit posts, etc. (see [this open-source replication](https://github.com/jcpeterson/openwebtext)). Unsurprisingly, the topics and points of view reflect that corner of human activities...

### Document your thought process! Add more cells rather than modify the one you are working with, so that your steps can be retraced.