# Generating Your First Text

This book is for the GPU-poor! We will use models that users can run without the most expensive GPU(s) available or a big budget.

A free instance of **Google Colab** will give you a **T4 GPU** with 16 GB VRAM, which is the minimum amount of VRAM that we suggest.

### Open Models

Open LLMs are models that share their weights and architecture with the public to use. They are still developed by specific organizations but often share their code for creating or running the model locally—with varying levels of licensing that may or may not allow commercial usage of the model.

Examples:
- Cohere’s Command R
- the Mistral models
- Microsoft’s Phi
- Meta’s Llama

### Get started with: Phi-3-mini

The main generative model we use throughout the book is [Phi-3-mini](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct). Small yet performant:

- 3.8 Billion parameters
- 8 GB of VRAM; 6 GB with Quantization

Moreover: MIT license. Which allows the model to be used for commercial purposes without constraints!

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Note that we load the `model` and `tokenizer` separately and keep them as such so that we can explore them separately.

In [None]:
# Notice the `<|assistant|>` special token at the end of the prompt
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda")

# Generate the text
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=20
)

# Print the output
print(tokenizer.decode(generation_output[0]))

Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Subject: Sincere Apologies for the Gardening Mishap


Dear


In [None]:
print(input_ids)

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001]], device='cuda:0')


In [None]:
for id in input_ids[0]:
   print(tokenizer.decode(id))

Write
an
email
apolog
izing
to
Sarah
for
the
trag
ic
garden
ing
m
ish
ap
.
Exp
lain
how
it
happened
.
<|assistant|>


Notice how the output is `token_ids`:

In [None]:
generation_output

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001,  3323,   622, 29901,   317,  3742,   406,
          6225, 11763,   363,   278, 19906,   292,   341,   728,   481,    13,
            13,    13, 29928,   799]], device='cuda:0')

We `decode` them to get the corresponding tokens:

In [None]:
print(tokenizer.decode(3323))
print(tokenizer.decode(622))
print(tokenizer.decode([3323, 622]))
print(tokenizer.decode(29901))

Sub
ject
Subject
:


### Using `pipeline`

When you use an LLM, two models are loaded:

- The generative `model` itself
- Its underlying `tokenizer`

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Note: having `device_map="cuda"` assumes NVIDIA GPU.

Although we now have enough to start generating text, there is a nice trick in transformers that simplifies the process, namely `transformers.pipeline`. It encapsulates the model, tokenizer, and text generation process into a single function:

In [None]:
from transformers import pipeline

# Create a pipeline
generator = pipeline(
    task="text-generation",

    model=model,
    tokenizer=tokenizer,

    return_full_text=False,
    max_new_tokens=500,
    do_sample=True
)

The following parameters are worth mentioning:

- `return_full_text`
By setting this to False, the prompt will not be returned but merely the output of the model.

- `max_new_tokens`
The maximum number of tokens the model will generate. By setting a limit, we prevent long and unwieldy output as some models might continue generating output until they reach their context window.

- `do_sample`
Whether the model uses a sampling strategy to choose the next token. By setting this to False, the model will always select the next most probable token. In Chapter 6, we explore several sampling parameters that invoke some creativity in the model’s output.

In [None]:
# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


 Why did the chicken join the band? Because it had the drumsticks!


In [None]:
print(f'Prompt: {messages[0]["content"]}')
print(f'Output: {output[0]["generated_text"]}')

Prompt: Create a funny joke about chickens.
Output:  Why did the chicken join the band? Because it had the drumsticks!


Let's make it into a function

In [None]:
def generate_text(prompt):
  messages = [{"role": "user", "content": prompt}]
  output = generator(messages)
  print(f"Prompt: {prompt}")
  print(f"Generated Output: {output[0]['generated_text']}")

Prompt: Why is the desert so hot? short answer:
Generated Output:  The desert is hot due to consistent high solar radiation and minimal cloud cover, which allows for more direct and intense sunlight throughout the day. Low humidity also contributes to efficient heat absorption as there's less moisture in the air to retain the heat.


In [None]:
generate_text("what is the sentiment of the following review? 'I love this product! It's amazing.'")

Prompt: what is the sentiment of the following review? 'I love this product! It's amazing.'
Generated Output:  The sentiment of the given review, 'I love this product! It's amazing,' is positive. The reviewer expresses strong positive feelings by using words like "love" and "amazing," indicating a high level of satisfaction with the product.


In [None]:
generate_text("what is the sentiment of the following review? 'I love this product! It's amazing.' answer with one word, either: POSITIVE or NEGATIVE")

Prompt: what is the sentiment of the following review? 'I love this product! It's amazing.' answer with one word, either: POSITIVE or NEGATIVE
Generated Output:  POSITIVE


In [None]:
generate_text("how do you write Adam backwards?")

Prompt: how do you write Adam backwards?
Generated Output:  To write "Adam" backward, you would reverse the order of its letters. So, the reversed version of "Adam" would be "madA".


### Your turn

---

# Other Architectural Experiments and Improvements for Transformers

Many tweaks of the Transformer are proposed and researched on a continuous basis. [“A Survey of Transformers”](https://oreil.ly/3SrG4) highlights a few of the main directions.

Transformer architectures are also constantly adapted to domains beyond LLMs:

- see: [“Transformers in vision: A survey”](https://oreil.ly/35CES) and [“A survey on vision transformer”](https://oreil.ly/0zEbq)
- see [“Open X-Embodiment: Robotic learning datasets and RT-X models”](https://oreil.ly/SXAuB)
- see [“Transformers in time series: A survey”](https://oreil.ly/p9duV)

# Practical Tips: on Model Selection

Choosing the right models is not as straightforward as you might think with over 60,000 models on the Hugging Face Hub for text classification and more than 8,000 models that generate embeddings at the moment of writing.

### Start simple

> It is highly advised to compare against classic, but strong baselines such as representing text with **TF-IDF** and training a logistic regression classifier on top of that. -- the authors of Hands-on LLMs.

### 1. LLM for Representation vs. Generation

#### 1.1 Representation models

- Encoder (e.g., BERT, legal-BERT).

While generative models, like the GPT family, are incredible models, encoder-only models similarly:
- excel in task-specific use cases and
- tend to be significantly smaller in size

Selecting the right model for the job can be a form of art in itself. However, consider these as solid baselines:

- BERT base model (uncased)
- RoBERTa base model
- DistilBERT base model (uncased)
- DeBERTa base model
- bert-tiny
- ALBERT base v2

The [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) is a great place to start looking for **embedding models**.
- Contains open and closed source models
- Benchmarked across several tasks:
    - Classification is the task of assigning a label to a text.
    - Clustering is the task of grouping similar documents together.
    - Retrieval is the task of finding relevant documents for a query.
    - ..etc.

#### 1.2 Generation models

- Decoder (e.g., GPT-style models)
- Encoder-decoder (e.g., T5, BART)

Used in tasks like:
- abstractive summarization
- translation
- conversations

### 2. Domain

The closer to your target-domain, the better.

**Experiment**

Try the [Twitter-RoBERTa-base for Sentiment Analysis](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) model (trained on tweets) and use it for a movie reviews dataset ([`rotten_tomatoes`](https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes)) and compare it to [DistilBERT base uncased finetuned SST-2](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english) (trained on moview reviews).

Other examples:
- [`legal-bert`](https://huggingface.co/nlpaueb/legal-bert-base-uncased) an embedding model trained on Legal Documents.
- [`biobert`](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2) an embedding model trained on Medical Documents.

### 3. Language

English vs Arabic or Multi-lingual, or ...?

This affects:

- Size of pre-training data
    - perhaps we need to start from scratch if it is very specialized domain like medical (if no data)
- Availability of supervised fine-tuning data
    - perhaps we need to start annotating a new dataset (if no data)

### 4. Performance

#### 4.1 Inference Speed

The importance of inference speed should not be underestimated in real-life solutions. As such, we will use [`sentence-transformers/all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) as the embedding throughout this section. It is a small but performant model.


#### 4.2 Inference Correctness

As evaluated by metrics.

### 5. Computation Requirements

Model size or VRAM required.

Can you run the model on your hardware or do you need cloud resources?

### 6. Modality

- text-only (BERT, GPT)
- vision (ViT)
- text + vision (VLLM)
- text + voice