<a href="https://colab.research.google.com/github/simulate111/Textual-Data-Analysis-25/blob/main/llms_using_transformers_library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Using LLMs with the `transformers` library

This notebook illustrates basic use of large language models using the Hugging Face `transformers` library.

---

### Setup

You can change the model used throughout the notebook here:

In [1]:
MODEL_NAME = 'HuggingFaceTB/SmolLM2-135M-Instruct'
#MODEL_NAME = 'HuggingFaceTB/SmolLM2-1.7B-Instruct'

You can change the prompt used in many of the examples here:

In [2]:
PROMPT = 'The best advice I ever got was:'

By default, the `transformers` library logs a variety of warning messages even when used correctly (e.g. "Setting `pad_token_id` to `eos_token_id`). We'll here configure logging as follows to suppress these messages.

**NOTE**: you should generally _not_ do this as some of the warnings may signal problems in your code.

In [3]:
import transformers

transformers.logging.set_verbosity_error()

---

### Minimal generation example

The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines) class provides a high-level abstraction for a variety of tasks.

To create a pipeline for text generation, simply invoke the `pipeline` constructor with the `text-generation` argument and the name of a model that supports text generation from the Hugging Face [models repository](https://huggingface.co/models?pipeline_tag=text-generation).

To support loading larger models, we'll here also provide the arguments `device_map='auto'` and `torch_dtype='auto'`. If you're interested in knowing more about this, you can read the `accelerate` documentation on [Loading big models into memory](https://huggingface.co/docs/accelerate/v1.4.0/concept_guides/big_model_inference).

In [4]:
from transformers import pipeline

pipe = pipeline(
    'text-generation',
    MODEL_NAME,
    device_map='auto',
    torch_dtype='auto',
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/861 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

Invoking a text generation pipeline with a prompt will return generated outputs, a list of dictionaries where the generated text is given as `'generated_text'`.

In [5]:
outputs = pipe(PROMPT)

print(outputs[0]['generated_text'])

The best advice I ever got was: "Don't be afraid to ask for help."

I'm not sure what to do with


---

### Example without `pipeline`

We can run the same generation using the tokenizer and model explicity, as follows. First, we could directly load these using ["auto" classes](https://huggingface.co/docs/transformers/en/model_doc/auto):

```
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
```

Here, to avoid loading the model twice, we'll instead just grab the tokenizer and model from the pipeline we loaded earlier.

In [6]:
tokenizer = pipe.tokenizer
model = pipe.model

We can then get the same result as the pipeline produced using `tokenizer` and `model.generate` explicitly like this:

In [7]:
prompt = 'The best advice I ever got was:'
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The best advice I ever got was: "Don't be afraid to ask for help."

I'm not sure what to do with


The `pipeline` class takes care of encoding the text into token indices for the model and decoding the output back into text, but you should keep in mind that the model doesn't deal with text but rather these indices:

In [8]:
print(inputs.input_ids)

tensor([[ 504, 1450, 5042,  339, 2042, 3363,  436,   42]])


In [9]:
print(outputs[0])

tensor([  504,  1450,  5042,   339,  2042,  3363,   436,    42,   476,  8084,
          982,   325, 11830,   288,  1998,   327,   724,  1270,   198,   198,
           57,  5248,   441,  2090,   732,   288,   536,   351])


You can see the individual tokens that these indices (ids) correspond to by using `tokenizer.convert_ids_to_tokens`. (For many models you'll see encoded characters such as `Ġ` for space; [this is not an error](https://github.com/facebookresearch/fairseq/issues/1716))

In [10]:
print(tokenizer.convert_ids_to_tokens(outputs[0]))

['The', 'Ġbest', 'Ġadvice', 'ĠI', 'Ġever', 'Ġgot', 'Ġwas', ':', 'Ġ"', 'Don', "'t", 'Ġbe', 'Ġafraid', 'Ġto', 'Ġask', 'Ġfor', 'Ġhelp', '."', 'Ċ', 'Ċ', 'I', "'m", 'Ġnot', 'Ġsure', 'Ġwhat', 'Ġto', 'Ġdo', 'Ġwith']


---

### Aside: where is my model?

The `pipeline` class automates many aspects of model loading, including where it is loaded. Here's one way to check:

In [11]:
from pprint import pprint

print(model.device)

cpu


If you're intending to run on GPU and see `cpu` above, you might want to try Runtime -> Change runtime type in the colab menu.

If you've loaded a larger model with `device_map='auto'` or similar, you can get further details on the placement like this:

In [12]:
print(model.hf_device_map)

{'': 'cpu'}


(If you see anything with the value `disk` above, part of the model has been offloaded onto disk, which may make things very slow.)

---

### Generation parameters

Both `model.generate` and text generation `pipeline` calls support a broad range of parameters that control generation.

We'll look at some key ones here; for the full list refer to [the documentation](https://huggingface.co/docs/transformers/v4.48.2/en/main_classes/text_generation#transformers.GenerationConfig).

#### Output length

The parameters `min_length`, `min_new_tokens`, `max_length`, and `max_new_tokens` control the minimum and maximum length of the generated output in tokens. The arguments with `_new_` ignore the length of the prompt.

---



In [13]:
outputs = pipe(
    prompt,
    min_new_tokens=50,
    max_new_tokens=100,
)

print(outputs[0]['generated_text'])

The best advice I ever got was: "Don't be afraid to ask for help."

I'm not sure what to do with my life now. I'm not sure what to do with my life. I'm not sure what to do with my life. I'm not sure what to do with my life. I'm not sure what to do with my life. I'm not sure what to do with my life. I'm not sure what to do with my life. I'm not sure what to do with my life


#### Sampling

By default, generation uses a _greedy_ decoding strategy that always picks the most likely next token, which may result in formulaic and repetitive generations.

When called with `do_sample=True`, the next token is instead sampled from the probability distribution predicted by the model. (Note that with this means you'll get different outputs every time you run generation.)

In [14]:
outputs = pipe(
    prompt,
    min_new_tokens=50,
    max_new_tokens=100,
    do_sample=True,
)

print(outputs[0]['generated_text'])

The best advice I ever got was: "Let people go." As many other people have said, it's unrealistic to expect everyone to have it all together. I've had my share of great customers and, as I said, there are always people who deserve a break.

I've tried things like giving them free time and taking them out for lunch, which really puts things in perspective. It feels liberating, even if it's just for a short time.

Don't get me wrong, it's hard to be


There are a fairly large number of parameters that control the sampling strategy (see "Parameters for manipulation of the model output logits" [here](https://huggingface.co/docs/transformers/v4.48.2/en/main_classes/text_generation#transformers.GenerationConfig) for details).

A key parameter is `temperature`:

* values approaching zero approximate greedy sampling
* values < 1 assign more probability mass to likely tokens
* value 1 samples from the unmodified distribution
* values > 1 assign more probability mass to unlikely tokens

Intuitively, lower values give more predictable output and higher values more "surprising" or "creative" output. Values below the default of 1.0 (e.g. 0.5 or 0.7) are commonly used for `temperature`. Very high values are unlikely to produce useful output, but may provide insight into the generation process by understanding how it breaks down:

In [15]:
outputs = pipe(
    prompt,
    min_new_tokens=50,
    max_new_tokens=100,
    do_sample=True,
    temperature=100.0,
)

print(outputs[0]['generated_text'])

The best advice I ever got was: make some sense as though nobody at a local gym talks shop in '5 cars going downtown that drive downtown without any signs there so your a 'road triage operator', right?! Guns or cheese! or either all or not!! -LH, February '3 years ahead.' 
9th March
How awesome about you!!! It sounds terrific but isn’ can really mean more things since what should start it:........and when. Don this, people, have this time at. For


---

### Using chat models

Instruction- or chat-tuned models are commonly trained with special tokens to differentiate user input from model output, for example `<|im_start|>` and `<|im_end|>` in [ChatML](https://github.com/openai/openai-python/blob/release-v0.28.0/chatml.md).

Special tokens and their usage conventions can differ between models, which means that switching models could potentially require writing model-specific code to format input. To make this easier, Hugging Face tokenizers introduced chat templates that implement that formatting.

Basic usage is illustrated below. For more details, see the [Chat Templates documentation](https://huggingface.co/docs/transformers/main/en/chat_templating).

In [16]:
messages = [
    { 'role': 'user', 'content': 'What is the capital of Finland?' }
]

print(tokenizer.apply_chat_template(messages, tokenize=False))

<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
What is the capital of Finland?<|im_end|>



Depending on the model, you may also see an added system message above. Here's how that would look for another model:

In [17]:
from transformers import AutoTokenizer

other_model = 'HuggingFaceH4/zephyr-7b-beta'
other_tokenizer = AutoTokenizer.from_pretrained(other_model)

print(other_tokenizer.apply_chat_template(messages, tokenize=False))

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

<|user|>
What is the capital of Finland?</s>



The `apply_chat_template` function returns a string that can be tokenized and used for generation as usual. We here also add the `add_generation_prompt=True` argument to assure that the model continues the prompt with an assistant response rather than e.g. continuing the user message. (For more details on this, see [the documentation](https://huggingface.co/docs/transformers/main/en/chat_templating)).

In [18]:
input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(input_text, return_tensors='pt').to(model.device)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))

<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
What is the capital of Finland?<|im_end|>
<|im_start|>assistant
The capital of Finland is Helsinki.<|im_end|>


You can of course also use a `pipeline` with text formatted by `apply_chat_template`:

In [19]:
pipe(input_text)[0]['generated_text']

'<|im_start|>system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>\n<|im_start|>user\nWhat is the capital of Finland?<|im_end|>\n<|im_start|>assistant\nThe capital of Finland is Helsinki.'

In recent versions of the `transformers` library, you can directly invoke a `pipeline` with an instruct- or chat- capable model and tokenizer. In this case, the templates will be used automatically to process input and output:

In [20]:
outputs = pipe(messages)
outputs[0]['generated_text']

[{'role': 'user', 'content': 'What is the capital of Finland?'},
 {'role': 'assistant', 'content': 'The capital of Finland is Helsinki.'}]

Note that even instruction- or chat-tuned models can revert to standard "continuation" generation if you prompt them without the expected template.

In [21]:
pipe('What is the capital of Finland?', min_new_tokens=10, max_new_tokens=20)[0]['generated_text']

'What is the capital of Finland?\n\nA: Finland is a country in Northern Europe. It is located in the northern part of'