<a href="https://colab.research.google.com/github/theogbrand/learning_llm_handbook/blob/main/notebooks/Week1_LLM_Fundamentals/Week1_LLM_Fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 1: LLM Fundamentals and Development Environment Setup

Resource Required: Google Colab (T4) GPU - free

💡 NOTE: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

In [1]:
!pip install transformers>=4.40.1 accelerate>=0.27.2

In this course, we will use Hugging Face's Transformers library to train and test our own models. Let's get familiar with the library by loading a pre-trained model and using it to generate text.

`Phi-3-mini-4k-instruct` is a small, fast, and efficient model that can generate text in a variety of languages. It is a decoder-only model that uses the GPT architecture.

The first interesting thing to note is that the model is loaded onto the GPU. This is because the `device_map` argument is set to `cuda`, which is the GPU.

The second interesting thing to note is that the model is loaded using the `AutoModelForCausalLM` class. This is a convenience class that loads the model and the associated tokenizer. There are other classes that HuggingFace offers for different model types.

The third interesting thing to note is that the model is loaded using the `pipeline` function. This is a convenience function that loads the model and the associated tokenizer and returns a pipeline for easy inference.

The reason to use the `pipeline` function for simple inference will become more obvious later, when we analyze the sequential steps taken for a model to generate text.

`Phi-3-mini-4k-instruct` is a chat-style/instruction-tuned model, which means it is trained to generate text in a conversational manner. This is why it is capable of chatting with us like a person would.

Chat-style models are trained from base models through a process called instruction tuning, a type of supervised finetuning. The difference between vanilla finetuning and instruction tuning is that finetuning involves training the model on a specific task, while instruction tuning involves training the model to follow a specific set of instructions.

Without instruction tuning, the model is only trained to generate text to complete the next word in a sequence. Before a model is instruction-tuned, the model is usually known as a base model, which was trained using a training objective to predict the next token in a sequence. We will explore the mechanics of how this works in greater detail later.

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct", # chosen for its small size and fast inference speed on smaller GPUs
    device_map="cuda", # requires GPU
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
)

messages = [
    {"role": "user", "content": "What is the nature of reality?"}
]

output = generator(messages)
print(output[0]["generated_text"])

Now we will load a base model, GPT-2 by OpenAI, which has not been instruction tuned and see how the model generates text for the same given prompt.

You can see that the model is not able to generate a coherent response, and the text is not very meaningful. This is because the model is not trained to generate text in a conversational manner, but only to predict the words following the prompt.

After instruction tuning was discovered, it became the standard way for researchers and developers to train models capable of accomplishing a wide range of tasks. The products we know and love today like ChatGPT, Claude, and Gemini are all based on instruction tuned models.

Instruction tuning was one of the fundamental breakthroughs which resulted in the usefulness of LLMs we know today.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer.pad_token = tokenizer.eos_token # Ignore this for now, we will learn what these are down the line
model.config.pad_token_id = model.config.eos_token_id # Ignore this for now, we will learn what these are down the line

prompt = "What is the nature of reality?"
inputs = tokenizer(
    prompt,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=100,
    add_special_tokens=True
)

# Generate text
gen_tokens = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    do_sample=True,
    temperature=0.9,
    max_length=100,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

# Decode and print the generated text
gen_text = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)[0]
print(gen_text)