# Lesson 1: Why Pretraining?

## Install dependencies and fix seed

Welcome to Lesson 1!

If you would like to access the `requirements.txt` file for this course, go to `File` and click on `Open`.

In [None]:
# Install any packages if it does not exist
# !pip install -q -r ../requirements.txt

In [1]:
# Ignore insignificant warnings (ex: deprecations)
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Set a seed for reproducibility
import torch

def fix_torch_seed(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

fix_torch_seed()

## Load a general pretrained model

**TinySolar-248m-4k** is a small **decoder-only** model with **248M parameters (similar in scale to GPT2) and a 4096 token** context window. We can find the model on the [Hugging Face model library](https://huggingface.co/upstage/TinySolar-248m-4k).

We'll load the model in three steps:
1. Specify the path to the model in the Hugging Face model library
2. Load the model using `AutoModelforCausalLM` in the `transformers` library
3. Load the tokenizer for the model from the same model path

In [10]:
checkpoint = "upstage/TinySolar-248m-4k"

In [None]:
from transformers import AutoModelForCausalLM

tiny_general_model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    device_map="auto", # change to auto if you have access to a GPU, else cpu
    torch_dtype=torch.bfloat16
    #use_auth_token=True
)

In [24]:
next(tiny_general_model.parameters()).device

device(type='cuda', index=0)

In [None]:
from transformers import AutoTokenizer
tiny_general_tokenizer = AutoTokenizer.from_pretrained(checkpoint
                                                      #use_auth_token=True
                                                      )

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## Generate text samples

Here we'll be generating some text with the model. We'll set a prompt, instantiate a text streamer, and then have the model complete the prompt.
> A **text streamer** is a tool that allows to stream generated text token-by-token — rather than waiting for the entire output to be completed before displaying anything.

In [15]:
prompt = "I am an engineer. I love"

In [29]:
inputs = tiny_general_tokenizer(prompt, 
                                return_tensors="pt").to("cuda")

In [30]:
inputs

{'input_ids': tensor([[    1,   315,   837,   396, 18112, 28723,   315,  2016]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

In [31]:
inputs['input_ids'].device

device(type='cuda', index=0)

In [20]:
from transformers import TextStreamer

streamer = TextStreamer(
    tiny_general_tokenizer,
    skip_prompt=True, # If you set to false, the model will first 
                      # return the prompt and then the generated text
    skip_special_tokens=True
)

Let's see few important keyword arguments of the **generate** method.
* **use_cache**
  * During generation, models generate tokens one at a time, using self-attention to look back at previous tokens.
  * Without caching, the model recomputes the entire history every time it generates a new token.
  * With caching, the model remembers intermediate results (key, value pairs) from previous steps and reuses them, which speeds things up.
* **do_sample**
  * When False, at each step, model picks the token with the highest probability.
    * Output is deterministic — same input always gives same output. 

In [32]:
outputs = tiny_general_model.generate(
    **inputs, 
    streamer=streamer, 
    use_cache=True,    # Reuses previous hidden states during generation
    max_new_tokens=128,
    do_sample=False,   # Controls whether sampling is used to pick next tokens
                       # False means greedy decoding is done - always pick the
                       # highest probability token. True would allow randomness
    temperature=0.0,   # this is valid only when do_sample = True
    repetition_penalty=1.1  # Penalizes the model for repeating the same tokens
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


I am an engineer. I love to travel and have been a part of the team since 2013.
I'm a big fan of the music scene in London, so I was really excited when I saw this album on the radio. It's a great song about being a musician and having fun with it. The lyrics are very catchy and catchy, but they also make you feel like you're listening to something new.
The album is called "The Sound of Music" and it's a great track that has a lot of different influences from other genres. It's got some cool songs such as "Sweet


## Generate Python samples with pretrained general model

We will try to use the model to write a python function called `find_max()` that finds the maximum value in a list of numbers:

In [33]:
prompt =  "def find_max(numbers):"

In [34]:
inputs = tiny_general_tokenizer(
    prompt, return_tensors="pt"
).to(tiny_general_model.device)

streamer = TextStreamer(
    tiny_general_tokenizer, 
    skip_prompt=True, # Set to false to include the prompt in the output
    skip_special_tokens=True
)

In [35]:
outputs = tiny_general_model.generate(
    **inputs, 
    streamer=streamer, 
    use_cache=True, 
    max_new_tokens=128, 
    do_sample=False, 
    temperature=0.0, 
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



       """
       Returns the number of times a user has been added to the list.
       """
       return num_users() + 1

   def get_user_id(self, id):
       """
       Returns the number of users that have been added to the list.
       """
       return self._get_user_id(id)

   def get_user_name(self, name):
       """
       Returns the name of the user that has been added to the list.
       """
       return self._get_user_name(name


## Generate Python samples with finetuned Python model

This model has been fine-tuned on instruction code examples. 

More information about the fine-tuning datasets on the Hugging Face model library at [this link](https://huggingface.co/upstage/TinySolar-248m-4k-code-instruct).

In [37]:
finetuned_checkpoint = "upstage/TinySolar-248m-4k-code-instruct"

In [38]:
tiny_finetuned_model = AutoModelForCausalLM.from_pretrained(
    finetuned_checkpoint,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

tiny_finetuned_tokenizer = AutoTokenizer.from_pretrained(
    finetuned_checkpoint
)

config.json:   0%|          | 0.00/686 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [39]:
prompt =  "def find_max(numbers):"

inputs = tiny_finetuned_tokenizer(
    prompt, return_tensors="pt"
).to(tiny_finetuned_model.device)

streamer = TextStreamer(
    tiny_finetuned_tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

outputs = tiny_finetuned_model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    temperature=0.0,
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



   if len(numbers) == 0:
       return "Invalid input"
   else:
       return max(numbers)
```

In this solution, the `find_max` function takes a list of numbers as input and returns the maximum value in that list. It then iterates through each number in the list and checks if it is greater than or equal to 1. If it is, it adds it to the `max` list. Finally, it returns the maximum value found so far.


## Generate Python samples with pretrained Python model

Now, we will use a version of TinySolar-248m-4k that has been further pretrained (a process called **continued pretraining**) on a large selection of python code samples. 

The model can be found on Hugging Face at [this link](https://huggingface.co/upstage/TinySolar-248m-4k-py).

In [40]:
pretrained_checkpoint = "upstage/TinySolar-248m-4k-py" 

In [41]:
tiny_custom_model = AutoModelForCausalLM.from_pretrained(
    pretrained_checkpoint,
    device_map="auto",
    torch_dtype=torch.bfloat16,    
)

tiny_custom_tokenizer = AutoTokenizer.from_pretrained(
    pretrained_checkpoint
)

config.json:   0%|          | 0.00/639 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [42]:
prompt = "def find_max(numbers):"

inputs = tiny_custom_tokenizer(
    prompt, return_tensors="pt"
).to(tiny_custom_model.device)

streamer = TextStreamer(
    tiny_custom_tokenizer,
    skip_prompt=True, 
    skip_special_tokens=True
)

outputs = tiny_custom_model.generate(
    **inputs, streamer=streamer,
    use_cache=True, 
    max_new_tokens=128, 
    do_sample=False, 
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



   """Find the maximum number of numbers in a list."""
   max = 0
   for num in numbers:
       if num > max:
           max = num
   return max


def get_min_max(numbers, min_value=1):
   """Get the minimum value of a list."""
   min_value = min_value or 1
   for num in numbers:
       if num < min_value:
           min_value = num
   return min_value



Try running the python code the model generated above:

In [43]:
def find_max(numbers):
   max = 0
   for num in numbers:
       if num > max:
           max = num
   return max

In [44]:
find_max([1,3,5,1,6,7,2])

7