# Lesson 1: Why Pretraining?

## 1. Install dependencies and fix seed

Welcome to Lesson 1!

If you would like to access the `requirements.txt` file for this course, go to `File` and click on `Open`.

In [None]:
# Install any packages if it does not exist
# !pip install -q -r ../requirements.txt

In [1]:
# cases where pretraining is the best option
#=> base models good at gen text but not following instr
#=> for model to learn new lang
#finetuning not enough
#new know not rep well

In [2]:
# Ignore insignificant warnings (ex: deprecations)
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Set a seed for reproducibility
import torch

def fix_torch_seed(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

fix_torch_seed()

## 2. Load a general pretrained model

This course will work with small models that fit within the memory of the learning platform. TinySolar-248m-4k is a small decoder-only model with 248M parameters (similar in scale to GPT2) and a 4096 token context window. You can find the model on the Hugging Face model library at [this link](https://huggingface.co/upstage/TinySolar-248m-4k).

You'll load the model in three steps:
1. Specify the path to the model in the Hugging Face model library
2. Load the model using `AutoModelforCausalLM` in the `transformers` library
3. Load the tokenizer for the model from the same model path

In [4]:
model_path_or_name = "./models/TinySolar-248m-4k"

In [5]:
from transformers import AutoModelForCausalLM
tiny_general_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map="auto", # change to auto if you have access to a GPU
    torch_dtype=torch.bfloat16 #datatyoe
)

In [None]:
lets practice with a different yet small model

In [6]:
from transformers import AutoTokenizer
tiny_general_tokenizer = AutoTokenizer.from_pretrained(
    model_path_or_name
)

## 3. Generate text samples

Here you'll try generating some text with the model. You'll set a prompt, instantiate a text streamer, and then have the model complete the prompt:

In [7]:
prompt = "I am an engineer. I love"

In [8]:
inputs = tiny_general_tokenizer(prompt, return_tensors="pt")

In [9]:
from transformers import TextStreamer
streamer = TextStreamer(
    tiny_general_tokenizer,
    skip_prompt=True, # If you set to false, the model will first return the prompt and then the generated text
    skip_special_tokens=True
)

In [10]:
outputs = tiny_general_model.generate(
    **inputs, 
    streamer=streamer, 
    use_cache=True,
    max_new_tokens=128,
    do_sample=False, 
    temperature=0.0,
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


to travel and have been a part of the community since 1985.
I'm a big fan of the music scene in New York City, so I was excited to see what the city could do with this new album. The first track on the album is "The Last Time" which features the band's lead singer, guitarist, and bassist, John Lennon. It's a great song that you can hear live at the end of the record.
The second track on the album is "Song for the Wicked" which features the band's vocalist, guitarist,


In [11]:
#trying other different pretrained models

model_path = 'google-t5/t5-small'

In [14]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Define the model path
model_path = 'google-t5/t5-small'

# Load the T5 model
tiny_general_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_path,
    device_map="auto",  
    torch_dtype=torch.bfloat16  
)

# Load the tokenizer
tiny_general_tokenizer = AutoTokenizer.from_pretrained(model_path)


model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [15]:

from transformers import AutoTokenizer
tiny_gen = AutoTokenizer.from_pretrained(model_path)

In [35]:
prompt = "Convert the following to german : I am Samridhi?"

In [36]:
inputs = tiny_gen(prompt, return_tensors="pt")

In [37]:
#stream

from transformers import TextStreamer
streamer = TextStreamer(
tiny_gen,skip_prompt=True,skip_special_tokens=True)

In [38]:
outputs = tiny_general_model.generate(
    **inputs,
    max_new_tokens=128,       # Maximum length of the output sequence
    do_sample=True,           # Enable sampling
    temperature=0.7,          # Adjust temperature for creativity
    repetition_penalty=1.1    # Penalize repeated words
)


In [39]:
outputs

tensor([[    0,  2974,  3027,    16, 13692,     3,    10,    27,   183,  3084,
          4055,   107,    23,    58,     1]])

In [41]:
generated_text = tiny_gen.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Konvert in german : I am Samridhi?


Trying a different model

In [42]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

In [43]:
model_path = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [44]:
entiment_analyzer = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)


In [50]:
sentence = "i love icecreams so much but hate potatoes"

In [None]:

# Test input
#sentence = "Transformers are amazing tools for NLP!"
result = sentiment_analyzer(sentence)
print(result)


# trying another model


In [52]:
from transformers import pipeline

qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

context = "The Eiffel Tower is located in Paris and was constructed in 1889."
question = "Where is the Eiffel Tower located?"

# Get answer
answer = qa_pipeline(question=question, context=context)
print(answer)


config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

{'score': 0.9808771014213562, 'start': 31, 'end': 36, 'answer': 'Paris'}


generating model

In [54]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "EleutherAI/gpt-neo-125M"

In [55]:
model_path = "EleutherAI/gpt-neo-125M"

In [56]:
tiny_general_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tiny_general_tokenizer = AutoTokenizer.from_pretrained(model_path)

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

In [57]:
prompt = "Once upon a time, there was a cat who"
inputs = tiny_general_tokenizer(prompt, return_tensors="pt")


In [58]:
outputs = tiny_general_model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [60]:
generated_text = tiny_general_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Once upon a time, there was a cat who was dying in the woods. The thought seemed to be that he was dead, but he could not be brought to look for it.

"Don't you know?" the cat asked.

"Not yet. It will be an hour


## 4. Generate Python samples with pretrained general model

Use the model to write a python function called `find_max()` that finds the maximum value in a list of numbers:

In [61]:
prompt =  "def find_max(numbers):"

In [62]:
inputs = tiny_general_tokenizer(
    prompt, return_tensors="pt"
).to(tiny_general_model.device)

streamer = TextStreamer(
    tiny_general_tokenizer, 
    skip_prompt=True, # Set to false to include the prompt in the output
    skip_special_tokens=True
)

In [63]:
outputs = tiny_general_model.generate(
    **inputs, 
    streamer=streamer, 
    use_cache=True, 
    max_new_tokens=128, 
    do_sample=False, 
    temperature=0.0, 
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



        """
        Find the maximum number of numbers in a given list.

        :param list: The list of numbers to find.
        :type list: str
        """
        if not isinstance(numbers, list):
            raise TypeError("list must be a list")
        if not isinstance(numbers, tuple):



## 5. Generate Python samples with finetuned Python model

This model has been fine-tuned on instruction code examples. You can find the model and information about the fine-tuning datasets on the Hugging Face model library at [this link](https://huggingface.co/upstage/TinySolar-248m-4k-code-instruct).

You'll follow the same steps as above to load the model and use it to generate text.

In [64]:
model_path_or_name = "./models/TinySolar-248m-4k-code-instruct"

In [65]:
tiny_finetuned_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map="cpu",
    torch_dtype=torch.bfloat16,
)

tiny_finetuned_tokenizer = AutoTokenizer.from_pretrained(
    model_path_or_name
)

In [66]:
prompt =  "def find_max(numbers):"

inputs = tiny_finetuned_tokenizer(
    prompt, return_tensors="pt"
).to(tiny_finetuned_model.device)

streamer = TextStreamer(
    tiny_finetuned_tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

outputs = tiny_finetuned_model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    temperature=0.0,
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



   if len(numbers) == 0:
       return "Invalid input"
   else:
       return numbers[i]
```

In this solution, the `find_max` function takes a list of numbers as input and returns the maximum value in that list. It then iterates through each number in the list and checks if it is greater than or equal to 1. If it is, it adds it to the `max` list. Finally, it returns the maximum value found so far.


## 6. Generate Python samples with pretrained Python model

Here you'll use a version of TinySolar-248m-4k that has been further pretrained (a process called **continued pretraining**) on a large selection of python code samples. You can find the model on Hugging Face at [this link](https://huggingface.co/upstage/TinySolar-248m-4k-py).

You'll follow the same steps as above to load the model and use it to generate text.

In [67]:
model_path_or_name = "./models/TinySolar-248m-4k-py" 

In [68]:
tiny_custom_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map="cpu",
    torch_dtype=torch.bfloat16,    
)

tiny_custom_tokenizer = AutoTokenizer.from_pretrained(
    model_path_or_name
)

In [69]:
prompt = "def find_max(numbers):"

inputs = tiny_custom_tokenizer(
    prompt, return_tensors="pt"
).to(tiny_custom_model.device)

streamer = TextStreamer(
    tiny_custom_tokenizer,
    skip_prompt=True, 
    skip_special_tokens=True
)

outputs = tiny_custom_model.generate(
    **inputs, streamer=streamer,
    use_cache=True, 
    max_new_tokens=128, 
    do_sample=False, 
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



   """Find the maximum number of numbers in a list."""
   max = 0
   for num in numbers:
       if num > max:
           max = num
   return max



another model

In [75]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


model_path = "Salesforce/codet5-small"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)


prompt = "Write a Python function to calculate the Fibonacci sequence."
inputs = tokenizer(prompt, return_tensors="pt")

# Generate code
outputs = model.generate(
    **inputs, 
    max_new_tokens=300, 
    do_sample=True, 
    temperature=0.1
)

# Decode and print the generated code
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_code)


 def fibonacci_sequence ( )


Try running the python code the model generated above:

In [70]:
def find_max(numbers):
   max = 0
   for num in numbers:
       if num > max:
           max = num
   return max

In [71]:
find_max([1,3,5,1,6,7,2])

7