<a href="https://colab.research.google.com/github/riddhi134/Build-a-Financial-Insights-Dashboard-and-Scoring-Model/blob/main/pretrainingLesson_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 1: Why Pretraining?

## 1. Install dependencies and fix seed

Welcome to Lesson 1!

If you would like to access the `requirements.txt` file for this course, go to `File` and click on `Open`.

In [3]:
from google.colab import files
uploaded = files.upload()


Saving requirements.txt to requirements.txt


In [4]:
# Install any packages if it does not exist
!pip install -q -r ../requirements.txt

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: '../requirements.txt'[0m[31m
[0m

In [5]:
# Ignore insignificant warnings (ex: deprecations)
import warnings
warnings.filterwarnings('ignore')

In [6]:
# Set a seed for reproducibility
import torch

def fix_torch_seed(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

fix_torch_seed()

## 2. Load a general pretrained model

This course will work with small models that fit within the memory of the learning platform. TinySolar-248m-4k is a small decoder-only model with 248M parameters (similar in scale to GPT2) and a 4096 token context window. You can find the model on the Hugging Face model library at [this link](https://huggingface.co/upstage/TinySolar-248m-4k).

You'll load the model in three steps:
1. Specify the path to the model in the Hugging Face model library
2. Load the model using `AutoModelforCausalLM` in the `transformers` library
3. Load the tokenizer for the model from the same model path

In [7]:
model_path_or_name = "./models/TinySolar-248m-4k"

In [10]:
from transformers import AutoModelForCausalLM

model_path_or_name = "./models/TinySolar-248m-4k"  # Local path

tiny_general_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map="cpu",  # Lowercase 'cpu'
    torch_dtype="auto",  # 'bfloat16' not supported on CPU — use "auto"
    local_files_only=True
)


HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './models/TinySolar-248m-4k'. Use `repo_type` argument if needed.

In [12]:
from huggingface_hub import login

login()  # Paste your Hugging Face token when prompted


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [18]:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)


tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## 3. Generate text samples

Here you'll try generating some text with the model. You'll set a prompt, instantiate a text streamer, and then have the model complete the prompt:

In [19]:
prompt = "I am an engineer. I love"

In [21]:
inputs = tokenizer(prompt, return_tensors="pt")

In [22]:
from transformers import TextStreamer
streamer = TextStreamer(
    tokenizer,
    skip_prompt=True, # If you set to false, the model will first return the prompt and then the generated text
    skip_special_tokens=True
)

In [23]:
outputs = model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    temperature=0.0,
    repetition_penalty=1.1
)

to write about science and technology, especially about the latest advancements in artificial intelligence, robotics, and quantum computing. My goal is to provide readers with a clear understanding of these cutting-edge technologies and how they are transforming our world.


In [28]:
# Initialize conversation context
conversation_history = "You are a helpful assistant."

# Add user input to the context
user_input = "What is the weather like today?"
conversation_history += "\nUser: " + user_input

# Tokenize the conversation context
inputs = tokenizer(conversation_history, return_tensors="pt")

# Generate a response from the model
outputs = model.generate(
    inputs["input_ids"],
    max_length=200,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    do_sample=True,
    top_p=0.95,
    temperature=0.7,
)

# Decode and print the response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Assistant:", response)


Assistant: You are a helpful assistant.
User: What is the weather like today?
Assistant: It is a sunny day, with a high temperature of 25 degrees Celsius. The wind speed is 10 km/h from the north-east.


## 4. Generate Python samples with pretrained general model

Use the model to write a python function called `find_max()` that finds the maximum value in a list of numbers:

In [24]:
prompt =  "def find_max(numbers):"

In [25]:
inputs = tokenizer(
    prompt, return_tensors="pt"
).to(model.device)

streamer = TextStreamer(
    tokenizer,
    skip_prompt=True, # Set to false to include the prompt in the output
    skip_special_tokens=True
)

In [27]:
outputs = model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    temperature=0.0,
    repetition_penalty=1.1
)


   max_num = numbers[0]
   for num in numbers:
       if num > max_num:
           max_num = num
   return max_num
```

In this updated version, we've added a `max_num` variable to store the maximum number found so far. We then loop through each number in the list and compare it with the current maximum value. If the current number is greater than the maximum value, we update the maximum value and reset the current maximum value to the new maximum value. This continues until all numbers have been processed or the maximum value has been


## 5. Generate Python samples with finetuned Python model

This model has been fine-tuned on instruction code examples. You can find the model and information about the fine-tuning datasets on the Hugging Face model library at [this link](https://huggingface.co/upstage/TinySolar-248m-4k-code-instruct).

You'll follow the same steps as above to load the model and use it to generate text.

In [29]:
model_path_or_name = "upstage/TinySolar-248m-4k-code-instruct"

In [30]:
tiny_finetuned_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map="cpu",
    torch_dtype=torch.bfloat16,
)

tiny_finetuned_tokenizer = AutoTokenizer.from_pretrained(
    model_path_or_name
)

config.json:   0%|          | 0.00/686 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [31]:
prompt =  "def find_max(numbers):"

inputs = tiny_finetuned_tokenizer(
    prompt, return_tensors="pt"
).to(tiny_finetuned_model.device)

streamer = TextStreamer(
    tiny_finetuned_tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

outputs = tiny_finetuned_model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    temperature=0.0,
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



   if len(numbers) == 0:
       return "Invalid input"
   else:
       return max(numbers)
```

In this solution, the `find_max` function takes a list of numbers as input and returns the maximum value in that list. It then iterates through each number in the list and checks if it is greater than or equal to 1. If it is, it adds it to the `max` list. Finally, it returns the maximum value found so far.


## 6. Generate Python samples with pretrained Python model

Here you'll use a version of TinySolar-248m-4k that has been further pretrained (a process called **continued pretraining**) on a large selection of python code samples. You can find the model on Hugging Face at [this link](https://huggingface.co/upstage/TinySolar-248m-4k-py).

You'll follow the same steps as above to load the model and use it to generate text.

In [32]:
model_path_or_name = "upstage/TinySolar-248m-4k-code-instruct"

In [33]:
tiny_custom_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map="cpu",
    torch_dtype=torch.bfloat16,
)

tiny_custom_tokenizer = AutoTokenizer.from_pretrained(
    model_path_or_name
)

In [34]:
prompt = "def find_max(numbers):"

inputs = tiny_custom_tokenizer(
    prompt, return_tensors="pt"
).to(tiny_custom_model.device)

streamer = TextStreamer(
    tiny_custom_tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

outputs = tiny_custom_model.generate(
    **inputs, streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



   if len(numbers) == 0:
       return "Invalid input"
   else:
       return max(numbers)
```

In this solution, the `find_max` function takes a list of numbers as input and returns the maximum value in that list. It then iterates through each number in the list and checks if it is greater than or equal to 1. If it is, it adds it to the `max` list. Finally, it returns the maximum value found so far.


Try running the python code the model generated above:

In [35]:
def find_max(numbers):
   max = 0
   for num in numbers:
       if num > max:
           max = num
   return max

In [36]:
find_max([1,3,5,1,6,7,2])

7