# LLM Zoomcamp - Week 2 Notes

In the second week, we set up cloud-based GPU options like SaturnCloud and explore open source alternatives to OpenAI platforms and  models like:
Platforms:
- HuggingFace
- Ollama
- SaturnCloud

Models:
- Google FLAN T5
- Phi 3 Mini
- Mistral 7-B

And finally, we put the RAG we built in week 1 into a Streamlit UI

## 2.2 Using SaturnCloud for GPU Notebooks
- The main thing not covered is how to give Saturn Cloud access to your GitHub repositories
    - This is fairly straightforward:
        1. In Saturn Cloud, go to "Manage <username>" and create an SSH key pair
        2. Copy the public key Saturn Cloud generates and go to Github.com
            i. In Github.com, go to Settings -> SSH and GPG keys and click on `New SSH Key`
            ii. Paste in the public key you copied from Saturn Cloud
        3. Now go back to Saturn Cloud and click on `Git Repositories`
            i. Click on `New`
            ii. Add the url for the Github repository you want Saturn Cloud to have access to
- When creating a new Python VM resource, make sure to install additional libs: `pip install -U transformers accelerate bitsandbytes`
- The rest of it is quite straightforward
- A couple of things I did with my setup of the notebook resource that just helps with development:
    1. I enabled SSH access so that I can ideally connect to this notebook resource in VS Code (and thus take advantange of many things including Github Copilot)
    2. I gave the VM an easy to remember name: https://llm-zoomcamp-waleed.community.saturnenterprise.io

## 2.3 HuggingFace and Google FLAN T5

- In this lesson, we start working with open source models available on [HuggingFace](huggingface.co)
    - HuggingFace is a place where people host models, not just LLMs, all kinds of ML models (which effectively boils down to hosting model weights)
- This is where our Saturn Cloud GPU notebook in 2.2 comes into play as we'll need a GPU to work with these models
- We're going to be using Google FLAN T5: https://huggingface.co/google/flan-t5-xl

Let's start by pulling in the minsearch engine we're going to use in our RAG

In [1]:
!rm -f minsearch.py
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-07-07 11:10:39--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py’


2024-07-07 11:10:39 (51.0 MB/s) - ‘minsearch.py’ saved [3832/3832]



From the link above, we have some reference code to run the model on a GPU:

```python
# pip install accelerate
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Import a tokenizer to convert text to tokens
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# Load the model
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto")

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
```

- An important consideration is how Saturn Cloud provisions storage
    - By default, HuggingFace wants to use a /cache subdirectory within the /home directory in your Saturn Cloud environment
        - You can change this by setting the `HF_HOME` environment variable
        - A better way to do this would be to set it using `direnv` (helpful blog post on that [here](https://waleedayoub.com/post/managing-dev-environments_local-vs-codespaces/#option-2-github-codespaces))
    ```python
    import os
    os.environ['HF_HOME']='/run/cache'import os
    os.environ['HF_HOME']='/run/cache'
    ```
    - The main change we make to our original FAQ answering RAG is the `def llm(query):` function
    ```python
    def llm(prompt):
        input_text = prompt
        input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
        outputs = model.generate(input_ids)
        results = tokenizer.decode(outputs[0])
        
        return results
    ```
    - By default, FLAN T5's `generate` method caps the length of the response. You can actually check what the max length is with this:
    ```python
    print(f"Default max_length: {model.config.max_length}")
    ```
    - This returns: Default max_length: 20
    - So this can easily be changed when calling the `generate` method like this:
    ```python
    outputs = model.generate(input_ids, max_length=200)
    ```
    - Another useful parameter to the `decode` method is passing `skip_special_tokens` which seems to get rid of the padding leading and trailing tokens
    ```python
    results = tokenizer.decode(outputs[0], skip_special_tokens=True)
    ```    

### So let's put it all together now and modify our RAG from section 1 to use FLAN T5!

In [2]:
import os
os.environ['HF_HOME']='/run/cache'

In [3]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [4]:
# Import a tokenizer to convert text to tokens
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# Load the model
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto")

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.45G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [5]:
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

In [6]:
print(input_ids)

tensor([[13959,  1566,    12,  2968,    10,   571,   625,    33,    25,    58,
             1]], device='cuda:0')


In [7]:
outputs = model.generate(input_ids)
print(outputs)
print(tokenizer.decode(outputs[0]))



tensor([[   0, 2739, 4445,  436,  292,   58,    1]], device='cuda:0')
<pad> Wie alt sind Sie?</s>


## Now we do the same RAG as before but we modify the llm function to use our local FLAN T5 weights vs an OpenAI API call

In [8]:
import requests 
import minsearch

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

index.fit(documents)

<minsearch.Index at 0x7fd4b21d5b50>

In [9]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results

In [10]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

def llm(prompt):
    input_text = prompt
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    outputs = model.generate(input_ids, max_length=200)
    results = tokenizer.decode(outputs[0])
    
    return results

In [11]:
def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [12]:
rag("I just discovered the course, can I still register?")

"<pad> Yes, even if you don't register, you're still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.</s>"

In [13]:
print(f"Default max_length: {model.config.max_length}")

Default max_length: 20


## 2.4 Phi 3 Mini
- Not a lot of notes to take here
- We just replaced the FLAN T5 implementation in the previous section with the Microsoft Phi 3 Mini implementation:
    - You can find that model here: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
- I'm not going to bother reproducing all the same code for the Phi3 model and will instead focus on the next section using Mistral7B

## 2.5 Mistral-7B and HuggingFace Hub Authentication
