## Base code for module-2
Basically a condensed version of what we learned in Module 1:

index document for RAG into a minsearch object

In [1]:
# !rm -f minsearch.py
# !wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

In [2]:
import os
if not os.path.isfile('minsearch.py'):
    !python -m wget "https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py"
else:
    print("minsearch.py already exists")

minsearch.py already exists


In [3]:
import requests 
import minsearch

In [4]:
docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

index.fit(documents)

<minsearch.Index at 0x7fe2642fd8e0>

In [5]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=3
    )

    return results

In [6]:
def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

## Replace the OpenAI LLM with HuggingFace open-source model Mistral 7B

### Important Note: If you're not running in Saturn Cloud 

You need to install these libraries:

Make sure you use the latest versions

```pip install -U transformers accelerate bitsandbytes```

By default, the tokenizers are loaded into a default location specified under env variable HF_HOME, usually it's HF_HOME = /home/\<your username\>.

However on Saturn Cloud, you may not have enough space in your home directory. To check on how much space, use ```!df -h```

As per Module 2.3, we will switch the HF_HOME env variable to "/run/cache" as there is more space there.

In [7]:
os.environ['HF_HOME'] = '/run/cache/'
# equivalent to this terminal cmd: "export HF_HOME='/run/cache' "

### Import HuggingFace API key
Note that I setup my HuggingFace API key in the env. variable "HF_KEY".
Another option would be to hardcode here but it is not recommended.

In [8]:
# os.environ['HF_TOKEN'] = 'hf_blabla'
from huggingface_hub import login
login(token=os.environ['HF_KEY'])

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /run/cache/token
Login successful


### Tokenizer and LLM

The Tokenizer takes in text and turn it into some representation, and then the representation is fed into the language model.

In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [10]:
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True
)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [11]:
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

### Putting everything together

Replace the original prompt and llm functions with the Phi3 model

In [12]:
# def build_prompt(query, search_results):
#     prompt_template = """
# You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
# Use only the facts from the CONTEXT when answering the QUESTION.

# QUESTION: {question}

# CONTEXT: 
# {context}
# """.strip()

#     context = ""
    
#     for doc in search_results:
#         context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
#     prompt = prompt_template.format(question=query, context=context).strip()
#     return prompt

# def llm(prompt):
#     response = client.chat.completions.create(
#         model='gpt-4o',
#         messages=[{"role": "user", "content": prompt}]
#     )
    
#     return response.choices[0].message.content

In [13]:
def build_prompt(query, search_results):
    prompt_template = """
QUESTION: {question}

CONTEXT:
{context}

ANSWER:
""".strip()

    context = ""
    
    for doc in search_results:
        context = context + f"{doc['question']}\n{doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

def llm(prompt):
    response = generator(prompt, max_length=500, temperature=0.7, top_p=0.95, num_return_sequences=1)
    response_final = response[0]['generated_text']    
    return response_final[len(prompt):].strip()

In [14]:
print(rag("I just discovered the course. Can I still join it?"))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Yes, you can still join the course.
