# RAG AND LLM Workshop

In this exercise we will download some documents which are the Q&A of previous zoomcamp sessions. Instead of having to search for the answer to your specific question, you can use elastic search to retrieve several similar questions and their answers. Then, using an LLM (in this case -- OpenAI, for simplicity), we can use the result of elasticsearch as context for the prompt, to provide one answer than takes all the results of elasticsearch into account. In this way we have created a quick way to get your questions answered.


In [1]:
!pip install elasticsearch openai tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
from elasticsearch import Elasticsearch
from openai import OpenAI
from tqdm.auto import tqdm
import json

client = OpenAI(api_key="INSERT API KEY HERE")

  from .autonotebook import tqdm as notebook_tqdm


### Downloading the documents

In [3]:
!wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json

--2024-07-11 11:51:08--  https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alexeygrigorev/llm-rag-workshop/main/notebooks/documents.json [following]
--2024-07-11 11:51:08--  https://raw.githubusercontent.com/alexeygrigorev/llm-rag-workshop/main/notebooks/documents.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 658332 (643K) [text/plain]
Saving to: ‘documents.json.6’


2024-07-11 11:51:08 (31.4 MB/s) - ‘documents.json.6’ saved [658332/658332]



In [4]:
!head documents.json

[
  {
    "course": "data-engineering-zoomcamp",
    "documents": [
      {
        "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
        "section": "General course-related questions",
        "question": "Course - When will the course start?"
      },
      {


### Functions

In [5]:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()

prompt_template = """
You're a course teaching assistant.
Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Don't use other information outside of the provided CONTEXT.  

QUESTION: {user_question}

CONTEXT:

{context}
""".strip()

def retrieve_documents(query, index_name="course-questions", max_results=5):
    es = Elasticsearch("http://localhost:9200")
    
    search_query = {
        "size": max_results,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }
    
    response = es.search(index=index_name, body=search_query)
    documents = [hit['_source'] for hit in response['hits']['hits']]
    return documents

def build_context(documents):
    context_result = ""
    
    for doc in documents:
        doc_str = context_template.format(**doc)
        context_result += ("\n\n" + doc_str)
    
    return context_result.strip()


def build_prompt(user_question, documents):
    context = build_context(documents)
    prompt = prompt_template.format(
        user_question=user_question,
        context=context
    )
    return prompt

def ask_openai(prompt, model="gpt-4o"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content
    return answer

def qa_bot(user_question):
    context_docs = retrieve_documents(user_question)
    prompt = build_prompt(user_question, context_docs)
    answer = ask_openai(prompt)
    return answer


### Loading documents to memory

In [6]:
with open('./documents.json', 'rt') as f_in:
    documents_file = json.load(f_in)

documents = []

for course in documents_file:
    course_name = course['course']

    for doc in course['documents']: # flattening of document
        doc['course'] = course_name
        documents.append(doc)

In [7]:
len(documents)

948

### Connecting elasticsearch instance with port

In [8]:
es = Elasticsearch("http://localhost:9200")
es.info()

ObjectApiResponse({'name': 'd2531231a5d9', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'sp2nFfSzS_O6uReijtqobg', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

### Defining indices

In [9]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"
response = es.indices.create(index=index_name, body=index_settings)

response

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

### Assigning indices to ES instance

In [10]:
for doc in tqdm(documents):
    es.index(index=index_name, document=doc)
    

100%|██████████| 948/948 [00:21<00:00, 43.17it/s]


### Example Usage

In [11]:
user_question = "How do I join the course after it has started?"

response = retrieve_documents(user_question)

for doc in response:
    print(f"Section: {doc['section']}")
    print(f"Question: {doc['question']}")
    print(f"Answer: {doc['text'][:60]}...\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to su...

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishe...

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependenc...

Section: General course-related questions
Question: How do I use Git / GitHub for this course?
Answer: After you create a GitHub account, you should clone the cour...

Section: Workshop 1 - dlthub
Question: How do I install the necessary dependencies to run the code?
Answer: Answer: To run the provided code, ensure that the 'dlt[duckd...



These are the results of elastic search. ES looks at words like ‘join’, ‘course’, ‘after’, ‘started’ and will find documents that contain these words, and the more these appear in the document, the more relevant it will be according to ES.

Now we will integrate the LLM. We define the prompt, with the question as the user question, and the context being the documents retrieved by elastic search.

In [12]:
print(qa_bot("How do I join the course after it has started?"))

Yes, you can still join the course after it has started. You are eligible to submit the homeworks even if you did not register at the beginning. However, please note that there are deadlines for turning in the final projects, so it's important not to delay too much.


In [13]:
print(qa_bot("I'm getting invalid reference format: repository name must be lowercase"))

It looks like you're encountering the error "invalid reference format: repository name must be lowercase" when working with Docker, particularly with mounting volumes on Windows. This issue often arises due to differences in handling file paths between different operating systems or formats. Here are some solutions you can try:

1. **Move your data to a folder without spaces:**
   If your project directory contains spaces (e.g., "C:/Users/Alexey Grigorev/git/..."), move it to a location without spaces (e.g., "C:/git/...").

2. **Different volume mapping options:**
   Try replacing the `-v` part of your Docker command with any of the following options:
   ```
   -v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data
   -v //c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data
   -v /c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data
   -v //c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data
   --volume //driveletter/path/ny_taxi_postgres_data/:/var/lib/po