# Homework 1 of LLM Zoomcamp
This notebook shows my work towards completing [Homework 1](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/cohorts/2025/01-intro/homework.md) of the 2025 cohort of the course LLM Zoomcamp.

## Question 1
With [Gemini's help](https://g.co/gemini/share/12e380bc651a), I was able to run Elasticsearch in a Docker container. The build hash was `dbcbbbd0bc4924cfeb28929dc05d82d662c527b7`.

Next, we get the FAQ data on which the questions will be based.

In [None]:
import requests
import tiktoken
from elasticsearch import Elasticsearch

In [None]:
docs_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1"
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course["course"]

    for doc in course["documents"]:
        doc["course"] = course_name
        documents.append(doc)

In [None]:
documents[0]

## Question 2
Since the Elasticsearch version in the Docker container is `8.17.6`, we need to install the Python package `elasticsearch==8.17.0`.

We first create an Elasticsearch client and then specify the settings for creating an index for our FAQ document. The course in each question is a keyword while the rest of the fields are text.

In [None]:
es_client = Elasticsearch("http://localhost:9200")
es_client.info()

In [None]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "questions": {"type": "text"},
            "course": {"type": "keyword"},
        }
    },
}

In [None]:
index_name = "llm-zoomcamp-docs"
es_client.indices.create(index=index_name, body=index_settings)

Now, using the `index()` method, we can add the data to Elasticsearch.

In [None]:
for doc in documents:
    es_client.index(index=index_name, document=doc)

## Question 3
Next, we search for documents in Elasticsearch that are relevant to our query. We specify the query and its parameters and then look for the document with the highest `_score` value which is `44.50556`.

In [None]:
query = "How do execute a command on a Kubernetes pod?"

In [None]:
search_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": query,
                    "fields": ["question^4", "text"],
                    "type": "best_fields",
                }
            }
        }
    },
}

In [None]:
response = es_client.search(index=index_name, body=search_query)
response["hits"]["hits"]

## Question 4
We now filter the data for a specific course and then look at the retrieved documents for the question *How do copy a file to a Docker container?*.

In [None]:
question = "How do copy a file to a Docker container?"

In [None]:
filter_search_query = {
    "size": 3,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": question,
                    "fields": ["question^4", "text"],
                    "type": "best_fields",
                }
            },
            "filter": {"term": {"course": "machine-learning-zoomcamp"}},
        }
    },
}

The third question in the response is *How do I copy files from a different folder into docker container’s working directory?*.

In [None]:
filter_response = es_client.search(index=index_name, body=filter_search_query)
filter_response["hits"]["hits"]

## Question 5
Given the context template, we populate the documents returned for the previous query using the template with two new lines separating each context entry.

In [None]:
context_template = """
Q: {question}
A: {text}
""".strip()

context = ""
for resp in filter_response["hits"]["hits"]:
    source = resp["_source"]
    if context:
        context += "\n\n"
    context += context_template.format(question=source["question"], text=source["text"])

We now insert the context into a prompt template and check the length of the prompt which comes out to be *1446*.

In [None]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

In [None]:
len(prompt_template.format(question=question, context=context))

## Question 6
Using `tiktoken`, we fetch the character to integer encoding mapping for the GPT-4o model. We then use the encoding to get the length of our prompt which is *320 tokens*.

In [None]:
encoding = tiktoken.encoding_for_model("gpt-4o")

In [None]:
len(encoding.encode(prompt_template.format(question=question, context=context)))