Need to run the command below for elastic search 

docker run -it --rm --name elasticsearch -m 4GB -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.17.6

In [1]:
from elasticsearch import Elasticsearch

In [2]:
es_client = Elasticsearch('http://localhost:9200') 

In [3]:
es_client.info()

ObjectApiResponse({'name': '31bf376eb9e4', 'cluster_name': 'docker-cluster', 'cluster_uuid': '5YqUafsCQsKKgZyjvov8Qw', 'version': {'number': '8.17.6', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'dbcbbbd0bc4924cfeb28929dc05d82d662c527b7', 'build_date': '2025-04-30T14:07:12.231372970Z', 'build_snapshot': False, 'lucene_version': '9.12.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

#### Question 1

What's the version.build_hash value ?

In [4]:
es_client.info()['version']['build_hash']

'dbcbbbd0bc4924cfeb28929dc05d82d662c527b7'

#### Continuation

In [5]:
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [6]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": { # Mapping for the fields in the index setting question to be the keyword type and rest to be text type
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions-homework" # Just a name for the index

es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions-homework'})

#### Q2. Indexing the data


Index the data in the same way as was shown in the course videos. Make the course field a keyword and the rest should be text. Which function do you use for adding your data to elastic?

insert\
index\
put\
add

The answer to the above question is index as we use the index function to add data to elastic search.

In [7]:
from tqdm.auto import tqdm

In [8]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

#### Question 3

Now let's search in our index.

We will execute a query ""How do execute a command on a Kubernetes pod?"".

Use only question and text fields and give question a boost of 4, and use "type": "best_fields".

What's the score for the top ranking result?

In [9]:
# Function to search the documents based on the query using ElasticSearch
def elastic_search(query):
    search_query = {
        "size": 5, # Number of results to return 
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text", "section"], # Boosting the question field by 4
                        "type": "best_fields"
                    }
                },
                #"filter": {
                #    "term": {
                #        "course": "data-engineering-zoomcamp" # Filtering the results based on the course
                    }
                }
            }

    response = es_client.search(index=index_name, body=search_query) # Searching the index with the query
    
    return response

In [10]:
query =  "How do execute a command on a Kubernetes pod?"

In [11]:
elastic_search(query)['hits']['max_score']

44.50556

#### Question 4. Filtering

Now let's only limit the questions to machine-learning-zoomcamp.

Return 3 results. What's the 3rd question returned by the search engine?

How do I debug a docker container?
How do I copy files from a different folder into docker container’s working directory?
How do Lambda container images work?
How can I annotate a graph?

In [12]:
query =  "How do copy a file to a Docker container?"

In [13]:
# Function to search the documents based on the query using ElasticSearch
def elastic_search(query):
    search_query = {
        "size": 3, # Number of results to return 
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text"], # Boosting the question field by 4
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "machine-learning-zoomcamp" # Filtering the results based on the course
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query) # Searching the index with the query
    
    return response


In [14]:
elastic_search(query)['hits']['hits'][2]['_source']['question']

'How do I copy files from a different folder into docker container’s working directory?'

#### Q5. Building a prompt


Now we're ready to build a prompt to send to an LLM.

Take the records returned from Elasticsearch in Q4 and use this template to build the context. Separate context entries by two linebreaks (\n\n)

Now use the context you just created along with the "How do I execute a command in a running docker container?" question to construct a prompt using the template below:

What's the length of the resulting prompt? (use the len function)




In [15]:
context_template = """
Q: {question}
A: {text}
""".strip()

In [16]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

In [17]:
context_pieces = []

for hit in elastic_search(query)['hits']['hits']:
    doc = hit['_source']
    context_piece = context_template.format(**doc)
    context_pieces.append(context_piece)

context = '\n\n'.join(context_pieces)

In [18]:
len(prompt_template.format(question=query, context=context))

1446

#### Q6. Tokens

In [19]:
import tiktoken

In [20]:
encoding = tiktoken.encoding_for_model("gpt-4o")

In [21]:
len(encoding.encode(prompt_template.format(question=query, context=context)))

320

In [22]:
encoding.decode_single_token_bytes(63842)

b"You're"