# Q1. Running Elastic

Run Elastic Search 8.4.3, and get the cluster information. If you run it on localhost, this is how you do it:

`curl localhost:9200`

What's the version.build_hash value?

Output:

```json
{
  "name" : "69bf270d13e5",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "x95q-KD2Qgq-yB7VBR0opA",
  "version" : {
    "number" : "8.4.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73",
    "build_date" : "2022-10-04T07:17:24.662462378Z",
    "build_snapshot" : false,
    "lucene_version" : "9.3.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}
```

**Answer:** 42f05b9372a9a4a470db3b52817899b99a76ee73

# Getting the data

In [2]:
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

# Q2. Indexing the data

Index the data in the same way as was shown in the course videos. Make the course field a keyword and the rest should be text.

In [3]:
from elasticsearch import Elasticsearch

In [4]:
es_client = Elasticsearch('http://localhost:9200')

In [5]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [6]:
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|████████████████████████████████████████████████████████████████████████████████| 948/948 [00:07<00:00, 124.37it/s]


Which function do you use for adding your data to elastic?  
**Answer:** index

# Q3. Searching

Now let's search in our index.

We will execute a query "How do I execute a command in a running docker container?".

Use only question and text fields and give question a boost of 4, and use "type": "best_fields".

What's the score for the top ranking result?

* 94.05
* 84.05
* 74.05
* 64.05


Look at the _score field.

In [18]:
def elastic_search(query, n = 1):
    search_query = {
        "size": n,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text"],
                        "type": "best_fields"
                    }
                },
                #"filter": {
                #    "term": {
                #        "course": "data-engineering-zoomcamp"
                #    }
                #}
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    return response

In [21]:
a = elastic_search("How do I execute a command in a running docker container?")
best_result = a['hits']['hits'][0]
print("""
Best score: {res[_score]}
Best result: {res[_source][text]}
""".format(res=best_result, _source='_source', text='text', _score='_score'))


Best score: 84.050095
Best result: Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.
docker run -it --entrypoint bash <image>
If the container is already running, execute a command in the specific container:
docker ps (find the container-id)
docker exec -it <container-id> bash
(Marcos MJD)



**Anser:** 85.05

# Q4. Filtering

Now let's only limit the questions to machine-learning-zoomcamp.

Return 3 results. What's the 3rd question returned by the search engine?

* How do I debug a docker container?
* How do I copy files from a different folder into docker container’s working directory?
* How do Lambda container images work?
* How can I annotate a graph?

In [29]:
def elastic_search_ml(query, n = 1):
    search_query = {
        "size": n,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "machine-learning-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    return response
def get_n_answer(query, n):
    a = elastic_search_ml(query, n)
    third_result = a['hits']['hits'][n-1]
    print("""
    Result: {res[_source][text]}
    """.format(res=third_result, _source='_source', text='text', _score='_score'))

In [30]:
get_n_answer("How do I debug a docker container?", 3)


    Result: I wanted to understand how lambda container images work in depth and how lambda functions are initialized, for this reason, I found the following documentation
https://docs.aws.amazon.com/lambda/latest/dg/images-create.html
https://docs.aws.amazon.com/lambda/latest/dg/runtimes-api.html
Added by Alejandro aponte
    


In [31]:
get_n_answer("How do I copy files from a different folder into docker container’s working directory?", 3)


    Result: Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.
docker run -it --entrypoint bash <image>
If the container is already running, execute a command in the specific container:
docker ps (find the container-id)
docker exec -it <container-id> bash
(Marcos MJD)
    


In [32]:
get_n_answer("How do Lambda container images work?", 3)


    Result: Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.
docker run -it --entrypoint bash <image>
If the container is already running, execute a command in the specific container:
docker ps (find the container-id)
docker exec -it <container-id> bash
(Marcos MJD)
    


In [33]:
get_n_answer("How can I annotate a graph?", 3)


    Result: Welcome to the course! Go to the course page (http://mlzoomcamp.com/), scroll down and start going through the course materials. Then read everything in the cohort folder for your cohort’s year.
Click on the links and start watching the videos. Also watch office hours from previous cohorts. Go to DTC youtube channel and click on Playlists and search for {course yyyy}. ML Zoomcamp was first launched in 2021.
Or you can just use this link: http://mlzoomcamp.com/#syllabus
    


# Q5. Building a prompt

In [42]:
def elastic_search_ml(query, n = 1):
    search_query = {
        "size": n,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "machine-learning-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)

    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [46]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()

    context = ""
    
    for doc in search_results:
        context = context + f"Q: {doc['question']}\nA: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [47]:
def rag1(query): 
    search_results = elastic_search_ml(query, 3)
    prompt = build_prompt(query, search_results)
    return prompt

In [49]:
prompt = rag1("How do I execute a command in a running docker container?")
len(prompt)

1463

What's the length of the resulting prompt? (use the len function)

* 962
* **1462**  (as closest)
* 1962
* 2462

**Answer**: 1462

# Q6. Tokens

In [50]:
pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting regex>=2022.1.18 (from tiktoken)
  Downloading regex-2024.5.15-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m925.4 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
Downloading tiktoken-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
[?25hDownloading regex-2024.5.15-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (774 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m774.6/774.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: regex, tiktoken
Successfully installed regex-2024.5.15 tiktoken-0.7.0
Note: you may ne

In [52]:
import tiktoken

In [53]:
encoding = tiktoken.encoding_for_model("gpt-4o")

In [55]:
len(encoding.encode(prompt))

323

Use the encode function. How many tokens does our prompt have?

* 122
* 222
* **322**
* 422
Note: to decode back a token into a word, you can use the decode_single_token_bytes function:

**Answer:** 322

In [59]:
encoding.decode_single_token_bytes(100)

b'\xa7'