### Step 1: Prepare documents

In [1]:
import json

with open("../01-intro/documents.json", "rt") as f:
    docs = json.load(f)

In [2]:
docs

[{'course': 'data-engineering-zoomcamp',
  'documents': [{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
    'section': 'General course-related questions',
    'question': 'Course - When will the course start?'},
   {'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
    'section': 'General course-related questions',
    'question': 'Course - What are the prerequisites for this course?'},
   {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in

In [3]:
documents = []

for course_dict in docs:
    for doc in course_dict["documents"]:
        doc["course"] = course_dict["course"]
        documents.append(doc)

documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

### Step 2: Create Embeddings using Pretrained Models

This is a new library compared to the previous modules.
Please perform `pip install sentence_transformers==2.7.0`
```
from sentence_transformers import SentenceTransformer
```

**Note:**
If you get an err do the following:
1. Uninstall `numpy`: `pip uninstall numpy`
2. Uninstall `torch`: `pip uninstall torch`
3. Install `numpy`: `pip install numpy=1.26.4`
4. Install `torch`: `pip install torch`

After completing above steps, it should work

In [4]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")

In [5]:
# let's check the model
model.encode("I am learning LLMs")

array([-1.46834739e-02, -1.45687670e-01, -1.11717647e-02, -4.17774823e-03,
       -4.69497740e-02,  4.05647382e-02,  1.46392137e-02, -1.13837598e-02,
        1.49467895e-02,  1.39439609e-02,  3.87493260e-02,  3.34818824e-03,
        6.49927836e-03, -5.10478243e-02,  5.26150949e-02,  2.74828705e-03,
        7.30563179e-02, -9.74272657e-03, -7.23577812e-02,  1.04203671e-02,
       -5.67392725e-03, -2.78357398e-02, -2.89573912e-02, -3.97306867e-02,
       -4.78648953e-02, -4.50870246e-02, -3.09748445e-02,  3.33431810e-02,
        2.25111451e-02, -2.04426274e-02,  4.08244319e-02,  2.45035117e-05,
        2.86329444e-02, -4.12123613e-02,  1.61245384e-06, -1.89224500e-02,
       -1.36324586e-02, -1.44837871e-02, -3.68302055e-02,  2.33760327e-02,
       -1.37301199e-02,  7.57615715e-02,  8.50398187e-03,  4.21982817e-02,
       -7.40093365e-03,  6.98372871e-02,  3.79895382e-02,  2.83731353e-02,
       -1.43681839e-02,  8.28124955e-02,  3.51101793e-02, -3.13635468e-02,
       -2.71269325e-02, -

In [6]:
# let's check the output embedding size
len(model.encode("I am learning LLMs"))

768

In [7]:
documents[2]

{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'section': 'General course-related questions',
 'question': 'Course - Can I still join the course after the start date?',
 'course': 'data-engineering-zoomcamp'}

In [8]:
# create the dense vector using the pre-trained model

operations = []
for document in documents:
    document["text_vector"] = model.encode(
        document["text"]
    ).tolist()  # convert to list (numpy -> list)
    operations.append(document)

In [9]:
operations[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'text_vector': [-0.03570352494716644,
  -0.06891420483589172,
  -0.04448334500193596,
  0.006250137463212013,
  -0.04224714636802673,
  -0.006454011891037226,
  0.027548108249902725,
  -0.018610386177897453,
  -0.052037570625543594,
  -0.0024046599864959717,
  0.057454511523246765,
  -0.015592562034726143,
  0.015488838776946068,
  0.09870167821645737,
  0.015649672597646713,
  

### Step 3: Setup ElasticSearch connection

We need to run the elastic search docker container using following command:

```bash
docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

In [10]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch("http://localhost:9200")

es_client.info()

ObjectApiResponse({'name': '709c1cb2dae4', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'KQOcYMoATwuQJQ1QNziYFA', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

### Step 4: Create Mappings and Index

- Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.
- Each document is a collection of fields, which each have their own data type.
- We can compare mapping to a databse schema in how it describes the fields and properties that documents hold, and the datatype of each field(e.g., integer, or data), and how those fields should be indexed and stored.

In [11]:
index_settings = {
    "settings": {"number_of_shards": 1, "number_of_replicas": 0},
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "text_vector": {
                "type": "dense_vector",
                "dims": 768,
                "index": True,
                "similarity": "cosine",
            },
        }
    },
}

In [12]:
index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

### Step 5: Add documents into index

In [13]:
for document in documents:
    try:
        es_client.index(index=index_name, document=document)
    except Exception as e:
        print(f"Exception occured: {e}")

### Step 6: Create end user query

In [14]:
search_term = "should I use windows?"
vector_search_term = model.encode(search_term)

In [15]:
# create a search query
query = {
    "field": "text_vector",
    "query_vector": vector_search_term,
    "k": 5,
    "num_candidates": 1000,
}

In [16]:
result = es_client.search(
    index=index_name, knn=query, source=["text", "section", "question", "source"]
)
result["hits"]["hits"]

[{'_index': 'course-questions',
  '_id': 'E908xJAB1xkangf6ZHol',
  '_score': 0.7810613,
  '_source': {'question': 'Environment - Is the course [Windows/mac/Linux/...] friendly?',
   'section': 'General course-related questions',
   'text': 'Yes! Linux is ideal but technically it should not matter. Students last year used all 3 OSes successfully'}},
 {'_index': 'course-questions',
  '_id': 'A908xJAB1xkangf6Y3rY',
  '_score': 0.6734718,
  '_source': {'question': 'Environment - Is Python 3.9 still the recommended version to use in 2024?',
   'section': 'General course-related questions',
   'text': 'Yes, for simplicity (of troubleshooting against the recorded videos) and stability. [source]\nBut Python 3.10 and 3.11 should work fine.'}},
 {'_index': 'course-questions',
  '_id': 'QN08xJAB1xkangf6b3w1',
  '_score': 0.6678438,
  '_source': {'question': 'How to install WSL on Windows 10 and 11 ?',
   'section': '5. Deploying Machine Learning Models',
   'text': 'It is quite simple, and you ca

### Step 7: Perform Keyword Search with Semantic Search (Hybrid / Advanced Search)

In [17]:
# reponse = es_client.search(
#     index=index_name,
#     query={
#         "bool": {
#             "must": {
#                 "multi_match":
#                     {
#                         "query": "windows or python?",
#                         "fields": ["text", "question", "course", "title"],
#                         "type": "best_fields"
#                     }
#                 },
#             "filter": {
#                     "term":
#                         {
#                             "course": "data-engineering-zoomcamp"
#                         }
#             }

#         }
#     }
# )

In [18]:
knn_query = {
    "field": "text_vector",
    "query_vector": vector_search_term,
    "k": 5,
    "num_candidates": 1000,
}

response = es_client.search(
    index=index_name,
    query={
        "match": {"course": "data-engineering-zoomcamp"},
    },
    knn=knn_query,
    size=5,
    explains=True,
)

TypeError: Elasticsearch.search() got an unexpected keyword argument 'explains'

In [None]:
reponse["hits"]["hits"]

[{'_index': 'course-questions',
  '_id': 'pd2tw5AB1xkangf6kXbH',
  '_score': 7.728908,
  '_source': {'text': 'Problem: If you have already installed pgcli but bash doesn\'t recognize pgcli\nOn Git bash: bash: pgcli: command not found\nOn Windows Terminal: pgcli: The term \'pgcli\' is not recognized…\nSolution: Try adding a Python path C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\Scripts to Windows PATH\nFor details:\nGet the location: pip list -v\nCopy C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\site-packages\n3. Replace site-packages with Scripts: C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\Scripts\nIt can also be that you have Python installed elsewhere.\nFor me it was under c:\\python310\\lib\\site-packages\nSo I had to add c:\\python310\\lib\\Scripts to PATH, as shown below.\nPut the above path in "Path" (or "PATH") in System Variables\nReference: https://stackoverflow.com/a/68233660',
   'section': 'Module 1: Docker and Terraform',
   'question': 'PGCLI - pg