Query: I just discovered the course. Can I still join?
Relevant documets: doc1,doc5,doc3 -



``` python

for each records in FAQ:
    generate 5 questions

```

Q1 -> q1_result in array form []

Q2 -> q2_result in array form []

Q3 -> q3_result in array form []

Q3 -> q4_result in array form []

Q5 -> q5_result in array form []



> so if record is 10 we would have 50 questions, for all those 50 questions we would results, then we will take those results and do some evaluation metrics to evaluate the quality of the results. 


> The evaluation metrics is done useing a `ground_truth_data`. We would then use our rseult and compare to the expected result (`ground_truth_data`), AKA evaluating.


### How do we pick `Ground Truth Data`


- Manual Annotation:

    - Experts or crowdworkers label relevant documents. Accurate but costly and slow.

- Implicit User Feedback:

    - Use user actions like clicks or watch time as relevance signals. Scalable but noisy.

- Explicit User Feedback:

    - Users give ratings or likes. More reliable but sparse.

- Public Datasets:

    - Use existing benchmark datasets if they fit your needs.

- Online A/B Testing:

    - Validate system effectiveness with real users.



In [10]:
import json

with open('documents-with-ids.json', 'rt') as f_in:
    documents = json.load(f_in)

# Use it
print(documents[0])


{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.", 'section': 'General course-related questions', 'question': 'Course - When will the course start?', 'course': 'data-engineering-zoomcamp', 'id': 'c02e79ef'}


#### Here the format of the document we will work on

``` json
{
    "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
    "section": "General course-related questions",
    "question": "Course - When will the course start?",
    "course": "data-engineering-zoomcamp",
    "id": "c02e79ef"
  },
  
```

In [11]:
from elasticsearch import Elasticsearch

# create our local client 

es_client = Elasticsearch('http://localhost:9200') 

# index our data (data is in db and reeady for searching)

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
        }
    }
}

index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [12]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [13]:

# search in elasticsearch DB (knowledge)
def elastic_search(query, course):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [14]:

# lets do a query (what we will search for), we will only look at the ones in "data-engineering-zoomcamp"
elastic_search(
    query="I just discovered the course. Can I still join?",
    course="data-engineering-zoomcamp"
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp',
  'id': '63394d91'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it fin

In [15]:
import pandas as pd

# here is the ground truth data we prepared for the evaluation

df_ground_truth = pd.read_csv('ground-truth-data.csv')

In [16]:
# change it to dict form
ground_truth = df_ground_truth.to_dict(orient='records')

In [17]:
# Perform Elasticsearch query for each question in the ground truth.
# For each result, check if the retrieved document ID matches the expected document ID.
# Save the relevance (True/False) list for each query into a 2D array `relevance_total`.

relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = elastic_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

In [18]:
# this is just an example not actual output

example = [
    [True, False, False, False, False], # 1, 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1 
    [False, False, True, False, False],  # 1/3
    [False, False, False, False, False], # 0
]

# 1 => 1
# 2 => 1 / 2 = 0.5
# 3 => 1 / 3 = 0.3333
# 4 => 0.25
# 5 => 0.2
# rank => 1 / rank
# none => 0

In [19]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

In [20]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [21]:
hit_rate(example)

0.5833333333333334

In [22]:

mrr(example)


0.5277777777777778

### Now lets use our ouput `relevance_total`

- hit-rate (recall)
- Mean Reciprocal Rank (mrr)

In [23]:
hit_rate(relevance_total), mrr(relevance_total)

(0.7395720769397017, 0.6029788920106625)

### So you can also create your own search tool and evaluate as needed 

# done! 