## 1. Introduction
This notebook provides step by step instructions to run the preparation for embedder testing. The preparation steps include:
- analyzing the MDX files that the embedder uses
- generating questions for mdx files by connecting to ChatGPT and saving them into file
- passing these questions to the embedder to see if it finds relevant mdx files
- evaluating the results


In [None]:
#!pip install requirements.txt

First, we take a look at the mdx files, that we will work with.
We will:
1) How many files are there in total, czech and english files,
2) check if the files contain the title,
3) if there are any duplicates in filenames or titles,
4) see how big the largest file is.
These mdx files contain documentation for the embedder to learn from.

In [None]:
!python analyze_mdx_files.py


## 2. Generate Questions
Let's try to contact the chatbot with a custom question, to see if the API works.

In [None]:
!python contact_chatbot.py --question "What are the key ideas in this document?" --model "gpt-4o"


Second, we will use these files to generate questions, that can be answered from these files.

In [None]:
!python generate_questions.py --language czech --num_questions 1 --max_files 1 --model gpt-4o

## 3. Evaluate Embedder
Let's try to contact the embedder, to see if the API works.

In [10]:
!python contact_embedder.py --question "What is Omero?" --top_k 3 --api_url "https://embedbase-ol.dyn.cloud.e-infra.cz/v1/ceritsc-documentation/search"


--- API Response ---
ID: 4fc2b10eb4d747e3aa4e93d244521c75
Created: 1741086887
Dataset ID: ceritsc-documentation
Query: What is Omero?

--- Similarities ---

Result 1:
  - Score: 0.6329
  - ID: 901ebdd3-fe37-4c69-9dff-bf9ba6b75f71
  - Data: --- title: Spuštění aplikace Omero --- Máme úkol spustit aplikaci [Omero](https://www.openmicroscopy...

Result 2:
  - Score: 0.6246
  - ID: 65d717da-6103-4c2c-a53f-1839d976ccf7
  - Data: --- title: Use case Omero --- We have got a task to run [Omero](https://www.openmicroscopy.org/omero...

Result 3:
  - Score: 0.4498
  - ID: 6d739198-78fd-44b8-924e-ebab85ed8db8
  - Data: --- title: MinIO Operator --- [MinIO Operator](https://min.io/docs/minio/kubernetes/upstream/index.h...




Third, we use previously generated questions and pass them to the embedder.

In [11]:
!python embedder_testing.py --embedder_url "https://embedbase-ol.dyn.cloud.e-infra.cz/v1/ceritsc-documentation/search" --num_questions 5 --top_k 3 --num_docs 1 --language czech --output_file "evaluation_results.json"


Results saved to evaluation_results.json




 ## 4. Embedder evaluation


In [12]:
!python evaluate_embedder.py --results_file "evaluation_results.json" --output_file "evaluation_2.csv"


Accuracy: 0.00% (0/1)
Mean Position Score: 0.0000
Evaluation results saved to evaluation_2.csv
                                            question  ... correct_found
0  Jaké informace jsou obsaženy v sekci "News" s ...  ...         False

[1 rows x 5 columns]
