# Homework

This notebook is a starting point for turning the FAQ data into a searchable, AI-ready knowledge base using Qdrant and dlt.

## Question 1: dlt Version

In this homework, we will load the data from our FAQ to Qdrant

Let's install dlt with Qdrant support and Qdrant client:
```python
pip install -q "dlt[qdrant]" "qdrant-client[fastembed]"
```
We load the data from our FAQ to Qdrant so we can perform efficient semantic search and retrieval over the FAQ documents using vector embeddings.

Qdrant is a vector database. It stores not just the text, but also the embeddings (numerical representations) of each document. 
This allows us to:

* Find similar questions or answers even if the wording is different (semantic search).
* Build retrieval-augmented generation (RAG) systems, chatbots, or FAQ assistants that can answer user queries by finding the most relevant FAQ entries.
* Scale to large collections of documents and perform fast, accurate searches.

What's the version of dlt that you installed?

In [6]:
import dlt
print(dlt.__version__)

1.15.0


### dlt Resourse
For reading the FAQ data, we have this helper function (Annotate the helper function with `@dlt.resource`):

In [7]:
# Import requests to fetch the FAQ documents
import requests

# Annotate the helper function with @dlt.resource so dlt recognizes it as a data source
@dlt.resource
def zoomcamp_data():
    # Download the FAQ documents JSON from the provided URL
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    docs_response = requests.get(docs_url)
    documents_raw = docs_response.json()

    # For each course, add the course name to each document and yield it
    for course in documents_raw:
        course_name = course['course']
        for doc in course['documents']:
            doc['course'] = course_name
            yield doc  # Yield each document as a record for dlt (one at a time)

## Question 2: dlt pipeline

Now let's create a pipeline.

We need to define a destination for that. Let's use the qdrant one:
```python
from dlt.destinations import qdrant

qdrant_destination = qdrant(
  qd_path="db.qdrant", 
)
```
In this case, we tell dlt (and Qdrant) to create a folder with our data, and the name for it will be `db.qdrant`

The destination is required so dlt knows where and how to load the data after processing it. The destination to Qdrant ensures that:

* The FAQ documents and their embeddings are stored in Qdrant.
* You can later perform vector search and retrieval on your data.

How many rows were inserted into the zoomcamp_data collection? 

Look for "Normalized data for the following tables:" in the trace output.

In [8]:
# Import the Qdrant destination from dlt
from dlt.destinations import qdrant

# Create a Qdrant destination instance, specifying the folder for the database
qdrant_destination = qdrant(
    qd_path="db.qdrant", # The folder where Qdrant will store its data files
)

# Create and run the dlt pipeline:
# dlt will automatically generate embeddings for the documents and store them in Qdrant
pipeline = dlt.pipeline(
    pipeline_name='zoomcamp_pipeline',
    destination=qdrant_destination,
    dataset_name='zoomcamp_tagged_data'
)

# This is where the data is loaded into Qdrant and embeddings are generated automatically
load_info = pipeline.run(zoomcamp_data())

# Print the pipeline trace to see details about the run
print(pipeline.last_trace)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model_optimized.onnx:   0%|          | 0.00/133M [00:00<?, ?B/s]

Run started at 2025-08-15 14:24:00.788837+00:00 and COMPLETED in 14.95 seconds with 4 steps.
Step extract COMPLETED in 0.30 seconds.

Load package 1755267851.2368312 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.12 seconds.
Normalized data for the following tables:
- zoomcamp_data: 948 row(s)

Load package 1755267851.2368312 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 4.09 seconds.
Pipeline zoomcamp_pipeline load step completed in 4.06 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_tagged_data
The qdrant destination used /workspaces/llm-homework/dlt/db.qdrant location to store data
Load package 1755267851.2368312 is LOADED and contains no failed jobs

Step run COMPLETED in 14.95 seconds.
Pipeline zoomcamp_pipeline load step completed in 4.06 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp

## Question 3. Embeddings

When inserting the data, an embedding model was used. Which one?

You can find this out by inspecting the meta.json file created in the target folder. During the data insertion process, a folder named db.qdrant will be created, and the meta.json file will be located inside this folder.

In [9]:
# Inspect the embedding model used by reading db.qdrant/meta.json
import json
with open('db.qdrant/meta.json') as f:
    meta = json.load(f)
print(json.dumps(meta, indent=2))

{
  "collections": {
    "zoomcamp_tagged_data": {
      "vectors": {
        "fast-bge-small-en": {
          "size": 384,
          "distance": "Cosine",
          "hnsw_config": null,
          "quantization_config": null,
          "on_disk": null,
          "datatype": null,
          "multivector_config": null
        }
      },
      "shard_number": null,
      "sharding_method": null,
      "replication_factor": null,
      "write_consistency_factor": null,
      "on_disk_payload": null,
      "hnsw_config": null,
      "wal_config": null,
      "optimizers_config": null,
      "init_from": null,
      "quantization_config": null,
      "sparse_vectors": null,
      "strict_mode_config": null
    },
    "zoomcamp_tagged_data__dlt_loads": {
      "vectors": {
        "fast-bge-small-en": {
          "size": 384,
          "distance": "Cosine",
          "hnsw_config": null,
          "quantization_config": null,
          "on_disk": null,
          "datatype": null,
          "m