# Question 1. dlt Version
In this homework, we will load the data from our FAQ to Qdrant

Let's install dlt with Qdrant support and Qdrant client:

```python
pip install -q "dlt[qdrant]" "qdrant-client[fastembed]"
```

In [1]:
!pip install -q "dlt[qdrant]" "qdrant-client[fastembed]"

In [None]:
# check dlt version installed
import dlt 

dlt.__version__

'1.12.3'

## dlt Resourse
For reading the FAQ data, we have this helper function:

In [5]:
import dlt
import requests

@dlt.resource(write_disposition="replace", name='dlt-homework')
def zoomcamp_data():
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    docs_response = requests.get(docs_url)
    documents_raw = docs_response.json()

    for course in documents_raw:
        course_name = course['course']

        for doc in course['documents']:
            doc['course'] = course_name
            yield doc

Annotate it with `@dlt.resource`. We will use it when creating a dlt pipeline.

# Question 2. dlt pipeline
Now let's create a pipeline.

We need to define a destination for that. Let's use the qdrant one:

In [6]:
from dlt.destinations import qdrant

qdrant_destination = qdrant(
  qd_path="db.qdrant", 
)

In this case, we tell dlt (and Qdrant) to create a folder with our data, and the name for it will be `db.qdrant`

Let's run it:

In [7]:
pipeline = dlt.pipeline(
    pipeline_name="zoomcamp_pipeline",
    destination=qdrant_destination,
    dataset_name="zoomcamp_tagged_data"

)
load_info = pipeline.run(zoomcamp_data())
print(pipeline.last_trace)

  from .autonotebook import tqdm as notebook_tqdm


Run started at 2025-07-05 20:20:26.338880+00:00 and COMPLETED in 9.87 seconds with 4 steps.
Step extract COMPLETED in 0.95 seconds.

Load package 1751746828.720088 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.19 seconds.
Normalized data for the following tables:
- dlt_homework: 948 row(s)
- _dlt_pipeline_state: 1 row(s)

Load package 1751746828.720088 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 6.35 seconds.
Pipeline zoomcamp_pipeline load step completed in 6.30 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_tagged_data
The qdrant destination used c:\Users\chetv\Documents\llm-zoomcamp\workshops\dlt\db.qdrant location to store data
Load package 1751746828.720088 is LOADED and contains no failed jobs

Step run COMPLETED in 9.87 seconds.
Pipeline zoomcamp_pipeline load step completed in 6.30 seconds
1 load package(s) were loade

How many rows were inserted into the zoomcamp_data collection?

Look for "Normalized data for the following tables:" in the trace output.

**Answer**: 948 

```
Step normalize COMPLETED in 0.19 seconds.
Normalized data for the following tables:
- dlt_homework: 948 row(s)
- _dlt_pipeline_state: 1 row(s)
```

In [None]:
import json
import gzip


# load task ID  - 1751746828.720088, it can be found in dlt "pipelines" folder
# there is JSONL gzip-compressed file
# we can read it and count the rows
PATH_TO_LOADED = "C:\\Users\\chetv\\.dlt\pipelines\\zoomcamp_pipeline\\load\\loaded\\1751746828.720088\\completed_jobs\\dlt_homework.e36120b43b.0.jsonl"

with gzip.open(PATH_TO_LOADED, 'rt',  encoding='utf-8') as f:
    data = [json.loads(line) for line in f]

print(f"Rows in loaded data: {len(data)}")


Rows in loaded data: 948



# Question 3. Embeddings
When inserting the data, an embedding model was used. Which one?

You can find this out by inspecting the `meta.json` file created in the target folder.

**Answer**:

`"fast-bge-small-en"`

In [None]:
# also it can be found in specifications 
pipeline.destination.spec.model

'BAAI/bge-small-en'