In [None]:
%pip install -q "dlt[qdrant]" "qdrant-client[fastembed]"

Note: you may need to restart the kernel to use updated packages.


### Question 1. dlt Version
##### What's the version of dlt that you installed?
###### Answer

In [5]:
!pip show dlt
!pip list | grep -E "dlt|qdrant"


Name: dlt
Version: 1.12.3
Summary: dlt is an open-source python-first scalable data loading library that does not require any backend to run.
Home-page: https://github.com/dlt-hub
Author: 
Author-email: "dltHub Inc." <services@dlthub.com>
License-Expression: Apache-2.0
Location: /Users/selamsew/opt/anaconda3/lib/python3.9/site-packages
Requires: click, fsspec, gitpython, giturlparse, hexbytes, humanize, jsonpath-ng, orjson, packaging, pathvalidate, pendulum, pluggy, pytz, pyyaml, requests, requirements-parser, rich-argparse, semver, setuptools, simplejson, sqlglot, tenacity, tomlkit, typing-extensions, tzdata
Required-by: 
dlt                                      1.12.3
qdrant-client                            1.14.2


### dlt use

#### Use dlt to wrap this flow

`dlt helps you:`

- Make this flow reproducible

- Track state and logs

- Potentially scale or schedule it

`In dlt, your job is to define:`

- A source → gives you the data

- A resource → prepares and formats it

- A pipeline → runs it and optionally loads it



### dlt Resourse

A `resource` is an (optionally async) function that yields data. To create a resource, we add the `@dlt.resource` decorator to that function.

In [8]:
import requests
import dlt

@dlt.resource
def zoomcamp_data():
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    docs_response = requests.get(docs_url)
    documents_raw = docs_response.json()

    for course in documents_raw:
        course_name = course['course']

        for doc in course['documents']:
            doc['course'] = course_name
            yield doc

### Question 2. dlt pipeline

Now let's create a pipeline.

We need to define a destination for that. Let's use the qdrant one:



In [9]:

#create a folder with our data, and the name for it will be db.qdrant
from dlt.destinations import qdrant

qdrant_destination = qdrant(
  qd_path="db.qdrant", 
)

### pipeline

A `pipeline` moves data from your Python code to a destination. The pipeline accepts `dlt` sources or resources, as well as generators, async generators, lists, and any iterables. Once the pipeline runs, all resources are evaluated and the data is loaded at the destination.

##### You instantiate a pipeline by calling the `dlt.pipeline` function with the following arguments:

`pipeline_name`: a name of the pipeline that will be used to identify it in trace and monitoring events and to restore its state and data schemas on subsequent runs. If not provided, dlt will create a pipeline name from the file name of the currently executing Python module.

`destination`: a name of the `destination` to which dlt will load the data. It may also be provided to the `run` method of the pipeline.

`dataset_name`: a name of the dataset to which the data will be loaded.
A dataset is a logical group of tables, i.e., schema in relational databases or a folder grouping many files. It may also be provided later to the `run` or `load` methods of the pipeline. If not provided, then it defaults to the `{pipeline_name}_dataset` on destinations that require datasets (most of the warehouses). It will stay empty on destinations that do not separate tables into datasets (or database schemas) ie. on vector databases or Clikchouse.

> To `load` the data, you call the `run` method and pass your data in the data argument.

In [10]:
pipeline = dlt.pipeline(
    pipeline_name="zoomcamp_pipeline",
    destination=qdrant_destination,
    dataset_name="zoomcamp_tagged_data"

)
load_info = pipeline.run(zoomcamp_data())
print(pipeline.last_trace)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

model_optimized.onnx:   0%|          | 0.00/133M [00:00<?, ?B/s]

Run started at 2025-07-07 01:29:41.726168+00:00 and COMPLETED in 16.21 seconds with 4 steps.
Step extract COMPLETED in 0.95 seconds.

Load package 1751851790.558231 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.13 seconds.
Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- zoomcamp_data: 948 row(s)

Load package 1751851790.558231 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 6.32 seconds.
Pipeline zoomcamp_pipeline load step completed in 6.30 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_tagged_data
The qdrant destination used /Users/selamsew/Documents/llm-zoomcampl/workshop/db.qdrant location to store data
Load package 1751851790.558231 is LOADED and contains no failed jobs

Step run COMPLETED in 16.21 seconds.
Pipeline zoomcamp_pipeline load step completed in 6.30 seconds
1 load package(s) were loade

### Question 2. dlt pipeline:  Answer -> 948

Normalized data for the following tables:

- _dlt_pipeline_state: 1 row(s)
- zoomcamp_data: 948 row(s)

### Question 3. Embeddings

#### When inserting the data, an embedding model was used. Which one?

`mata.json` -> information about db.qdrant, it used `"fast-bge-small-en"`

here a some part of `mata.json`

```json
"collections": {
        "zoomcamp_tagged_data": {
            "vectors": {
                "fast-bge-small-en": {
                    "size": 384,
                    "distance": "Cosine",
                    "hnsw_config": null,
                    "quantization_config": null,
                    "on_disk": null,
                    "datatype": null,
                    "multivector_config": null
    }
}       
```


### Answer: `fast-bge-small-en`