# Homework Submission

```ascii
    ┌─────────────┐           ┌─────────────┐
    │             │           │             │
    │  HOME       │ ────────► │  WORK       │
    │             │           │             │
    └─────────────┘           └─────────────┘
```

## Question 1. dlt Version

---


What's the version of dlt that you installed?
- `Version: 1.12.3`

In [1]:
!pip show dlt

Name: dlt
Version: 1.12.3
Summary: dlt is an open-source python-first scalable data loading library that does not require any backend to run.
Home-page: https://github.com/dlt-hub
Author: 
Author-email: "dltHub Inc." <services@dlthub.com>
License-Expression: Apache-2.0
Location: /home/radianv/anaconda3/envs/llm-zoomcamp/lib/python3.10/site-packages
Requires: click, fsspec, gitpython, giturlparse, hexbytes, humanize, jsonpath-ng, orjson, packaging, pathvalidate, pendulum, pluggy, pytz, pyyaml, requests, requirements-parser, rich-argparse, semver, setuptools, simplejson, sqlglot, tenacity, tomlkit, typing-extensions, tzdata
Required-by: cognee


## Question 2. dlt pipeline

---
How many rows were inserted into the `zoomcamp_data` collection?

Look for **"Normalized data for the following tables:"** in the trace output.

- Normalized data for the following tables:
    - zoomcamp_data: 948 row(s)
    - _dlt_pipeline_state: 1 row(s)

### Step 1 - support functions

In [10]:
import dlt
import requests
import pandas as pd
from datetime import datetime
from qdrant_client import QdrantClient, models
from dlt.destinations import qdrant

In [8]:
qd_client = QdrantClient("http://localhost:6333") #connecting to local Qdrant instance
EMBEDDING_DIMENSIONALITY = 512
model_handle = "BAAI/bge-small-en"

In [4]:
@dlt.resource(write_disposition="replace", name="zoomcamp_data")
def zoomcamp_data():
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    docs_response = requests.get(docs_url)
    documents_raw = docs_response.json()

    for course in documents_raw:
        course_name = course['course']

        for doc in course['documents']:
            doc['course'] = course_name
            yield doc

In [None]:
qdrant_destination = qdrant(
  qd_path="db.qdrant", 
)

In [15]:
pipeline = dlt.pipeline(
    pipeline_name="zoomcamp_pipeline",
    destination=qdrant_destination,
    dataset_name="zoomcamp_tagged_data"

)
load_info = pipeline.run(zoomcamp_data())
print(pipeline.last_trace)

Run started at 2025-07-07 04:00:38.631966+00:00 and COMPLETED in 8.47 seconds with 4 steps.
Step extract COMPLETED in 0.76 seconds.

Load package 1751860839.8473814 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.11 seconds.
Normalized data for the following tables:
- zoomcamp_data: 948 row(s)
- _dlt_pipeline_state: 1 row(s)

Load package 1751860839.8473814 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 6.40 seconds.
Pipeline zoomcamp_pipeline load step completed in 6.38 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_tagged_data
The qdrant destination used /home/radianv/projects/data_talk_club/local-llm-zoomcamp/workshops/dlt/code/local/db.qdrant location to store data
Load package 1751860839.8473814 is LOADED and contains no failed jobs

Step run COMPLETED in 8.47 seconds.
Pipeline zoomcamp_pipeline load step completed in 6.38 se

## Question 3. Embeddings

---

When inserting the data, an embedding model was used. Which one?

**Note**:You can find this out by inspecting the meta.json file created in the target folder.

- `fast-bge-small-en`

In [34]:
import json

# Read and parse the JSON file
with open("code/local/db.qdrant/meta.json", "r") as f:
    data = json.load(f)

# Pretty-print the content
print(json.dumps(data, indent=4))

{
    "collections": {
        "zoomcamp_tagged_data": {
            "vectors": {
                "fast-bge-small-en": {
                    "size": 384,
                    "distance": "Cosine",
                    "hnsw_config": null,
                    "quantization_config": null,
                    "on_disk": null,
                    "datatype": null,
                    "multivector_config": null
                }
            },
            "shard_number": null,
            "sharding_method": null,
            "replication_factor": null,
            "write_consistency_factor": null,
            "on_disk_payload": null,
            "hnsw_config": null,
            "wal_config": null,
            "optimizers_config": null,
            "init_from": null,
            "quantization_config": null,
            "sparse_vectors": null,
            "strict_mode_config": null
        },
        "zoomcamp_tagged_data__dlt_pipeline_state": {
            "vectors": {
                "fast-bg