In [1]:
import os
os.chdir("/home/ubuntu/llm-zoomcamp/llm-zoomcamp-2024/workshops-dlt")

# Load the data
Loading of the data consists of a few steps:
1. Use the requests library to get the data
2. Define a dlt resources that yields the individual documents
3. Create a dlt pipeline and run it

In [1]:
import requests
import dlt

dataset_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1"

raw_qa_dataset = requests.get(dataset_url).json() # 1

@dlt.resource
def extract_documents(): # 2
    for course in raw_qa_dataset:
        yield course["documents"]

pipeline = dlt.pipeline(
    pipeline_name="from_json",
    destination="lancedb",
    dataset_name="qanda"
)

load_info = pipeline.run(extract_documents, table_name="documents")

print(load_info)

Pipeline from_json load step completed in 0.27 seconds
1 load package(s) were loaded to destination LanceDB and into dataset qanda
The LanceDB destination used <dlt.destinations.impl.lancedb.configuration.LanceDBCredentials object at 0x76cbb83fdf90> location to store data
Load package 1721490291.9591303 is LOADED and contains no failed jobs


```python
@dlt.resource
def extract_documents(): # 2
    for course in raw_qa_dataset:
        yield course["documents"]
```
the `yield` keyword is employed in this code to create a memory-efficient generator function that can process the dataset in a streaming manner, avoiding the need to load the entire dataset into memory at once.

In [2]:
import lancedb

db = lancedb.connect(".lancedb/")
print(db.table_names())

['qanda____dlt_loads', 'qanda____dlt_pipeline_state', 'qanda____dlt_version', 'qanda___dltSentinelTable', 'qanda___documents']


In [3]:
db_table = db.open_table("qanda___documents")

db_table.to_pandas()

Unnamed: 0,id__,text,section,question,_dlt_load_id,_dlt_id
0,402dc43d-a267-57ee-b9dd-cdce747a697f,The purpose of this document is to capture fre...,General course-related questions,Course - When will the course start?,1721490291.9591303,YEeOzZSAdrkVzQ
1,de47230f-15fd-5f54-95c7-28b1ae3bda10,GitHub - DataTalksClub data-engineering-zoomca...,General course-related questions,Course - What are the prerequisites for this c...,1721490291.9591303,4GJ1XvwLuq2LWQ
2,e681ca91-0c1b-59bc-a3d9-d4b8e06bf812,"Yes, even if you don't register, you're still ...",General course-related questions,Course - Can I still join the course after the...,1721490291.9591303,6QhOr1rGIkeS6A
3,e1772d2c-25bd-56a2-9d68-eaaf7a9cfa73,You don't need it. You're accepted. You can al...,General course-related questions,Course - I have registered for the Data Engine...,1721490291.9591303,88xSnKdWrpLP7Q
4,33ec6a0a-7f09-57df-895b-4bdfee2d04c6,You can start by installing and setting up all...,General course-related questions,Course - What can I do before the course starts?,1721490291.9591303,d7rPp2owX6ctlw
...,...,...,...,...,...,...
943,cb9abee1-4715-5371-81d3-9ae5cfb007c3,Problem description\nThis is the step in the c...,Module 6: Best practices,Github actions: Permission denied error when e...,1721490291.9591303,+YJYwFqCORevgw
944,300e98b3-2d2b-5c57-a10f-9e04e46b25f7,Problem description\nWhen a docker-compose fil...,Module 6: Best practices,Managing Multiple Docker Containers with docke...,1721490291.9591303,tF7PDIlOCXlmRQ
945,5a2d6cca-1649-5995-8257-03bd44224578,Problem description\nIf you are having problem...,Module 6: Best practices,AWS regions need to match docker-compose,1721490291.9591303,+QfhqUOX3xzPuA
946,5b68ef5f-736d-5cb3-b12e-42c0fb8df858,Problem description\nPre-commit command was fa...,Module 6: Best practices,Isort Pre-commit,1721490291.9591303,M7HhztArkFeAWQ


# Load and embed the data
Load the same data again (into a new table), but embed it directly with the `lancedb_adapter`. This consist of the following steps:
1. Define the embedding model to use via ENV variables
2. Define a new pipeline to load the same data and embed the "text" and "question" columns with the "lancedb_adapter"

In [4]:
import os
from dlt.destinations.adapters import lancedb_adapter

os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL_PROVIDER"] = "sentence-transformers"
os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL"] = "all-MiniLM-L6-v2"

pipeline = dlt.pipeline(
    pipeline_name="from_json_embedded",
    destination="lancedb",
    dataset_name="qanda_embedded")

load_info = pipeline.run(
    lancedb_adapter(extract_documents, embed=['text', 'question']),
    table_name="documents"
)

print(load_info)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Pipeline from_json_embedded load step completed in 20.35 seconds
1 load package(s) were loaded to destination LanceDB and into dataset qanda_embedded
The LanceDB destination used <dlt.destinations.impl.lancedb.configuration.LanceDBCredentials object at 0x76cbb82c4990> location to store data
Load package 1721490292.9536402 is LOADED and contains no failed jobs


In [5]:
db = lancedb.connect("./.lancedb")
print(db.table_names())

['qanda____dlt_loads', 'qanda____dlt_pipeline_state', 'qanda____dlt_version', 'qanda___dltSentinelTable', 'qanda___documents', 'qanda_embedded____dlt_loads', 'qanda_embedded____dlt_pipeline_state', 'qanda_embedded____dlt_version', 'qanda_embedded___dltSentinelTable', 'qanda_embedded___documents']


In [6]:
db_table = db.open_table("qanda_embedded___documents")

db_table.to_pandas()

Unnamed: 0,id__,vector__,text,section,question,_dlt_load_id,_dlt_id
0,054ab036-ef30-5d78-be07-5d42a0825408,"[-0.00035095983, -0.062014256, -0.03799991, 0....",The purpose of this document is to capture fre...,General course-related questions,Course - When will the course start?,1721490292.9536402,DUsvwpVXb8mEtg
1,a04227ec-f689-5c75-9294-8db4a057253c,"[0.020011423, -0.011535534, 0.0130172055, -0.0...",GitHub - DataTalksClub data-engineering-zoomca...,General course-related questions,Course - What are the prerequisites for this c...,1721490292.9536402,bcFJnB6B8nM8nQ
2,561f6815-9f4b-51ea-96f0-e0558aae041c,"[0.014857549, -0.06664994, -0.013571203, 0.023...","Yes, even if you don't register, you're still ...",General course-related questions,Course - Can I still join the course after the...,1721490292.9536402,aAFvvZYRv1hINA
3,677351ff-594e-5e03-b9fc-336ad2973ee4,"[-0.023312101, -0.09461491, 0.056361604, -0.00...",You don't need it. You're accepted. You can al...,General course-related questions,Course - I have registered for the Data Engine...,1721490292.9536402,3jYadkeaZtUtew
4,92137555-afb2-5ab0-b2da-f2b3b76a89cd,"[0.026537659, -0.017796658, 0.0021156375, 0.00...",You can start by installing and setting up all...,General course-related questions,Course - What can I do before the course starts?,1721490292.9536402,feaEYNR29C6Vcw
...,...,...,...,...,...,...,...
943,9ff4d092-8dd4-5ec3-818a-434955e8434f,"[0.016619304, -0.033603117, -0.09334721, -0.02...",Problem description\nThis is the step in the c...,Module 6: Best practices,Github actions: Permission denied error when e...,1721490292.9536402,NtYk/8csJP2c8A
944,67b0df83-99f8-5e3c-bef8-bd392f6a00f7,"[0.026872871, -0.0019949335, 0.008369085, -0.0...",Problem description\nWhen a docker-compose fil...,Module 6: Best practices,Managing Multiple Docker Containers with docke...,1721490292.9536402,yMXoxqpNTxfXyg
945,f167c20e-88ae-5c76-94af-0f1a4420c792,"[0.03513752, 0.056265578, 0.024428517, -0.0651...",Problem description\nIf you are having problem...,Module 6: Best practices,AWS regions need to match docker-compose,1721490292.9536402,veTN4Izmbdq1qg
946,168157ae-3f8f-5347-9415-0873208601c2,"[0.03380975, -0.0031219234, 0.0017484598, 0.01...",Problem description\nPre-commit command was fa...,Module 6: Best practices,Isort Pre-commit,1721490292.9536402,adEFD8ox3yyc/A


In [2]:
import lancedb

db = lancedb.connect(".lancedb/")
print(db.table_names())

['qanda____dlt_loads', 'qanda____dlt_pipeline_state', 'qanda____dlt_version', 'qanda___dltSentinelTable', 'qanda___documents', 'qanda_embedded____dlt_loads', 'qanda_embedded____dlt_pipeline_state', 'qanda_embedded____dlt_version', 'qanda_embedded___dltSentinelTable', 'qanda_embedded___documents']


In [3]:
db_table = db.open_table('qanda_embedded___documents')
db_table.to_pandas()

Unnamed: 0,id__,vector__,text,section,question,_dlt_load_id,_dlt_id
0,054ab036-ef30-5d78-be07-5d42a0825408,"[-0.00035095983, -0.062014256, -0.03799991, 0....",The purpose of this document is to capture fre...,General course-related questions,Course - When will the course start?,1721490292.9536402,DUsvwpVXb8mEtg
1,a04227ec-f689-5c75-9294-8db4a057253c,"[0.020011423, -0.011535534, 0.0130172055, -0.0...",GitHub - DataTalksClub data-engineering-zoomca...,General course-related questions,Course - What are the prerequisites for this c...,1721490292.9536402,bcFJnB6B8nM8nQ
2,561f6815-9f4b-51ea-96f0-e0558aae041c,"[0.014857549, -0.06664994, -0.013571203, 0.023...","Yes, even if you don't register, you're still ...",General course-related questions,Course - Can I still join the course after the...,1721490292.9536402,aAFvvZYRv1hINA
3,677351ff-594e-5e03-b9fc-336ad2973ee4,"[-0.023312101, -0.09461491, 0.056361604, -0.00...",You don't need it. You're accepted. You can al...,General course-related questions,Course - I have registered for the Data Engine...,1721490292.9536402,3jYadkeaZtUtew
4,92137555-afb2-5ab0-b2da-f2b3b76a89cd,"[0.026537659, -0.017796658, 0.0021156375, 0.00...",You can start by installing and setting up all...,General course-related questions,Course - What can I do before the course starts?,1721490292.9536402,feaEYNR29C6Vcw
...,...,...,...,...,...,...,...
943,9ff4d092-8dd4-5ec3-818a-434955e8434f,"[0.016619304, -0.033603117, -0.09334721, -0.02...",Problem description\nThis is the step in the c...,Module 6: Best practices,Github actions: Permission denied error when e...,1721490292.9536402,NtYk/8csJP2c8A
944,67b0df83-99f8-5e3c-bef8-bd392f6a00f7,"[0.026872871, -0.0019949335, 0.008369085, -0.0...",Problem description\nWhen a docker-compose fil...,Module 6: Best practices,Managing Multiple Docker Containers with docke...,1721490292.9536402,yMXoxqpNTxfXyg
945,f167c20e-88ae-5c76-94af-0f1a4420c792,"[0.03513752, 0.056265578, 0.024428517, -0.0651...",Problem description\nIf you are having problem...,Module 6: Best practices,AWS regions need to match docker-compose,1721490292.9536402,veTN4Izmbdq1qg
946,168157ae-3f8f-5347-9415-0873208601c2,"[0.03380975, -0.0031219234, 0.0017484598, 0.01...",Problem description\nPre-commit command was fa...,Module 6: Best practices,Isort Pre-commit,1721490292.9536402,adEFD8ox3yyc/A
