https://lu.ma/cnpdoc5n

If you want to play around with this notebook and make edits in future, we highly recommend making a copy since the link is view only! Also make sure you're signed in with your Google account to be able to add secrets.

Before going into a more complex example, we will go through a simple example of how to load the course Q&A data into LanceDB.

# Intro dlt -> LanceDB loading example

## Install requirements

To create a json -> lancedb pipeline, we need to install:
1. dlt with lancedb extras
2. sentence-transformers: we need to use an embedding model to vectorize and store data inside LanceDB. For this we choose the open-source model "sentence-transformers/all-MiniLM-L6-v2".

In [1]:
%%capture
!pip install dlt[lancedb]==0.5.1a0
!pip install sentence-transformers

In [2]:
import requests
import dlt

qa_dataset = requests.get("https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1").json()

@dlt.resource
def qa_documents():
  for course in qa_dataset:
    yield course["documents"]

pipeline = dlt.pipeline(pipeline_name="from_json", destination="lancedb", dataset_name="qanda")

load_info = pipeline.run(qa_documents, table_name="documents")

print(load_info)

_dlt_version
[{'name': 'version', 'data_type': 'bigint', 'nullable': False}, {'name': 'engine_version', 'data_type': 'bigint', 'nullable': False}, {'name': 'inserted_at', 'data_type': 'timestamp', 'nullable': False}, {'name': 'schema_name', 'data_type': 'text', 'nullable': False}, {'name': 'version_hash', 'data_type': 'text', 'nullable': False}, {'name': 'schema', 'data_type': 'text', 'nullable': False}]
_dlt_pipeline_state
[{'name': 'version', 'data_type': 'bigint', 'nullable': False}, {'name': 'engine_version', 'data_type': 'bigint', 'nullable': False}, {'name': 'pipeline_name', 'data_type': 'text', 'nullable': False}, {'name': 'state', 'data_type': 'text', 'nullable': False}, {'name': 'created_at', 'data_type': 'timestamp', 'nullable': False}, {'name': 'version_hash', 'data_type': 'text', 'nullable': True}, {'name': '_dlt_load_id', 'data_type': 'text', 'nullable': False}, {'name': '_dlt_id', 'data_type': 'text', 'nullable': False, 'unique': True}]
_dlt_loads
[{'name': 'load_id', 'da

In [3]:
import lancedb

db = lancedb.connect("./.lancedb")
print(db.table_names())


['qanda____dlt_loads', 'qanda____dlt_pipeline_state', 'qanda____dlt_version', 'qanda___dltSentinelTable', 'qanda___documents']


In [4]:
db_table = db.open_table("qanda___documents")

db_table.to_pandas()

Unnamed: 0,id__,text,section,question,_dlt_load_id,_dlt_id
0,2c38f4bf-f81d-5bb0-b22d-46e70a308a3e,The purpose of this document is to capture fre...,General course-related questions,Course - When will the course start?,1720722955.667708,XN/bSv48xPZ5Fg
1,d4024049-fd87-5699-9b71-3ce91304faa0,GitHub - DataTalksClub data-engineering-zoomca...,General course-related questions,Course - What are the prerequisites for this c...,1720722955.667708,Yr+Gd/3zuMqleg
2,6a9bc826-b8bf-5b7e-9c3a-c05db18e7a4f,"Yes, even if you don't register, you're still ...",General course-related questions,Course - Can I still join the course after the...,1720722955.667708,ijWl3zl877MTXQ
3,3e747fcf-1d70-562a-8fd9-e8b77da6e88f,You don't need it. You're accepted. You can al...,General course-related questions,Course - I have registered for the Data Engine...,1720722955.667708,M1J2TtyrCY9xoQ
4,48795f75-30df-5704-9dd7-709effca892c,You can start by installing and setting up all...,General course-related questions,Course - What can I do before the course starts?,1720722955.667708,Xl354qT2BQgRFA
...,...,...,...,...,...,...
943,df961357-03d8-57c1-ae35-0c35fa49fab7,Problem description\nThis is the step in the c...,Module 6: Best practices,Github actions: Permission denied error when e...,1720722955.667708,i/QXOwWTfqwR1w
944,942d8afc-dfd9-5985-9a0b-e04a5d0fc082,Problem description\nWhen a docker-compose fil...,Module 6: Best practices,Managing Multiple Docker Containers with docke...,1720722955.667708,1nRTT/sdAETpUQ
945,9842d81f-7f34-557e-8c2d-c30302eebdce,Problem description\nIf you are having problem...,Module 6: Best practices,AWS regions need to match docker-compose,1720722955.667708,WttFTTR6M/0drw
946,89c5066f-6960-54ce-ac51-53bea44d9298,Problem description\nPre-commit command was fa...,Module 6: Best practices,Isort Pre-commit,1720722955.667708,Hs9ZTlBDbMY76g


In [5]:
import os
from dlt.destinations.adapters import lancedb_adapter

os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL_PROVIDER"] = "sentence-transformers"
os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL"] = "all-MiniLM-L6-v2"

pipeline = dlt.pipeline(pipeline_name="from_json_embedded", destination="lancedb", dataset_name="qanda_embedded")

load_info = pipeline.run(lancedb_adapter(qa_documents, embed=["text", "question"]), table_name="documents")
print(load_info)

_dlt_version
[{'name': 'version', 'data_type': 'bigint', 'nullable': False}, {'name': 'engine_version', 'data_type': 'bigint', 'nullable': False}, {'name': 'inserted_at', 'data_type': 'timestamp', 'nullable': False}, {'name': 'schema_name', 'data_type': 'text', 'nullable': False}, {'name': 'version_hash', 'data_type': 'text', 'nullable': False}, {'name': 'schema', 'data_type': 'text', 'nullable': False}]
_dlt_pipeline_state
[{'name': 'version', 'data_type': 'bigint', 'nullable': False}, {'name': 'engine_version', 'data_type': 'bigint', 'nullable': False}, {'name': 'pipeline_name', 'data_type': 'text', 'nullable': False}, {'name': 'state', 'data_type': 'text', 'nullable': False}, {'name': 'created_at', 'data_type': 'timestamp', 'nullable': False}, {'name': 'version_hash', 'data_type': 'text', 'nullable': True}, {'name': '_dlt_load_id', 'data_type': 'text', 'nullable': False}, {'name': '_dlt_id', 'data_type': 'text', 'nullable': False, 'unique': True}]
_dlt_loads
[{'name': 'load_id', 'da

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

UPLOAD
Pipeline from_json_embedded load step completed in 28.29 seconds
1 load package(s) were loaded to destination LanceDB and into dataset qanda_embedded
The LanceDB destination used <dlt.destinations.impl.lancedb.configuration.LanceDBCredentials object at 0x7d308db080d0> location to store data
Load package 1720723294.4352853 is LOADED and contains no failed jobs


In [6]:
db = lancedb.connect("./.lancedb")
print(db.table_names())

['qanda____dlt_loads', 'qanda____dlt_pipeline_state', 'qanda____dlt_version', 'qanda___dltSentinelTable', 'qanda___documents', 'qanda_embedded____dlt_loads', 'qanda_embedded____dlt_pipeline_state', 'qanda_embedded____dlt_version', 'qanda_embedded___dltSentinelTable', 'qanda_embedded___documents']


In [7]:
db_table = db.open_table("qanda_embedded___documents")

db_table.to_pandas()

Unnamed: 0,id__,vector__,text,section,question,_dlt_load_id,_dlt_id
0,11375bf0-da34-5b21-b4c7-a1c3d9cf77f7,"[-0.00035095983, -0.062014256, -0.03799991, 0....",The purpose of this document is to capture fre...,General course-related questions,Course - When will the course start?,1720723294.4352853,2Z89CCgpMa+b6w
1,bbaaa44e-eeb8-5282-820a-64f569f05eb7,"[0.020011423, -0.011535534, 0.0130172055, -0.0...",GitHub - DataTalksClub data-engineering-zoomca...,General course-related questions,Course - What are the prerequisites for this c...,1720723294.4352853,jR/YemYNHhlXRQ
2,6782512d-c2f9-56c4-869b-06c87ea76c15,"[0.014857549, -0.06664994, -0.013571203, 0.023...","Yes, even if you don't register, you're still ...",General course-related questions,Course - Can I still join the course after the...,1720723294.4352853,PrReLUa1ctIiZg
3,3c81e3fc-782f-5013-bd8e-5b64c4c61395,"[-0.023312101, -0.09461491, 0.056361604, -0.00...",You don't need it. You're accepted. You can al...,General course-related questions,Course - I have registered for the Data Engine...,1720723294.4352853,1bRIomXFhAmgCA
4,a09e8fab-fdb4-5e46-a2a7-7200e9cd5a7c,"[0.026537659, -0.017796658, 0.0021156375, 0.00...",You can start by installing and setting up all...,General course-related questions,Course - What can I do before the course starts?,1720723294.4352853,zVPQN4vbkzAK+Q
...,...,...,...,...,...,...,...
943,16824b0b-57cd-55d0-8cae-978470b31a68,"[0.016619304, -0.033603117, -0.09334721, -0.02...",Problem description\nThis is the step in the c...,Module 6: Best practices,Github actions: Permission denied error when e...,1720723294.4352853,p30WbbSqs1CPag
944,6b5eb030-554e-5163-8d14-eba8e01547a4,"[0.026872871, -0.0019949335, 0.008369085, -0.0...",Problem description\nWhen a docker-compose fil...,Module 6: Best practices,Managing Multiple Docker Containers with docke...,1720723294.4352853,+4cU7m5nb++ksA
945,e0813d19-7b8c-57bb-81ec-88d5d8861d99,"[0.03513752, 0.056265578, 0.024428517, -0.0651...",Problem description\nIf you are having problem...,Module 6: Best practices,AWS regions need to match docker-compose,1720723294.4352853,GPp0VI5/j32/Yg
946,494c51c4-4594-5766-bee1-9329c239b31c,"[0.03380982, -0.0031219546, 0.0017484119, 0.01...",Problem description\nPre-commit command was fa...,Module 6: Best practices,Isort Pre-commit,1720723294.4352853,vezaNgadTQOQLQ


That's all for this intro example! The DB could now be used as a basis for a RAG.

In [8]:
%%capture
!pip install dlt[lancedb]==0.5.1a0
!pip install sentence-transformers

In [9]:
!yes | dlt init rest_api lancedb

Looking up the init scripts in [1mhttps://github.com/dlt-hub/verified-sources.git[0m...
Cloning and configuring a verified source [1mrest_api[0m (Generic API Source)
Do you want to proceed? [Y/n]: 
Verified source [1mrest_api[0m was added to your project!
* See the usage examples and code snippets to copy from [1mrest_api_pipeline.py[0m
* Add credentials for [1mlancedb[0m and other secrets in [1m./.dlt/secrets.toml[0m
* [1mrequirements.txt[0m was created. Install it with:
pip3 install -r requirements.txt
* Read [1mhttps://dlthub.com/docs/walkthroughs/create-a-pipeline[0m for more information


### 3. Add API credentials

In [10]:
import os
from google.colab import userdata

os.environ["SOURCES__REST_API__NOTION__API_KEY"] = userdata.get("secret_q4QgENGc9JKXgGContzdV6Z6KMTwzLYXWX9VMFi1dlb")

os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL_PROVIDER"] = "sentence-transformers"
os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL"] = "all-MiniLM-L6-v2"

os.environ["DESTINATION__LANCEDB__CREDENTIALS__URI"] = ".lancedb"

In [11]:
import dlt
from rest_api import RESTAPIConfig, rest_api_source

from dlt.sources.helpers.rest_client.paginators import BasePaginator, JSONResponsePaginator
from dlt.sources.helpers.requests import Response, Request

from dlt.destinations.adapters import lancedb_adapter

5. Create the pipeline and run it

  With our source configured, we can now define the pipeline and run it.

  Normally, to do this we would run  
  ```python
  pipeline.run(
    rest_api_notion_incremental,
    table_name="employee_handbook",
    write_disposition="merge"
  )
  ```
  and this would load the data into lancedb normally, without creating any embeddings.

  However, we can have lancedb automatically create embeddings and load it along with the normal data using dlt's native adapter for lancedb: `lancedb_adapter`. It will use the embedding model that we specified in the credentials.   
    
  ```python
  pipeline.run(
    lancedb_adapter(
      rest_api_notion_incremental,
      embed="content" # The column that we'd like to embed
    )
    table_name="employee_handbook",
    write_disposition="merge"
  )
  ```

In [18]:
from datetime import datetime, timezone

class PostBodyPaginator(BasePaginator):
    def __init__(self):
        super().__init__()
        self.cursor = None

    def update_state(self, response: Response) -> None:
        # Assuming the API returns an empty list when no more data is available
        if not response.json():
            self._has_next_page = False
        else:
            self.cursor = response.json().get("next_cursor")
            if self.cursor is None:
                self._has_next_page = False

    def update_request(self, request: Request) -> None:
        if request.json is None:
            request.json = {}

        # Add the cursor to the request body
        request.json["start_cursor"] = self.cursor

@dlt.resource(name="homework")
def rest_api_notion_resource():
    notion_config: RESTAPIConfig = {
        "client": {
            "base_url": "https://api.notion.com/v1/",
            "auth": {
                "token": dlt.secrets["sources.rest_api.notion.api_key"]
            },
            "headers":{
            "Content-Type": "application/json",
            "Notion-Version": "2022-06-28"
            }
        },
        "resources": [
            {
                "name": "search",
                "endpoint": {
                    "path": "search",
                    "method": "POST",
                    "paginator": PostBodyPaginator(),
                    "json": {
                        #"query": "workshop",
                        "query": "homework",
                        "sort": {
                            "direction": "ascending",
                            "timestamp": "last_edited_time"
                        }
                    },
                    "data_selector": "results"
                }
            },
            {
                "name": "page_content",
                "endpoint": {
                    "path": "blocks/{page_id}/children",
                    "paginator": JSONResponsePaginator(),
                    "params": {
                        "page_id": {
                            "type": "resolve",
                            "resource": "search",
                            "field": "id"
                        }
                    },
                }
            }
        ]
    }

    yield from rest_api_source(notion_config,name="homework")

def extract_page_content(response):
    block_id = response["id"]
    last_edited_time = response["last_edited_time"]
    block_type = response.get("type", "Not paragraph")
    if block_type != "paragraph":
        content = ""
    else:
        try:
            content = response["paragraph"]["rich_text"][0]["plain_text"]
        except IndexError:
            content = ""
    return {
        "block_id": block_id,
        "block_type": block_type,
        "content": content,
        "last_edited_time": last_edited_time,
        "inserted_at_time": datetime.now(timezone.utc)
    }

@dlt.resource(
    name="homework",
    write_disposition="merge",
    primary_key="block_id",
    columns={"last_edited_time":{"dedup_sort":"desc"}}
    )
def rest_api_notion_incremental(
    last_edited_time = dlt.sources.incremental("last_edited_time", initial_value="2024-06-26T08:16:00.000Z",primary_key=("block_id"))
):
    # last_value = last_edited_time.last_value
    # print(last_value)

    for block in rest_api_notion_resource.add_map(extract_page_content):
        if not(len(block["content"])):
            continue
        yield block

def load_notion() -> None:
    pipeline = dlt.pipeline(
        pipeline_name="company_policies",
        destination="lancedb",
        dataset_name="notion_pages",
        # full_refresh=True
    )

    load_info = pipeline.run(
        lancedb_adapter(
            rest_api_notion_incremental,
            embed="content"
        ),
        table_name="homework",
        write_disposition="merge"
    )
    print(load_info)

load_notion()

Pipeline company_policies load step completed in ---
0 load package(s) were loaded to destination LanceDB and into dataset None
The LanceDB destination used <dlt.destinations.impl.lancedb.configuration.LanceDBCredentials object at 0x7d308c8644f0> location to store data


### Question 1

In [21]:
import lancedb

db = lancedb.connect(".lancedb")
# print(db.table_names())
dbtable = db.open_table("notion_pages___homework")

dbtable.to_pandas()

Unnamed: 0,id__,vector__,block_id,block_type,content,last_edited_time,inserted_at_time,_dlt_load_id,_dlt_id
0,c69f1ecf-7b02-5810-8286-3f42659ae9d4,"[-0.024265556, 0.04746074, -0.01179647, 0.0638...",a8196881-ae94-4767-8767-92fe1a327d24,paragraph,We owe our success to our employees. To show o...,2024-07-05 22:34:00+00:00,2024-07-11 18:47:23.789011+00:00,1720723641.3424432,iEt76JOuRgT5JQ
1,f2c18ac0-50f5-5b72-a871-dc5a46780353,"[-0.04966156, 0.10853508, -0.009762607, -0.036...",31fcbf26-2ca5-468a-8af8-d7eb4c2db8c8,paragraph,We want to ensure that private information abo...,2024-07-05 22:38:00+00:00,2024-07-11 18:47:23.791377+00:00,1720723641.3424432,VkPj7H5HwBNRhw
2,4553193e-c655-54df-9a33-cfc570bf34d0,"[-0.06316319, 0.17331503, 0.025351755, -0.0191...",da7721fd-3d0f-4c04-bc5e-825ad60bed1c,paragraph,Employee health is important to us. We don’t d...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.791594+00:00,1720723641.3424432,j+VgPTZPsWTjzA
3,791be1a1-6c67-530d-87ab-bd9912500ea5,"[-0.109743185, 0.10586075, 0.003290699, -0.021...",ff36dcf3-5faa-40b4-ad8e-92fdc952201e,paragraph,Our company is dedicated to maintaining a safe...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.791758+00:00,1720723641.3424432,dEgufKtAGtf7vw
4,a83497f4-922c-5d62-bab1-53804e93c811,"[0.05242333, -0.064576, 0.06586297, 0.01454380...",a1ff9697-4bb6-4f1e-b464-dda296dbd307,paragraph,If your job doesn’t require you to be present ...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.791921+00:00,1720723641.3424432,KJRkvooS3E5Pbg
5,434b71e9-a11a-519d-a9fe-e3ade78d47d6,"[0.00052337867, -0.054883413, 0.043573413, -0....",e4ec9f4d-b687-4c28-a80d-985bfabcc2ba,paragraph,Remote working refers to working from a non-of...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.792112+00:00,1720723641.3424432,ozuja+AR1b/AEQ
6,17816363-54b7-5ba7-b8d5-06d871a25414,"[0.03802633, -0.021509705, 0.04752782, 0.06470...",e6e550dc-b59e-4928-abd7-07eace948681,paragraph,There are some expenses that we will pay direc...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.792308+00:00,1720723641.3424432,Q5Nb+5j9k95pCg
7,2a434cf9-09d9-5514-a88b-02977f2f953e,"[-0.05858811, -0.07540446, 0.033775203, 0.0096...",a269d0ca-ce14-481b-a5f4-9192d6840d6e,paragraph,Our company operates between 9 a.m. to 7 p.m. ...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.792481+00:00,1720723641.3424432,RAms2A/Ym89VvA
8,5f9384fa-7f98-5f52-a06e-05b05f42f69a,"[-0.013599302, 0.0047530197, 0.024835136, 0.01...",5b65f3e7-0a37-429a-818d-f99b53755ebd,paragraph,"In this section, we are going to be covering i...",2024-07-05 23:33:00+00:00,2024-07-11 18:47:23.792657+00:00,1720723641.3424432,bgGu9uqcZoXDPA
9,42af72f6-9db7-54a2-87b2-d466169078ff,"[0.032060888, 0.024244698, 0.008471344, 0.0317...",b27f7d80-f2f1-460e-aa0c-b8e770cf050a,paragraph,Our company observes the following holidays: N...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.792811+00:00,1720723641.3424432,hLvPpA4p344qTw


###Question2

 ---

In [23]:
db = lancedb.connect(".lancedb")
# print(db.table_names())
dbtable = db.open_table("notion_pages___homework")

dbtable.to_pandas()

Unnamed: 0,id__,vector__,block_id,block_type,content,last_edited_time,inserted_at_time,_dlt_load_id,_dlt_id
0,c69f1ecf-7b02-5810-8286-3f42659ae9d4,"[-0.024265556, 0.04746074, -0.01179647, 0.0638...",a8196881-ae94-4767-8767-92fe1a327d24,paragraph,We owe our success to our employees. To show o...,2024-07-05 22:34:00+00:00,2024-07-11 18:47:23.789011+00:00,1720723641.3424432,iEt76JOuRgT5JQ
1,f2c18ac0-50f5-5b72-a871-dc5a46780353,"[-0.04966156, 0.10853508, -0.009762607, -0.036...",31fcbf26-2ca5-468a-8af8-d7eb4c2db8c8,paragraph,We want to ensure that private information abo...,2024-07-05 22:38:00+00:00,2024-07-11 18:47:23.791377+00:00,1720723641.3424432,VkPj7H5HwBNRhw
2,4553193e-c655-54df-9a33-cfc570bf34d0,"[-0.06316319, 0.17331503, 0.025351755, -0.0191...",da7721fd-3d0f-4c04-bc5e-825ad60bed1c,paragraph,Employee health is important to us. We don’t d...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.791594+00:00,1720723641.3424432,j+VgPTZPsWTjzA
3,791be1a1-6c67-530d-87ab-bd9912500ea5,"[-0.109743185, 0.10586075, 0.003290699, -0.021...",ff36dcf3-5faa-40b4-ad8e-92fdc952201e,paragraph,Our company is dedicated to maintaining a safe...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.791758+00:00,1720723641.3424432,dEgufKtAGtf7vw
4,a83497f4-922c-5d62-bab1-53804e93c811,"[0.05242333, -0.064576, 0.06586297, 0.01454380...",a1ff9697-4bb6-4f1e-b464-dda296dbd307,paragraph,If your job doesn’t require you to be present ...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.791921+00:00,1720723641.3424432,KJRkvooS3E5Pbg
5,434b71e9-a11a-519d-a9fe-e3ade78d47d6,"[0.00052337867, -0.054883413, 0.043573413, -0....",e4ec9f4d-b687-4c28-a80d-985bfabcc2ba,paragraph,Remote working refers to working from a non-of...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.792112+00:00,1720723641.3424432,ozuja+AR1b/AEQ
6,17816363-54b7-5ba7-b8d5-06d871a25414,"[0.03802633, -0.021509705, 0.04752782, 0.06470...",e6e550dc-b59e-4928-abd7-07eace948681,paragraph,There are some expenses that we will pay direc...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.792308+00:00,1720723641.3424432,Q5Nb+5j9k95pCg
7,2a434cf9-09d9-5514-a88b-02977f2f953e,"[-0.05858811, -0.07540446, 0.033775203, 0.0096...",a269d0ca-ce14-481b-a5f4-9192d6840d6e,paragraph,Our company operates between 9 a.m. to 7 p.m. ...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.792481+00:00,1720723641.3424432,RAms2A/Ym89VvA
8,5f9384fa-7f98-5f52-a06e-05b05f42f69a,"[-0.013599302, 0.0047530197, 0.024835136, 0.01...",5b65f3e7-0a37-429a-818d-f99b53755ebd,paragraph,"In this section, we are going to be covering i...",2024-07-05 23:33:00+00:00,2024-07-11 18:47:23.792657+00:00,1720723641.3424432,bgGu9uqcZoXDPA
9,42af72f6-9db7-54a2-87b2-d466169078ff,"[0.032060888, 0.024244698, 0.008471344, 0.0317...",b27f7d80-f2f1-460e-aa0c-b8e770cf050a,paragraph,Our company observes the following holidays: N...,2024-07-05 22:52:00+00:00,2024-07-11 18:47:23.792811+00:00,1720723641.3424432,hLvPpA4p344qTw


In [24]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Downloading ollama...
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


2. Start Ollama using `ollama serve`. This needs to run in the backgound - so we run it using `nohup` (to see the output log, open nohup.out).

In [25]:
!nohup ollama serve > nohup.out 2>&1 &

In [26]:
%%capture
!ollama pull llama2-uncensored

In [27]:
!pip install ollama

Collecting ollama
  Downloading ollama-0.2.1-py3-none-any.whl (9.7 kB)
Collecting httpx<0.28.0,>=0.27.0 (from ollama)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<0.28.0,>=0.27.0->ollama)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<0.28.0,>=0.27.0->ollama)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, ollama
Successfully installed h11-0.14.0 httpcore-1.0.5 httpx-0.27.0 ollama-0.2.1


In [28]:
import ollama

In [29]:
def retrieve_context_from_lancedb(dbtable, question, top_k=2):

    query_results = dbtable.search(query=question).to_list()
    context = "\n".join([result["content"] for result in query_results[:top_k]])

    return context

6. Finally we define a very basic RAG. We define a simple system prompt, retrieve the relevant context for the user query with the function defined above and then send the user question and the context to the `llama2-uncensored` model.

In [30]:
def main():
  # Connect to the lancedb table
  db = lancedb.connect(".lancedb")
  dbtable = db.open_table("notion_pages___homework")

  # A system prompt telling ollama to accept input in the form of "Question: ... ; Context: ..."
  messages = [
      {"role": "system", "content": "You are a helpful assistant that helps users understand policies inside a company's employee handbook. The user will first ask you a question and then provide you relevant paragraphs from the handbook as context. Please answer the question based on the provided context. For any details missing in the paragraph, encourage the employee to contact the HR for that information. Please keep the responses conversational."}
  ]

  while True:
    # Accept user question
    question = input("You: ")

    # Retrieve the relevant paragraphs on the question
    context = retrieve_context_from_lancedb(dbtable,question,top_k=2)

    # Create a user prompt using the question and retrieved context
    messages.append(
        {"role": "user", "content": f"Question: '{question}'; Context:'{context}'"}
    )

    # Get the response from the LLM
    response = ollama.chat(
        model="llama2-uncensored",
        messages=messages
    )
    response_content = response['message']['content']
    print(f"Assistant: {response_content}")

    # Add the response into the context window
    messages.append(
        {"role": "assistant", "content":response_content}
    )

###Question3

how many PTO days are the employees entitled to in a year?

In [None]:
main()

You: how many PTO days are the employees entitled to in a year?
Assistant: Thank you for asking! To answer your question, based on the provided context, it looks like employees are entitled to 30 days of paid time off (PTO) per year. You can take this PTO at any time after your first week with our company and use time off that you haven't accrued yet. After your first year with our company, you will earn one additional day per year up to a cap of 25 days overall. If you leave the company or are not terminated for cause, we will compensate your remaining PTO according to local law.
If you have any other questions or need clarification on anything, please don't hesitate to contact HR for more information.


There's a lot more to learn and do with dlt and LanceDB, find more info the [dlt docs](https://dlthub.com/docs/) and the [LanceDB docs](https://lancedb.github.io/lancedb/)

If you have questions about this workshop or dlt, feel free to join our [community on Slack](https://dlthub.com/community).

If you're at EuroPython in Prague this week, come see us at our booth!