<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/data_connectors/PathwayReaderDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pathway Reader

> [Pathway](https://pathway.com/) is an open data processing framework. It allows you to easily develop data transformation pipelines and Machine Learning applications that work with live data sources and changing data.

This notebook shows how to use the Pathway to deploy a live data indexing pipeline which can be queried from reader. You can add documents to Pathway from existing connectors or create your own connector with Python ensuring your LLM stays up to date with latest information. 

## Prequisites

In [None]:
!pip install pathway
!pip install llama-index

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

If there is no Pathway instance running, we need to start one.
For the demo, lets create an instance that listens local files.

In [None]:
from llama_index.retrievers import PathwayVectorServer

In [None]:
import getpass
import os
import pathway as pw

# omit if embedder of choice is not OpenAI
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Define inputs Pathway will listen to

In [None]:
data_sources = []
data_sources.append(
    pw.io.fs.read(
        "../data/paul_graham",
        format="binary",
        mode="streaming",
        with_metadata=True,
    )  # This creates a `pathway` connector that tracks
    # all the files in the `data/paul_graham` directory.
)

# We can add more connectors from various sources/formats with pw.io.
# This creates a connector that tracks files in Google drive.
# please follow the instructions at https://pathway.com/developers/tutorials/connectors/gdrive-connector/ to get credentials

# data_sources.append(
#     pw.io.gdrive.read(object_id="17H4YpBOAKQzEJ93xmC2z170l0bP2npMy", service_user_credentials_file="credentials.json", with_metadata=True))

## Create document transformation pipeline

In [None]:
from llama_index.embeddings import OpenAIEmbedding
from llama_index.node_parser import TokenTextSplitter

embed_model = OpenAIEmbedding(embed_batch_size=10)

transformations_example = [
    TokenTextSplitter(
        chunk_size=150,
        chunk_overlap=10,
        separator=" ",
    ),
    embed_model,
]

## Run the Pathway

In [None]:
pr = PathwayVectorServer(
    *data_sources,
    transformations=transformations_example,
)

# Define the Host and port that Pathway will be on
PATHWAY_HOST = "127.0.0.1"
PATHWAY_PORT = 8754

pr.run_server(
    host=PATHWAY_HOST, port=PATHWAY_PORT, with_cache=False, threaded=True
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


<Thread(Thread-5 (run), started 139735756285504)>

INFO:pathway_engine.engine.dataflow:Preparing Pathway computation
Preparing Pathway computation
(Press CTRL+C to quit)
INFO:pathway_engine.connectors.monitoring:FilesystemReader-0: 0 entries (1 minibatch(es)) have been sent to the engine
FilesystemReader-0: 0 entries (1 minibatch(es)) have been sent to the engine
INFO:pathway_engine.connectors.monitoring:PythonReader-1: 0 entries (1 minibatch(es)) have been sent to the engine
PythonReader-1: 0 entries (1 minibatch(es)) have been sent to the engine
78
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP R

## Define `reader` client for Pathway

In [None]:
from llama_index.readers.pathway import PathwayReader

In [None]:
reader = PathwayReader(host=PATHWAY_HOST, port=PATHWAY_PORT)

In [None]:
# let us search with some text
reader.load_data(query_text="some search input")

INFO:pathway_engine.connectors.monitoring:PythonReader-1: 1 entries (713 minibatch(es)) have been sent to the engine
PythonReader-1: 1 entries (713 minibatch(es)) have been sent to the engine
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:aiohttp.access:127.0.0.1 [22/Dec/2023:16:18:44 +0100] "POST / HTTP/1.1" 200 2340 "-" "python-requests/2.31.0"
127.0.0.1 [22/Dec/2023:16:18:44 +0100] "POST / HTTP/1.1" 200 2340 "-" "python-requests/2.31.0"


[Document(id_='9a1e1475-0cc8-4e95-86b7-e82748dd4957', embedding=None, metadata={'created_at': None, 'modified_at': 1703258245, 'owner': 'berke', 'path': '/home/berke/experimental_berke/llama_index/docs/examples/data/paul_graham_short/paul_graham_essay.txt'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='1bac9196cb11f02e03562f27c85609182c85c8c8a952152fd201290d8e1adeeb', text='made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='9d220ead-70aa-49f2-80b4-d303729bdd57', embedding=None, metadata={'created_at': None, 'mod

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:pathway_engine.connectors.monitoring:PythonReader-1: 1 entries (101 minibatch(es)) have been sent to the engine
PythonReader-1: 1 entries (101 minibatch(es)) have been sent to the engine
INFO:pathway_engine.connectors.monitoring:PythonReader-1: 0 entries (100 minibatch(es)) have been sent to the engine
PythonReader-1: 0 entries (100 minibatch(es)) have been sent to the engine


## Create a summary index with llama-index

In [None]:
docs = reader.load_data(query_text="some search input", k=2)

INFO:pathway_engine.connectors.monitoring:PythonReader-1: 1 entries (787 minibatch(es)) have been sent to the engine
PythonReader-1: 1 entries (787 minibatch(es)) have been sent to the engine


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:aiohttp.access:127.0.0.1 [22/Dec/2023:16:19:34 +0100] "POST / HTTP/1.1" 200 1170 "-" "python-requests/2.31.0"
127.0.0.1 [22/Dec/2023:16:19:34 +0100] "POST / HTTP/1.1" 200 1170 "-" "python-requests/2.31.0"


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
from llama_index.indices.list import SummaryIndex

In [None]:
index = SummaryIndex.from_documents(docs)

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("What does Paul Graham talk about?")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [None]:
print(response)

Paul Graham talks about his experience with programming and the first programs he tried writing on the IBM 1401.
