<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/data_connectors/PathwayReaderDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pathway Reader

> [Pathway](https://pathway.com/) is an open data processing framework. It allows you to easily develop data transformation pipelines and Machine Learning applications that work with live data sources and changing data.

This notebook shows how to use the Pathway to deploy a live data indexing pipeline which can be queried from reader. You can add documents to Pathway from existing connectors or create your own connector with Python ensuring your LLM stays up to date with latest information. 

## Prequisites

In [None]:
!pip install pathway
!pip install llama-index

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

If there is no Pathway instance running, we need to start one.
For the demo, lets create an instance that listens local files.

In [None]:
from llama_index.retrievers import PathwayVectorServer

In [None]:
import getpass
import os
import pathway as pw

# omit if embedder of choice is not OpenAI
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Define inputs Pathway will listen to

In [None]:
data_sources = []
data_sources.append(
    pw.io.fs.read(
        "../data/paul_graham",
        format="binary",
        mode="streaming",
        with_metadata=True,
    )  # This creates a `pathway` connector that tracks
    # all the files in the `data/paul_graham` directory.
)

# We can add more connectors from various sources/formats with pw.io.
# This creates a connector that tracks files in Google drive.
# please follow the instructions at https://pathway.com/developers/tutorials/connectors/gdrive-connector/ to get credentials

# data_sources.append(
#     pw.io.gdrive.read(object_id="17H4YpBOAKQzEJ93xmC2z170l0bP2npMy", service_user_credentials_file="credentials.json", with_metadata=True))

## Create document transformation pipeline

In [None]:
from llama_index.embeddings import OpenAIEmbedding
from llama_index.node_parser import TokenTextSplitter

embed_model = OpenAIEmbedding(embed_batch_size=10)

transformations_example = [
    TokenTextSplitter(
        chunk_size=150,
        chunk_overlap=10,
        separator=" ",
    ),
    embed_model,
]

## Run the Pathway

In [None]:
pr = PathwayVectorServer(
    *data_sources,
    transformations=transformations_example,
)

# Define the Host and port that Pathway will be on
PATHWAY_HOST = "127.0.0.1"
PATHWAY_PORT = 8754

pr.run_server(
    host=PATHWAY_HOST, port=PATHWAY_PORT, with_cache=False, threaded=True
)

## Define `reader` client for Pathway

In [None]:
from llama_index.readers.pathway import PathwayReader

In [None]:
reader = PathwayReader(host=PATHWAY_HOST, port=PATHWAY_PORT)

In [None]:
# let us search with some text
reader.load_data(query_text="some search input")

## Create a summary index with llama-index

In [None]:
docs = reader.load_data(query_text="some search input", k=2)

In [None]:
from llama_index.indices.list import SummaryIndex

In [None]:
index = SummaryIndex.from_documents(docs)

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("What does Paul Graham talk about?")

In [None]:
print(response)