[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/data-platforms/aryn/weaviate_blog_post.ipynb)

In [None]:
%load_ext autoreload
%autoreload 2

Tested with:

* Weaviate server: 1.25.0
* Weavite client: 4.6.3
* Sycamore: 0.1.18

by:

* Name: Benjamin Sowell, CTO, Aryn
* X handle: @aryninc
* Website: www.aryn.ai


## Introduction
In this notebook we will walk through how to prepare and load data into Weaviate using Sycamore. 

## Prerequisites
To run this notebook, you should complete the following setup tasks. 

- **Sycamore**. The Sycamore library can be installed using `pip` with the command `pip install sycamore-ai`. We recommend you install this in a virtual environment to isolate its dependencies.
- **ArynPartitioner**. This notebook utilizes the Aryn Partitioning Service. This provides an endpoint that integrates with Sycamore for partitioning PDFs. You can sign up for a free API key at [https://www.aryn.ai/get-started](https://www.aryn.ai/get-started). Once you have gotten an API key, export it by setting the `ARYN_API_KEY` environment variable. You can read about other options to specify your API key [here](https://sycamore.readthedocs.io/en/stable/aryn_cloud/aryn_partitioning_service.html). Alternatively, you can run the partitioning step locally, as described below. 
- **Poppler**. Some of Sycamore's PDF processing routines depend on the `poppler` package being available. This can be installed with your platform's native package manager. For example, on Mac with Homebrew, you can install it with `brew install poppler` and on Debian Linux and it's derivatives you can use `sudo apt install poppler-utils`. More information about Poppler can be found [here](https://poppler.freedesktop.org/).
- **OpenAI**. The `SummarizeImages` transform makes use of OpenAI to compute text summaries of images. To make use of OpenAI, you need an OpenAI API key, which you can get from [here](https://platform.openai.com). This notebook assumes you have set the `OPENAI_API_KEY` environment variable set to your key, though we show below how to set the key directly as well. 
- **Weaviate**. Weaviate should be accessible via localhost on port 8080 for HTTP and port 50051 for gRPC. To support embedding of queries, you should have the `sentence-transformers-all-MiniLM-L6-v2` model setup. You can find a sample Docker compose file to set this up [here](https://github.com/aryn-ai/sycamore/blob/main/apps/weaviate/compose.yml).

## Getting Started
The first step is read both files into a DocSet. 

In [None]:
import sycamore

paths = ["../data/"] 

context = sycamore.init()
ds = context.read.binary(paths=paths, binary_format="pdf")

ds.show()

Next, we run the ArynPartitioner to segment the documents. We show an example of how one page is partitioned. 

In [None]:
from sycamore.transforms.partition import ArynPartitioner
from sycamore.utils.pdf_utils import show_pages

# Make sure that your Aryn token is available in the environment variable
# ARYN_API_KEY
partitioner = ArynPartitioner(
    extract_table_structure=True, 
    extract_images=True)

# Alternatively, you can uncomment the following to run the ArynPartitioner
# locally. This works best if you have a NVIDIA GPU. 
# partitioner = ArynPartitioner(
#     extract_table_structure=True, 
#     extract_images=True,
#     use_partitioning_service=False)

ds = ds.partition(partitioner=partitioner)

# The show_pages utility displays a subset of pages with their bounding 
# boxes after partitioning. This can be useful for understanding and 
# debugging the output of the ArynPartitioner. 
show_pages(ds, limit=1)

At this point we have split the papers into elements, and we can look at the output. Here we look at section headers from the first paper:

In [None]:
docs = ds.filter(lambda doc: doc.properties['path'].endswith("paper01.pdf"))\
         .filter_elements(lambda el: el.type == "Section-header" and el.text_representation is not None)\
         .take_all()

for d in docs:
    for e in d.elements:
        print(e.text_representation.strip())

You can see that the section headers were correctly extracted, though a few of the table titles were also identified as section headers.

## Entity Extraction and Summarization
In addition to basic partitioning, Sycamore makes it easy to augment your documents with metadata to improve retrieval. For example, the following code extracts the title, authors, and abstract of each paper in the DocSet and saves it as metadata associated with the document. 

In [None]:
from sycamore.llms import OpenAI, OpenAIModels
from sycamore.transforms.extract_schema import OpenAIPropertyExtractor

# Specifies a schema name and type that direct the LLM what properties to extract.
schema_name = "PaperInfo"
schema = {
	"title": "string",
	"authors": "list[string]",
	"abstract": "string"
}

openai = OpenAI(OpenAIModels.GPT_4O) # Reads the OPENAI_API_KEY env var

# Extract the properties and add them under a special key "entity" in the 
# document properties. By default this sends the first 10 elements of the 
# of the Document the LLM. 
ds = ds.extract_properties(OpenAIPropertyExtractor(
   llm=openai,
   schema_name=schema_name,
   schema=schema))

ds.show(show_elements=False)

The following code summarizes each image in the documents using GPT-4o. 

In [None]:
from sycamore.transforms.summarize_images import OpenAIImageSummarizer, SummarizeImages

ds = ds.transform(SummarizeImages)

# By default the SummarizeImages transform will use GPT-4o and pick up
# credentials from the OPENAI_API_KEY environment variables. You
# can use a custom LLM like the following. 
#
# summarizer = OpenAIImageSummarizer(openai_model=openai)
# ds = ds.transform(SummarizeImages, summarizer=summarizer)

# Display only the image elements from each document. 
ds.filter_elements(lambda e: e.type == 'Image')\
  .show()

## Writing to Weaviate
The final step is to write the records to Weaviate. The following code configures the Weaviate client assuming that it runs locally, though you can adjust this to point to any Weaviate endpoint. 

In [None]:
from weaviate.client import AdditionalConfig, ConnectionParams
from weaviate.config import Timeout
from weaviate.collections.classes.config import Configure
from weaviate.classes.config import ReferenceProperty

collection = "WeaviateSycamoreDemoCollection"
wv_client_args = {
    "connection_params": ConnectionParams.from_params(
        http_host="localhost",
        http_port=8080,
        http_secure=False,
        grpc_host="localhost",
        grpc_port=50051,
        grpc_secure=False,
    ),
    "additional_config": AdditionalConfig(timeout=Timeout(init=2, query=45, insert=300)),
}
collection_config_params = {
    "name": collection,
    "description": "A collection to demo data-prep with Sycamore",

    # Sycamore can be used to embed document chunks before writing to Weaviate, so this is primarily to 
    # ensure that queries are embedded using the correct model in Weaviate. If you don't need to embed
    # queries or can do so externally, you can change the vectorizer_config to None. 
    "vectorizer_config": [Configure.NamedVectors.text2vec_transformers(name="embedding", source_properties=['text_representation'])],
}


Next, we write the data out from Sycamore into Weaviate. The following code does a few things: (1) It associates the "path" and "entity" properties from the top-level documents with each element to simplify queries, (2) it breaks each document into chunks and creates embeddings for each chunk, and (3) it writes the chunks to Weaviate using the configuration defined above. 

In [None]:
from sycamore.transforms.embed import SentenceTransformerEmbedder

model_name = "sentence-transformers/all-MiniLM-L6-v2"

ds.spread_properties(["path", "entity"])\
  .explode()\
  .embed(embedder=SentenceTransformerEmbedder(
      model_name=model_name, batch_size=1000))\
  .write.weaviate(
      wv_client_args=wv_client_args,
      collection_name=collection,
      collection_config=collection_config_params,
      flatten_properties=True)

## Querying with Weaviate
Once the data has been loaded into Weaviate, you can query with the standard client. The following shows an example of a query that uses both a hybrid search and filters to find images aobut skin cancer image classification. 

In [None]:
import weaviate
from weaviate.classes.query import Filter

wcl = weaviate.connect_to_local()
demo = wcl.collections.get(collection)

# Utility method for formatting the output in an easily readable way.
def print_search_result(sr):
    for obj in sr.objects:
        print("=" * 80)
        for p in obj.properties:
            print(f"{p: <30}| {obj.properties[p]}")

# Specify the properties to return in the vector search.
get_props = [
	"text_representation",
	"type",
	"properties__path",
	"properties__page_number",
	"properties__entity__authors",
	"properties__entity__title"
]

# Do a hybrid search query with a filter for Image elements.
print_search_result(demo.query.hybrid(
    query="Applications of deep learning to skin cancer image classification.",
    query_properties=["text_representation"],
    target_vector="embedding",
    return_properties=get_props,
    filters=Filter.by_property("type").equal("Image"),
    limit=2
))