[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/data-platforms/unstructured/unstructured_weaviate.ipynb)

For this demo, we're using version `1.25.4` of [Weaviate](https://weaviate.io/) and the `4.6.5` python client, and [Unstructured](https://unstructured.io/) `0.14.5`.


Author: **Maria Khalusova** from Unstructured

Maria's X handle: @mariaKhalusova

Maria's LinkedIn: https://www.linkedin.com/in/maria-khalusova-a958aa14/

## Install the dependencies

In [None]:
!pip install -U -q "unstructured[s3, pdf, weaviate, openai]" python-dotenv

## Load environment variables

Mount your Google Drive - there will be a pop up asking you to connect to your google drive.
Then, load the env variables from a `.env` file. If you have another preferred method for loading env variables, go ahead and use it :)

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
import dotenv

dotenv.load_dotenv('/content/drive/MyDrive/.env')

True

## Connect to Weaviate

You can use [Weaviate Cloud](https://console.weaviate.cloud/), [Weaviate Embedded](https://weaviate.io/developers/weaviate/installation/embedded), or [locally](https://weaviate.io/developers/weaviate/installation/docker-compose).

In [None]:
# Weaviate Cloud

import weaviate

# Set these environment variables
URL = os.getenv("WEAVIATE_URL")
APIKEY = os.getenv("WEAVIATE_API_KEY")

# Connect to your WCD instance
client = weaviate.connect_to_wcs(
    cluster_url=URL,
    auth_credentials=weaviate.auth.AuthApiKey(APIKEY),
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")  # Replace with your OpenAI key
    }
)

client.is_ready()

True

In [None]:
# Weaviate Embedded (will run in your local runtime)

# import weaviate
# import os

# client = weaviate.connect_to_embedded(
#     headers={
#         "X-OpenAI-Api-Key": os.getenv("OPENAI_APIKEY")  # Replace with your API key
#     }
# )

# client.is_ready()

In [None]:
# Connect to your local Weaviate instance deployed with Docker

# import weaviate
# import os

# client = weaviate.connect_to_local(
#     headers={
#         "X-OpenAI-Api-Key": os.environ["OPENAI-API-KEY"] # Replace with your OpenAI key
#     }
# )

# client.is_ready()

## Configure your Weaviate Schema

In [None]:
import weaviate.classes.config as wc
from weaviate.classes.config import ReferenceProperty

client.collections.create(
    name="UnstructuredDemo",

    vectorizer_config=wc.Configure.Vectorizer.text2vec_openai( # specify the vectorizer and model type you're using
        model="ada",
        model_version="002",
        type_="text"
    ),
    generative_config=wc.Configure.Generative.openai(
        model="gpt-4"  # Optional - Defaults to `gpt-3.5-turbo`
    ),

    # Weaviate can infer schema, but it is considered best practice to define it upfront
    properties=[
        wc.Property(name="type", data_type=wc.DataType.TEXT),
        wc.Property(name="element_id", data_type=wc.DataType.TEXT, skip_vectorization=True),
        wc.Property(name="text", data_type=wc.DataType.TEXT),
        wc.Property(name="embeddings", data_type=wc.DataType.NUMBER_ARRAY, skip_vectorization=True),
        wc.Property(name="metadata", data_type=wc.DataType.OBJECT, nested_properties=[
            wc.Property(name="filename", data_type=wc.DataType.TEXT),
            wc.Property(name="filetype", data_type=wc.DataType.TEXT),
            wc.Property(name="languages", data_type=wc.DataType.TEXT_ARRAY),
            wc.Property(name="page_number",  data_type=wc.DataType.TEXT, skip_vectorization=True),

        ])
    ],
)

<weaviate.collections.collection.Collection at 0x7f19ab525b40>

## Grab Data from S3 Bucket

In [None]:
from unstructured.ingest.connector.fsspec.s3 import S3AccessConfig, SimpleS3Config
from unstructured.ingest.connector.weaviate import (
    SimpleWeaviateConfig,
    WeaviateAccessConfig,
    WeaviateWriteConfig,
)
from unstructured.ingest.interfaces import (
    ChunkingConfig,
    PartitionConfig,
    ProcessorConfig,
    ReadConfig,
    EmbeddingConfig,
)
from unstructured.ingest.runner import S3Runner
from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.weaviate import (
    WeaviateWriter,
)

In [None]:
def get_writer() -> Writer:
    return WeaviateWriter(
        connector_config=SimpleWeaviateConfig(
            access_config=WeaviateAccessConfig(api_key=APIKEY),
            host_url=URL,
            class_name="UnstructuredDemo",
        ),
        write_config=WeaviateWriteConfig(),
    )

writer = get_writer()

output_path = "s3-output"

runner = S3Runner(
    processor_config=ProcessorConfig(
        verbose=True,
        output_dir=output_path,
        num_processes=40, # when processing a large number of documents via Unstructured API, set a larger number of workers/processes here
        ),
    read_config=ReadConfig(),
    partition_config=PartitionConfig(
        partition_by_api=True,
        api_key=os.getenv("UNSTRUCTURED_API_KEY"), # get your Unstructured API key and URL here https://unstructured.io/api-key-hosted
        partition_endpoint=os.getenv("UNSTRUCTURED_URL"),
        ),
    connector_config=SimpleS3Config(
        access_config=S3AccessConfig( # configure the authentication options for your S3 bucket
            key=os.getenv("AWS_KEY"),
            secret=os.getenv("AWS_SECRET"),
            ),
            remote_url=os.getenv("AWS_S3_NAME"),
        ),
    chunking_config=ChunkingConfig(chunk_elements=True,
                                   chunking_strategy="by_title",
                                   max_characters=8192, # the chunking size depends on the embedding model you use
                                   combine_text_under_n_chars=1000, # Unstructured can combine small elements into a larger chunk if it fits the max_character limit
                                   ),
    embedding_config=EmbeddingConfig(
        provider="langchain-openai",
        api_key=os.getenv("OPENAI_API_KEY"), # the embeddings model should match the one defined for Weaviate collection, in this case the default is text-embedding-ada-002
    ),

    writer=writer,
    writer_kwargs={},
    )

runner.run()


2024-06-19 23:27:58,298 MainProcess DEBUG    updating download directory to: /root/.cache/unstructured/ingest/s3/761c634451
2024-06-19 23:27:58,304 MainProcess INFO     running pipeline: DocFactory -> Reader -> Partitioner -> Chunker -> Embedder -> Writer -> Copier with config: {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "s3-output", "num_processes": 40, "raise_on_error": false}
2024-06-19 23:27:58,415 MainProcess INFO     Running doc factory to generate ingest docs. Source connector: {"processor_config": {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "s3-output", "num_processes": 40, "raise_on_error": false}, "read_config": {"download_dir": "/root/.cache/unstructured/ingest/s3/761c634451", "re_download": false, "preserve_downloads": false, "download_only": false, "max_docs": null}, "connector_config": {"remote_url": "s3://marias-rag-demo/", "uncompress": false

## Time to Search!

### Aggregate query

In [None]:
# count how many chunks are in the database

documents = client.collections.get("UnstructuredDemo")
response = documents.aggregate.over_all(total_count=True)

print(response.total_count)

794


### Hybrid search (mix of keyword and vector search)

In [None]:
import json

documents = client.collections.get("UnstructuredDemo")

response = documents.query.hybrid(
    query="types of biological pest control",
    alpha=0.5, # equal weighting of BM25 and vector search
    return_properties=['text'],
    auto_limit=2  # autocut after 2 jumps
)

for obj in response.objects:
    print(json.dumps(obj.properties, indent=2))

{
  "text": "6. Conclusions\n\nBiological pest control is a sustainable practice that benefits food production and health. Despite all the biological and environmental appeal for the development of these studies, there is also a need for adequate statistical methodologies to confirm the scientific hypotheses. Interactions between species, as well as changes in behaviour over time require specific methods of analysis to estimate the biological control efficiency of a species, and models for categorical longitudinal data are very useful in this context. In this work, we presented the problem of the soybean pest Euschistus heros and two potential agents for natural control in the field. As a statistical contribution, we developed an extension of multi-state models to compare two parasitoid species by evaluating their behaviours over time. These models allow not only to describe behavioural actions but also the intensity with which they occur. In this context, the method validated the expe

### Vector Search

In [None]:
documents = client.collections.get("UnstructuredDemo")

response = documents.query.near_text(
    query="types of biological pest control",
    return_properties=['text'],
    limit=5  # limit to 5
)

for obj in response.objects:
    print(json.dumps(obj.properties, indent=2))

{
  "text": "6. Conclusions\n\nBiological pest control is a sustainable practice that benefits food production and health. Despite all the biological and environmental appeal for the development of these studies, there is also a need for adequate statistical methodologies to confirm the scientific hypotheses. Interactions between species, as well as changes in behaviour over time require specific methods of analysis to estimate the biological control efficiency of a species, and models for categorical longitudinal data are very useful in this context. In this work, we presented the problem of the soybean pest Euschistus heros and two potential agents for natural control in the field. As a statistical contribution, we developed an extension of multi-state models to compare two parasitoid species by evaluating their behaviours over time. These models allow not only to describe behavioural actions but also the intensity with which they occur. In this context, the method validated the expe

## Generative Search

In [None]:
generateTask = "Please write a short ad on how customers can fight against aphids in their garden."

documents = client.collections.get("UnstructuredDemo")
response = documents.generate.near_text(
    query="types of biological pest control",
    limit=5,
    grouped_task=generateTask
)

print(response.generated)

"Are aphids wreaking havoc in your garden? Don't let these pests ruin your beautiful plants! With our sustainable and effective biological pest control methods, you can fight back against aphids and reclaim your garden. Our methods are not only efficient but also environmentally friendly, reducing the need for harmful chemical pesticides. We also provide comprehensive guidance on integrated pest management, helping you understand the behavior of pests and the best ways to control them. Don't let aphids take over your garden. Contact us today and let's fight back together!"
