[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/llm-agent-frameworks/llamaindex/data-loaders-episode1/episode1.ipynb)

# LlamaIndex Episode 1 🦙

## Overview

* What is LlamaIndex?

        * LlamaHub (data loaders)

* How to setup Weaviate

        * Create schema


* Adding Data to Weaviate using LlamaIndex

        *  Data loader examples

* Chunking up your data

* Connecting Weaviate instance to LlamaIndex

* Simple query engine

## What is [LlamaIndex](https://www.llamaindex.ai/)?

#### Framework that enables you to connect LLMs and storage providers together seamlessly.
#### LlamaIndex 🤝 Weaviate ➡ Ultimate RAG stack

#### [LlamaHub](https://llama-hub-ui.vercel.app/): Enables you to connect to a number of external data sources (Notion, Slack, Web pages, and more!)

## Setting up Weaviate

We first first to initialize a Weaviate client and hand it over to LlamaIndex. You can do that in different ways:

1. Embedded - Runs a local Weaviate cluster. Works on Linux and Mac.

2. WCD - Connects to Weaviate Cloud. You can spin up a free sandbox cluster at https://console.weaviate.cloud/ and get the url and api key.

3. Docker - run in Docker. You can use our [docker configurator tool](https://weaviate.io/developers/weaviate/installation/docker-compose#configurator) to get you started.

Let's first install weaviate client and llama-index along some other dependencies:

In [None]:
%pip install -U weaviate-client llama-index llama-index-vector-stores-weaviate llama-index-embeddings-openai

In [1]:
# let's catch some logs
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

### Embedded

In [2]:
import weaviate

client = weaviate.connect_to_embedded()

INFO:weaviate-client:Started /Users/dudanogueira/.cache/weaviate-embedded: process ID 80646
Started /Users/dudanogueira/.cache/weaviate-embedded: process ID 80646


{"action":"startup","build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-11-13T15:10:29-03:00"}
{"action":"startup","auto_schema_enabled":true,"build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-11-13T15:10:29-03:00"}
{"build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-11-13T15:10:29-03:00"}
{"build_git_commit":"ab0312d5d","build_go_versio

INFO:httpx:HTTP Request: GET http://localhost:8079/v1/.well-known/openid-configuration "HTTP/1.1 404 Not Found"
HTTP Request: GET http://localhost:8079/v1/.well-known/openid-configuration "HTTP/1.1 404 Not Found"
INFO:httpx:HTTP Request: GET http://localhost:8079/v1/meta "HTTP/1.1 200 OK"
HTTP Request: GET http://localhost:8079/v1/meta "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:8079/v1/.well-known/ready "HTTP/1.1 200 OK"
HTTP Request: GET http://localhost:8079/v1/.well-known/ready "HTTP/1.1 200 OK"


{"build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","docker_image_tag":"localhost","level":"info","msg":"configured versions","server_version":"1.26.6","time":"2024-11-13T15:10:31-03:00"}
{"action":"grpc_startup","build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","level":"info","msg":"grpc server listening at [::]:50050","time":"2024-11-13T15:10:31-03:00"}
{"address":"192.168.28.127:50968","build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","level":"info","msg":"current Leader","time":"2024-11-13T15:10:31-03:00"}
{"build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","level":"info","msg":"attempting to join","remoteNodes":["192.168.28.127:50968"],"time":"2024-11-13T15:10:31-03:00"}
{"action":"raft","build_git_commit":"ab03

INFO:httpx:HTTP Request: GET https://pypi.org/pypi/weaviate-client/json "HTTP/1.1 200 OK"
HTTP Request: GET https://pypi.org/pypi/weaviate-client/json "HTTP/1.1 200 OK"


{"action":"lsm_recover_from_active_wal_success","build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","class":"BlogPost","index":"blogpost","level":"info","msg":"successfully recovered from write-ahead-log","path":"/Users/dudanogueira/.local/share/weaviate/blogpost/sC3Y5dmAGch9/lsm/property_file_name/segment-1731521330669499000.wal","shard":"sC3Y5dmAGch9","time":"2024-11-13T15:10:32-03:00"}
{"action":"lsm_recover_from_active_wal_success","build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","class":"BlogPost","index":"blogpost","level":"info","msg":"successfully recovered from write-ahead-log","path":"/Users/dudanogueira/.local/share/weaviate/blogpost/sC3Y5dmAGch9/lsm/property_creation_date/segment-1731521330669381000.wal","shard":"sC3Y5dmAGch9","time":"2024-11-13T15:10:32-03:00"}
{"action":"lsm_recover_from_active_wal_success","build_git_commit":"ab0312d5d"

In [None]:
# lets check the connection getting the server version
print(f"Client: {weaviate.__version__}, Server: {client.get_meta().get('version')}")

### WCD

In [None]:
import weaviate
import os
  
# Set these environment variables
URL = os.getenv("WCD_URL")
APIKEY = os.getenv("WCD_API_KEY")
  
# Connect to a WCD instance
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=URL,
    auth_credentials=weaviate.auth.AuthApiKey(APIKEY)
)

### Docker

In order to run with docker, you can use our [docker configurator tool](https://weaviate.io/developers/weaviate/installation/docker-compose#configurator). 

Once you have Weaviate running with docker, you can get the client with:

In [None]:
import weaviate
from weaviate import classes as wvc
  
# Connect to a local instance
client = weaviate.connect_to_local()

### Collection
Let's create our collection before hand, and specify a model to use. This model must be the same one used in LlamaIndex.

In [3]:
from weaviate import classes as wvc
# clean slate
client.collections.delete("BlogPost")

collection = client.collections.create(
    name="BlogPost",
    description="Blog post from the Weaviate website.",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small"
    ),
    generative_config=wvc.config.Configure.Generative.openai(
        model="gpt-3.5-turbo"
    ),
    properties=[
        wvc.config.Property(name="text", description="Content from the blog post", data_type=wvc.config.DataType.TEXT)
    ]
)

print("Collection was created.")

INFO:httpx:HTTP Request: DELETE http://localhost:8079/v1/schema/BlogPost "HTTP/1.1 200 OK"
HTTP Request: DELETE http://localhost:8079/v1/schema/BlogPost "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8079/v1/schema "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:8079/v1/schema "HTTP/1.1 200 OK"
Collection was created.


{"action":"hnsw_prefill_cache_async","build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-11-13T15:10:39-03:00","wait_for_cache_prefill":false}
{"build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","level":"info","msg":"Created shard blogpost_N8UvMi6gvKWy in 2.219667ms","time":"2024-11-13T15:10:39-03:00"}
{"action":"hnsw_vector_cache_prefill","build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-11-13T15:10:39-03:00","took":68959}


## Adding Data to Weaviate using LlamaIndex

### SimpleDirectoryReader: Read files in your filesystem

In [4]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader('./data').load_data()

### SimpleWebPageReader: Web scraper that turns HTML to text

In [None]:
from llama_index.readers.web import SimpleWebPageReader

loader = SimpleWebPageReader()
documents = loader.load_data(urls=['https://weaviate.io/blog/llamaindex-and-weaviate'])

### NotionPageReader: Loads documents from Notion

In [None]:
%pip install llama-index-readers-notion

In [None]:
from llama_index.readers.notion import NotionPageReader

integration_token = ("secret_key")
page_ids = ["40be241cac924a5aa887fa85e945dbf6"]
reader = NotionPageReader(integration_token=integration_token)
documents = reader.load_data(page_ids=page_ids)

### Inspecting the nodes of you documents

In [5]:
from llama_index.core.node_parser import SimpleNodeParser

parser = SimpleNodeParser()
nodes = parser.get_nodes_from_documents(documents)
print("Number of nodes:", len(nodes))
print(nodes[0])

Number of nodes: 9
Node ID: 3777fe8b-7230-4cf3-a700-3089b9e093e2
Text: title: What is Ref2Vec and why you need it for your
recommendation system  Weaviate 1.16 introduced the
[Ref2Vec](/developers/weaviate/modules/retriever-vectorizer-
modules/ref2vec-centroid) module. In this article, we give you an
overview of what Ref2Vec is and some examples in which it can add
value such as recommendations or representing long ...


### Documents to Weaviate

In [6]:
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core import VectorStoreIndex, StorageContext, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
import openai
import os

# global
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Lets set the OPENAI key
# os.environ["OPENAI_API_KEY"] = "sk-key"
openai.api_key = os.environ["OPENAI_API_KEY"]

# loading the documents
documents = SimpleDirectoryReader("./data/").load_data()

# Let's name our index properly as BlogPost, as we will need it later.
vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="BlogPost"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

INFO:httpx:HTTP Request: GET http://localhost:8079/v1/schema/BlogPost "HTTP/1.1 200 OK"
HTTP Request: GET http://localhost:8079/v1/schema/BlogPost "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:8079/v1/schema "HTTP/1.1 200 OK"
HTTP Request: GET http://localhost:8079/v1/schema "HTTP/1.1 200 OK"


INFO:httpx:HTTP Request: GET http://localhost:8079/v1/nodes "HTTP/1.1 200 OK"
HTTP Request: GET http://localhost:8079/v1/nodes "HTTP/1.1 200 OK"


In [7]:
# Let's check if the objects were created
collection = client.collections.get("BlogPost")
query = collection.query.fetch_objects()
if query.objects:
    print("Objects in this collection:", len(collection))
    print("Object properties example:", query.objects[0].properties)
else:
    print("No objects found in this collection.")

INFO:httpx:HTTP Request: POST http://localhost:8079/v1/graphql "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:8079/v1/graphql "HTTP/1.1 200 OK"
Objects in this collection: 9
Object properties example: {'file_size': 11641.0, 'last_modified_date': '2024-06-05', '_node_type': 'TextNode', 'text': 'title: What is Ref2Vec and why you need it for your recommendation system\n\nWeaviate 1.16 introduced the [Ref2Vec](/developers/weaviate/modules/retriever-vectorizer-modules/ref2vec-centroid) module. In this article, we give you an overview of what Ref2Vec is and some examples in which it can add value such as recommendations or representing long objects.\n\n## What is Ref2Vec?\nThe name Ref2Vec is short for reference-to-vector, and it offers the ability to vectorize a data object with its cross-references to other objects. The Ref2Vec module currently holds the name ref2vec-**centroid** because it uses the average, or centroid vector, of the cross-referenced vectors to represent the **ref

### Query in LlamaIndex

In [8]:
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core import VectorStoreIndex

vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="BlogPost"
)

loaded_index = VectorStoreIndex.from_vector_store(vector_store)

query_engine = loaded_index.as_query_engine()
response = query_engine.query("What is the intersection between LLMs and search?")
print(response)

INFO:httpx:HTTP Request: GET http://localhost:8079/v1/schema/BlogPost "HTTP/1.1 200 OK"
HTTP Request: GET http://localhost:8079/v1/schema/BlogPost "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:8079/v1/schema/BlogPost "HTTP/1.1 200 OK"
HTTP Request: GET http://localhost:8079/v1/schema/BlogPost "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:8079/v1/schema/BlogPost "HTTP/1.1 200 OK"
HTTP Request: GET http://localhost:8079/v1/schema/BlogPost "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
The intersection between LLMs and search lies in finding suitable representations for long objects, particularly text documents that exceed the 512 token input limit on Deep T

In [None]:
# let's close the client
client.close()

{"action":"restapi_management","build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","docker_image_tag":"localhost","level":"info","msg":"Shutting down... ","time":"2024-11-13T15:11:21-03:00"}
{"action":"restapi_management","build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","docker_image_tag":"localhost","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:8079","time":"2024-11-13T15:11:21-03:00"}
{"action":"telemetry_push","build_git_commit":"ab0312d5d","build_go_version":"go1.23.1","build_image_tag":"localhost","build_wv_version":"1.26.6","level":"info","msg":"telemetry terminated","payload":"\u0026{MachineID:1a57be6c-3d4e-4f09-8c26-dc8e8029edcb Type:TERMINATE Version:1.26.6 NumObjects:0 OS:darwin Arch:arm64 UsedModules:[generative-openai text2vec-openai]}","time":"2024-11-13T15:11:22-03:00"}
{"build_git_commit":"ab0312d5d","build_go_versi