# Weaviate: Store, index, and search with vector database

Welcome to this introductory tutorial on Weaviate! Weaviate is an open-source vector database that allows you to store, index, and search through your data based on its semantic meaning. It's a powerful tool for building AI-powered applications like semantic search engines, recommendation systems, and question-answering bots.

In this notebook, we'll cover the most important concepts of Weaviate, using **Weaviate Cloud** for hosting and **OpenAI** for vectorization and generative tasks. By the end of this tutorial, you'll have a solid understanding of how to use Weaviate for your own projects.

## 1. Setup

First, let's install the necessary libraries and set up our environment. We'll need the `weaviate-client` library to interact with Weaviate and `python-dotenv` to manage our API keys securely.

In [None]:
!pip install weaviate-client python-dotenv

Next, create a `.env` file in the same directory as this notebook and add your Weaviate Cloud and OpenAI credentials.

You can get your Weaviate Cloud URL and API Key from the [Weaviate Cloud Console](https://console.weaviate.cloud/) after creating a free sandbox cluster.

In [4]:
#OPENAI_API_KEY=YOUR_OPENAI_API_KEY
#WEAVIATE_CLUSTER_URL=YOUR_WEAVIATE_CLUSTER_URL
#WEAVIATE_API_KEY=YOUR_WEAVIATE_API_KEY

In [5]:
import weaviate
import os
import json
from dotenv import load_dotenv

load_dotenv()

# Get credentials from .env file
openai_api_key = os.getenv("OPENAI_API_KEY")
weaviate_cluster_url = os.getenv("WEAVIATE_CLUSTER_URL")
weaviate_api_key = os.getenv("WEAVIATE_API_KEY")

# Connect to your Weaviate Cloud cluster
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=weaviate_cluster_url,
    auth_credentials=weaviate.auth.AuthApiKey(weaviate_api_key),
    headers={
        "X-OpenAI-Api-Key": openai_api_key
    }
)

print("Client is ready:", client.is_ready())

Client is ready: True


## 2. Schema Configuration

In Weaviate, you organize your data into **collections** (previously called classes), which are similar to tables in a relational database. Each collection has a set of **properties** that define the structure of your data objects. When you define a collection, you also specify a **vectorizer**, which is a machine learning model that converts your data into vector embeddings.

Let's create a collection called `Article` with properties for `title`, `content`, and `url` using the OpenAI vectorizer.

In [7]:
collection_name = "Article"

if client.collections.exists(collection_name):
    client.collections.delete(collection_name)

# Use the latest config objects for vectorizer and generative modules
from weaviate.classes.config import Configure, Property, DataType, Tokenization

articles = client.collections.create(
    name=collection_name,
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    generative_config=Configure.Generative.openai(),
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT
        ),
        Property(
            name="content",
            data_type=DataType.TEXT
        ),
        Property(
            name="url",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WHITESPACE # Use a simple tokenizer for URLs
        )
    ]
)

print(f"'{collection_name}' collection created successfully.")

'Article' collection created successfully.


## 3. Data Import

Now that we have our collection, let's import some data. We'll use a small dataset of articles. For efficient data ingestion, it's highly recommended to use **batching**.

In [8]:
data = [
    {"title": "The Future of AI", "content": "Artificial intelligence is rapidly evolving, with new breakthroughs happening every day.", "url": "http://example.com/ai"},
    {"title": "Introduction to Vector Databases", "content": "Vector databases are designed to store and query high-dimensional data like vector embeddings.", "url": "http://example.com/vector-db"},
    {"title": "The Rise of Large Language Models", "content": "Large language models, or LLMs, are a type of AI that can understand and generate human-like text.", "url": "http://example.com/llms"},
    {"title": "Exploring the Cosmos", "content": "Space exploration continues to push the boundaries of our knowledge about the universe.", "url": "http://example.com/space"},
]

with articles.batch.dynamic() as batch:
    for item in data:
        batch.add_object(
            properties=item
        )

print("Data imported successfully.")
print(f"There are {len(articles.query.fetch_objects().objects)} articles in the collection.")

Data imported successfully.
There are 4 articles in the collection.


## 4. Vector Search

The core feature of Weaviate is **vector search**, which allows you to find objects based on their semantic similarity. You can perform a vector search using a text query, an image, or even another object's vector.

In [9]:
response = articles.query.near_text(
    query="What are language models?",
    limit=2
)

for obj in response.objects:
    print(json.dumps(obj.properties, indent=2))

{
  "url": "http://example.com/llms",
  "title": "The Rise of Large Language Models",
  "content": "Large language models, or LLMs, are a type of AI that can understand and generate human-like text."
}
{
  "url": "http://example.com/vector-db",
  "title": "Introduction to Vector Databases",
  "content": "Vector databases are designed to store and query high-dimensional data like vector embeddings."
}


## 5. Generative Search

Weaviate can also perform **generative search** (also known as Retrieval-Augmented Generation or RAG), where it combines the search results with a large language model to generate a direct answer to your query. This is a powerful feature for building question-answering systems.

In [10]:
response = articles.generate.near_text(
    query="Explain vector databases in one sentence",
    limit=2,
    single_prompt="Explain the following article in one sentence: {content}"
)

for obj in response.objects:
    print("--- Original Content ---")
    print(obj.properties['content'])
    print("--- Generated Summary ---")
    print(obj.generated)
    print("\n")

--- Original Content ---
Vector databases are designed to store and query high-dimensional data like vector embeddings.
--- Generated Summary ---
Vector databases store high-dimensional vectors (like embeddings) and use specialized indexes and similarity search algorithms (e.g., approximate nearest neighbors) to quickly retrieve items based on semantic or distance-based similarity for tasks such as semantic search, recommendation, and retrieval-augmented generation.


--- Original Content ---
Large language models, or LLMs, are a type of AI that can understand and generate human-like text.
--- Generated Summary ---
Large language models (LLMs) are AI systems trained on vast amounts of text to understand context and generate fluent, human-like language.




You can also use a **grouped task** to get a single, consolidated answer based on all the search results.

In [11]:
response = articles.generate.near_text(
    query="What is the common theme between these articles?",
    limit=3,
    grouped_task="Based on the following articles, what is the main topic?"
)

print(response.generated)

The main topic is artificial intelligence — specifically AI and large language models (two articles focus on AI/LLMs, one is about space).


## 6. Filtering

You can combine vector search with **filters** to narrow down your results based on specific criteria. This allows you to perform complex queries that leverage both semantic and scalar information.

In [12]:
from weaviate.classes.query import Filter

response = articles.query.near_text(
    query="artificial intelligence",
    limit=2,
    filters=Filter.by_property("title").like("*AI*") # Find articles with 'AI' in the title
)

for obj in response.objects:
    print(json.dumps(obj.properties, indent=2))

{
  "url": "http://example.com/ai",
  "title": "The Future of AI",
  "content": "Artificial intelligence is rapidly evolving, with new breakthroughs happening every day."
}


In [13]:
client.close() # Close the connection to Weaviate

## 7. Conclusion

Congratulations! You've completed this Weaviate tutorial and learned the fundamental concepts of vector databases. You now know how to:

* Connect to Weaviate Cloud
* Configure a schema with an OpenAI vectorizer
* Import data using batching
* Perform vector and generative (RAG) searches
* Apply filters to your queries

This is just the beginning of what you can do with Weaviate. To continue your journey, I recommend exploring the official Weaviate documentation and trying out more advanced features like cross-references and custom vectors. Happy building!