<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email author@email.address.<br />
____

# `Introduction to Semantic Search and Vector Databases` `3`

This is lesson `3` of 3 in the educational series on `Semantic Search and Vector Databases`. This notebook is intended `to teach how to build a Weaviate cluster for RAG systems`.

**Skills:** 
* Data analysis
* Machine learning
* Text analysis
* spaCy
* Vector databases
* Semantic search
* Python

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Intermediate`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* Object-oriented programming (classes, instances, inheritance)
* Regular Expressions (`re`, character classes)

These should be general skills but can mention a particular library
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
* Data cleaning with `Pandas`
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Describe and implement an XXXX for XXXX
2. Convert XXXX into XXXX for the purpose of XXXX
3. Develop a workflow in order to XXXX
4. Be familiar with XXXXX resources for pursuing the topic
```
**Research Pipeline:**
```
1. Research steps before this notebook
2. **The skills in this notebook**
3. Steps after this notebook
4. Final steps
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* [Tesseract](https://tesseract-ocr.github.io/) for performing [optical character recognition](https://docs.constellate.org/key-terms/#ocr).
* [Pandas](https://pandas.pydata.org/) for manipulating and cleaning data.
* [Pdf2image](https://pdf2image.readthedocs.io/en/latest/) for converting pdf files into image files.

## Install Required Libraries

In [None]:
### Install Libraries ###

# Using !pip installs
!pip install spacy==3.7.5 weaviate-client==4.7.1 pandas

In [None]:
!python -m spacy download en_core_web_sm

In [75]:
!pip show weaviate-client

Name: weaviate-client
Version: 4.7.1
Summary: A python native Weaviate client
Home-page: https://github.com/weaviate/weaviate-python-client
Author: Weaviate
Author-email: hello@weaviate.io,
License: BSD 3-clause
Location: /Applications/anaconda3/envs/tap/lib/python3.10/site-packages
Requires: authlib, grpcio, grpcio-health-checking, grpcio-tools, httpx, pydantic, requests, validators
Required-by: 


In [76]:
!pip show spacy

Name: spacy
Version: 3.7.5
Summary: Industrial-strength Natural Language Processing (NLP) in Python
Home-page: https://spacy.io
Author: Explosion
Author-email: contact@explosion.ai
License: MIT
Location: /Applications/anaconda3/envs/tap/lib/python3.10/site-packages
Requires: catalogue, cymem, jinja2, langcodes, murmurhash, numpy, packaging, preshed, pydantic, requests, setuptools, spacy-legacy, spacy-loggers, srsly, thinc, tqdm, typer, wasabi, weasel
Required-by: en-core-web-lg, en-core-web-md, en-core-web-sm, en-core-web-trf, gliner-spacy, spacy-curated-transformers, spacy-llm, spacy-transformers


In [36]:
import os
# uncomment this out if you are using a Mac. This is a bug and with spacy-llm and pytorch on a Mac and this resolves it for now.
# os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["OPENAI_API_KEY"] = ""

import weaviate
from weaviate.classes.init import Auth
import weaviate.classes as wvc
from weaviate.collections.classes.filters import Filter

import srsly
import spacy


import pandas as pd
from tqdm import tqdm
from datetime import datetime, timezone


In [74]:
!pip show weaviate-client

Name: weaviate-client
Version: 4.7.1
Summary: A python native Weaviate client
Home-page: https://github.com/weaviate/weaviate-python-client
Author: Weaviate
Author-email: hello@weaviate.io,
License: BSD 3-clause
Location: /Applications/anaconda3/envs/tap/lib/python3.10/site-packages
Requires: authlib, grpcio, grpcio-health-checking, grpcio-tools, httpx, pydantic, requests, validators
Required-by: 


# Introduction

In this notebook, we will learn a bit more about Weaviate. Rather than querying a pre-built Weaviate server, we will instead build our own from scratch. In this tutorial, I'll walk you through each step, from building the cluster on the Weaviate dashboard to populating it with data via the Python API.

Once we have built our cluster, we will learn how to query it, the same as we did in the previous notebook. Querying a RAG system is important, but being able to filter out texts that we send to the LLM is even more important. Therefore, we will learn also how to structure a query with a filter. Finally, we will learn how to do more complex searches by stacking multiple filters.

# Building a Weaviate Cluster

In order to store data on the cloud, you must have a Weaviate cluster. Fortunately, Weaviate provides a free sandbox that expires after 14 days. It's a good way to prototype for free. Once you have an account, you can build a free cluster from your console dashboard. Follow the steps in this video:

![weaviate](../assets/weaviate.gif)


The steps in this video are:

1. Click "+ Create Cluster"
2. Name your cluster
3. Select a region for your server
4. Click `Create`

Once your cluster is created, you will have a server address and an API key. These can then be used to populate that cluster with data. For this notebook, you will not only need an OpenAI API key, you'll also need two pieces of information from your Weaviate sever: the server URL and the server API Key. You will have two options for each. You will want to get the REST API end point and Admin API Key. If you wanted to query the data, you would only use the "read only"

![weaviate](../assets/weaviate-url.gif)

# Access Cluster via Python

Once we have these two pieces of data, we need to bring them either into our environment (best method), or paste them in this notebook. Notice that we are using all capital letters here. This is Pythonic, but is entirely optional.

In [67]:
WEAVIATE_URL = ""
WEAVIATE_API_KEY  = ""

Once we have these variables prepared. We can now access our server. To do that, we will use `weaviate.connect_to_weaviate_cloud()`. This will take a few arguments. The first will be `cluster_url`. This is the url of the Weaviate server. The second is `auth_credentials`, this is going to be your Weaviate api key. Notice here that we are passing it through `Auth.api_key()`. Finally, we will pass our headers. This will include our OpenAI api key.

The below code creates an object called `client` which represents our connection to the Weaviate server. The `client` object, therefore, is the Weaviate server as a whole.

In [14]:
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=WEAVIATE_URL,
    auth_credentials=Auth.api_key(WEAVIATE_API_KEY),
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)

# Load Data

In order to start populating the Weaviate server via the `client` object, we need to first grab our data. We will once again be working with a small sample from Founders Online. Let's go ahead and load up the data just like we did in the previous notebook.

In [5]:
data = list(srsly.read_json("../data/processed/sample_1000_42.json"))
data[0]

{'title': 'Thomas Jefferson to Joseph Milligan, 22 December 1815',
 'permalink': 'https://founders.archives.gov/documents/Jefferson/03-09-02-0174',
 'project': 'Jefferson Papers',
 'authors': ['Jefferson, Thomas'],
 'recipients': ['Milligan, Joseph'],
 'date-from': '1815-12-22',
 'date-to': '1815-12-22',
 'content': 'Monticello Dec. 22. 15.\nDear Sir\nOn my return here from Bedford a few days ago, I found the Hutton and Requisite tables, bound to my mind. by this mail I send you an Ovid’s metamorphoses almost entirely worne out & defaced, yet of sovaluable and rareaneditionthat I wish you to put it into as good a state of repair as it is susceptible of. by the next mail I will forward a Cornelius Nepos to be bound. be so good as to procure and forward to me by stage the underwritten books.I salute you with friendship & esteem\nTh: Jefferson\nAinsworth’sLat. & Eng. dict. abridged. to be bound[. . .]\nthe Lat. & Eng in one, & the Eng. & Lat.[. . .]\nOvid’s metamorphoses. the Delphin edn 

# Defining a Data Model and Properties

In order to populate our Weaviate server with data, we need to first create a collection. A good way to think about a collection is as a section of the cluster that can hold a unique set of data. This allows you to work on multiple projects with a single Weaviate cluster. For our project, we will call the collection `Founders`. It's important to name the collection something that relates to the collection.

In [6]:
collection_name = "Founders"

Once we have our collection name, let's go ahead and check to see if the collection exists and, if it does, delete it. It is important to only execute the cell below if you absolutely want to delete everything within a cluster collection. I will leave it commented out so this cell is not accidently executed.

In [68]:
# if client.collections.exists(collection_name):  # In case we've created this collection before
#     client.collections.delete(collection_name)  # THIS WILL DELETE ALL DATA IN THE COLLECTION


Once we have a clean and empty collection, we can create a data model for our collection. This is probably one of the most important steps in this whole notebook. Data models structure how you approach the project as a whole. Remember, you may come back to this step several times when working on a project both as your data expands and as you get a sense of what properties of your data users need to access.

Data models are based on structuring the properties of a given text (and its vector). In Weaviate, we have many ways to define properties. Here is a handy reference table.

| Data Type         | Description                  | Example                          |
|-------------------|------------------------------|----------------------------------|
| TEXT              | Text data type.              | "Hello, World!"                  |
| TEXT_ARRAY        | Text array data type.        | ["Hello", "World"]               |
| INT               | Integer data type.           | 42                               |
| INT_ARRAY         | Integer array data type.     | [1, 2, 3, 4, 5]                  |
| BOOL              | Boolean data type.           | true                             |
| BOOL_ARRAY        | Boolean array data type.     | [true, false, true]              |
| NUMBER            | Number data type.            | 3.14159                          |
| NUMBER_ARRAY      | Number array data type.      | [1.1, 2.2, 3.3]                  |
| DATE              | Date data type.              | "2024-08-01"                     |
| DATE_ARRAY        | Date array data type.        | ["2024-08-01", "2023-07-01"]     |
| UUID              | UUID data type.              | "123e4567-e89b-12d3-a456-426614174000" |
| UUID_ARRAY        | UUID array data type.        | ["123e4567-e89b-12d3-a456-426614174000", "987e6543-e21b-34d3-a123-426614174001"] |
| GEO_COORDINATES   | Geo coordinates data type.   | {"latitude": 37.7749, "longitude": -122.4194} |
| BLOB              | Blob data type.              | Binary data like images or files |
| PHONE_NUMBER      | Phone number data type.      | "+1-800-555-5555"                |
| OBJECT            | Object data type.            | {"name": "Alice", "age": 30}     |
| OBJECT_ARRAY      | Object array data type.      | [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}] |

We can create the data model with the cell below the following code:

```python
founders = client.collections.create(
    name=collection_name,
    properties=[
        wvc.config.Property(
            name="chunk",
            data_type=wvc.config.DataType.TEXT
        ),
        wvc.config.Property(
            name="date_from",
            data_type=wvc.config.DataType.DATE
        ),
        wvc.config.Property(
            name="date_to",
            data_type=wvc.config.DataType.DATE
        ),
        wvc.config.Property(
            name="authors",
            data_type=wvc.config.DataType.TEXT_ARRAY
        ),
        wvc.config.Property(
            name="recipients",
            data_type=wvc.config.DataType.TEXT_ARRAY
        ),
        wvc.config.Property(
            name="chapter_title",
            data_type=wvc.config.DataType.TEXT
        ),
        wvc.config.Property(
            name="chunk_index",
            data_type=wvc.config.DataType.INT
        ),
        wvc.config.Property(
            name="doc_index",
            data_type=wvc.config.DataType.INT
        ),
    ],
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
    generative_config=wvc.config.Configure.Generative.openai(model="gpt-3.5-turbo"),
)

```

Let's break down this code section by section:

1. Collection Creation:
   ```python
   founders = client.collections.create(
       name=collection_name,
       ...
   )
   ```
   This line starts the creation of a new collection in Weaviate. The collection is named using the `collection_name` variable, and the result is assigned to the `founders` variable.

2. Properties Definition:
   ```python
   properties=[
       wvc.config.Property(
           name="chunk",
           data_type=wvc.config.DataType.TEXT
       ),
       ...
   ],
   ```
   This section defines the properties (or fields) of the collection. Each property is created using `wvc.config.Property()` and has a name and a data type. The properties defined are:
   - `chunk`: TEXT type, for storing text content
   - `date_from` and `date_to`: DATE type, probably for time ranges
   - `authors` and `recipients`: TEXT_ARRAY type, for storing multiple text values
   - `chapter_title`: TEXT type
   - `chunk_index` and `doc_index`: INT type, for indexing or ordering purposes

3. Vectorizer Configuration:
   ```python
   vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
   ```
   This line configures the vectorizer for the collection. It's set to use OpenAI's text2vec model, which will convert text data into vector representations.

4. Generative Configuration:
   ```python
   generative_config=wvc.config.Configure.Generative.openai(model="gpt-3.5-turbo"),
   ```
   This configures the generative model for the collection. It's set to use OpenAI's GPT-3.5 Turbo model, which can be used for generating text based on the collection's data.


In [33]:
founders = client.collections.create(
    name=collection_name,
    properties=[
        wvc.config.Property(
            name="chunk",
            data_type=wvc.config.DataType.TEXT
        ),
        wvc.config.Property(
            name="date_from",
            data_type=wvc.config.DataType.DATE
        ),
        wvc.config.Property(
            name="date_to",
            data_type=wvc.config.DataType.DATE
        ),
        wvc.config.Property(
            name="authors",
            data_type=wvc.config.DataType.TEXT_ARRAY
        ),
        wvc.config.Property(
            name="recipients",
            data_type=wvc.config.DataType.TEXT_ARRAY
        ),
        wvc.config.Property(
            name="chapter_title",
            data_type=wvc.config.DataType.TEXT
        ),
        wvc.config.Property(
            name="chunk_index",
            data_type=wvc.config.DataType.INT
        ),
        wvc.config.Property(
            name="doc_index",
            data_type=wvc.config.DataType.INT
        ),
    ],
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),  # Use `text2vec-openai` as the vectorizer
    generative_config=wvc.config.Configure.Generative.openai(model="gpt-3.5-turbo"),  # Use `generative-openai` with default parameters
)

# Chunk Data

In the previous notebook, we learned about the different methods of chunking. Here, we will learn how to chunk our data. First, I want to convert our data into a Pandas DataFrame for easier manipulation.

In [69]:
df = pd.DataFrame(data)
df

Unnamed: 0,title,permalink,project,authors,recipients,date-from,date-to,content
0,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,Monticello Dec. 22. 15.\nDear Sir\nOn my retur...
1,"To Alexander Hamilton from James McHenry, 3 Ma...",https://founders.archives.gov/documents/Hamilt...,Hamilton Papers,"[McHenry, James]","[Hamilton, Alexander]",1791-05-03,1791-05-03,[Baltimore] 3 May 1791.\nMy dear Sir.\nI did n...
2,John Adams to John Quincy Adams and Thomas Boy...,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]","[Adams, John Quincy, Adams, Thomas Boylston]",1794-09-14,1794-09-14,Quincy Septr.14. 1794\nMy dear Sons\nI once mo...
3,From George Washington to Major General Horati...,https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]","[Gates, Horatio]",1776-12-23,1776-12-23,"Head Quarters [Bucks County, Pa.] 23d Decr 177..."
4,[Diary entry: 5 July 1795],https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]",[],1795-07-05,1795-07-05,Could not find the main content
...,...,...,...,...,...,...,...,...
995,"From John Adams to Boston Patriot, 4 November ...",https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[Boston Patriot],1809-11-04,1809-11-04,"Quincy, November 4, 1809.\nSirs,\nIn my last l..."
996,"From John Adams to United States Senate, 14 Ma...",https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[United States Senate],1798-03-14,1798-03-14,United States March 14th 1798:\nGentlemen of t...
997,"To Benjamin Franklin from William Henly, [Apri...",https://founders.archives.gov/documents/Frankl...,Franklin Papers,"[Henly, William]","[Franklin, Benjamin]",1772-04-01,1772-04-30,"Sunday Eve. [April?, 1772]\nDear Sir:\nI have ..."
998,From George Washington to Major General Alexan...,https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]","[McDougall, Alexander]",1779-05-20,1779-05-20,Head Quarters Middle Brook May 20th 1779\nDr S...


Next, we will load a spaCy English pipeline and add a `sentencizer` to it so that we can chunk our data by each sentence.

In [37]:
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
            Please make sure to close the connection using `client.close()`.


<spacy.pipeline.sentencizer.Sentencizer at 0x348119680>

Now that we have everything we need, we can chunk our data. To do that, we will use the code below. Here's an explanation of each line.

1. Function definition:
```python
def chunk_text(text, chunk_size=10):
```
This defines a function named `chunk_text` that takes two parameters: `text` (the input text to be chunked) and `chunk_size` (default value of 10).

2. Processing the text:
```python
    doc = nlp(text)
```
This line processes the input text using an NLP library. `nlp` is presumably a pre-initialized spaCy model.

3. Extracting sentences:
```python
    sentences = list(doc.sents)
```
This creates a list of sentence objects from the processed document.

4. Initializing the chunks list:
```python
    temp_chunks = []
```
This creates an empty list to store the text chunks.

5. Chunking loop:
```python
    for i in range(0, len(sentences), chunk_size):
```
This loop iterates over the sentences list, stepping by `chunk_size` each time.

6. Creating a chunk:
```python
        c = sentences[i:i+chunk_size]
```
This slices the sentences list to get `chunk_size` sentences.

7. Joining sentences and adding to chunks:
```python
        temp_chunks.append(" ".join([sent.text for sent in c]))
```
This joins the sentences in the chunk into a single string and adds it to `temp_chunks`.

8. Returning the chunks:
```python
    return temp_chunks
```
This returns the list of text chunks.

9. Initializing the chunked data list:
```python
chunked_data = []
```
This creates an empty list to store the chunked data.

10.  Main processing loop:
```python
for idx, row in tqdm(df.iterrows(), total=len(df), desc="Chunking texts"):
```
This iterates over each row in the DataFrame `df`, using `tqdm` to show a progress bar.

11.  Chunking the content:
```python
    c = chunk_text(row['content'])
```
This applies the `chunk_text` function to the 'content' column of the current row.

12.  Processing each chunk:
```python
    for chunk_idx, chunk in enumerate(c):
```
This loops over each chunk created from the current row's content.

13.  Creating a new data dictionary:
```python
        chunk_data = row.to_dict()
```
This creates a new dictionary with all the data from the current row.

14.  Updating the chunk data:
```python
        chunk_data['content'] = chunk
        chunk_data['document_index'] = idx
        chunk_data['chunk_index'] = chunk_idx
```
These lines update the 'content' with the current chunk, and add 'document_index' and 'chunk_index'.

15.  Adding to chunked data:
```python
        chunked_data.append(chunk_data)
```
This adds the processed chunk data to the `chunked_data` list.

16.  Creating the final DataFrame:
```python
chunked_df = pd.DataFrame(chunked_data)
```
This creates a new DataFrame from the `chunked_data` list.

17.  Displaying the result:
```python
chunked_df
```
This displays or returns the resulting DataFrame with the chunked data.

In [70]:
# Function to chunk text into groups of 3 sentences
def chunk_text(text, chunk_size=10):
    doc = nlp(text)
    sentences = list(doc.sents)
    temp_chunks = []
    for i in range(0, len(sentences), chunk_size):
        c = sentences[i:i+chunk_size]
        temp_chunks.append(" ".join([sent.text for sent in c]))
    return temp_chunks

# Create chunks
chunked_data = []
for idx, row in tqdm(df.iterrows(), total=len(df), desc="Chunking texts"):
    c = chunk_text(row['content'])
    for chunk_idx, chunk in enumerate(c):
        chunk_data = row.to_dict()
        chunk_data['content'] = chunk
        chunk_data['document_index'] = idx
        chunk_data['chunk_index'] = chunk_idx
        chunked_data.append(chunk_data)
# Create new DataFrame with chunks
chunked_df = pd.DataFrame(chunked_data)
chunked_df

Chunking texts: 100%|██████████| 1000/1000 [00:01<00:00, 967.18it/s]


Unnamed: 0,title,permalink,project,authors,recipients,date-from,date-to,content,document_index,chunk_index
0,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,Monticello Dec. 22. 15. \nDear Sir\nOn my retu...,0,0
1,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,"to be bound[. . .] \nthe Lat. & Eng in one, & ...",0,1
2,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,\nMair’s Tyro’s dictionary. \nI observe a mrRi...,0,2
3,"To Alexander Hamilton from James McHenry, 3 Ma...",https://founders.archives.gov/documents/Hamilt...,Hamilton Papers,"[McHenry, James]","[Hamilton, Alexander]",1791-05-03,1791-05-03,[Baltimore] 3 May 1791. \nMy dear Sir. \nI did...,1,0
4,"To Alexander Hamilton from James McHenry, 3 Ma...",https://founders.archives.gov/documents/Hamilt...,Hamilton Papers,"[McHenry, James]","[Hamilton, Alexander]",1791-05-03,1791-05-03,\nI then called on Mr. Wm. Smith who with less...,1,1
...,...,...,...,...,...,...,...,...,...,...
1785,"From John Adams to United States Senate, 14 Ma...",https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[United States Senate],1798-03-14,1798-03-14,United States March 14th 1798:\nGentlemen of t...,996,0
1786,"To Benjamin Franklin from William Henly, [Apri...",https://founders.archives.gov/documents/Frankl...,Franklin Papers,"[Henly, William]","[Franklin, Benjamin]",1772-04-01,1772-04-30,"Sunday Eve. [ April?, 1772]\nDear Sir:\nI have...",997,0
1787,"To Benjamin Franklin from William Henly, [Apri...",https://founders.archives.gov/documents/Frankl...,Franklin Papers,"[Henly, William]","[Franklin, Benjamin]",1772-04-01,1772-04-30,At this instant I saw every one of the wooden ...,997,1
1788,From George Washington to Major General Alexan...,https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]","[McDougall, Alexander]",1779-05-20,1779-05-20,Head Quarters Middle Brook May 20th 1779\nDr S...,998,0


Now that we have our new DataFrame, you can see that we have all our chunks: `1790` of them. We have everything we need to now convert this data into a chunk list that we can use to populate the Weaviate cluster.

In [43]:
def convert_to_rfc3339(date_value):
    if date_value is None:
        return None
    if isinstance(date_value, str):
        try:
            dt = datetime.fromisoformat(date_value)
        except ValueError:
            dt = datetime.strptime(date_value, '%Y-%m-%d')  # Adjust format as needed
    elif isinstance(date_value, datetime):
        dt = date_value
    else:
        raise ValueError(f"Unsupported date type: {type(date_value)}")
    
    if dt.tzinfo is None:
        dt = dt.replace(tzinfo=timezone.utc)
    
    return dt.isoformat()

chunks_list = list()
for idx, row in chunked_df.iterrows():
    # Convert date values to RFC 3339 format
    date_from = convert_to_rfc3339(row['date-from'])
    date_to = convert_to_rfc3339(row['date-to'])
    
    data_properties = {
        "chunk": row["title"] + "\n" + row["content"],
        "authors": row["authors"] if isinstance(row["authors"], list) else [row["authors"]],
        "recipients": row["recipients"] if isinstance(row["recipients"], list) else [row["recipients"]],
        "chapter_title": row["title"],
        "chunk_index": int(row["chunk_index"]),
        "doc_index": int(row["document_index"])
    }
    
    # Only add date properties if they are not None
    if date_from is not None:
        data_properties["date_from"] = date_from
    if date_to is not None:
        data_properties["date_to"] = date_to
    
    data_object = wvc.data.DataObject(properties=data_properties)
    chunks_list.append(data_object)

In [46]:
chunks_list[0]

DataObject(properties={'chunk': 'Thomas Jefferson to Joseph Milligan, 22 December 1815\nMonticello Dec. 22. 15. \nDear Sir\nOn my return here from Bedford a few days ago, I found the Hutton and Requisite tables, bound to my mind. by this mail I send you an Ovid’s metamorphoses almost entirely worne out & defaced, yet of sovaluable and rareaneditionthat I wish you to put it into as good a state of repair as it is susceptible of. by the next mail I will forward a Cornelius Nepos to be bound. be so good as to procure and forward to me by stage the underwritten books. I salute you with friendship & esteem\nTh: Jefferson\nAinsworth’sLat. & Eng. dict. abridged.', 'authors': ['Jefferson, Thomas'], 'recipients': ['Milligan, Joseph'], 'chapter_title': 'Thomas Jefferson to Joseph Milligan, 22 December 1815', 'chunk_index': 0, 'doc_index': 0, 'date_from': '1815-12-22T00:00:00+00:00', 'date_to': '1815-12-22T00:00:00+00:00'}, uuid=None, vector=None, references=None)

# Populate Weaviate Cluster

Now that we have our chunk lists, we can populate the Weaviate server! We can do this by accessing our `Founders` collection and using `data.insert_many`. This will handle batching for you, making the process a lot faster. Let's just send the first 300 chunks to save on cost.

In [48]:

founders.data.insert_many(chunks_list[:300])

BatchObjectReturn(_all_responses=[UUID('9052c7b4-8621-4e95-b38e-c8f006493bf4'), UUID('8c2ca252-4573-46e4-a898-2472e8e70840'), UUID('deb77c16-a8df-41b2-9b34-89a2b83fd707'), UUID('ae543507-dc4f-408b-b6e3-bdc0da8c410a'), UUID('1b071ae5-ab04-4438-a72f-05c93d422879'), UUID('ce434d25-a47a-496c-8bba-0f5a137866fb'), UUID('f4e4c82c-e495-47fe-9be1-975a3172b836'), UUID('a80ac851-a656-47b9-9774-f072099ea3cb'), UUID('b4ab8e90-d962-4f6d-b71b-2c75689f762e'), UUID('8cc5ecbf-0c9d-4ec3-b531-5fc6f706c3c6'), UUID('1c858723-8fe9-43d7-92bd-bfa07071a5b6'), UUID('09b9f253-1758-4c01-a1be-5c8b524651d4'), UUID('f56e7781-a0a0-40ad-a3de-57a2a416fcea'), UUID('974558fb-09e9-4212-894c-f410bf7ffa15'), UUID('ca32b8e9-6823-4b29-8d7f-43dd324075cf'), UUID('65c542fd-f873-46b9-b976-16455070e50f'), UUID('0c920da0-d542-4da6-b43a-5df1f87c562a'), UUID('6f5bdb08-e870-42c6-bbac-9777264a7de6'), UUID('4bcd93bc-3851-4a76-a00f-b647b3155559'), UUID('87fe4203-3425-4812-a22b-095028dd37c0'), UUID('db23b297-258f-4cbf-8e8c-3b375ec91bcf'), 

# Querying the Database

Once we have our database created, we can now query it, just like we did in the previous notebook! If we were working in a different notebook and needed access the founders collection again, we could use the following code after connecting to the client again.

In [71]:
founders = client.collections.get("Founders")

In [72]:
response = founders.generate.near_text(
    query="What does Abigail Adams say about her sister?",
    limit=1,
)

print(response.objects)

[GenerativeObject(uuid=_WeaviateUUIDInt('0c920da0-d542-4da6-b43a-5df1f87c562a'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'date_from': datetime.datetime(1815, 4, 14, 0, 0, tzinfo=datetime.timezone.utc), 'date_to': datetime.datetime(1815, 4, 14, 0, 0, tzinfo=datetime.timezone.utc), 'recipients': ['Adams, Louisa Catherine Johnson'], 'authors': ['Adams, Abigail Smith'], 'chunk_index': 0, 'doc_index': 10, 'chapter_title': 'From Abigail Smith Adams to Louisa Catherine Johnson Adams, 14 April 1815', 'chunk': 'From Abigail Smith Adams to Louisa Catherine Johnson Adams, 14 April 1815\nQuincy April 14th 1815\nMy dear Daughter\nI address you, altho I know not where to find you, which is, and has been a source of much anxiety to me, four months have elapsed since the signature of the Treaty of Peace; when mr Adams wrote from Ghent, that in ten day’s, he shou

I want to draw attention to something important. Let's see what document number we returned.

In [73]:
response.objects[0].properties["doc_index"]

10

In many instances it can be necessary to only return results based on certain conditions. What if we wanted a document higher than a `doc_index` of 10? This is where filtering comes in.

# Filtering Data

In Weaviate, we have two ways to structure filters. We can structure them by using the filter classes available to us (how we will do it for this tutorial) and via GraphQL (useful for perhaps more complex queries). Since I do not have GraphQL as a prerequisite for students of this course, we will opt for the easier Filter class approach. Weaviate offers numerous ways to filter data natively. Here's a list.

| Operator | Description | Example |
|----------|-------------|---------|
| And | Combines multiple conditions, all of which must be true | `Filter.by_property("wordCount").greater_than(1000) & Filter.by_property("title").like("*economy*")` |
| Or | Combines multiple conditions, at least one of which must be true | `Filter.by_property("wordCount").greater_than(1000)` | Filter.by_property("title").like("*economy*")` |
| Equal | Matches exact values | `Filter.by_property("category").equal("Technology")` |
| NotEqual | Matches all values except the specified one | `Filter.by_property("category").not_equal("Sports")` |
| GreaterThan | Matches values greater than the specified value | `Filter.by_property("wordCount").greater_than(1000)` |
| GreaterThanEqual | Matches values greater than or equal to the specified value | `Filter.by_property("wordCount").greater_or_equal(1000)` |
| LessThan | Matches values less than the specified value | `Filter.by_property("price").less_than(50)` |
| LessThanEqual | Matches values less than or equal to the specified value | `Filter.by_property("price").less_or_equal(50)` |
| Like | Matches text based on partial matches using wildcards | `Filter.by_property("title").like("New *")` |
| WithinGeoRange | Matches geo coordinates within a specified range | `Filter.by_property("location").within_geo_range(GeoCoordinate(latitude=33.7579, longitude=84.3948), distance=10000)` |
| IsNull | Matches null or non-null values | `Filter.by_property("description").is_null(True)` |
| ContainsAny | Matches if any of the specified values are present in an array or text | `Filter.by_property("languages_spoken").contains_any(["Chinese", "French", "English"])` |
| ContainsAll | Matches if all of the specified values are present in an array or text | `Filter.by_property("languages_spoken").contains_all(["Chinese", "French", "English"])` |

**Note: The exact syntax may vary depending on the Weaviate client version and programming language used. These examples are based on the Python client v4 syntax.** This notebook uses the v4 syntax.

## Filtering by Integer

In [27]:

response = founders.generate.near_text(
    query="What does Abigail Adams say about her sister?",
    limit=1,
    filters=Filter.by_property("doc_index").greater_than(10),
    grouped_task="Summarize the letter."
)

print(response.objects)

[GenerativeObject(uuid=_WeaviateUUIDInt('d9a69243-6a68-4cac-b013-35b42f3a1928'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'chapter_title': 'From Abigail Smith Adams to Mary Smith Gray Otis, 13 May 1814', 'chunk_index': 0, 'chunk': 'From Abigail Smith Adams to Mary Smith Gray Otis, 13 May 1814\nQuincy May 13th 1814\nMy dear Friend\nNext to the Supports of religion; is the sympathy of Friends in affliction, of the first you have abandent sources in the Belief of an all wise goveneur and disposer of events, in whose goodness you can confide, and upon whose word you can repose—as the Husband of the Widow and the Father of the Fatherless, unto that being I commend you, and your dear Children, most sincerely sympathizing with you in the Bereavement You have been calld upon to sustain\nTo know that the Frends we have lost are dear to others, as well as o

In [24]:
response.generated

'Abigail Smith Adams expresses her sympathy and support to her friend Mary Smith Gray Otis, who has recently experienced a loss. She emphasizes the importance of the support of friends in times of affliction, and assures Mary that she is not alone in her grief. Abigail also highlights the comfort that comes from knowing that the virtues and characters of loved ones are remembered and respected by others. She acknowledges the pain and sorrow that Mary and her children are experiencing, and encourages them to find solace in their faith and in the belief that their loved one is now in a better place. Abigail closes the letter by expressing her continued support and sympathy for Mary during this difficult time.'

## Filtering with Strings

In [50]:

response = founders.generate.near_text(
    query="What does Abigail Adams say about her sister's death?",
    limit=1,
    filters=Filter.by_property("chapter_title").like("*John Adams*")
)

print(response.objects)

[GenerativeObject(uuid=_WeaviateUUIDInt('c1f33c3a-69c7-43fe-89c3-15ae983073bc'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'date_from': datetime.datetime(1797, 1, 1, 0, 0, tzinfo=datetime.timezone.utc), 'chunk_index': 1, 'recipients': ['Adams, John'], 'authors': ['Adams, Abigail'], 'date_to': datetime.datetime(1797, 1, 1, 0, 0, tzinfo=datetime.timezone.utc), 'doc_index': 20, 'chapter_title': 'Abigail Adams to John Adams, 1 January 1797', 'chunk': 'Abigail Adams to John Adams, 1 January 1797\n\nwhom do you think has undertaken to read the Defence! but Deacon Webb, and declares himself well pleasd with the first volm.as cousin Boylstone informs me. \nI fear the Deleware is frozen up So that Brisler will not be able to send me any flower—\nBillings is just recovering from a visit to Stoughten which has lasted him a week, the Second he has made since y

## Filtering with List of Strings

In [54]:
response = founders.generate.near_text(
    query="Something about peace",
    limit=1,
    filters=Filter.by_property("authors").contains_any(["Adams, John"])
)

print(response.objects)

[GenerativeObject(uuid=_WeaviateUUIDInt('96428597-4577-465d-9846-702f11196fd3'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'date_from': datetime.datetime(1770, 11, 29, 0, 0, tzinfo=datetime.timezone.utc), 'date_to': datetime.datetime(1770, 11, 29, 0, 0, tzinfo=datetime.timezone.utc), 'recipients': [], 'authors': ['Adams, John'], 'chunk_index': 11, 'doc_index': 13, 'chapter_title': 'Adams’ Minutes of Crown Evidence, Concluded, and of Samuel Quincy’s Argument for the Crown: 29 November 1770', 'chunk': 'Adams’ Minutes of Crown Evidence, Concluded, and of Samuel Quincy’s Argument for the Crown: 29 November 1770\nThe Person he killed was in Peace. No Insult offerd to K.\nMarshall. The Street entirely Still. Fewer People there than usual. He had been warned not to go out that Evening. Moon, to the North. Saw a Party come out of the main Guard door. D—n e

## Filtering with Dates

In [56]:
# Define the date we want to filter from
filter_date = datetime(1800, 1, 1, tzinfo=timezone.utc)

founders = client.collections.get("Founders")
response = founders.generate.near_text(
    query="What does John Adams say about war?",
    limit=1,
    filters=Filter.by_property("date_from").greater_than(filter_date)
)

print(response.objects)

[GenerativeObject(uuid=_WeaviateUUIDInt('9c04919b-cc8b-47fc-b93c-96e0aae88401'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'date_from': datetime.datetime(1809, 11, 3, 0, 0, tzinfo=datetime.timezone.utc), 'date_to': datetime.datetime(1809, 11, 3, 0, 0, tzinfo=datetime.timezone.utc), 'recipients': ['Boston Patriot'], 'authors': ['Adams, John'], 'chunk_index': 10, 'doc_index': 83, 'chapter_title': 'From John Adams to Boston Patriot, 3 November 1809', 'chunk': 'From John Adams to Boston Patriot, 3 November 1809\nAll Europe is in a crisis, and this ingredient thrown in at this time will have more effect, than at any other. At a future time I may enlarge upon this subject. \nAt the foot of this letter to congress I find in my hand writing a note, February 20, 1782. The late evacuation of the barrier towns and demolition of their fortifications, may serve

# Stacking Filters

In [66]:
# Define the date we want to filter from
filter_date = datetime(1800, 1, 1, tzinfo=timezone.utc)

founders = client.collections.get("Founders")
response = founders.generate.near_text(
    query="What do people say about war?",
    limit=5,  # Increased limit to potentially get more diverse results
    filters=(
        Filter.by_property("date_from").greater_than(filter_date) &
        Filter.by_property("authors").contains_any(["Adams, John"])
    )
)
print(response.objects)

[GenerativeObject(uuid=_WeaviateUUIDInt('a93dc098-0ed6-42ca-b981-a0271a7eac8b'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'date_from': datetime.datetime(1802, 8, 4, 0, 0, tzinfo=datetime.timezone.utc), 'date_to': datetime.datetime(1802, 8, 4, 0, 0, tzinfo=datetime.timezone.utc), 'recipients': ['National Intelligencer'], 'authors': ['Adams, John'], 'chunk_index': 1, 'doc_index': 61, 'chapter_title': 'From John Adams to National Intelligencer, 4 August 1802', 'chunk': 'From John Adams to National Intelligencer, 4 August 1802\n\nAbstract Opinions in favour of Monarchy or Democracy may exist without Injury to the State. Plato & Aristotle declare freely in their Writings a Veneration for Kingly Goverment. Yet, in the most democratical Governments of Greece, they were not persecuted. An End will be put to all Liberty of thought as well as Speech, if Dua