## Live coding - play with real data & Weaviate

Let's build our own instance of Weaviate, with real data!

We'll load the data from the Pro Git book<sup>*</sup> and go from there:

> *Pro Git book, written by Scott Chacon and Ben Straub and published by Apress, licensed under the Creative Commons Attribution Non Commercial Share Alike 3.0 license.
>
> - https://git-scm.com/book/en/v2
> - https://github.com/progit/progit2

In [1]:
def get_book_text_objects():
    import requests

    # Source location
    text_objs = list()
    api_base_url = 'https://api.github.com/repos/progit/progit2/contents/book'  # Book base URL
    chapter_urls = ['/01-introduction/sections', '/02-git-basics/sections']  # List of section URLs

    # Loop through book chapters
    for chapter_url in chapter_urls:
        response = requests.get(api_base_url + chapter_url)  # Get the JSON data for the section files in the chapter

        # Loop through inner files (sections)
        for file_info in response.json():
            if file_info['type'] == 'file':  # Only process files (not directories)
                file_response = requests.get(file_info['download_url'])

                # Build objects including metadata
                chapter_title = file_info['download_url'].split('/')[-3]
                filename = file_info['download_url'].split('/')[-1]
                text_obj = {
                    "body": file_response.text,
                    "chapter_title": chapter_title,
                    "filename": filename
                }
                text_objs.append(text_obj)
    return text_objs


def load_text_files():
    from pathlib import Path

    data_dir = Path("data/pro_git/")
    all_files = []

    for file_path in data_dir.rglob("*.asc"):
        if file_path.is_file():
            all_files.append(file_path)

    text_objs = list()
    for fpath in all_files:
        with fpath.open() as f:
            text_obj = {
                "body": f.read(),
                "chapter_title": fpath.parent.name,
                "filename": fpath.name
            }    
            text_objs.append(text_obj)  
            
    return text_objs  


text_objs = get_book_text_objects()
# text_objs = load_text_files()

In [2]:
text_objs

[{'body': '=== About Version Control\n\n(((version control)))\nWhat is "`version control`", and why should you care?\nVersion control is a system that records changes to a file or set of files over time so that you can recall specific versions later.\nFor the examples in this book, you will use software source code as the files being version controlled, though in reality you can do this with nearly any type of file on a computer.\n\nIf you are a graphic or web designer and want to keep every version of an image or layout (which you would most certainly want to), a Version Control System (VCS) is a very wise thing to use.\nIt allows you to revert selected files back to a previous state, revert the entire project back to a previous state, compare changes over time, see who last modified something that might be causing a problem, who introduced an issue and when, and more.\nUsing a VCS also generally means that if you screw things up or lose files, you can easily recover.\nIn addition, yo

In [3]:
text_obj = text_objs[0]

In [4]:
text_obj

{'body': '=== About Version Control\n\n(((version control)))\nWhat is "`version control`", and why should you care?\nVersion control is a system that records changes to a file or set of files over time so that you can recall specific versions later.\nFor the examples in this book, you will use software source code as the files being version controlled, though in reality you can do this with nearly any type of file on a computer.\n\nIf you are a graphic or web designer and want to keep every version of an image or layout (which you would most certainly want to), a Version Control System (VCS) is a very wise thing to use.\nIt allows you to revert selected files back to a previous state, revert the entire project back to a previous state, compare changes over time, see who last modified something that might be causing a problem, who introduced an issue and when, and more.\nUsing a VCS also generally means that if you screw things up or lose files, you can easily recover.\nIn addition, you

In [5]:
# Set up Weaviate
import weaviate
import os

client = weaviate.Client(
    embedded_options=weaviate.EmbeddedOptions(),
    additional_headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]
    }
)
client.is_ready()

Started /Users/jphwang/.cache/weaviate-embedded: process ID 62720


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-08-11T12:13:53+01:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-08-11T12:13:53+01:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"knowledgeblock_1q1b9BLntORd","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-08-11T12:13:53+01:00","took":277167}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"docchunk_ZzfkiDQ02JIO","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-08-11T12:13:53+01:00","took":243209}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-08-11T12:13:53+01:00"}
{"action":"hnsw_vector_cache_prefill","count":30

True

In [6]:
COLLECTION_NAME = "Chunk"

if client.schema.exists(COLLECTION_NAME):
    client.schema.delete_class(COLLECTION_NAME)

Define the collection - you can use the example definition from our docs!

In [7]:
chunk_class_def = {
    "class": COLLECTION_NAME,
    "vectorizer": "text2vec-openai",
    "vectorIndexConfig": {
        "distance": "cosine",
    },
    "moduleConfig": {
        "generative-openai": {}
    },
    "properties": [
        {
            "name": "title",
            "dataType": ["text"]
        },        
        {
            "name": "chunk",
            "dataType": ["text"]
        },
        {
            "name": "chunk_no",
            "dataType": ["int"]
        },
        {
            "name": "url",
            "dataType": ["text"],
            "tokenization": "field",
            "moduleConfig": {
                "text2vec-openai": {
                    "skip": True
                },
            }
        },
    ],
}

client.schema.create_class(chunk_class_def)

{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"chunk_pzhMqCJ2YcSh","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-08-11T12:13:54+01:00","took":75625}


Split the text into chunks

In [8]:
from typing import List

def get_chunks(text_in: str, chunk_size: int = 300) -> List[str]:
    import math
    
    chunks = list()
    for i in range(math.ceil(len(text_in) / chunk_size)):
        chunk = text_in[i * chunk_size: (i+1) * chunk_size]
        chunks.append(chunk)
    return chunks

Make sure our chunking looks okay

In [9]:
get_chunks(text_obj["body"])

['=== About Version Control\n\n(((version control)))\nWhat is "`version control`", and why should you care?\nVersion control is a system that records changes to a file or set of files over time so that you can recall specific versions later.\nFor the examples in this book, you will use software source code',
 ' as the files being version controlled, though in reality you can do this with nearly any type of file on a computer.\n\nIf you are a graphic or web designer and want to keep every version of an image or layout (which you would most certainly want to), a Version Control System (VCS) is a very wise thi',
 'ng to use.\nIt allows you to revert selected files back to a previous state, revert the entire project back to a previous state, compare changes over time, see who last modified something that might be causing a problem, who introduced an issue and when, and more.\nUsing a VCS also generally means tha',
 "t if you screw things up or lose files, you can easily recover.\nIn additi

In [10]:
text_obj["body"][-50:]

'centralized systems, such as hierarchical models.\n'

In [11]:
chunk_class_def["properties"]

[{'name': 'title', 'dataType': ['text']},
 {'name': 'chunk', 'dataType': ['text']},
 {'name': 'chunk_no', 'dataType': ['int']},
 {'name': 'url',
  'dataType': ['text'],
  'tokenization': 'field',
  'moduleConfig': {'text2vec-openai': {'skip': True}}}]

In [12]:
text_obj.keys()

dict_keys(['body', 'chapter_title', 'filename'])

In [13]:
from weaviate.util import generate_uuid5

with client.batch(batch_size=50) as batch:
    chunks = get_chunks(text_obj["body"])
    for i, chunk in enumerate(chunks):
        data_object = {
            "title": text_obj["filename"],
            "chunk": chunk,
            "chunk_no": i,
            "url": text_obj["chapter_title"] + ": " + text_obj["filename"]
        }   
        batch.add_data_object(
            data_object=data_object,
            class_name=COLLECTION_NAME,
            uuid=generate_uuid5(data_object)
        )

In [14]:
client.query.aggregate(COLLECTION_NAME).with_meta_count().do()

{'data': {'Aggregate': {'Chunk': [{'meta': {'count': 16}}]}}}

In [15]:
response = (
    client.query.get(COLLECTION_NAME, ["chunk", "chunk_no"])
    .with_near_text({"concepts": ["history of git"]})
    .with_limit(3)
    .do()
)

display(response)

{'data': {'Get': {'Chunk': [{'chunk': ' -- whenever you have the entire history of the project in a single place, you risk losing everything.\n\n==== Distributed Version Control Systems\n\n(((version control,distributed)))\nThis is where Distributed Version Control Systems (DVCSs) step in.\nIn a DVCS (such as Git, Mercurial, Bazaar or Darcs), ',
     'chunk_no': 12},
    {'chunk': '=== About Version Control\n\n(((version control)))\nWhat is "`version control`", and why should you care?\nVersion control is a system that records changes to a file or set of files over time so that you can recall specific versions later.\nFor the examples in this book, you will use software source code',
     'chunk_no': 0},
    {'chunk': 'ed local VCSs that had a simple database that kept all the changes to files under revision control.\n\n.Local version control diagram\nimage::images/local.png[Local version control diagram]\n\nOne of the most popular VCS tools was a system called RCS, which is still dist

In [16]:
def near_text(query, limit=3):
    import json
    response = (
        client.query.get(COLLECTION_NAME, ["chunk", "chunk_no", "title", "url"])
        .with_near_text({"concepts": [query]})
        .with_limit(limit)
        .do()
    )
    for r in response["data"]["Get"][COLLECTION_NAME]:
        print(json.dumps(r, indent=2))
    return None

In [17]:
near_text("origin of git")

{
  "chunk": "=== About Version Control\n\n(((version control)))\nWhat is \"`version control`\", and why should you care?\nVersion control is a system that records changes to a file or set of files over time so that you can recall specific versions later.\nFor the examples in this book, you will use software source code",
  "chunk_no": 0,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}
{
  "chunk": " -- whenever you have the entire history of the project in a single place, you risk losing everything.\n\n==== Distributed Version Control Systems\n\n(((version control,distributed)))\nThis is where Distributed Version Control Systems (DVCSs) step in.\nIn a DVCS (such as Git, Mercurial, Bazaar or Darcs), ",
  "chunk_no": 12,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}
{
  "chunk": "ed local VCSs that had a simple database that kept all the changes to files under revision control.\n\n.Local ve

In [18]:
near_text("version control")

{
  "chunk": "=== About Version Control\n\n(((version control)))\nWhat is \"`version control`\", and why should you care?\nVersion control is a system that records changes to a file or set of files over time so that you can recall specific versions later.\nFor the examples in this book, you will use software source code",
  "chunk_no": 0,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}
{
  "chunk": "centralized)))\nThe next major issue that people encounter is that they need to collaborate with developers on other systems.\nTo deal with this problem, Centralized Version Control Systems (CVCSs) were developed.\nThese systems (such as CVS, Subversion, and Perforce) have a single server that contains",
  "chunk_no": 7,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}
{
  "chunk": "ed local VCSs that had a simple database that kept all the changes to files under revision control.\n\n.Local versio

In [19]:
near_text("git add remote")

{
  "chunk": " clone is really a full backup of all the data.\n\n.Distributed version control diagram\nimage::images/distributed.png[Distributed version control diagram]\n\nFurthermore, many of these systems deal pretty well with having several remote repositories they can work with, so you can collaborate with differ",
  "chunk_no": 14,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}
{
  "chunk": "ed local VCSs that had a simple database that kept all the changes to files under revision control.\n\n.Local version control diagram\nimage::images/local.png[Local version control diagram]\n\nOne of the most popular VCS tools was a system called RCS, which is still distributed with many computers today.",
  "chunk_no": 5,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}
{
  "chunk": " -- whenever you have the entire history of the project in a single place, you risk losing everything.\n\n==== Dist

In [20]:
from weaviate.util import generate_uuid5

with client.batch(batch_size=50) as batch:
    for text_obj in text_objs:
        chunks = get_chunks(text_obj["body"])
        for i, chunk in enumerate(chunks):
            data_object = {
                "title": text_obj["filename"],
                "chunk": chunk,
                "chunk_no": i,
                "url": text_obj["chapter_title"] + ": " + text_obj["filename"]
            }   
            batch.add_data_object(
                data_object=data_object,
                class_name=COLLECTION_NAME,
                uuid=generate_uuid5(data_object)
            )

In [21]:
client.query.aggregate(COLLECTION_NAME).with_meta_count().do()

{'data': {'Aggregate': {'Chunk': [{'meta': {'count': 357}}]}}}

In [22]:
near_text("git add remote", limit=5)

{
  "chunk": "<url>`:\n\n[source,console]\n----\n$ git remote\norigin\n$ git remote add pb https://github.com/paulboone/ticgit\n$ git remote -v\norigin\thttps://github.com/schacon/ticgit (fetch)\norigin\thttps://github.com/schacon/ticgit (push)\npb\thttps://github.com/paulboone/ticgit (fetch)\npb\thttps://github.com/paulboone/t",
  "chunk_no": 12,
  "title": "remotes.asc",
  "url": "02-git-basics: remotes.asc"
}
{
  "chunk": "----\n$ git remote -v\norigin\thttps://github.com/schacon/ticgit (fetch)\norigin\thttps://github.com/schacon/ticgit (push)\n----\n\nIf you have more than one remote, the command lists them all.\nFor example, a repository with multiple remotes for working with several collaborators might look something like t",
  "chunk_no": 7,
  "title": "remotes.asc",
  "url": "02-git-basics: remotes.asc"
}
{
  "chunk": "\nWe've mentioned and given some demonstrations of how the `git clone` command implicitly adds the `origin` remote for you.\nHere's how to add a new remote expli

In [23]:
def keyword_search(query, limit=3):
    import json
    response = (
        client.query.get(COLLECTION_NAME, ["chunk", "chunk_no", "title", "url"])
        .with_bm25(query=query)
        .with_limit(limit)
        .do()
    )
    for r in response["data"]["Get"][COLLECTION_NAME]:
        print(json.dumps(r, indent=2))
    return None

In [24]:
keyword_search("clone")

{
  "chunk": "erver#_getting_git_on_a_server>> for more details).\n\nYou clone a repository with `git clone <url>`.(((git commands, clone)))\nFor example, if you want to clone the Git linkable library called `libgit2`, you can do so like this:\n\n[source,console]\n----\n$ git clone https://github.com/libgit2/libgit2\n---",
  "chunk_no": 9,
  "title": "getting-a-repository.asc",
  "url": "02-git-basics: getting-a-repository.asc"
}
{
  "chunk": "e worked on or used.\n\nIf you want to clone the repository into a directory named something other than `libgit2`, you can specify the new directory name as an additional argument:\n\n[source,console]\n----\n$ git clone https://github.com/libgit2/libgit2 mylibgit\n----\n\nThat command does the same thing as ",
  "chunk_no": 11,
  "title": "getting-a-repository.asc",
  "url": "02-git-basics: getting-a-repository.asc"
}
{
  "chunk": "amples use a very simple project called \"`simplegit`\".\nTo get the project, run:\n\n[source,console]\n----\n$ git 

In [25]:
keyword_search("duplicate")

In [26]:
near_text("duplicate")

{
  "chunk": "\nIf you and someone else clone at the same time and they push upstream and then you push upstream, your push will rightly be rejected.\nYou'll have to fetch their work first and incorporate it into yours before you'll be allowed to push.\nSee <<ch03-git-branching#ch03-git-branching>> for more detailed",
  "chunk_no": 25,
  "title": "remotes.asc",
  "url": "02-git-basics: remotes.asc"
}
{
  "chunk": " clone is really a full backup of all the data.\n\n.Distributed version control diagram\nimage::images/distributed.png[Distributed version control diagram]\n\nFurthermore, many of these systems deal pretty well with having several remote repositories they can work with, so you can collaborate with differ",
  "chunk_no": 14,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}
{
  "chunk": "t if you screw things up or lose files, you can easily recover.\nIn addition, you get all this for very little overhead.\n\n==== Local Version Con

In [27]:
def hybrid_search(query, limit=3):
    import json
    response = (
        client.query.get(COLLECTION_NAME, ["chunk", "chunk_no", "title", "url"])
        .with_hybrid(query=query)
        .with_limit(limit)
        .do()
    )
    for r in response["data"]["Get"][COLLECTION_NAME]:
        print(json.dumps(r, indent=2))
    return None

In [28]:
hybrid_search("complete backup")

{
  "chunk": " clone is really a full backup of all the data.\n\n.Distributed version control diagram\nimage::images/distributed.png[Distributed version control diagram]\n\nFurthermore, many of these systems deal pretty well with having several remote repositories they can work with, so you can collaborate with differ",
  "chunk_no": 14,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}
{
  "chunk": "orking on.\nIf the hard disk the central database is on becomes corrupted, and proper backups haven't been kept, you lose absolutely everything -- the entire history of the project except whatever single snapshots people happen to have on their local machines.\nLocal VCSs suffer from this same problem",
  "chunk_no": 11,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}
{
  "chunk": "clients don't just check out the latest snapshot of the files; rather, they fully mirror the repository, including it

In [29]:
near_text("complete backup")

{
  "chunk": "orking on.\nIf the hard disk the central database is on becomes corrupted, and proper backups haven't been kept, you lose absolutely everything -- the entire history of the project except whatever single snapshots people happen to have on their local machines.\nLocal VCSs suffer from this same problem",
  "chunk_no": 11,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}
{
  "chunk": " clone is really a full backup of all the data.\n\n.Distributed version control diagram\nimage::images/distributed.png[Distributed version control diagram]\n\nFurthermore, many of these systems deal pretty well with having several remote repositories they can work with, so you can collaborate with differ",
  "chunk_no": 14,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}
{
  "chunk": "clients don't just check out the latest snapshot of the files; rather, they fully mirror the repository, including it

In [30]:
keyword_search("complete backup")

{
  "chunk": " clone is really a full backup of all the data.\n\n.Distributed version control diagram\nimage::images/distributed.png[Distributed version control diagram]\n\nFurthermore, many of these systems deal pretty well with having several remote repositories they can work with, so you can collaborate with differ",
  "chunk_no": 14,
  "title": "about-version-control.asc",
  "url": "01-introduction: about-version-control.asc"
}


In [31]:
def generative_search(query, prompt, limit=3):
    import json
    response = (
        client.query.get(COLLECTION_NAME, ["chunk", "chunk_no", "title", "url"])
        .with_hybrid(query=query)
        .with_limit(limit)
        .with_generate(grouped_task=prompt)
        .do()
    )
    print(response["data"]["Get"][COLLECTION_NAME][0]["_additional"]["generate"]["groupedResult"])
    return None

In [32]:
generative_search("git remote", "based on this, how would I add a remote repository?")

To add a remote repository, you can use the `git remote add` command followed by the name you want to give to the remote and the URL of the repository. 

For example, if you want to add a remote repository named "origin" with the URL "https://github.com/schacon/ticgit", you would run the following command:

```
$ git remote add origin https://github.com/schacon/ticgit
```

After adding the remote, you can verify that it has been added successfully by using the `git remote -v` command. It will list all the remotes associated with your repository:

```
$ git remote -v
origin  https://github.com/schacon/ticgit (fetch)
origin  https://github.com/schacon/ticgit (push)
```

If you have multiple remotes, all of them will be listed.


In [33]:
generative_search(
    "git remote", 
    "based on this, how would I add a remote repository? Respond as a poem"
)

In the realm of Git, where repositories reside,
To add a remote, let me be your guide.
With the command "git remote add",
A new connection we shall have.

But before we proceed, let's take a glance,
At the remotes already in this dance.
Using "git remote show origin" we see,
The Fetch and Push URLs, as they should be.

If only one remote is what you possess,
A simple "git remote -v" will impress.
It shows the origin, both fetch and push,
With URLs that make your repository hush.

But if multiple remotes are what you seek,
Collaborating with others, so unique,
The command will list them, one by one,
For each collaborator, a connection is spun.

And as you delve deeper into Git's embrace,
More information will come to your space.
With "git remote show origin" in your hand,
URLs and branches, you'll understand.

So fear not, my friend, adding a remote is grand,
With Git's commands, you'll surely expand.
Connect and collaborate, with ease and delight,
In this vast world of repositories, sh