# Introduction

This notebook shows how to use GPT Index to read data stored in Weaviate.

The weaviate instance in this notebook has been loaded with data from https://github.com/weaviate/weaviate-podcast-search

# Imports

In [88]:
import logging
import sys
from weaviate import Client

from gpt_index import (
    GPTListIndex,
    GPTTreeIndex,
    Document,
)

from gpt_index.composability import ComposableGraph

from gpt_index.readers.weaviate.reader import WeaviateReader


In [2]:
# make gpt index verbose
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


# Useful functions

In [79]:
def summarize_tree(tree_idx):
    nodes = tree_idx.index_struct.root_nodes.values()
    text = '\n'.join(n.text for n in nodes)
    doc = Document(text)
    summary = GPTListIndex([doc]).query("summarize this conversation").response
    return summary


# Dataset

What is the dataset in weaviate?

In [3]:
WEAVIATE_URL = "http://weaviate:8080"


In [4]:
client = Client(WEAVIATE_URL)
client.cluster.get_nodes_status()


[{'gitHash': '5ce21bb',
  'name': 'node1',
  'shards': [{'class': 'PodClip', 'name': '2zbGNa7foWGF', 'objectCount': 394}],
  'stats': {'objectCount': 394, 'shardCount': 1},
  'status': 'HEALTHY',
  'version': '1.17.3'}]

Check the schema of PodClip:

In [5]:
client.schema.get("PodClip")["properties"]


[{'dataType': ['text'],
  'description': 'The text content of the podcast clip',
  'moduleConfig': {'text2vec-transformers': {'skip': False,
    'vectorizeClassName': False,
    'vectorizePropertyName': False}},
  'name': 'content',
  'tokenization': 'word'},
 {'dataType': ['string'],
  'description': 'The speaker in the podcast',
  'moduleConfig': {'text2vec-transformers': {'skip': True,
    'vectorizeClassName': False,
    'vectorizePropertyName': False}},
  'name': 'speaker',
  'tokenization': 'word'},
 {'dataType': ['int'],
  'description': 'The podcast number.',
  'moduleConfig': {'text2vec-transformers': {'skip': True,
    'vectorizeClassName': False,
    'vectorizePropertyName': False}},
  'name': 'podNum'}]

Load the data in weaviate into gpt index documents:

In [6]:
pod_nums = [25, 26, 27, 28, 30, 31, 32, 33, 34]
docs = []

weaviate_reader = WeaviateReader(WEAVIATE_URL)


for pod_num in pod_nums:
    graphql_query = client.query\
        .get(class_name="PodClip", properties=["content", "speaker"])\
        .with_where({
            "path": ["podNum"],
            "operator": "Equal",
            "valueInt": pod_num
        })\
        .with_limit(1000)\
        .build()

    doc = weaviate_reader.load_data(
        graphql_query=graphql_query, separate_documents=False)[0]
    doc.doc_id = pod_num

    docs.append(doc)


# Indexing

Index each document into a tree index 💸💸💸:

In [44]:
tree_idxs = [GPTTreeIndex([doc]) for doc in docs[:2]]


INFO:root:> Building index from nodes: 3 chunks


> Building index from nodes: 3 chunks


INFO:root:> [build_index_from_documents] Total LLM token usage: 12292 tokens


> [build_index_from_documents] Total LLM token usage: 12292 tokens


INFO:root:> [build_index_from_documents] Total embedding token usage: 0 tokens


> [build_index_from_documents] Total embedding token usage: 0 tokens


INFO:root:> Building index from nodes: 4 chunks


> Building index from nodes: 4 chunks


INFO:root:> [build_index_from_documents] Total LLM token usage: 15552 tokens


> [build_index_from_documents] Total LLM token usage: 15552 tokens


INFO:root:> [build_index_from_documents] Total embedding token usage: 0 tokens


> [build_index_from_documents] Total embedding token usage: 0 tokens


Set the text of each tree so that we can compose it with another index:

In [86]:
for tree_idx in tree_idxs:
    summary = summarize_tree(tree_idx)
    tree_idx.set_text(summary)


INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens


> [build_index_from_documents] Total LLM token usage: 0 tokens


INFO:root:> [build_index_from_documents] Total embedding token usage: 0 tokens


> [build_index_from_documents] Total embedding token usage: 0 tokens


INFO:root:> [query] Total LLM token usage: 636 tokens


> [query] Total LLM token usage: 636 tokens


INFO:root:> [query] Total embedding token usage: 0 tokens


> [query] Total embedding token usage: 0 tokens


INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens


> [build_index_from_documents] Total LLM token usage: 0 tokens


INFO:root:> [build_index_from_documents] Total embedding token usage: 0 tokens


> [build_index_from_documents] Total embedding token usage: 0 tokens


INFO:root:> [query] Total LLM token usage: 877 tokens


> [query] Total LLM token usage: 877 tokens


INFO:root:> [query] Total embedding token usage: 0 tokens


> [query] Total embedding token usage: 0 tokens


Put a list index on top of the tree index:

In [90]:
list_idx = GPTListIndex(tree_idxs)
data_in_gpt_index = ComposableGraph.build_from_index(list_idx)


INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens


> [build_index_from_documents] Total LLM token usage: 0 tokens


INFO:root:> [build_index_from_documents] Total embedding token usage: 0 tokens


> [build_index_from_documents] Total embedding token usage: 0 tokens


# Querying

In [96]:
query = "What is Mosaic ML?"
answer = data_in_gpt_index.query(query)


INFO:root:> Starting query: What is Mosaic ML?


> Starting query: What is Mosaic ML?


INFO:root:> [query] Total LLM token usage: 620 tokens


> [query] Total LLM token usage: 620 tokens


INFO:root:> [query] Total embedding token usage: 0 tokens


> [query] Total embedding token usage: 0 tokens


INFO:root:> Starting query: What is Mosaic ML?


> Starting query: What is Mosaic ML?


INFO:root:>[Level 0] Selected node: [1]/[1]


>[Level 0] Selected node: [1]/[1]


INFO:root:>[Level 1] Selected node: [2]/[2]


>[Level 1] Selected node: [2]/[2]


INFO:root:> [query] Total LLM token usage: 4983 tokens


> [query] Total LLM token usage: 4983 tokens


INFO:root:> [query] Total embedding token usage: 0 tokens


> [query] Total embedding token usage: 0 tokens


INFO:root:> [query] Total LLM token usage: 5887 tokens


> [query] Total LLM token usage: 5887 tokens


INFO:root:> [query] Total embedding token usage: 0 tokens


> [query] Total embedding token usage: 0 tokens


Show the answer:

In [104]:
print(answer.response)




ANSWER: Mosaic ML is an open source library that provides efficient methods for training large language models such as GPT-3. It also includes an orchestration stack as part of its Mosaic cloud, which allows users to train GPT-3 models for a starting price of $450,000. The goal of Mosaic ML is to drive the cost of training GPT-3 models down to as close to zero as possible.


Show the source documents:

In [114]:
sources = answer.source_nodes

for i, source in enumerate(sources, 1):
    print(f"Source {i}:")
    print(source.doc_id)
    print(source.source_text)
    print()


Source 1:
5e890958-4af2-43b0-9770-c80788e1e95d

Erik Bernhardsson and Etienne Dilocker discussed the power of vector models and the new databases and embedding models that have been developed in the past few years. They discussed the importance of making the trade-off between recall and latency explicit, and how to configure parameters to achieve high recall. They also discussed the use of mini-batching and matrix algebra for high performance model serving, the use of vector search providers and hybrid search, as well as two-stage pipeline approaches such as question-answer extraction. They concluded by discussing the importance of finding the right approach for the user's use case.

Source 2:
d40755b8-1c59-4308-877d-23295a43017a

Connor Shorten and Jonathan Frankle discussed the latest update with Mosaic ML Cloud and training large language models, such as GPT-3. They discussed the importance of data volume, transfer learning, and pre-training BERT. They also discussed the application