# LlamaIndex + Pinecone 

In this tutorial, we show how to use LlamaIndex with Pinecone to answer complex queries over multiple data sources.  
* While Pinecone provides a powerful and efficient retrieval engine,
it remains challenging to answer complex questions that require multi-step reasoning and synthesis over many data sources.
* With LlamaIndex, we combine the power of vector similiarty search and multi-step reasoning to delivery higher quality and richer responses.


Here, we show 2 specific use-cases:
1. compare and contrast queries over Wikipedia articles about different cities.
2. temporal queries that require reasoning over time

In [1]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

#### Creating a Pinecone Index

In [2]:
import pinecone

  from tqdm.autonotebook import tqdm


In [3]:
pinecone.init(environment="eu-west1-gcp")

In [None]:
# create index if it does not already exist
# dimensions are for text-embedding-ada-002
pinecone.create_index("quickstart-index", dimension=1536, metric="euclidean", pod_type="p1")

In [4]:
pinecone_index = pinecone.Index("quickstart-index")

# Use-Case 1: Compare and Contrast

#### Load Dataset

Fetch and load Wikipedia pages

In [5]:
from llama_index import SimpleDirectoryReader

INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
NumExpr defaulting to 8 threads.


In [6]:
wiki_titles = ["Toronto", "Seattle", "San Francisco", "Chicago", "Boston", "Washington, D.C.", "Cambridge, Massachusetts", "Houston"]

In [7]:
from pathlib import Path
import requests

data_path = Path('data_wiki')

for title in wiki_titles:
    response = requests.get(
        'https://en.wikipedia.org/w/api.php',
        params={
            'action': 'query',
            'format': 'json',
            'titles': title,
            'prop': 'extracts',
            'explaintext': True,
        }
    ).json()
    page = next(iter(response['query']['pages'].values()))
    wiki_text = page['extract']

    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", 'w') as fp:
        fp.write(wiki_text)


In [8]:
# Load all wiki documents
city_docs = {}
all_docs = []
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(input_files=[data_path / f"{wiki_title}.txt"]).load_data()
    all_docs.extend(city_docs[wiki_title])


#### Build Indices

In [9]:
from llama_index import GPTVectorStoreIndex, StorageContext
from llama_index.vector_stores import PineconeVectorStore

In [10]:
# Build index for each city document
city_indices = {}
index_summaries = {}
for wiki_title in wiki_titles:
    print(f"Building index for {wiki_title}")
    # create storage context
    vector_store = PineconeVectorStore(pinecone_index=pinecone_index, namespace=wiki_title)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    
    # build index
    city_indices[wiki_title] = GPTVectorStoreIndex.from_documents(city_docs[wiki_title], storage_context=storage_context)

    # set summary text for city
    index_summaries[wiki_title] = f"Wikipedia articles about {wiki_title}"

Building index for Toronto
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 20744 tokens
> [build_index_from_nodes] Total embedding token usage: 20744 tokens
Building index for Seattle
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 16942 tokens
> [build_index_from_nodes] Total embedding token usage: 16942 tokens
Building index for San Francisco
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_inde

#### Build Graph Query Engine for Compare & Contrast Query

In [11]:
from llama_index.indices.composability import ComposableGraph
from llama_index.indices.keyword_table.simple_base import GPTSimpleKeywordTableIndex

In [12]:
graph = ComposableGraph.from_indices(
    GPTSimpleKeywordTableIndex,
    [index for _, index in city_indices.items()], 
    [summary for _, summary in index_summaries.items()],
    max_keywords_per_chunk=50
)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens
> [build_index_from_nodes] Total embedding token usage: 0 tokens


In [13]:
from llama_index.indices.query.query_transform.base import DecomposeQueryTransform
from llama_index.query_engine.transform_query_engine import TransformQueryEngine

decompose_transform = DecomposeQueryTransform(verbose=True)

custom_query_engines = {}
for wiki_title in wiki_titles:
    index = city_indices[wiki_title]
    query_engine = index.as_query_engine()
    query_engine = TransformQueryEngine(
        query_engine,
        query_transform=decompose_transform,
        transform_extra_info={'index_summary': index_summaries[wiki_title]},
    )
    custom_query_engines[index.index_id] = query_engine

custom_query_engines[graph.root_id] = graph.root_index.as_query_engine(
    retriever_mode='simple',
    response_mode='tree_summarize',
)


In [14]:
# with query decomposition in subindices
query_engine = graph.as_query_engine(custom_query_engines=custom_query_engines)

#### Run Compare & Contrast Query

In [15]:
response = query_engine.query("Compare and contrast the demographics in Seattle, Houston, and Toronto.")

INFO:llama_index.indices.keyword_table.retrievers:> Starting query: Compare and contrast the demographics in Seattle, Houston, and Toronto.
> Starting query: Compare and contrast the demographics in Seattle, Houston, and Toronto.
INFO:llama_index.indices.keyword_table.retrievers:query keywords: ['demographics', 'houston', 'contrast', 'seattle', 'toronto', 'compare']
query keywords: ['demographics', 'houston', 'contrast', 'seattle', 'toronto', 'compare']
INFO:llama_index.indices.keyword_table.retrievers:> Extracted keywords: ['houston', 'seattle', 'toronto']
> Extracted keywords: ['houston', 'seattle', 'toronto']
[33;1m[1;3m> Current query: Compare and contrast the demographics in Seattle, Houston, and Toronto.
[0m[38;5;200m[1;3m> New query:  What is the population of Houston?
[0mINFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embed

In [16]:
from llama_index.response.pprint_utils import pprint_response

pprint_response(response)

Final Response: Seattle, Houston, and Toronto are all large cities
with diverse populations. Houston is the largest of the three cities,
with a population of 2,304,580 according to the 2020 U.S. census.
Seattle is the second largest, with an estimated population of 730,000
people. Toronto is the third largest, with a population of 6,202,225
in 2021. All three cities have a diverse population, with a mix of
different ethnicities, cultures, and religions. Houston is known for
its large Hispanic population, while Seattle is known for its large
Asian population. Toronto is known for its multiculturalism, with a
large population of immigrants from all over the world.


# Use-Case 2: Temporal Query

Temporal queries such as "what happened after X" is intuitive to humans, but can often confuse vector databases.  

This is because the vector embedding will focus on the subject "X" rather than the imporant temporal cue. This results in irrelevant and misleading context that harms the final answer.  

LlamaIndex solves this by explicitly maintainging node relationships and leverage LLM to automatically perform query expansion to find more relevant context.  

In [26]:
from llama_index import SimpleDirectoryReader, StorageContext, GPTVectorStoreIndex
from llama_index.vector_stores import PineconeVectorStore


# load documents
documents = SimpleDirectoryReader('../data/paul_graham').load_data()

# define storage context
vector_store = PineconeVectorStore(pinecone_index=pinecone_index, namespace='pg_essay_0.6.0')
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# build index 
index = GPTVectorStoreIndex.from_documents(
    documents, 
    storage_context=storage_context,
    # override to store Node in document store in addition to vector store, necessary for the node postprocessor
    store_nodes_override=True  
)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 20729 tokens
> [build_index_from_nodes] Total embedding token usage: 20729 tokens


We can define an auto prev/next node postprocessor to leverage LLM reasoning to help query expansion (with relevant additional nodes)

In [27]:
from llama_index.indices.postprocessor.node import AutoPrevNextNodePostprocessor

# define postprocessor
node_postprocessor = AutoPrevNextNodePostprocessor(
    docstore=index.storage_context.docstore, 
    service_context=index.service_context,
    num_nodes=3,
    verbose=True
)

# define query engine
query_engine = index.as_query_engine(
    similarity_top_k=1,
    node_postprocessors=[node_postprocessor],
)

#### Example 1

In [28]:
# Infer that we need to search nodes after current one
response = query_engine.query(
    "What did the author do after handing off Y Combinator to Sam Altman?", 
)

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 17 tokens
> [retrieve] Total embedding token usage: 17 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1153 tokens
> [get_response] Total LLM token usage: 1153 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1153 tokens
> [get_response] Total LLM token usage: 1153 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
> Postprocessor Predicted mode: next
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM

In [29]:
from llama_index.response.pprint_utils import pprint_response

pprint_response(response)

Final Response: After handing off Y Combinator to Sam Altman, the
author decided to take a break and pursue a completely different
activity. He chose to paint and spent most of the rest of 2014
painting. However, in November he ran out of steam and stopped
painting. He then started writing essays again and wrote a few that
weren't about startups. In March 2015, he started working on Lisp
again, and spent the next four years writing a new Lisp called Bel in
itself in Arc. He had to ban himself from writing essays during most
of this time, or he would never have finished. In late 2015 he spent 3
months writing essays, and when he went back to working on Bel he
could barely understand the code.


In comparison, naive top-k retrieval results in irrelevant context and hallucinated answer

In [33]:
# define query engine
naive_query_engine = index.as_query_engine(
    similarity_top_k=1,
)

response = naive_query_engine.query(
    "What did the author do after handing off Y Combinator to Sam Altman?", 
)

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 17 tokens
> [retrieve] Total embedding token usage: 17 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1028 tokens
> [get_response] Total LLM token usage: 1028 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens


In [34]:
pprint_response(response, show_source=True)

Final Response: After handing off Y Combinator to Sam Altman, the
author went on to found OpenAI, a research laboratory dedicated to
artificial intelligence. He also wrote a book, "The Launch Pad: Inside
Y Combinator, Silicon Valley's Most Exclusive School for Startups,"
and became a partner at Founders Fund, a venture capital firm.
______________________________________________________________________
Source Node 1/1
Document ID: 204a1bd3-95cd-4421-accb-469fbd876a00
Similarity: 0.839429557
Text: in. We also noticed that the startups were becoming one
another's customers. We used to refer jokingly to the "YC GDP," but as
YC grows this becomes less and less of a joke. Now lots of startups
get their initial set of customers almost entirely from among their
batchmates.  I had not originally intended YC to be a full-time job. I
was going to ...


#### Example 2

In [35]:
# Infer that we need to search nodes before current one
response = query_engine.query(
    "What did the author do before handing off Y Combinator to Sam Altman?", 
)

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 17 tokens
> [retrieve] Total embedding token usage: 17 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1131 tokens
> [get_response] Total LLM token usage: 1131 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1131 tokens
> [get_response] Total LLM token usage: 1131 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
> Postprocessor Predicted mode: previous
INFO:llama_index.token_counter.token_counter:> [get_response] Total

In [37]:
pprint_response(response, show_source=True)

Final Response: Before handing off Y Combinator to Sam Altman, the
author worked on several different projects. He wrote essays and
published them online, worked on spam filters, painted, cooked for
groups, bought a building in Cambridge, and started Y Combinator. He
also gave a talk at a Lisp conference, wrote a postscript file of the
talk and posted it online, and started angel investing. He also
schemed with Robert and Trevor about projects they could work on
together, and then he and Jessica Livingston started their own
investment firm. They created the Summer Founders Program, which was a
summer program for undergrads to start startups instead of getting
temporary jobs at tech companies. They also created the batch model,
which was to fund a bunch of startups all at once, twice a year, and
then to spend three months focusing intensively on trying to help
them. He also worked on a new version of Arc with Robert, wrote to
test the new Arc, and noticed the advantages of scale as YC g