# LlamaIndex + Pinecone 

In this tutorial, we show how to use LlamaIndex with Pinecone to answer complex queries over multiple data sources.  
* While Pinecone provides a powerful and efficient retrieval engine,
it remains challenging to answer complex questions that require multi-step reasoning and synthesis over many data sources.
* With LlamaIndex, we combine the power of vector similiarty search and multi-step reasoning to delivery higher quality and richer responses.


Here, we show 2 specific use-cases:
1. compare and contrast queries over Wikipedia articles about different cities.
2. temporal queries that require reasoning over time

In [1]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

#### Creating a Pinecone Index

In [2]:
import pinecone

  from tqdm.autonotebook import tqdm


In [3]:
pinecone.init(environment="eu-west1-gcp")

In [None]:
# create index if it does not already exist
# dimensions are for text-embedding-ada-002
pinecone.create_index("quickstart-index", dimension=1536, metric="euclidean", pod_type="p1")

In [4]:
pinecone_index = pinecone.Index("quickstart-index")

# Use-Case 1: Compare and Contrast

#### Load Dataset

Fetch and load Wikipedia pages

In [None]:
from llama_index import SimpleDirectoryReader

INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
NumExpr defaulting to 8 threads.


  from .autonotebook import tqdm as notebook_tqdm


In [None]:
wiki_titles = ["Toronto", "Seattle", "San Francisco", "Chicago", "Boston", "Washington, D.C.", "Cambridge, Massachusetts", "Houston"]

In [None]:
from pathlib import Path
import requests

data_path = Path('data_wiki')

for title in wiki_titles:
    response = requests.get(
        'https://en.wikipedia.org/w/api.php',
        params={
            'action': 'query',
            'format': 'json',
            'titles': title,
            'prop': 'extracts',
            # 'exintro': True,
            'explaintext': True,
        }
    ).json()
    page = next(iter(response['query']['pages'].values()))
    wiki_text = page['extract']

    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", 'w') as fp:
        fp.write(wiki_text)


In [None]:
# Load all wiki documents
city_docs = {}
all_docs = []
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(input_files=[data_path / f"{wiki_title}.txt"]).load_data()
    all_docs.extend(city_docs[wiki_title])


#### Build Indices

In [12]:
from llama_index import GPTVectorStoreIndex, StorageContext
from llama_index.vector_stores import PineconeVectorStore
from IPython.display import Markdown, display

In [13]:
# Build index for each city document
city_indices = {}
index_summaries = {}
for wiki_title in wiki_titles:
    print(f"Building index for {wiki_title}")
    # create storage context
    vector_store = PineconeVectorStore(pinecone_index=pinecone_index, namespace=wiki_title)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    
    # build index
    city_indices[wiki_title] = GPTVectorStoreIndex.from_documents(city_docs[wiki_title], storage_context=storage_context)

    # set summary text for city
    index_summaries[wiki_title] = f"Wikipedia articles about {wiki_title}"

Building index for Toronto
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 20744 tokens
> [build_index_from_nodes] Total embedding token usage: 20744 tokens
Building index for Seattle
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 16942 tokens
> [build_index_from_nodes] Total embedding token usage: 16942 tokens
Building index for San Francisco
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_inde

#### Build Graph Query Engine for Compare & Contrast Query

In [14]:
from llama_index.indices.composability import ComposableGraph
from llama_index.indices.keyword_table.simple_base import GPTSimpleKeywordTableIndex

In [15]:
graph = ComposableGraph.from_indices(
    GPTSimpleKeywordTableIndex,
    [index for _, index in city_indices.items()], 
    [summary for _, summary in index_summaries.items()],
    max_keywords_per_chunk=50
)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens
> [build_index_from_nodes] Total embedding token usage: 0 tokens


In [16]:
from llama_index.indices.query.query_transform.base import DecomposeQueryTransform
from llama_index.query_engine.transform_query_engine import TransformQueryEngine

decompose_transform = DecomposeQueryTransform(verbose=True)

custom_query_engines = {}
for wiki_title in wiki_titles:
    index = city_indices[wiki_title]
    query_engine = index.as_query_engine()
    query_engine = TransformQueryEngine(
        query_engine,
        query_transform=decompose_transform,
        transform_extra_info={'index_summary': index_summaries[wiki_title]},
    )
    custom_query_engines[index.index_id] = query_engine

custom_query_engines[graph.root_id] = graph.root_index.as_query_engine(
    retriever_mode='simple',
    response_mode='tree_summarize',
)


In [17]:
# with query decomposition in subindices
query_engine = graph.as_query_engine(custom_query_engines=custom_query_engines)

#### Run Compare & Contrast Query

In [18]:
response = query_engine.query("Compare and contrast the demographics in Seattle, Houston, and Toronto.")

INFO:llama_index.indices.keyword_table.retrievers:> Starting query: Compare and contrast the demographics in Seattle, Houston, and Toronto.
> Starting query: Compare and contrast the demographics in Seattle, Houston, and Toronto.
INFO:llama_index.indices.keyword_table.retrievers:query keywords: ['contrast', 'demographics', 'seattle', 'toronto', 'compare', 'houston']
query keywords: ['contrast', 'demographics', 'seattle', 'toronto', 'compare', 'houston']
INFO:llama_index.indices.keyword_table.retrievers:> Extracted keywords: ['seattle', 'toronto', 'houston']
> Extracted keywords: ['seattle', 'toronto', 'houston']
[33;1m[1;3m> Current query: Compare and contrast the demographics in Seattle, Houston, and Toronto.
[0m[38;5;200m[1;3m> New query:  What is the population of Seattle?
[0mINFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embed

In [19]:
from llama_index.response.notebook_utils import display_response

display_response(response, show_source=False)

**`Final Response:`** Seattle, Houston, and Toronto are all large cities with diverse populations. Seattle has the smallest population of the three cities, with 753,675 people as of 2021. Houston has the second largest population, with 2,304,580 people according to the 2020 U.S. census. Toronto has the largest population of the three cities, with 6,202,225 people in 2021. All three cities have a mix of different ethnicities and cultures, with Seattle having the most diverse population. Houston and Toronto have larger immigrant populations than Seattle. All three cities have a mix of different economic classes, with Toronto having the highest median household income.

# Use-Case 2: Temporal Query

Temporal queries such as "what happened after X" is intuitive to humans, but can often confuse vector databases.  

This is because the vector embedding will focus on the subject "X" rather than the imporant temporal cue. This results in irrelevant and misleading context that harms the final answer.  

LlamaIndex solves this by explicitly maintainging node relationships and leverage LLM to automatically perform query expansion to find more relevant context.  

In [7]:
from llama_index import SimpleDirectoryReader, StorageContext, GPTVectorStoreIndex
from llama_index.vector_stores import PineconeVectorStore


# load documents
documents = SimpleDirectoryReader('../data/paul_graham').load_data()

# define storage context
vector_store = PineconeVectorStore(pinecone_index=pinecone_index, namespace='pg_essay')
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# build index 
index = GPTVectorStoreIndex.from_documents(documents, storage_context=storage_context, store_nodes_override=True)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 20729 tokens
> [build_index_from_nodes] Total embedding token usage: 20729 tokens


We can define an auto prev/next node postprocessor to leverage LLM reasoning to help query expansion (with relevant additional nodes)

In [8]:
from llama_index.indices.postprocessor.node import AutoPrevNextNodePostprocessor

# define postprocessor
node_postprocessor = AutoPrevNextNodePostprocessor(
    docstore=index.storage_context.docstore, 
    service_context=index.service_context,
    num_nodes=3,
    verbose=True
)

# define query engine
query_engine = index.as_query_engine(
    similarity_top_k=1,
    node_postprocessors=[node_postprocessor],
)

#### Example 1

In [9]:
# Infer that we need to search nodes after current one
response = query_engine.query(
    "What did the author do after handing off Y Combinator to Sam Altman?", 
)

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 17 tokens
> [retrieve] Total embedding token usage: 17 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1153 tokens
> [get_response] Total LLM token usage: 1153 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1153 tokens
> [get_response] Total LLM token usage: 1153 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
> Postprocessor Predicted mode: next
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM

In [10]:
from llama_index.response.notebook_utils import display_response

display_response(response)

**`Final Response:`** After handing off Y Combinator to Sam Altman, the author decided to take a break and focus on painting. He spent most of the rest of 2014 painting, but eventually ran out of steam and stopped working on it. He then started writing essays again and wrote a bunch of new ones over the next few months. In March 2015, he started working on Lisp again and spent the next four years writing a new Lisp called Bel. During this time, he had to ban himself from writing essays in order to focus on the project. In the summer of 2016, he and his family moved to England and he wrote most of Bel there. In the fall of 2019, Bel was finally finished and he wrote a bunch of essays about topics he had stacked up. He then wrote an essay for himself to answer the question of how he should choose what to do next. After taking advice from Robert Morris, he decided to move on from Y Combinator and focus on other projects. He has since been involved in a variety of projects, including writing essays, working on Lisp, and investing in startups.

In comparison, naive top-k retrieval results in irrelevant context and hallucinated answer

In [13]:
# define query engine
naive_query_engine = index.as_query_engine(
    similarity_top_k=1,
)

response = naive_query_engine.query(
    "What did the author do after handing off Y Combinator to Sam Altman?", 
)

display_response(response, show_source=True, source_length=1000)


INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 17 tokens
> [retrieve] Total embedding token usage: 17 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1028 tokens
> [get_response] Total LLM token usage: 1028 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens


**`Final Response:`** After handing off Y Combinator to Sam Altman, the author went on to found OpenAI, a research laboratory dedicated to artificial intelligence. He also wrote a book, "The Launch Pad: Inside Y Combinator, Silicon Valley's Most Exclusive School for Startups," and became a partner at Founders Fund, a venture capital firm.

---

**`Source Node 1/1`**

**Document ID:** b06bb792-633e-498a-84ce-84ec68308a82<br>**Similarity:** 0.83942616<br>**Text:** in. We also noticed that the startups were becoming one another's customers. We used to refer jokingly to the "YC GDP," but as YC grows this becomes less and less of a joke. Now lots of startups get their initial set of customers almost entirely from among their batchmates.

I had not originally intended YC to be a full-time job. I was going to do three things: hack, write essays, and work on YC. As YC grew, and I grew more excited about it, it started to take up a lot more than a third of my attention. But for the first few years I was still able to work on other things.

In the summer of 2006, Robert and I started working on a new version of Arc. This one was reasonably fast, because it was compiled into Scheme. To test this new Arc, I wrote Hacker News in it. It was originally meant to be a news aggregator for startup founders and was called Startup News, but after a few months I got tired of reading about nothing but startups. Plus it wasn't startup founders we wanted to reach. ...<br>

#### Example 2

In [14]:
# Infer that we need to search nodes before current one
response = query_engine.query(
    "What did the author do before handing off Y Combinator to Sam Altman?", 
)

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 17 tokens
> [retrieve] Total embedding token usage: 17 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1131 tokens
> [get_response] Total LLM token usage: 1131 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1131 tokens
> [get_response] Total LLM token usage: 1131 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
> Postprocessor Predicted mode: previous
INFO:llama_index.token_counter.token_counter:> [get_response] Total

In [15]:
display_response(response)

**`Final Response:`** Before handing off Y Combinator to Sam Altman, the author wrote essays, worked on spam filters, painted, cooked for groups, bought a building in Cambridge, went on dates with Jessica Livingston, gave talks about how to start a startup, worked on a new version of Arc, and wrote a book called Hackers & Painters. He also schemed with Robert and Trevor about projects they could work on together, and worked on a subset of a new architecture that could be done as an open source project. He also wrote essays about a variety of topics, and started angel investing. He also recognized the potential of the internet to allow anyone to publish anything, and began to focus on writing essays online. He worked on projects that weren't prestigious, such as Lisp, Still Life painting, Viaweb, and Y Combinator, which he believed had the potential to be discovered.