# Simple Index Demo

#### Load documents, build the GPTSimpleVectorIndex

In [1]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from gpt_index import (
    GPTSimpleVectorIndex, 
    SimpleDirectoryReader,
    LLMPredictor
)
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from IPython.display import Markdown, display

In [2]:
# LLM Predictor (gpt-3)
llm_predictor_gpt3 = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-003"))

# LLMPredictor (gpt-4)
llm_predictor_gpt4 = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-4"))

In [4]:
# load documents
documents = SimpleDirectoryReader('../paul_graham_essay/data').load_data()

In [5]:
index = GPTSimpleVectorIndex(documents)

INFO:gpt_index.token_counter.token_counter:> [build_index_from_documents] Total LLM token usage: 0 tokens
> [build_index_from_documents] Total LLM token usage: 0 tokens
> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_documents] Total embedding token usage: 17598 tokens
> [build_index_from_documents] Total embedding token usage: 17598 tokens
> [build_index_from_documents] Total embedding token usage: 17598 tokens


In [5]:
# save index to disk
index.save_to_disk('index_simple.json')

In [3]:
# load index from disk
index = GPTSimpleVectorIndex.load_from_disk('index_simple.json')

#### Query Index

In [4]:
from gpt_index.indices.query.query_transform.base import StepDecomposeQueryTransform
# gpt-4
step_decompose_transform = StepDecomposeQueryTransform(
    llm_predictor_gpt4, verbose=True
)

# gpt-3
step_decompose_transform_gpt3 = StepDecomposeQueryTransform(
    llm_predictor_gpt3, verbose=True
)

In [5]:
index.set_text("Used to answer questions about the author")

In [6]:
# set Logging to DEBUG for more detailed outputs
response_gpt4 = index.query(
    "Who was in the first batch of the accelerator program the author started?",
    query_transform=step_decompose_transform,
    llm_predictor=llm_predictor_gpt4
)

[33;1m[1;3m> Current query: Who was in the first batch of the accelerator program the author started?
[0m[38;5;200m[1;3m> Formatted prompt: The original question is as follows: Who was in the first batch of the accelerator program the author started?
We have an opportunity to answer some, or all of the question from a knowledge source. Context information for the knowledge source is provided below, as well as previous reasoning steps.
Given the context and previous reasoning, return a question that can be answered from the context. This question can be the same as the original question, or this question can represent a subcomponent of the overall question.It should not be irrelevant to the original question.
If we cannot extract more information from the context, provide 'None' as the answer. Some examples are given below: 

Question: How many Grand Slam titles does the winner of the 2020 Australian Open have?
Knowledge source context: Provides names of the winners of the 2020 Aus

In [10]:
display(Markdown(f"<b>{response_gpt4}</b>"))

<b>

The first batch of the Y Combinator accelerator program included startups such as Reddit (founded by Steve Huffman and Alexis Ohanian), Loopt, Weebly, Twitch (founded by Justin Kan and Emmett Shear), and Aaron Swartz (who had already helped write the RSS spec and would later become a martyr for open access). This initial group of startups was supported by Y Combinator with funding, mentorship, and resources to help them grow their businesses, and Sam Altman, who would later become the second president of YC, was also part of this batch.</b>

In [21]:
sub_qa = response_gpt4.extra_info["sub_qa"]
tuples = [(t[0], t[1].response) for t in sub_qa]
print(tuples)

[('What accelerator program did the author start?', 'The author started the accelerator program called Y Combinator (YC), which focuses on supporting and funding new startups every six months.'), ("Who was in the first batch of Y Combinator's accelerator program?", "The first batch of Y Combinator's accelerator program, which started in 2005, included startups like Reddit, founded by Steve Huffman and Alexis Ohanian; Justin Kan and Emmett Shear, who later founded Twitch; Aaron Swartz, who had already helped write the RSS spec and would later become a martyr for open access; and Sam Altman, who would later become the second president of YC. These startups and their founders faced various challenges during the program, and their problems became Y Combinator's problems, leading to a highly engaging and dynamic work environment."), ("Who were the founders and startups in the first batch of Y Combinator's accelerator program?", "The first batch of Y Combinator's accelerator program, which t

In [8]:
response_gpt4 = index.query(
    "In which city did the author found his first company, Viaweb?",
    query_transform=step_decompose_transform,
    llm_predictor=llm_predictor_gpt4
)

[33;1m[1;3m> Current query: In which city did the author found his first company, Viaweb?
[0m[38;5;200m[1;3m> Formatted prompt: The original question is as follows: In which city did the author found his first company, Viaweb?
We have an opportunity to answer some, or all of the question from a knowledge source. Context information for the knowledge source is provided below, as well as previous reasoning steps.
Given the context and previous reasoning, return a question that can be answered from the context. This question can be the same as the original question, or this question can represent a subcomponent of the overall question.It should not be irrelevant to the original question.
If we cannot extract more information from the context, provide 'None' as the answer. Some examples are given below: 

Question: How many Grand Slam titles does the winner of the 2020 Australian Open have?
Knowledge source context: Provides names of the winners of the 2020 Australian Open
Previous re

In [9]:
print(response_gpt4)



Answer: Paul Graham founded his first company, Viaweb, in Cambridge, Massachusetts.


In [21]:
response_gpt3 = index.query(
    "In which city did the author found his first company, Viaweb?",
    query_transform=step_decompose_transform_gpt3,
    llm_predictor=llm_predictor_gpt3
)

[33;1m[1;3m> Current query: In which city did the author found his first company, Viaweb?
[0m[38;5;200m[1;3m> Formatted prompt: The original question is as follows: In which city did the author found his first company, Viaweb?
We have an opportunity to answer some, or all of the question from a knowledge source. Context information for the knowledge source is provided below, as well as previous reasoning steps.
Given the context and previous reasoning, return a question that can be answered from the context. This question can be the same as the original question, or this question can represent a subcomponent of the overall question.It should not be irrelevant to the original question.
If we cannot extract more information from the context, provide 'None' as the answer. Some examples are given below: 

Question: How many Grand Slam titles does the winner of the 2020 Australian Open have?
Knowledge source context: Provides names of the winners of the 2020 Australian Open
Previous re

In [23]:
print(response_gpt3)



The author founded his first company, Viaweb, in Cambridge, Massachusetts, which is located in the Greater Boston area of the northeastern United States, directly north of Boston, across the Charles River. Paul Graham, the creator of the Lisp programming language, bought a house in Cambridge in the spring of 1995 to work on a new dialect of Lisp called Arc, and used it as a base for his work on the language.
