## GPT Keyword Table Index Comparisons

Comparing GPTSimpleKeywordTableIndex, GPTRAKEKeywordTableIndex, GPTKeywordTableIndex.

- GPTSimpleKeywordTableIndex - uses simple regex to extract keywords.
- GPTRAKEKeywordTableIndex - uses RAKE to extract keywords.
- GPTKeywordTableIndex - uses GPT to extract keywords.

#### GPTSimpleKeywordTableIndex

In [1]:
import sys
sys.path.append("../../")
from gpt_index import GPTSimpleKeywordTableIndex, SimpleDirectoryReader
from IPython.display import Markdown, display

In [2]:
# build keyword index
documents = SimpleDirectoryReader('data').load_data()

In [35]:
import transformers
from gpt_index import (
    GPTKeywordTableIndex, 
    SimpleDirectoryReader, 
    LLMPredictor,
    PromptHelper
)
from langchain import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "gpt2"
max_input_size = 4096
num_output = 200
num_chunk_overlap = 20
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# pipe = pipeline(
#     "text2text-generation", model=model, tokenizer=tokenizer, # max_new_tokens=100
# )
pipe = pipeline(
    "text-generation", model=model, tokenizer=tokenizer,
    # max_tokens=num_output,
    # max_new_tokens=100
    max_new_tokens=num_output,
)
prompt_helper = PromptHelper(
    max_input_size,
    num_output,
    num_chunk_overlap,
    chunk_size_limit=512,
)
hf = HuggingFacePipeline(pipeline=pipe)

llm_predictor = LLMPredictor(llm=hf)


In [36]:
# hf("tell me a joke")

In [37]:
index = GPTSimpleKeywordTableIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)

> Adding chunk: 		

What I Worked On

February 2021

Before col...
> Adding chunk: microcomputers, everything changed. Now you cou...
> Adding chunk: Mistress, which featured an intelligent compute...
> Adding chunk: At the time this bothered me, but now it seems ...
> Adding chunk: book about something to help you learn it. The ...
> Adding chunk: it had to be possible to make enough to survive...
> Adding chunk: I'll give you something to read in a few days."...
> Adding chunk: foundation that summer, but I still don't know ...
> Adding chunk: they don't sit very still. So the traditional m...
> Adding chunk: and my money was running out, so at the end of ...
> Adding chunk: learned that it's better for technology compani...
> Adding chunk: belonged to, seemed to be pretty rigorous. No d...
> Adding chunk: out not to be controlled by the Romans. You can...
> Adding chunk: now in grad school at Harvard. It seemed to me ...
> Adding chunk: For some reason there was no bed frame or shee

In [38]:
response = index.query("What did the author do after his time at YC?")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


> Starting query: What did the author do after his time at YC?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


query keywords: ['tell', 'words', 'hope', 'put', 'title', "you are asked to type in two different words: 'yes_three_to_get_yes_three'", 'text', 'keywords', 'first', 'easily', 'talking', 'type:\nthis question has 4 answers. in the following sections', 'placing', 'key', 'following', 'list', 'got', 'asked', 'find', 'next', 'keyword', 'extracted', '4', 'sections', 'questions', "the keyword you're looking for", 'word', 'exact', 'want', 'answers', 'ones', "and 'no_three_to_get_no_three'. type in your exact keyword by placing a '@' next to it. this will give us a list of the keywords in the text.\nso we have four keywords in yc who will tell us who they're talking about. let's say they've got a title. a word is always a word – that's how many people know a key word 'hope'.\nlet's say we've extracted from them a title with a long title. we could easily extract four different words and put them one at a time in this case. the first four words are the ones we want", 'give', 'could', 'the other t

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


> [query] Total LLM token usage: 9216 tokens
> [query] Total embedding token usage: 0 tokens


In [39]:
display(Markdown(f"<b>{response}</b>"))

<b> Otherwise, see if you can do a good job using the original version, adding more context.

The one bit that is a little different to traditional m.o. is that it doesn't give a clear direction on the direction or how it may be interpreted. This is not necessarily good, but it's more likely than not. The fact that you should use the following (in your own notes and later in in an answer): the direction you're supposed to stick to based on a person's history (the first time you put the topic in question in the context that the idea is the one to be considered) or to set your own own way to draw.

We don't know for sure what happens next, but the idea of something that is supposed to be "correct" is a pretty interesting idea. Here is a good link to the Wikipedia article.

- David D. Auerbach (2007)

(via Wikimedia Commons) - You Can't Always</b>

In [40]:
index2 = GPTKeywordTableIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
index2.query("What did the author do after his time at YC?")
display(Markdown(f"<b>{response}</b>"))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


> Adding chunk: 		

What I Worked On

February 2021

Before col...
> Adding chunk: microcomputers, everything changed. Now you cou...
> Adding chunk: Mistress, which featured an intelligent compute...
> Adding chunk: At the time this bothered me, but now it seems ...
> Adding chunk: book about something to help you learn it. The ...
> Adding chunk: it had to be possible to make enough to survive...
> Adding chunk: I'll give you something to read in a few days."...
> Adding chunk: foundation that summer, but I still don't know ...
> Adding chunk: they don't sit very still. So the traditional m...
> Adding chunk: and my money was running out, so at the end of ...
> Adding chunk: learned that it's better for technology compani...
> Adding chunk: belonged to, seemed to be pretty rigorous. No d...
> Adding chunk: out not to be controlled by the Romans. You can...
> Adding chunk: now in grad school at Harvard. It seemed to me ...
> Adding chunk: For some reason there was no bed frame or shee

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

> [build_index_from_documents] Total LLM token usage: 26955 tokens
> [build_index_from_documents] Total embedding token usage: 0 tokens
> Starting query: What did the author do after his time at YC?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


query keywords: ['words', 'select', 'prefixs', 'character', 'result', 'matter', 'expression', 'provide', "add the 'prefix' if you have one of the many comma-separated 'prefixs' set to '\\u' (this doesn't matter if it's a string that has both a double quote and a regular expression). for a list of all of the available keyword sets", 'text', "'d'", 'prefix', 'following', "or 'i' separators. if you're just looking for a name", 'u', 'search', 'use', 'list', 'regular', 'created', 'comma', 'shown', 'keyword', "of the type '*'. try to search 'keyword sets' with the single 's'", 'separated', 'full', 'finally', 'available', 'string', 'try', 'double', 'set', 'quote', "select 'keypenis' and click on 'keyword sets'. the search result will be shown", 'possible', 'click', "then extract the following words from the text and add the '|' or '|' in a single colons. 'keypenis' will provide you with the full character set. finally", 'if any', 'sets', 'then use \'keyword sets\'" \'keypenis\' will be create

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


> [query] Total LLM token usage: 9295 tokens
> [query] Total embedding token usage: 0 tokens


<b> Otherwise, see if you can do a good job using the original version, adding more context.

The one bit that is a little different to traditional m.o. is that it doesn't give a clear direction on the direction or how it may be interpreted. This is not necessarily good, but it's more likely than not. The fact that you should use the following (in your own notes and later in in an answer): the direction you're supposed to stick to based on a person's history (the first time you put the topic in question in the context that the idea is the one to be considered) or to set your own own way to draw.

We don't know for sure what happens next, but the idea of something that is supposed to be "correct" is a pretty interesting idea. Here is a good link to the Wikipedia article.

- David D. Auerbach (2007)

(via Wikimedia Commons) - You Can't Always</b>

In [5]:
display(Markdown(f"<b>{response}</b>"))

<b>

The author went on to write essays and work on other projects, including a new version of the Arc programming language and Hacker News. He also started painting, but stopped after a few months. In 2015, he started working on a new Lisp programming language, which he finished in 2019. The author then moved to England in 2016 with his family and continued writing essays. In 2019, he finished Bel and wrote a bunch of essays on various topics.

The author also worked on building online stores in 1995 after finishing ANSI Common Lisp. He ran the software on servers and let users control it by clicking on links, which was a new concept at the time. In 1996, he co-founded Viaweb with Robert Morris, which was later acquired by Yahoo in 1998. After leaving Yahoo, the author moved back to New York and started painting again. In 2000, he had the idea for a web application that would let people edit code on a server and host the resulting applications, which later became known as "Reddit".</b>

#### GPTRAKEKeywordTableIndex

In [1]:
from gpt_index import GPTRAKEKeywordTableIndex, SimpleDirectoryReader
from IPython.display import Markdown, display

[nltk_data] Downloading package stopwords to /home/jerry/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# build keyword index
documents = SimpleDirectoryReader('data').load_data()
index = GPTRAKEKeywordTableIndex(documents)

In [10]:
response = index.query("What did the author do after his time at YC?")

> Starting query: What did the author do after his time at YC?
Extracted keywords: []


In [11]:
display(Markdown(f"<b>{response}</b>"))

<b>Empty response</b>

#### GPTKeywordTableIndex

In [7]:
from gpt_index import GPTKeywordTableIndex, SimpleDirectoryReader
from IPython.display import Markdown, display

In [None]:
# build keyword index
documents = SimpleDirectoryReader('data').load_data()
index = GPTKeywordTableIndex(documents)

In [None]:
response = index.query("What did the author do after his time at Y Combinator?")

In [10]:
display(Markdown(f"<b>{response}</b>"))

<b>

After a few years, the author decided to step away from Y Combinator to focus on other projects, such as painting and writing essays. In 2013, he handed over control of Y Combinator to Sam Altman. The author's mother passed away in 2014, and after taking some time to grieve, he returned to writing essays and working on Lisp. He continued working on Lisp until 2019, when he finally completed the project.

In 2015, the author decided to move to England with his family. They originally intended to only stay for a year, but ended up liking it so much that they remained there. The author wrote Bel while living in England. In 2019, he finally finished the project. After completing Bel, the author wrote a number of essays on various topics. He continued writing essays through 2020, but also started thinking about other things he could work on.</b>

## GPT Keyword Table Query Comparisons
Compare mode={"default", "simple", "rake"}

In [None]:
# build table with default GPTKeywordTableIndex
from gpt_index import GPTKeywordTableIndex, SimpleDirectoryReader
from IPython.display import Markdown, display

documents = SimpleDirectoryReader('data').load_data()
index = GPTKeywordTableIndex(documents)

In [3]:
# default
response = index.query("What did the author do after his time at Y Combinator?", mode="default")
display(Markdown(f"<b>{response}</b>"))

> Starting query: What did the author do after his time at Y Combinator?
Extracted keywords: ['y combinator', 'combinator']
> Querying with idx: 235042210695008001: of excluding them, because there were so many s...
> Querying with idx: 7029274505691774319: it was like living in another country, and sinc...
> Querying with idx: 1773317813360405038: browser, and then host the resulting applicatio...
> Querying with idx: 3866067077574405334: person, and from those we picked 8 to fund. The...


<b>

The author went on to write a book about his experiences at Y Combinator, and then moved to England. He started writing essays again and also began working on a new Lisp programming language. He also wrote an essay about how he chooses what to work on.</b>

In [4]:
# simple
response = index.query("What did the author do after his time at Y Combinator?", mode="simple")
display(Markdown(f"<b>{response}</b>"))

> Starting query: What did the author do after his time at Y Combinator?
Extracted keywords: ['combinator']
> Querying with idx: 235042210695008001: of excluding them, because there were so many s...
> Querying with idx: 7029274505691774319: it was like living in another country, and sinc...
> Querying with idx: 1773317813360405038: browser, and then host the resulting applicatio...
> Querying with idx: 3866067077574405334: person, and from those we picked 8 to fund. The...


<b>

The author went on to write a book about his experiences at Y Combinator, and then moved to England. He started writing essays again and also began working on a new Lisp programming language. He also wrote an essay about how he chooses what to work on.</b>

In [5]:
# rake
response = index.query("What did the author do after his time at Y Combinator?", mode="rake")
display(Markdown(f"<b>{response}</b>"))

> Starting query: What did the author do after his time at Y Combinator?
Extracted keywords: ['combinator']
> Querying with idx: 235042210695008001: of excluding them, because there were so many s...


[nltk_data] Downloading package punkt to /home/jerry/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


> Querying with idx: 7029274505691774319: it was like living in another country, and sinc...
> Querying with idx: 1773317813360405038: browser, and then host the resulting applicatio...
> Querying with idx: 3866067077574405334: person, and from those we picked 8 to fund. The...


<b>

The author went on to write a book about his experiences at Y Combinator, and then moved to England. He started writing essays again and also began working on a new Lisp programming language. He also wrote an essay about how he chooses what to work on.</b>