# State of the Union question answering example

This notebook demonstrates how to use the Momento Vector Index langchain integration to answer questions about President Joe Biden's 2023 the State of the Union address. We demonstrate how to load a dataset, index it into Momento Vector Index, run a simple query, and how make a full-fledged question answering system.

# Setup

Before we begin, we need to read our API tokens from the environment. There are two required:
- `MOMENTO_AUTH_TOKEN`: This is your Momento API token. You can get one by signing up at https://console.gomomento.com.
- `OPENAI_API_KEY`: This is your OpenAI API key. You can get one by signing up at https://openai.com.

You can store these in a `.env` file, in your environment, or set them directly here. We use dotenv to read the values from a `.env` file.

In [None]:
%load_ext dotenv
%dotenv

In [None]:
import os

# Can set the environment variables directly here if you don't want to use a .env file:
# os.environ["MOMENTO_AUTH_TOKEN"] = "<your token here>"
# os.environ["OPENAI_API_KEY"] = "<your key here>"

# check the environment variables are set
if os.environ.get('MOMENTO_AUTH_TOKEN') is None:
    raise ValueError("MOMENTO_AUTH_TOKEN is not set")

if os.environ.get('OPENAI_API_KEY') is None:
    raise ValueError("OPENAI_API_KEY is not set")

We'll import the libraries we need to data loading, indexing, and querying.

In [None]:
# For setting up the Momento Vector Index and langchain Vector Store
from mvi_langchain import MomentoVectorIndex
from momento import VectorIndexConfigurations, CredentialProvider
from langchain.embeddings.openai import OpenAIEmbeddings

# For reading data and chunking it into smaller segments
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# For doing QA
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI


# Load the data

Load a dataset and chunk it into smaller pieces for question answering.

Here we use the State of the Union transcript. You can substitute in your dataset of choice. Explore the langchain document loaders for a rich ecosystem of ingestors.

In [None]:
raw_documents = TextLoader('data/sample').load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

In [None]:
len(raw_documents)

1

In [None]:
raw_documents[0].page_content[:500]

'The United States Capitol\n\nMr. Speaker. Madam Vice President. Our First Lady and Second Gentleman.\n\nMembers of Congress and the Cabinet. Leaders of our military.\n\nMr. Chief Justice, Associate Justices, and retired Justices of the Supreme Court.\n\nAnd you, my fellow Americans.\n\nI start tonight by congratulating the members of the 118th Congress and the new Speaker of the House, Kevin McCarthy.\n\nMr. Speaker, I look forward to working together.\n\nI also want to congratulate the new leader of the Hous'

# Index the document chunks into MVI

We will use OpenAI to generate text embeddings. We will create an index called "sample-text" in Momento Vector Index to store the embeddings and metadata.

First we instantiate the Momento Vector Index langchain vector store:

In [None]:
db = MomentoVectorIndex(embedding_function=OpenAIEmbeddings(),
    configuration=VectorIndexConfigurations.Default.latest(),
    credential_provider=CredentialProvider.from_environment_variable("MOMENTO_AUTH_TOKEN"),
    index_name="sample-text")

Then we index the document chunks into the vector store:

In [None]:
_ = db.add_documents(documents=documents, ids=[f"sotu-chunk-{i}" for i in range(len(documents))])

We could have also created the db with `MomentoVectorIndex.from_documents`, combining the two steps into one.

We can search directly against the index to get an idea of the document fragments that match the question. Note that the fragments:
- possible contain the answer to the question;
- possibly do not; and
- usually contain irrelevant information.

We will improve the user experience in the step after this.

In [None]:
docs = db.similarity_search("What did the president say about small business?", k=2)
len(docs)

2

In [None]:
print(docs[0].page_content)

Here at home, gas prices are down $1.50 a gallon since their peak.

Food inflation is coming down.

Inflation has fallen every month for the last six months while take home pay has gone up.

Additionally, over the last two years, a record 10 million Americans applied to start a new small business.

Every time somebody starts a small business, it’s an act of hope.

And the Vice President will continue her work to ensure more small businesses can access capital and the historic laws we enacted.

Standing here last year, I shared with you a story of American genius and possibility.

Semiconductors, the small computer chips the size of your fingertip that power everything from cellphones to automobiles, and so much more. These chips were invented right here in America.

America used to make nearly 40% of the world’s chips.

But in the last few decades, we lost our edge and we’re down to producing only 10%. We all saw what happened during the pandemic when chip factories overseas shut down.

In [None]:
print(docs[1].page_content)

Buy American has been the law of the land since 1933. But for too long, past administrations have found ways to get around it.

Not anymore.

Tonight, I’m also announcing new standards to require all construction materials used in federal infrastructure projects to be made in America.

American-made lumber, glass, drywall, fiber optic cables.

And on my watch, American roads, American bridges, and American highways will be made with American products.

My economic plan is about investing in places and people that have been forgotten. Amid the economic upheaval of the past four decades, too many people have been left behind or treated like they’re invisible.

Maybe that’s you, watching at home.

You remember the jobs that went away. And you wonder whether a path even exists anymore for you and your children to get ahead without moving away.

I get it.

That’s why we’re building an economy where no one is left behind.


# Use a QA chain to generate fluent answers

Here we add on to the above example by using a special prompt to generate fluent answers. We use the `RetrievalQA` chain from langchain, which is a simple question answering workflow.

It uses the following steps:
- retrieval: Retrieve the top `k` documents from the index.
- question answering: Use a question answering prompt to generate an answer from the original query and retrieved documents.

In [None]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm, retriever=db.as_retriever())

In [None]:
qa_chain({"query": "What did the president say about small business?"})

{'query': 'What did the president say about small business?',
 'result': "The President mentioned that over the last two years, a record 10 million Americans applied to start a new small business. He also emphasized that every time somebody starts a small business, it's an act of hope. The Vice President will continue her work to ensure more small businesses can access capital and the historic laws they enacted."}

In [None]:
qa_chain({"query": "What did the president say about credit card fees?"})

{'query': 'What did the president say about credit card fees?',
 'result': 'The president said that they have reduced credit card late fees by 75%, from $30 to $8.'}

# Cleanup

In [None]:
db._client.delete_index(index_name="sample-text")

DeleteIndex.Success()