## Refining Retrieval for QA
Let's reproduce an example from the previous [notebook](QAoverFedMinutesHistory.ipynb)

In [8]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
model_name = 'sentence-transformers/all-mpnet-base-v2'
model_kwargs = {'device':'mps'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs)
vector_s = FAISS.load_local("MINUTES_FOMC_HISTORY", embeddings)

query = "What was the staff economic outlook during the FOMC meeting of November 2022? Please include some of the political developments discussed in the meeting and why they were important to the economic outlook. Also include some statistical indicators supporting the views when possible."
docs_ans = vector_s.similarity_search(query, k=7)

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Let's see the metadata from the retrieved documents.

In [9]:
for doc in docs_ans:
    print(doc.metadata)

{'source': 'Minutes/fomcminutes20211215.pdf', 'page': 7, 'year': 2021, 'month': 12, 'day': 15}
{'source': 'Minutes/fomcminutes20220921.pdf', 'page': 6, 'year': 2022, 'month': 9, 'day': 21}
{'source': 'Minutes/fomcminutes20151216.pdf', 'page': 4, 'year': 2015, 'month': 12, 'day': 16}
{'source': 'Minutes/fomcminutes20140618.pdf', 'page': 6, 'year': 2014, 'month': 6, 'day': 18}
{'source': 'Minutes/fomcminutes20110126.pdf', 'page': 11, 'year': 2011, 'month': 1, 'day': 26}
{'source': 'Minutes/fomcminutes20121212.pdf', 'page': 5, 'year': 2012, 'month': 12, 'day': 12}
{'source': 'Minutes/fomcminutes20130918.pdf', 'page': 4, 'year': 2013, 'month': 9, 'day': 18}


As we can see no document retrieved refers to November 2022, when in fact there was a meeting that took place in November 2022.

###  Filter the documents with metadata filtering with recurring search expansion

Assume for a moment that we know the query relates to events that can be traced to the metadata of the documents, say the year and month. Then, we can search the vector store until we find a desired number of documents that match the filtering criteria. We know from the query we are interested in November 2022.

In [10]:
from filterminutes import search_with_filter
filtered_context = search_with_filter(vector_s, query, init_k=200, step=300, filter_dict={'year':2022,  'month': 11})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 1 documents
INFO:filter methods:Expanding search with k=200


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 2 documents
INFO:filter methods:Expanding search with k=500


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 3 documents
INFO:filter methods:Expanding search with k=800


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 3 documents
INFO:filter methods:Expanding search with k=1100


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 4 documents
INFO:filter methods:Expanding search with k=1400


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 4 documents
INFO:filter methods:Expanding search with k=1700


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Done. Context contains 5 Documents matching the filtering criteria


### Add Prompt and QA chain

In [5]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv('.env.rtf'))

In [11]:
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
prompt_template = """You are a research analyst at a federal reserve bank and you are trying to answer questions or provide colour on statements. Use the following pieces of context to answer the question at the end. Explain the rationale behind your answer.
If you don't have all the elements to answer the query, say it explicitly.

{context}

Question: {question}
"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
fed_chain = load_qa_chain(llm=ChatOpenAI(model_name='gpt-3.5-turbo'), chain_type='stuff', prompt=PROMPT)

We can now feed the results of the filtered context to the chain.

In [12]:
results = fed_chain({"input_documents": filtered_context, "question": query})
print(results['output_text'])
for doc in results['input_documents']:
    print(doc.metadata)

During the FOMC meeting of November 2022, the staff economic outlook was that U.S. real gross domestic product (GDP) had increased at a moderate pace in the third quarter after declining over the first half of the year. The staff projected that inflation would decline over the next two years, with both total and core PCE price inflation expected to be 2 percent in 2025. However, the staff continued to view the risks to the inflation projection as skewed to the upside, as inflation remained stubbornly high. 

The staff also noted that political developments, specifically Russia's war against Ukraine, were causing tremendous human and economic hardship. These events were creating additional upward pressure on inflation and weighing on global economic activity. The war and its related events were important to the economic outlook as they were contributing to the supply and demand imbalances that were impacting inflation. 

Some statistical indicators supporting the staff's views included 

We can see that all the documents report to November 2022. Let's see what happens when we do not perform any filter to the retrieved documents for the original search.

In [13]:
simple_retrieval = vector_s.similarity_search(query, top_k=5)
results_simple = fed_chain({"input_documents": simple_retrieval, "question":query})
print(results_simple['output_text'])
for doc in results_simple['input_documents']:
    print(doc.metadata)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The given context does not provide information about the staff economic outlook during the FOMC meeting of November 2022.
{'source': 'Minutes/fomcminutes20211215.pdf', 'page': 7, 'year': 2021, 'month': 12, 'day': 15}
{'source': 'Minutes/fomcminutes20220921.pdf', 'page': 6, 'year': 2022, 'month': 9, 'day': 21}
{'source': 'Minutes/fomcminutes20151216.pdf', 'page': 4, 'year': 2015, 'month': 12, 'day': 16}
{'source': 'Minutes/fomcminutes20140618.pdf', 'page': 6, 'year': 2014, 'month': 6, 'day': 18}


As expected, which we know it is true because the context is not correctly identified.

### Adding a method for automatically creating the filter parameters
Let's recap the methodology to filter for a given query (we already have a prompt) that contain time related information.
1. Extract the date information in natural language from the query.
2. Compose a dictionary with the parameters to perform the search.
3. Expand the search until we get the desired number of top_k documents, **after filtering**.
4. Feed it into the chain.

We now need to automate steps 1 and 2. For that we will use a specific template.

In [17]:
from langchain import LLMChain
template_extract_date_filters = """Extract the date elements from the following question in numeric format. If there is mention of both year and month, use the datatime string format %Y-%m. Example: 1)The meeting took place in October 2022 ->> 2022-10. Then put them in a dictionary like so:
1) The meeting took place in October 2022 ->> (year: 2022, month: 10)
 2) During 1 November 1968 ->> (year: 2022, month: 10, day:1). Use json format. You are allowed to use the keys 'year', 'month', 'day', and 'page'.

{question}
"""
prompt = PromptTemplate.from_template(
    template_extract_date_filters)

date_extractor = LLMChain(prompt=prompt, llm=ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo'))

In [19]:
date = date_extractor.run('What was the staff economic outlook during the FOMC meeting of November 2022? Please include some of the political developments discussed in the meeting and why they were important to the economic outlook. Also include some figures to illustrate.')
date



'{\n  "year": 2022,\n  "month": 11\n}'

## Load into Json object


In [21]:
import json
filter_date = json.loads(date)

In [22]:
filter_date.keys()

dict_keys(['year', 'month'])

We can now use the filter_date directly to search.

## Putting it all together

In [37]:
query = "What was the economic outlook from the staff presented in the meeting of April 2009 with respect to Labor market conditions and industrial production? Write three paragraphs about this topic. Then write three paragraphs about the staff view of the Financial Situation."

print('Extracting the date in numeric format..')
filter_date = json.loads(date_extractor.run(query))

print(f'Date parameters retrieved: {filter_date}')
print('Running the qa with filtered context..')
filtered_context = search_with_filter(vector_s, query, init_k=200, step=300, filter_dict=filter_date)

print(20*'-' + 'Metadata for the documents to be used' + 20*'-' )
for doc in filtered_context:
    print(doc.metadata)

print('Computing Final output..' )
answer = fed_chain({"input_documents": filtered_context, "question": query})
print(20*'-' + 'Final output' + 20*'-' )
print(answer['output_text'])

Extracting the date in numeric format..
Date parameters retrieved: {'year': 2009, 'month': 4}
Running the qa with filtered context..


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 0 documents
INFO:filter methods:Expanding search with k=200


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 0 documents
INFO:filter methods:Expanding search with k=500


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 0 documents
INFO:filter methods:Expanding search with k=800


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 1 documents
INFO:filter methods:Expanding search with k=1100


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 1 documents
INFO:filter methods:Expanding search with k=1400


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 1 documents
INFO:filter methods:Expanding search with k=1700


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 1 documents
INFO:filter methods:Expanding search with k=2000


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 2 documents
INFO:filter methods:Expanding search with k=2300


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 3 documents
INFO:filter methods:Expanding search with k=2600


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 3 documents
INFO:filter methods:Expanding search with k=2900


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 3 documents
INFO:filter methods:Expanding search with k=3200


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 3 documents
INFO:filter methods:Expanding search with k=3500


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 3 documents
INFO:filter methods:Expanding search with k=3800


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Context contains 3 documents
INFO:filter methods:Expanding search with k=4100


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:filter methods:Done. Context contains 5 Documents matching the filtering criteria


--------------------Metadata for the documents to be used--------------------
{'source': 'Minutes/fomcminutes20090429.pdf', 'page': 5, 'year': 2009, 'month': 4, 'day': 29}
{'source': 'Minutes/fomcminutes20090429.pdf', 'page': 5, 'year': 2009, 'month': 4, 'day': 29}
{'source': 'Minutes/fomcminutes20090429.pdf', 'page': 12, 'year': 2009, 'month': 4, 'day': 29}
{'source': 'Minutes/fomcminutes20090429.pdf', 'page': 10, 'year': 2009, 'month': 4, 'day': 29}
{'source': 'Minutes/fomcminutes20090429.pdf', 'page': 12, 'year': 2009, 'month': 4, 'day': 29}
Computing Final output..
--------------------Final output--------------------
According to the staff economic outlook presented in the April 2009 meeting, the labor market conditions were still contracting and the decline in industrial production was rapid. Despite stabilization in consumer purchases and the abating decline in the housing sector, the contraction in the labor market persisted into March and industrial production continued to fall