### Fed Minutes with Langchain and HuggingFace

We load the FED minutes in pdf into a document store with langchain.

In [2]:
from langchain.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("Minutes/")
data = loader.load()

In [3]:
print(f'There are {len(data)} documents')

There are 29939 documents


### Save and re-load the documents

In [2]:
import pickle

In [22]:
with open('minutes_individual_docs.pkl', 'wb') as f:
    pickle.dump(data, f)

In [3]:
with open('minutes_individual_docs.pkl', 'rb') as f:
    data = pickle.load(f)

## Split the documents into smaller chunks
We try two splitters, one character-wise and another with a sentence transformer.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
splitter_recursive = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
spliiter_tokens = SentenceTransformersTokenTextSplitter(chunk_overlap=15, model_name='sentence-transformers/all-mpnet-base-v2')

In [6]:
spliiter_tokens.split_documents(data[:2])[:2]

[Document(page_content='a meeting of the federal open market committee was held in the offices of the board of governors of the federal reserve system in washington on monday, november 10, 1958, at 10 : 00 a. m. present : mr. mr. mr. mr. mr. mr. mr. mr. mr. mr. mr. martin, chairman hayes, vice chairman balderston fulton irons leach mangels mills robertson shepardson szymczak messrs. erickson, allen, alternate members of market committee messrs. bopp, bryan, and the federal reserve atlanta, and kansasjohns, and deming, the federal open leedy, presidents of banks of philadelphia, city, respectively mr. riefler, secretary mr. thurston, assistant secretary mr. sherman, assistant secretary mr. hackley, general counsel mr. solomon, assistant general counsel mr. thomas, economist messrs. daane, walker, and young, associate economists mr. rouse, manager, system open market account mr. molony, special assistant to the board of governors mr. koch, associate adviser, division of research and stat

In [6]:
splitter_recursive.split_documents(data[:2])[:2]

[Document(page_content='A meeting of the Federal Open Market Committee was held in \nthe offices of the Board of Governors of the Federal Reserve System \nin Washington on Monday, November 10, 1958, at 10:00 a.m.\nPRESENT: Mr.  \nMr.  \nMr.  \nMr.  \nMr.  \nMr.  \nMr.  \nMr.  \nMr.  \nMr.  \nMr.Martin, Chairman \nHayes, Vice Chairman \nBalderston \nFulton \nIrons \nLeach \nMangels \nMills \nRobertson \nShepardson \nSzymczak\nMessrs. Erickson, Allen, \nAlternate Members of \nMarket Committee \nMessrs. Bopp, Bryan, and \nthe Federal Reserve \nAtlanta, and KansasJohns, and Deming, \nthe Federal Open \nLeedy, Presidents of \nBanks of Philadelphia, \nCity, respectively\nMr. Riefler, Secretary \nMr. Thurston, Assistant Secretary \nMr. Sherman, Assistant Secretary \nMr. Hackley, General Counsel \nMr. Solomon, Assistant General Counsel \nMr. Thomas, Economist \nMessrs. Daane, Walker, and Young, Associate \nEconomists \nMr. Rouse, Manager, System Open Market Account \nMr. Molony, Special Assist

Since we will use the same model to do embeddings, we will split the text with the same sentence transformer.

In [18]:
docs = spliiter_tokens.split_documents(data)
len(docs)

34966

In [8]:
docs_characters = splitter_recursive.split_documents(data)

In [9]:
len(docs_characters)

71921

### Embed the documents

In [8]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
model_name = 'sentence-transformers/all-mpnet-base-v2'
model_kwargs = {'device':'mps'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs)
vector_store = FAISS.from_documents(docs, embeddings)

In [13]:
vector_store.save_local(folder_path='fed_minutes_vector_store', index_name='index')

# Testing Semantic  Search
We retrieve the top 5 answers.

In [24]:
vector_s = FAISS.load_local("fed_minutes_vector_store", embeddings)
query = 'What happened with the economy in 1968?'
docs_ans = vector_s.similarity_search(query, k=5)

In [25]:
docs_ans

[Document(page_content='8 / 18 / 70 - 44 nature of current policy. he thought the desk and the staff were to be commended for the manner in which they had adapted to the new type of directive, and he personally would be unhappy if the committee were to return to directives of the old type. mr. francis commented that it had been popular to criti cize stabilization actions and the performance of the economy over the past year. however, given the strong inflationary momentum gradually built up from 1964 through 1968, he believed that stabi lization actions had, on the whole, been applied satisfactorily and that the economy had performed as well as could reasonably have been expected. the rate of price increase had not been slowed much, if at all, mr. francis remarked. however, the rise had stopped accelerating, and all econometric models now indicated that a moderation of the upward price movement was likely this fall. cutbacks in real output had been much less than in other periods when 

As we can see sentence transformers makes the documents relatively large. However, we can use another LLM to extract the relevant information from those documents and produce an answer that 'probabilistically' makes sense. We do that in the following notebook.