In [7]:
from urllib.request import urlretrieve
import os
urls =[
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf',
]
filenames = [
    '2022-Shareholder-letter.pdf',
    '2021-Shareholder-letter.pdf',
    '2019-Shareholder-letter.pdf',
]
metadata = [
    {'source': '2022-Shareholder-letter.pdf', 'year': 2022, 'reported_currency': 'USD','title':'Shareholder-letter-2022'},
    {'source': '2021-Shareholder-letter.pdf', 'year': 2021, 'reported_currency': 'INR','title':'Shareholder-letter-2021'},
    {'source': '2019-Shareholder-letter.pdf', 'year': 2019, 'reported_currency': 'CAD','title':'Shareholder-letter-2019'},
]
data_root ="data/"
for idx,url in enumerate(urls):
    filepath = data_root+filenames[idx]
    urlretrieve(url,filepath)

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.embeddings import SentenceTransformerEmbeddings
documents =[]
for idx,file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata['source'] = metadata[idx]['source']
        document_fragment.metadata['title'] = metadata[idx]['title']
        document_fragment.metadata['reported_currency'] = metadata[idx]['reported_currency']
        document_fragment.metadata['year'] = metadata[idx]['year']

    print(f"Document {idx+1} has {len(document)} pages")
    documents += document

Document 1 has 10 pages
Document 2 has 9 pages
Document 3 has 8 pages


In [9]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    add_start_index=True,
)

In [10]:
docs = text_splitter.split_documents(documents)

In [12]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [13]:
from langchain.schema import Document
from langchain.vectorstores import FAISS

In [14]:
db =FAISS.from_documents(docs, embeddings)

In [15]:
db.save_local("faiss_hf_index")

In [None]:
new_db = FAISS.load_local("faiss_hf_index",embeddings)
db = new_db

In [19]:
query = "How was AWS evolved?"

# Perform Basic Similarity Search

In [20]:
results_with_scores = db.similarity_search_with_score(query)
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}, \nMetadata: {doc.metadata}, \nScore: {score}")

Content: at a significantly lower cost. We’re not close to being done innovating here,and this long-term investment should prove fruitful for both customers and AWS. AWS is still in the earlystages of its evolution, and has a chance for unusual growth in the next decade., 
Metadata: {'source': '2022-Shareholder-letter.pdf', 'page': 2, 'title': 'Shareholder-letter-2022', 'reported_currency': 'USD', 'year': 2022, 'start_index': 4966}, 
Score: 0.781078040599823
Content: In 2008, AWS was still a fairly small, fledgling business.We knew we were on to something, but it still required substantial capital investment. There were voicesinside and outside of the company questioning why Amazon (known mostly as an online retailer then) wouldbe investing so much in cloud computing. But, we knew we were inventing something special that couldcreate a lot of value for customers and Amazon in the future. We had a head start on potential competitors;and if anything, we wanted to accelerate our pace of in

## Similarity Search with Metadata Filtering

In [21]:
filter =  dict(year=2022)

In [22]:
results_with_scores = db.similarity_search_with_score(query,filter=filter)
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}, \nMetadata: {doc.metadata}, \nScore: {score}")

Content: at a significantly lower cost. We’re not close to being done innovating here,and this long-term investment should prove fruitful for both customers and AWS. AWS is still in the earlystages of its evolution, and has a chance for unusual growth in the next decade., 
Metadata: {'source': '2022-Shareholder-letter.pdf', 'page': 2, 'title': 'Shareholder-letter-2022', 'reported_currency': 'USD', 'year': 2022, 'start_index': 4966}, 
Score: 0.781078040599823
Content: In 2008, AWS was still a fairly small, fledgling business.We knew we were on to something, but it still required substantial capital investment. There were voicesinside and outside of the company questioning why Amazon (known mostly as an online retailer then) wouldbe investing so much in cloud computing. But, we knew we were inventing something special that couldcreate a lot of value for customers and Amazon in the future. We had a head start on potential competitors;and if anything, we wanted to accelerate our pace of in

## TOP K Matching
Top-K Matching is a filtering technique that involves a 2 stage approach.
1. Perform a similarity Search, returning the top K matches
2. Apply your metadata filter on the smaller resultset

Note: A caveat for Top-K matching is that if the value of K is too small, there is a chance that after filtering there will be no results to return.

Using Top-K matching requires 2 values:
- K, the max number of results to return at the end of our query
- fetch_k, the max number of results to return from the similarity search before applying filters

In [24]:
results = db.similarity_search(query,filter=filter, k=2,fetch_k=4)
for doc in results:
    print(f"Content: {doc.page_content}, \nMetadata: {doc.metadata}\n\n")

Content: at a significantly lower cost. We’re not close to being done innovating here,and this long-term investment should prove fruitful for both customers and AWS. AWS is still in the earlystages of its evolution, and has a chance for unusual growth in the next decade., 
Metadata: {'source': '2022-Shareholder-letter.pdf', 'page': 2, 'title': 'Shareholder-letter-2022', 'reported_currency': 'USD', 'year': 2022, 'start_index': 4966}


Content: In 2008, AWS was still a fairly small, fledgling business.We knew we were on to something, but it still required substantial capital investment. There were voicesinside and outside of the company questioning why Amazon (known mostly as an online retailer then) wouldbe investing so much in cloud computing. But, we knew we were inventing something special that couldcreate a lot of value for customers and Amazon in the future. We had a head start on potential competitors;and if anything, we wanted to accelerate our pace of innovation. We made the lon