## **VECTORSTORE AND EMBEDDINGS**

### Loading Environment Variable

In [1]:
from secret_key import hugging_facehub_key
import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = hugging_facehub_key

### Document Splitter

In [2]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("MachineLearning-Lecture01.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [3]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [4]:
splits = text_splitter.split_documents(docs)

In [5]:
len(splits)

57

### Embeddings

In [6]:
# !pip install sentence-transformers

In [7]:
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings

In [8]:
embeddings = HuggingFaceEmbeddings()



In [9]:
text = "This is a test document to check the embeddings."

In [10]:
text_embedding = embeddings.embed_query(text)

In [11]:
print(f'Embeddings lenght: {len(text_embedding)}')
print (f"Here's a sample: {text_embedding[:5]}...")

Embeddings lenght: 768
Here's a sample: [-0.0027273022569715977, -0.09290839731693268, -0.02543785609304905, 0.07693515717983246, 0.034590646624565125]...


### Vector Store

In [12]:
# ! pip install langchain-chroma

In [13]:
from langchain_chroma import Chroma

In [14]:
db = Chroma.from_documents(splits, embeddings)

In [15]:
print(db._collection.count())

57


### Similarity Search

In [16]:
question = "is there an email i can ask for help"

In [17]:
docs = db.similarity_search(question,k=3)

In [18]:
len(docs)

3

In [19]:
docs[0].page_content

"So all right, online resources. The class has a home page, so it's in on the handouts. I \nwon't write on the chalkboard — http:// cs229.stanford.edu. And so when there are \nhomework assignments or things like that, we  usually won't sort of — in the mission of \nsaving trees, we will usually not give out many handouts in class. So homework \nassignments, homework solutions will be posted online at the course home page.  \nAs far as this class, I've also written, a nd I guess I've also revised every year a set of \nfairly detailed lecture notes that cover the te chnical content of this  class. And so if you \nvisit the course homepage, you'll also find the detailed lecture notes that go over in detail \nall the math and equations and so on  that I'll be doing in class.  \nThere's also a newsgroup, su.class.cs229, also written on the handout. This is a \nnewsgroup that's sort of a forum for people in  the class to get to  know each other and \nhave whatever discussions you want to ha 

## Failure modes

In [20]:
def search(query, k=5):
    # Perform similarity search
    docs = db.similarity_search(question,k=5)
    
    # Print metadata of top search results
    for doc in docs:
        print(doc.metadata)
    
    # Return the content of the last document
    return docs[-1].page_content

# Example usage
question1 = "what did they say about matlab?"
print(search(question1))

question2 = "what did they say about regression in the third lecture?"
print(search(question2))


{'page': 5, 'source': 'MachineLearning-Lecture01.pdf'}
{'page': 4, 'source': 'MachineLearning-Lecture01.pdf'}
{'page': 14, 'source': 'MachineLearning-Lecture01.pdf'}
{'page': 7, 'source': 'MachineLearning-Lecture01.pdf'}
{'page': 5, 'source': 'MachineLearning-Lecture01.pdf'}
cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So 
rather than sending us email individually, if you send email to this account, it will 
actually let us get back to you maximally quickly with answers to your questions.  
If you're asking questions about homework probl ems, please say in the subject line which 
assignment and which question the email refers to, since that will also help us to route 
your question to the appropriate TA or to me  appropriately and get the response back to 
you quickly.  
Let's see. Skipping ahead — let's see — for homework, one midterm, one open and term 
project. Notice on the honor code. So one thi ng that I think will help you to succeed and 