In [113]:
import os
import numpy as np

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFaceHub
from getpass import getpass
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_chroma import Chroma

In [114]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass("HF Token:")
# from llama_index.embeddings.huggingface import HuggingFaceEmbedding

HF Token: ········


## Loading files

The files used are the first three lecture transcripts of the Stanford CS229 course

In [116]:
fileLoc = "D:\\github\\LLMS\\transcripts"
files = os.listdir(fileLoc)

loaders = [
    PyPDFLoader(os.path.join(fileLoc, files[0])),
    PyPDFLoader(os.path.join(fileLoc, files[1])),
    PyPDFLoader(os.path.join(fileLoc, files[2]))
]

# 
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [117]:
textSplitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = textSplitter.split_documents(docs)

### Loading a model for vector embedding

In [119]:
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
# model_name = "Snowflake/snowflake-arctic-embed-m"

embeddings = HuggingFaceEmbeddings(model_name = EMBEDDING_MODEL_NAME, encode_kwargs={"normalize_embeddings": True})



Creating a persist directory and removing the files that already exist in the persist directory

In [121]:
persist_directory = "D:\\github\\LLMS\\PersistDir"
os.system("rm -rf D:\\github\\LLMS\\PersistDir")

1

### Storing the embeddings in a vector database

In [123]:
vectorDB = Chroma.from_documents(
    documents = splits,
    embedding = embeddings,
    persist_directory = persist_directory
)

print(vectorDB._collection.count())

1816


### Retriving relevant information from the stored vector database

In [125]:
### Answer Retrieval through simpler sematic search

question = "What are the names of the Course TA?"

ans = vectorDB.similarity_search(question, k = 3)
print(len(ans))  ### The search has returned three chunks as answers

3


In [126]:
ans[0].page_content

"learning algorithms to teach a car how to  drive at reasonably high speeds off roads \navoiding obstacles.  \nAnd on the lower right, that's a robot program med by PhD student Eva Roshen to teach a \nsort of somewhat strangely configured robot how to get on top of an obstacle, how to get \nover an obstacle. Sorry. I know the video's kind of small. I hope you can sort of see it. \nOkay?  \nSo I think all of these are robots that I thi nk are very difficult to hand-code a controller \nfor by learning these sorts of l earning algorithms. You can in relatively short order get a \nrobot to do often pretty amazing things.  \nOkay. So that was most of what I wanted to say today. Just a couple more last things, but \nlet me just check what questions you have righ t now. So if there are no questions, I'll just \nclose with two reminders, which are after class today or as you start to talk with other \npeople in this class, I just encourage you again to start to form project partners, to try to

The returned answer has a chunk of from the documents stored in the vector database that has been found to be related to the question based on its semantic similarity. There are some limitations to the semantic search.

In [128]:
question = "What was told about supervised learning?"

ans = vectorDB.similarity_search(question, k = 4)

In [129]:
print("Retrived context 1")
print(ans[0].page_content)

Retrived context 1
So I just want to start by showing you a f un video. Remember at the last lecture, the 
initial lecture, I talk ed about supervised learning. And supervised learning was this 
machine-learning problem where I said we'r e going to tell the algorithm what the close 
right answer is for a number of examples, a nd then we want the algorithm to replicate 
more of the same.  
So the example I had at the first lecture was the problem of predicting housing prices, 
where you may have a training set, and we tell the algorithm what the "right" housing 
price was for every house in the training set. And then you want the algorithm to learn the 
relationship between sizes of houses and the pr ices, and essentially produce more of the 
"right" answer.  
So let me show you a video now. Load the bi g screen, please. So I'll show you a video 
now that was from Dean Pomerleau at some work he did at Carnegie Mellon on applied 
supervised learning to get a car to drive itself . This is

### LLM aided Retrieval

In [132]:
llm = HuggingFaceHub(
    repo_id="huggingfaceh4/zephyr-7b-alpha", 
    model_kwargs={"temperature": 0.01, "max_length": 64,"max_new_tokens":512}
)

compressor = LLMChainExtractor.from_llm(llm)

The compressor is the extractor which uses the LLM. The entire vector database is run through the conpressing LLM and the most relevant chunks from the documnets are returned. The retrieval metric used is the **Maximum Marginal Relevance** which returns the most diverse chunks returned by the Compressor.

In [134]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor = compressor,
    base_retriever = vectorDB.as_retriever(search_type = "mmr")
)

In [152]:
question = "What was told about supervised learning?"

ans = compression_retriever.get_relevant_documents(question)

In [170]:
print(ans[0].page_content)

Given the following question and context, extract any part of the context *AS IS* that is relevant to answer the question. If none of the context is relevant return NO_OUTPUT. 

Remember, *DO NOT* edit the extracted parts of the context.

> Question: What was told about supervised learning?
> Context:
>>>
So I just want to start by showing you a f un video. Remember at the last lecture, the 
initial lecture, I talk ed about supervised learning. And supervised learning was this 
machine-learning problem where I said we'r e going to tell the algorithm what the close 
right answer is for a number of examples, a nd then we want the algorithm to replicate 
more of the same.  
So the example I had at the first lecture was the problem of predicting housing prices, 
where you may have a training set, and we tell the algorithm what the "right" housing 
price was for every house in the training set. And then you want the algorithm to learn the 
relationship between sizes of houses and the pr ice