# Vector DB Experimentations

Here we will try to leverage the Cohere Embeddings generator to create our embeddings and then use Pinecone to store the vectors before using the Cohere API again to create a QA on a PDF file that we have chosen to be our source data.

## Importing Libraries

In [1]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import CohereEmbeddings
from langchain.llms import Cohere
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from ApiSecrets import ApiSecrets
import os

## Creating a PDF Directory as our Retrieval Source

In [2]:
loader = PyPDFDirectoryLoader("pdfs")
source_data = loader.load()

In [3]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 20)
text_chunks = text_splitter.split_documents(source_data)

In [4]:
print(text_chunks[1].page_content)

aidan@cs.toronto.eduŁukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,


## Embeddings and Pinecone

In [5]:
from pinecone import Pinecone
os.environ["COHERE_API_KEY"] = ApiSecrets.COHERE_API_KEY
os.environ["PINECONE_API_KEY"] = ApiSecrets.PINECONE_API_KEY

In [6]:
embeddings = CohereEmbeddings(model="embed-english-v3.0")
text = "this is a test document"
sample_embed = embeddings.embed_query(text)
len(sample_embed)

  warn_deprecated(


1024

In [7]:
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))
index_name = "testing-vec-db"
index = pc.Index(index_name)

### Creating Embeddings from each chunk from PDF

In [8]:
from langchain.vectorstores import Pinecone as LC_Pinecone

In [9]:
vecstore = LC_Pinecone.from_texts([chunk.page_content for chunk in text_chunks], embeddings, index_name=index_name)
vecstore.as_retriever()

VectorStoreRetriever(tags=['Pinecone', 'CohereEmbeddings'], vectorstore=<langchain_community.vectorstores.pinecone.Pinecone object at 0x000001C82DF99010>)

In [10]:
simi_prompt = "what is attention?"
simi_result = vecstore.similarity_search_with_score(simi_prompt)
print(f"Answer: {simi_result[0][0].page_content}\n Score: {simi_result[0][1]}")

Answer: described in section 3.2.
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions
of a single sequence in order to compute a representation of the sequence. Self-attention has been
used successfully in a variety of tasks including reading comprehension, abstractive summarization,
textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
 Score: 0.599071622


### Creating a Retrieval QA Chain

In [11]:
llm = Cohere(cohere_api_key = os.getenv("COHERE_API_KEY"))
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vecstore.as_retriever())

  warn_deprecated(


In [38]:
qa_prompt = "what is multi-head attention?"
qa_result = qa.run(qa_prompt)
print(qa_result)

 The methodological strengths of the provided text lie in the thorough and insightful analysis of the performance of a particular model for parsing tasks. The analysis compares the model's performance to other previously reported models in the field, noting where it outperforms them. 

The text additionally highlights the benefits of the model's attention mechanism, which enables the model to handle long-range dependencies and capture global patterns, therefore improving its performance. 


In [49]:
print("Type 'exit' to quit")
while True:
    user_input = input("Enter Prompt: ")
    if user_input == "exit" or user_input == "Exit":
        break
    if user_input == '':
        continue
    res = qa({"query": user_input})
    print(f"Ans: {res["result"]}")

Type 'exit' to quit
Ans:  From the provided context, a Transformer is a model architecture that substitutes recurrence and self-attention mechanisms for drawing global dependencies between input and output. Specifically, The Transformer presents an updated approach to sequence transduction models that utilize multi-headed self-attention mechanisms, replacing the use of recurrent layers in encoder-decoder architectures. The model allows for more parallelization and can reach a new state of the art in translation tasks. 
Ans:  This paper was presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) and was published on August 2, 2023, as stated in the copyright section of the paper. 
The authors' names and affiliations are listed on the paper, and the identity of the specific author who wrote the paper may be included in this information in some cases. 
However, I don't have access to real-time data on the internet, so I cannot search for any subsequent update

## Embeddings and ChromaDB

In [39]:
from langchain.vectorstores import Chroma

### Download news article data
Download Commands:
- Windows: `Invoke-WebRequest -Uri "https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip" -OutFile "new_articles.zip"`
- Unix: `wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip`

Unzip Commands:
- Windows: `Expand-Archive -Path "new_articles.zip" -DestinationPath "news_articles"`
- Unix: `unzip -q new_articles.zip -d new_articles`

<b>NOTE:</b> These commands are correct. But this is not working correctly in this case, maybe due to file format issues in Dropbox. Here, I have done it manually using GUI.

### Load Files into list of Documents
We can use `DirectoryLoader` paired with `TextLoader` to complete the task in one line, but due to some encoding errors, I was not able to process it using that path. Explicit encoding declaration was required and so I chose the path used below. In Unix systems, this is usually not an issue and the following command can be used to do the same job with prebuilt funcs: 
```python
docs = DirectoryLoader("./news_articles", glob="./*.txt", loader_cls=TextLoader).load()
```

In [38]:
from langchain.document_loaders import TextLoader

In [32]:
dirpath = "./news_articles"
txt_files = os.listdir(dirpath)
filelist = [file for file in txt_files if file.endswith(".txt")]
docs = []
for file in filelist:
    filepath = os.path.join(dirpath, file)
    filetext = TextLoader(filepath, encoding='utf-8').load()[0]
    docs.append(filetext)


In [33]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
text = text_splitter.split_documents(docs)

In [36]:
print(text[0].page_content)

Signaling that investments in the supply chain sector remain robust, Pando, a startup developing fulfillment management technologies, today announced that it raised $30 million in a Series B round, bringing its total raised to $45 million.

Iron Pillar and Uncorrelated Ventures led the round, with participation from existing investors Nexus Venture Partners, Chiratae Ventures and Next47. CEO and founder Nitin Jayakrishnan says that the new capital will be put toward expanding Pando’s global sales, marketing and delivery capabilities.

“We will not expand into new industries or adjacent product areas,” he told TechCrunch in an email interview. “Great talent is the foundation of the business — we will continue to augment our teams at all levels of the organization. Pando is also open to exploring strategic partnerships and acquisitions with this round of funding.”


In [37]:
len(text)

233

### Creating Chroma Vector DB

In [40]:
persist_dirname = "vecdb"
vectordb = Chroma.from_documents(documents=docs, embedding=embeddings, persist_directory=persist_dirname)

In [41]:
vectordb.persist()

In [42]:
vectordb = None

In [43]:
vectordb = Chroma(persist_directory=persist_dirname, embedding_function=embeddings)
vectordb

<langchain_community.vectorstores.chroma.Chroma at 0x1c82f7c9b20>

### Creating QA Retriever using Chroma VecDB

In [44]:
retriever = vectordb.as_retriever()
prompt_res = retriever.get_relevant_documents("What is the relation between databricks and okera?")

In [46]:
print(prompt_res[0].page_content)

Databricks today announced that it has acquired Okera, a data governance platform with a focus on AI. The two companies did not disclose the purchase price. According to Crunchbase, Okera previously raised just under $30 million. Investors include Felicis, Bessemer Venture Partners, Cyber Mentor Fund, ClearSky and Emergent Ventures.

Data governance was already a hot topic, but the recent focus on AI has highlighted some of the shortcomings of the previous approach to it, Databricks notes in today’s announcement. “Historically, data governance technologies, regardless of sophistication, rely on enforcing control at some narrow waist layer and require workloads to fit into the ‘walled garden’ at this layer,” the company explains in a blog post. That approach doesn’t work anymore in the age of large language models (LLMs) because the number of assets is growing too quickly (in part because so much of it is machine-generated) and because the overall AI landscape is changing so quickly, st

In [48]:
len(prompt_res)

4

In [51]:
retriever = vectordb.as_retriever(search_kwargs={'k':2})
prompt_res = retriever.get_relevant_documents("What is the relation between databricks and okera?")
print(len(prompt_res))
print(retriever.search_kwargs)
print(prompt_res[0].page_content)

2
{'k': 2}
Databricks today announced that it has acquired Okera, a data governance platform with a focus on AI. The two companies did not disclose the purchase price. According to Crunchbase, Okera previously raised just under $30 million. Investors include Felicis, Bessemer Venture Partners, Cyber Mentor Fund, ClearSky and Emergent Ventures.

Data governance was already a hot topic, but the recent focus on AI has highlighted some of the shortcomings of the previous approach to it, Databricks notes in today’s announcement. “Historically, data governance technologies, regardless of sophistication, rely on enforcing control at some narrow waist layer and require workloads to fit into the ‘walled garden’ at this layer,” the company explains in a blog post. That approach doesn’t work anymore in the age of large language models (LLMs) because the number of assets is growing too quickly (in part because so much of it is machine-generated) and because the overall AI landscape is changing so 

In [52]:
def cited_answer(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [53]:
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)

In [54]:
query = "What is the relation between databricks and okera?"
response = qa_chain(query)

  warn_deprecated(


In [55]:
response

{'query': 'What is the relation between databricks and okera?',
 'result': " Databricks has recently acquired Okera, a data governance platform. Databricks plans to integrate Okera's technology into its own platform, specifically its Unity Catalog. The acquisition will also enable Databricks to expose additional APIs that its data governance partners can use to provide solutions for their customers. ",
 'source_documents': [Document(page_content='Databricks today announced that it has acquired Okera, a data governance platform with a focus on AI. The two companies did not disclose the purchase price. According to Crunchbase, Okera previously raised just under $30 million. Investors include Felicis, Bessemer Venture Partners, Cyber Mentor Fund, ClearSky and Emergent Ventures.\n\nData governance was already a hot topic, but the recent focus on AI has highlighted some of the shortcomings of the previous approach to it, Databricks notes in today’s announcement. “Historically, data governan

In [56]:
cited_answer(response)

 Databricks has recently acquired Okera, a data governance platform. Databricks plans to integrate Okera's technology into its own platform, specifically its Unity Catalog. The acquisition will also enable Databricks to expose additional APIs that its data governance partners can use to provide solutions for their customers. 


Sources:
./news_articles\05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
./news_articles\05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


### Deleting Vector DB

In [None]:
!zip -r vecdb.zip ./vecdb

In [None]:
vectordb.delete_collection()
vectordb.persist()

In [None]:
!rm -rf vecdb/