<a href="https://colab.research.google.com/github/tysonbarreto/VectorDatabases/blob/main/VectorDatabases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ChromaDB Demo

In [1]:
!pip -q install chromadb openai langchain langchain_community tiktoken

In [2]:
!pip show chromadb

Name: chromadb
Version: 0.5.11
Summary: Chroma.
Home-page: https://github.com/chroma-core/chroma
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: bcrypt, build, chroma-hnswlib, fastapi, grpcio, httpx, importlib-resources, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, orjson, overrides, posthog, pydantic, pypika, PyYAML, rich, tenacity, tokenizers, tqdm, typer, typing-extensions, uvicorn
Required-by: 


In [5]:
%pip install --upgrade --quiet pypdf rapidocr_onnxruntime langchain-openai langchain-huggingface transformers

## Import Libs

In [18]:
from langchain.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFacePipeline
from langchain.llms import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
import torch

In [7]:
loader = PyPDFLoader("./CC.pdf", extract_images=True)
document = loader.load()

In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200
)

In [9]:
text = text_splitter.split_documents(document)

## Creating DB

In [10]:
persist_directory = "db"

embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [11]:
vecotor_db = Chroma.from_documents(
    documents=text,
    embedding=embedding,
    persist_directory=persist_directory
)

In [12]:
vecotor_db.persist()
vecotor_db=None

  vecotor_db.persist()


In [13]:
vecotor_db = Chroma(
    embedding_function=embedding,
    persist_directory=persist_directory
)

  vecotor_db = Chroma(


## Retriever

In [14]:
retriever = vecotor_db.as_retriever()

In [15]:
retriever.get_relevant_documents("What is climate change?")

  retriever.get_relevant_documents("What is climate change?")


[Document(metadata={'page': 1, 'source': './CC.pdf'}, page_content='climate back into geological history, 100,000s of years ago!\nClimate change (sometimes called global warming) is the \nprocess of our planet heating up. Our planet has already \nwarmed by an average of 1°C in the last 100 years and if \nthings don’t change, it could increase by a lot more than \nthat. This warming causes harmful impacts such as the \nmelting of Arctic sea ice, more severe weather events like \nheatwaves, floods and hurricanes, rising sea levels, spread'),
 Document(metadata={'page': 1, 'source': './CC.pdf'}, page_content='climate back into geological history, 100,000s of years ago!\nClimate change (sometimes called global warming) is the \nprocess of our planet heating up. Our planet has already \nwarmed by an average of 1°C in the last 100 years and if \nthings don’t change, it could increase by a lot more than \nthat. This warming causes harmful impacts such as the \nmelting of Arctic sea ice, more 

In [16]:
retriever = vecotor_db.as_retriever(search_kwargs={"k":2}) # no of similarities
retriever.get_relevant_documents("What is climate change?")

[Document(metadata={'page': 1, 'source': './CC.pdf'}, page_content='climate back into geological history, 100,000s of years ago!\nClimate change (sometimes called global warming) is the \nprocess of our planet heating up. Our planet has already \nwarmed by an average of 1°C in the last 100 years and if \nthings don’t change, it could increase by a lot more than \nthat. This warming causes harmful impacts such as the \nmelting of Arctic sea ice, more severe weather events like \nheatwaves, floods and hurricanes, rising sea levels, spread'),
 Document(metadata={'page': 1, 'source': './CC.pdf'}, page_content='climate back into geological history, 100,000s of years ago!\nClimate change (sometimes called global warming) is the \nprocess of our planet heating up. Our planet has already \nwarmed by an average of 1°C in the last 100 years and if \nthings don’t change, it could increase by a lot more than \nthat. This warming causes harmful impacts such as the \nmelting of Arctic sea ice, more 

In [17]:
retriever.search_kwargs, retriever.search_type

({'k': 2}, 'similarity')

## Chains

In [19]:
basemodel= AutoModelForSeq2SeqLM.from_pretrained(
    "MBZUAI/LaMini-T5-738M",
    device_map = torch.device('cpu'),
    torch_dtype=torch.float32
)

tokenizer = AutoTokenizer.from_pretrained(
    "MBZUAI/LaMini-T5-738M"
)

config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [20]:
pipe = pipeline(
    "text2text-generation",
    model=basemodel,
    tokenizer=tokenizer,
    max_length=256,
    do_sample=True,
    temperature=0.3,
    top_p=0.95
)

In [21]:
local_llm = HuggingFacePipeline(
    pipeline=pipe
)

In [23]:
qa_chain = RetrievalQA.from_chain_type(
    llm=local_llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

In [24]:
query = "What is climate change?"
llm_reponse = qa_chain(query)

  llm_reponse = qa_chain(query)


In [25]:
llm_reponse

{'query': 'What is climate change?',
 'result': 'Climate change is the process of our planet heating up, causing harmful impacts such as melting of Arctic sea ice, more severe weather events like heatwaves, floods and hurricanes, rising sea levels, and spreading climate back into geological history, 100,000s of years ago.',
 'source_documents': [Document(metadata={'page': 1, 'source': './CC.pdf'}, page_content='climate back into geological history, 100,000s of years ago!\nClimate change (sometimes called global warming) is the \nprocess of our planet heating up. Our planet has already \nwarmed by an average of 1°C in the last 100 years and if \nthings don’t change, it could increase by a lot more than \nthat. This warming causes harmful impacts such as the \nmelting of Arctic sea ice, more severe weather events like \nheatwaves, floods and hurricanes, rising sea levels, spread'),
  Document(metadata={'page': 1, 'source': './CC.pdf'}, page_content='climate back into geological history, 

In [26]:
query = "What is the solution to climate change?"
llm_reponse = qa_chain(query)

In [27]:
llm_reponse

{'query': 'What is the solution to climate change?',
 'result': 'The solution to climate change is to reduce greenhouse gas emissions by transitioning to renewable energy sources, reducing food waste, and using energy-efficient appliances.',
 'source_documents': [Document(metadata={'page': 1, 'source': './CC.pdf'}, page_content='climate back into geological history, 100,000s of years ago!\nClimate change (sometimes called global warming) is the \nprocess of our planet heating up. Our planet has already \nwarmed by an average of 1°C in the last 100 years and if \nthings don’t change, it could increase by a lot more than \nthat. This warming causes harmful impacts such as the \nmelting of Arctic sea ice, more severe weather events like \nheatwaves, floods and hurricanes, rising sea levels, spread'),
  Document(metadata={'page': 1, 'source': './CC.pdf'}, page_content='climate back into geological history, 100,000s of years ago!\nClimate change (sometimes called global warming) is the \npr

## Delete DB

In [28]:
vecotor_db.delete_collection()

In [29]:
vecotor_db.persist()

In [30]:
!rm -rf db/