<a href="https://colab.research.google.com/github/rvraghvender/ChromaDB_vectorDatabase/blob/main/ChromaDB_Vector_Database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# !pip -q install chromadb openai langchain tiktoken

In [2]:
!pip show chromadb

Name: chromadb
Version: 0.4.22
Summary: Chroma.
Home-page: 
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: bcrypt, build, chroma-hnswlib, fastapi, grpcio, importlib-resources, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, overrides, posthog, pulsar-client, pydantic, pypika, PyYAML, requests, tenacity, tokenizers, tqdm, typer, typing-extensions, uvicorn
Required-by: 


In [3]:
!wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip

In [4]:
!rm -r new_articles
!unzip -q new_articles.zip -d new_articles

In [5]:
from google.colab import userdata

In [6]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader

## Loading data

In [7]:
loader = DirectoryLoader('/content/new_articles', glob="./*.txt", loader_cls=TextLoader)

In [8]:
document = loader.load()

In [9]:
document[0]



## Converting data into chunks

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [11]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
text = text_splitter.split_documents(document)

In [12]:
text[0]

Document(page_content='Well that was fast. The U.K.’s competition watchdog has announced an initial review of “AI foundational models”, such as the large language models (LLMs) which underpin OpenAI’s ChatGPT and Microsoft’s New Bing. Generative AI models which power AI art platforms such as OpenAI’s DALL-E or Midjourney will also likely fall in scope.\n\nThe Competition and Markets Authority (CMA) said its review will look at competition and consumer protection considerations in the development and use of AI foundational models — with the aim of understanding “how foundation models are developing and producing an assessment of the conditions and principles that will best guide the development of foundation models and their use in the future”.\n\nIt’s proposing to publish the review in “early September”, with a deadline of June 2 for interested stakeholders to submit responses to inform its work.', metadata={'source': '/content/new_articles/05-04-cma-generative-ai-review.txt'})

In [13]:
text[1]

Document(page_content='It’s proposing to publish the review in “early September”, with a deadline of June 2 for interested stakeholders to submit responses to inform its work.\n\n“Foundation models, which include large language models and generative artificial intelligence (AI), that have emerged over the past five years, have the potential to transform much of what people and businesses do. To ensure that innovation in AI continues in a way that benefits consumers, businesses and the UK economy, the government has asked regulators, including the [CMA], to think about how the innovative development and deployment of AI can be supported against five overarching principles: safety, security and robustness; appropriate transparency and explainability; fairness; accountability and governance; and contestability and redress,” the CMA wrote in a press release.”', metadata={'source': '/content/new_articles/05-04-cma-generative-ai-review.txt'})

In [14]:
text[2]

Document(page_content='Stanford University’s Human-Centered Artificial Intelligence Center’s Center for Research on Foundation Models is credited with coining the term “foundational models”, back in 2021, to refer to AI systems that focus on training one model on a huge amount of data and adapting it to many applications.\n\n“The development of AI touches upon a number of important issues, including safety, security, copyright, privacy, and human rights, as well as the ways markets work. Many of these issues are being considered by government or other regulators, so this initial review will focus on the questions the CMA is best placed to address — what are the likely implications of the development of AI foundation models for competition and consumer protection?” the CMA added.\n\nIn a statement, its CEO, Sarah Cardell, also said:', metadata={'source': '/content/new_articles/05-04-cma-generative-ai-review.txt'})

In [15]:
text[3]

Document(page_content='In a statement, its CEO, Sarah Cardell, also said:\n\nAI has burst into the public consciousness over the past few months but has been on our radar for some time. It’s a technology developing at speed and has the potential to transform the way businesses compete as well as drive substantial economic growth. It’s crucial that the potential benefits of this transformative technology are readily accessible to UK businesses and consumers while people remain protected from issues like false or misleading information. Our goal is to help this new, rapidly scaling technology develop in ways that ensure open, competitive markets and effective consumer protection.\n\nSpecifically, the U.K. competition regulator said its initial review of AI foundational models will:\n\nexamine how the competitive markets for foundation models and their use could evolve\n\nexplore what opportunities and risks these scenarios could bring for competition and consumer protection', metadata={'

In [16]:
len(text)

233

## Creating db object

In [17]:
from langchain import embeddings
persist_directory = 'db'

embeddings = OpenAIEmbeddings(openai_api_key=userdata.get('OPENAI_API_KEY'))

  warn_deprecated(


In [18]:
vectordb = Chroma.from_documents(documents=text,
                               embedding=embeddings,
                               persist_directory=persist_directory)

In [19]:
# persist the db to disk
vectordb.persist()
vectordb = None

In [20]:
# Load the persisted database from disk, and use it as noraml database
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embeddings
                  )

## Make a retriever

In [21]:
retriever = vectordb.as_retriever()

In [22]:
docs = retriever.get_relevant_documents('How much money did Microsoft made')

In [23]:
docs

[Document(page_content='According to a study from the University of Cambridge, at least half of developers’ efforts are spent debugging and not actively programming, which costs the software industry an estimated $312 billion per year. But so far, only a handful of code-generating AI systems have been made freely available to the public — reflecting the commercial incentives of the organizations building them (see: Replit).\n\nStarCoder, which by contrast is licensed to allow for royalty-free use by anyone, including corporations, was trained on over 80 programming languages as well as text from GitHub repositories, including documentation and programming notebooks. StarCoder integrates with Microsoft’s Visual Studio Code code editor and, like OpenAI’s ChatGPT, can follow basic instructions (e.g., “create an app UI”) and answer questions about code.', metadata={'source': '/content/new_articles/05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt'}),
 Document(page

In [24]:
retriever = vectordb.as_retriever(search_kwargs={'k':2})

In [25]:
retriever.search_type

'similarity'

**Till now we have just got relevant context from the databse but have to received a smart answer.**

# So use LLM to get smart answer

## Make a chain

In [26]:
from langchain.chains import RetrievalQA

In [32]:
# Create the chain to answer question

qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(api_key=userdata.get('OPENAI_API_KEY')),
                                       chain_type='stuff',
                                       retriever=retriever,
                                       return_source_documents=True)

In [33]:
query = 'How much did Microsoft raise?'
llm_response = qa_chain(query)
llm_response

  warn_deprecated(


{'query': 'How much did Microsoft raise?',
 'result': ' Microsoft raised around $10 billion from their investment in OpenAI.',
 'source_documents': [Document(page_content='April 28, 2023\n\nVC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.\n\nApril 25, 2023\n\nCalled ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”', metadata={'source': '/conten

In [34]:
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources: ')
    for source in llm_response['source_documents']:
        print(source.metadata['source'])

In [35]:
process_llm_response(llm_response)

 Microsoft raised around $10 billion from their investment in OpenAI.


Sources: 
/content/new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt
/content/new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt


## Deleting the database

In [36]:
# !zip -r db.zip ./db

In [37]:
# to cleanup , the database

vectordb.delete_collection()
vectordb.persist()

!rm -rf db