

1.   Indexing Documents with Langchain Utilities in Chroma DB
2.   Retrieving Semantically Similar Documents for a Specific Query
3.   Persistence in Chroma DB
4.   Integrating Chroma DB with LLM (OpenAI Chat Models)
5.   Using Question-Answering Chain to Extract Answers from Documents
6.   Utilizing RetrieverQA Chain

Youtube Video : https://youtu.be/5NG8mefEsCU

In [26]:
!pip install  openai langchain sentence_transformers -q
!pip install chromadb -q

In [27]:
!pip install unstructured -q

Files Used : https://github.com/PradipNichite/Youtube-Tutorials/tree/main/chroma_db/pets

In [91]:
from langchain.document_loaders import DirectoryLoader

directory = '/content/Docs/'

def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)
len(documents)

3

https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

In [92]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_docs(documents,chunk_size=1000,chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

docs = split_docs(documents)
print(len(docs))

87


In [93]:
# # import openai
# from langchain.embeddings.openai import OpenAIEmbeddings
# embeddings = OpenAIEmbeddings(model_name="ada")
from langchain.embeddings import SentenceTransformerEmbeddings
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

In [94]:
from langchain.vectorstores import Chroma
db = Chroma.from_documents(docs, embeddings)

In [47]:
query = "When was satyam formed?"
matching_docs = db.similarity_search(query)

In [48]:
matching_docs[0]

Document(page_content='Background story of the Satyam fraud case In the Indian outsourced IT-services market, Satyam Computer Services Limited was a rising star. Mr. Ramalinga Raju established the firm in Hyderabad in 1987. The company began with 20 workers and quickly expanded to become a worldwide company with operations in 65 countries across the world. Satyam was the first Indian business to be listed on three global stock exchanges, namely New York Stock Exchange (NYSE), DOW Jones, and EURONEXT.', metadata={'source': '/content/Docs/Satyam Case.txt'})

In [49]:
print(matching_docs[0].page_content)

Background story of the Satyam fraud case In the Indian outsourced IT-services market, Satyam Computer Services Limited was a rising star. Mr. Ramalinga Raju established the firm in Hyderabad in 1987. The company began with 20 workers and quickly expanded to become a worldwide company with operations in 65 countries across the world. Satyam was the first Indian business to be listed on three global stock exchanges, namely New York Stock Exchange (NYSE), DOW Jones, and EURONEXT.


In [50]:
matching_docs = db.similarity_search_with_score(query,k=2)
matching_docs

[(Document(page_content='Background story of the Satyam fraud case In the Indian outsourced IT-services market, Satyam Computer Services Limited was a rising star. Mr. Ramalinga Raju established the firm in Hyderabad in 1987. The company began with 20 workers and quickly expanded to become a worldwide company with operations in 65 countries across the world. Satyam was the first Indian business to be listed on three global stock exchanges, namely New York Stock Exchange (NYSE), DOW Jones, and EURONEXT.', metadata={'source': '/content/Docs/Satyam Case.txt'}),
  0.6587530970573425),
 (Document(page_content='After TCS, Infosys, and Wipro, it was recognized as India’s fourth-largest software exporter. The corporation had significant expansion in the 1990s. Satyam Renaissance, Satyam Info way, Satyam Spark Solutions, and Satyam Enterprise Solutions were formed as a result of the same. Satyam Info Way (Sify) was the first Indian internet business to be listed on the NASDAQ. In the new centur

Persist a ChromaDB instance

In [95]:
persist_directory = "chroma_db"

vectordb = Chroma.from_documents(
    documents=docs, embedding=embeddings, persist_directory=persist_directory
)

In [96]:
vectordb.persist()

In [97]:
new_db = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

In [54]:
matching_docs = new_db.similarity_search_with_score(query)
matching_docs[0]

(Document(page_content='Background story of the Satyam fraud case In the Indian outsourced IT-services market, Satyam Computer Services Limited was a rising star. Mr. Ramalinga Raju established the firm in Hyderabad in 1987. The company began with 20 workers and quickly expanded to become a worldwide company with operations in 65 countries across the world. Satyam was the first Indian business to be listed on three global stock exchanges, namely New York Stock Exchange (NYSE), DOW Jones, and EURONEXT.', metadata={'source': '/content/Docs/Satyam Case.txt'}),
 0.6587530970573425)

##LLM

In [17]:
import os
os.environ["OPENAI_API_KEY"] = "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"

In [18]:
from langchain.chat_models import ChatOpenAI
model_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=model_name)

###Document QA

https://python.langchain.com/docs/modules/chains/additional/question_answering

https://python.langchain.com/docs/modules/chains/document/

In [57]:
from langchain.chains.question_answering import load_qa_chain
# chain = load_qa_chain(llm, chain_type="stuff")
chain = load_qa_chain(llm, chain_type="stuff",verbose=True)

In [78]:
query = "who are listed in Copies to"
matching_docs = db.similarity_search(query)
answer =  chain.run(input_documents=matching_docs, question=query)
answer



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the users question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
DATED: ~Ju I fl w~Jningtbn, D.C.

Copies to:

Jan M. Folena, Esquire

Assistant Chief Litigation Counsel

Securities and Exchange Commission

100 F Street, NE

Washington, D.C. 20549

(202) 551-4738 (telephone)

(202) 772-9245 (facsimile)

[I Lz /-__, J l!UvCJ c____

UNITED STATES DISTRICT JUDGE

12

Lawrence A. West, Esquire

LATHAM & WATKINS LLP

555 Eleventh Street, N.W.

Washington, DC 20004-1304

(202) 637-2135 (telephone)

(202) 637-2201 (facsimile)

DATED: ~Ju I fl w~Jningtbn, D.C.

Copies to:

Jan M. Folena, Esquire

Assistant Chief Litigation Counsel

Securities and Exchange Commission

100 F Street, NE

Washington, D.C. 20549

(202) 551-4738 (telephone)

(202) 77

'Jan M. Folena, Esquire\n\nAssistant Chief Litigation Counsel\n\nSecurities and Exchange Commission\n\n100 F Street, NE\n\nWashington, D.C. 20549\n\n(202) 551-4738 (telephone)\n\n(202) 772-9245 (facsimile)\n\nLawrence A. West, Esquire\n\nLATHAM & WATKINS LLP\n\n555 Eleventh Street, N.W.\n\nWashington, DC 20004-1304\n\n(202) 637-2135 (telephone)\n\n(202) 637-2201 (facsimile)'

### Retrieval QA

In [111]:
from langchain.chains import RetrievalQA
retrieval_chain = RetrievalQA.from_chain_type(llm, chain_type="stuff", retriever=db.as_retriever())
query = "Who is K. Surender"
retrieval_chain.run(query)

"There is no information provided in the context regarding K. Surender. Therefore, I don't know who K. Surender is."