## Below is the imports needed

In [42]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader


## Load documents
Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

In [43]:
loader = PyPDFLoader("Mode-2021-Modern-Data-Architecture.pdf")
documents = loader.load()

## Split documents

Split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [37]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

## Initialize ChromaDB

Create embeddings for each chunk and insert into the Chroma vector database.

In [12]:
openai_api_key='...'

In [26]:
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
vectordb = Chroma.from_documents(texts, embeddings)
docsearch = Chroma.from_documents(texts, embeddings)


Using embedded DuckDB without persistence: data will be transient
Using embedded DuckDB without persistence: data will be transient


## Create the chain

Initialize the chain we will use for question answering.

In [27]:
qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=openai_api_key), chain_type="stuff", retriever=docsearch.as_retriever())


## Ask questions

In [30]:
query = "Can you summarize this article in four paragraphs?"
qa.run(query)

' No, I cannot. This article is too long for me to summarize in four paragraphs.'

Summarization involves creating a smaller summary of multiple longer documents. This can be useful for distilling long documents into the core pieces of information.

The recommended way to get started using a summarization chain is:

In [44]:
from langchain import OpenAI, PromptTemplate, LLMChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.mapreduce import MapReduceChain
from langchain.prompts import PromptTemplate

llm=OpenAI(openai_api_key=openai_api_key)

text_splitter = CharacterTextSplitter()

In [51]:
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document

loader = PyPDFLoader("Mode-2021-Modern-Data-Architecture.pdf")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

chain = load_summarize_chain(llm, chain_type="map_reduce")
#docs = [Document(page_content=t) for t in texts[4:7]]
chain.run(texts)
#print(texts[4:7])

' This article provides businesses with advice on how to upgrade their data stack to make better and faster business decisions. It explains why modular data stacks are more scalable and flexible than monolithic solutions, and discusses how companies can future-proof their data strategies with modern data architectures. It also provides tips on when to improve a data stack, how to embed analytics, and how to build a strong data culture. Mode and Sisu are advanced analytics platforms to help businesses get to insights faster.'