## 0. Installation and Setup

In [None]:
! pip install langchain

## 1. Load Data
In Langchiain, we use document_loaders to load our data. We can simply import langchain.document_loaders and specify the data type.
1. folder: DirectoryLoader
2. Azure: AzureBlobStorageContainerLoader
3. CSV file: CSVLoader
4. Google Drive: GoogleDriveLoader
5. Website: UnstructuredHTMLLoader
6. PDF: PyPDFLoader
7. Youtube: YoutubeLoader

For more data loader refer to the following link:
https://python.langchain.com/docs/modules/data_connection/document_loaders.html

In [None]:
# take pdf as a exapmle. This is helpful if we directly download the documents from company website.

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()


# We can also use github (Website type) to store our original data.

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://name.github.io/folder_name/document_name")
data = loader.load()

## 2. Split the data
Once we loaded documents, we need to transform them to better suit our application. The simplest example is to split a long document into smaller chunks that can fit into our model's context window. The most common Splitter in LangChain includes:

1. RecursiveCharacterTextSplitter()
2. CharacterTextSplitter()

The paramether of above functions:
 - length_function: how the length of chunks is calculated. Defaults to just counting number of characters, but it's pretty common to pass a token counter here.
 - chunk_size: the maximum size of your chunks (as measured by the length function).
 - chunk_overlap: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (e.g. do a sliding window).
 - add_start_index: whether to include the starting position of each chunk within the original document in the metadata.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)

## 3. Vectorstores
Since the input of model is vector instead of character, we need to transfer the text data into vector space(embeddding). There are already some useful vector database like ChromaDB, Milvus, pgvector...

Before we load the data into vector database, we need a perfect embeddings model.The Embeddings class is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc).

https://python.langchain.com/en/latest/modules/indexes/vectorstores.html

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Initialize openai’s embeddings object
embeddings = OpenAIEmbeddings() # api-key needed

# Calculate the embedding vector information of the document through the embeddings object of openai and temporarily store it in the Chroma vector database for subsequent matching queries.

vectorstore = Chroma.from_documents(documents=all_splits, embeddings)

## 4.Retrive
Retrieve relevant splits for any question using similarity search. There are servral way for retrievals, Vectorstores+similarity_search are commonly used. We can also use SVM Retriever.

In [None]:
question = "Our answer here"

# Vectorstores+ s imilarity_search
docs = vectorstore.similarity_search(question)


# SVM Retriever
from langchain.retrievers import SVMRetriever

svm_retriever = SVMRetriever.from_documents(all_splits,OpenAIEmbeddings())
docs_svm=svm_retriever.get_relevant_documents(question)

## 5. Generate Answer
The key function of this part is RetrievalQA(). We need to feed our model, retriever and prompt into the function to create Q&A object.

For details on RetrievalQA, refers to
https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Use a answer template as the prompt feeded into the model
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


# load our model
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Create Q&A object
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

# Feed our question and get the answer.
result = qa_chain({"query": question})
result["result"]


Reference:
https://python.langchain.com/docs/use_cases/question_answering/#step-4-retrieve