# Overview

Question answering over documents consists of four steps:
1. Create an index
2. Create a Retriever from that index
3. Create a question answering chain
4. Ask questions!

# 1 Create an index
Okay, so what’s actually going on? How is this index getting created? A lot of the magic is being hid in this VectorstoreIndexCreator. What is this doing?

There are three main steps going on after the documents are loaded (inside `VectorstoreIndexCreator`):
1. Splitting documents into chunks
2. Creating embeddings for each document
3. Storing documents and embeddings in a vectorstore

In [1]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import DirectoryLoader
from langchain.indexes import VectorstoreIndexCreator
from pathlib import Path

loader = DirectoryLoader("Obsidian_DB/", glob="**/*.md", show_progress=True)
docs = loader.load()
print(f"len of docs {len(docs)}")

100%|██████████| 419/419 [00:03<00:00, 120.18it/s]

len of docs 419





In [2]:
# Each Document object has two fields: page_content and metadata
dict(docs[0])

{'page_content': 'start_dt: 2022-05-31\nend_dt:\ntags: project/life\n\nProject Zero\n\nLinks:\n\n1 Overview\n\nBoost personal health (physical and sexual) through a trail of exercise and routines\n\n2 Routines\n\nEat 3 dark chocolates every day with lunch\n\nCold shower\n\n3 Fitness\n\n| Day of Week | Plan     |\n| ----------- | -------- |\n| Mon         | Workout  |\n| Tue         | Run      |\n| Wed         | Rest     |\n| Thu         | Workout  |\n| Fri         | Rest     |\n| Sat         | Workout  |\n| Sun         | Long run |\n\n4 Log\n\nFeb 28 Tur shoulder and chest\nWarmup\nMachine declined shoulder press lvl 10 6x3\nmachine shoulder press lvl 6 6x3\nPull ups 5x3\nRunning 1.5 miles\n\nFeb 15 Wed\nShoulder press 30ib 6 x3\ncore training v shape 10x 3\nkettle swinging 10 x 3\n\nJan 19 thur\nBench press DB 35ib\nBar shoulder lift 5ib each side\nCycle gear 14 for 45 min (hr 150)\n\nJan 17 Tue\nRun 30 min\nGoal: run 3 hours every week\n\nJan 14  (w/ Mymy and Julien)\nSlope 7-9 speed

In [8]:
# By default, LangChain uses Chroma as the vectorstore to index and search embeddings.
index = VectorstoreIndexCreator().from_loaders([loader]) # return `VectorStoreIndexWrapper`

# Check package used for creating the vector store
print(index.vectorstore)
# Check Retriever (how to find answer)
print(index.vectorstore.as_retriever())

100%|██████████| 419/419 [00:02<00:00, 147.25it/s]


<langchain.vectorstores.chroma.Chroma object at 0x7f826be583d0>
vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7f826be583d0> search_type='similarity' search_kwargs={}


# 2 Create a Retriever from index

Logic is included in `query_with_sources`


# 3 Create a question answering chain

Used `RetrievalQAWithSourcesChain`

# 4 Ask questions!

In [11]:
# By default query_with_sources uses OpenAI text-davinci-001 to generate answer
index.query_with_sources("What's a vector storage?", llm=OpenAI(temperature=0))

{'question': "What's a vector storage?",
 'answer': ' A vector storage is a set of vectors closed under addition and scalar multiplication.\n',
 'sources': 'Obsidian_DB/02-SlipBox/Vector Space.md'}

In [17]:
index.query_with_sources("What's the difference between word error rate (WER) and BLEU score?")

{'question': "What's the difference between word error rate (WER) and BLEU score?",
 'answer': ' Word Error Rate (WER) is a metric used to evaluate speech recognition models, while BLEU Score is a metric used to evaluate machine translation models. WER is computed using the Levenshtein Distance Algorithm, while BLEU Score is computed using a brevity penalty and n-gram precision.\n',
 'sources': 'Obsidian_DB/02-SlipBox/WER.md, Obsidian_DB/02-SlipBox/BLEU Score.md, Obsidian_DB/02-SlipBox/Levenshtein Distance Algorithm.md, Obsidian_DB/02-SlipBox/ML Model Evaluation.md'}

In [12]:
from langchain_visualizer.jupyter import visualize

async def qa_with_docs():
    return index.query_with_sources("what's NLP?")

visualize(qa_with_docs)

2023-06-01 18:14.29.330768 [info     ] Trace: http://0.0.0.0:8935/traces/01H1WGDKPJ541YG98G6SNTAAFB
2023-06-01 18:14.29.387791 [info     ] Starting server, set OUGHT_ICE_AUTO_SERVER=0 to disable.
2023-06-01 18:14.29.825481 [info     ] Server started! Run `python -m ice.server stop` to stop it.
Rendering http://127.0.0.1:8935/traces/01H1WGDKPJ541YG98G6SNTAAFB in notebook


[1m{[0m
    [32m'question'[0m: [32m"what's NLP?"[0m,
    [32m'answer'[0m: [32m' Natural Language Processing [0m[32m([0m[32mNLP[0m[32m)[0m[32m is a process where we learn statistics of language using different means [0m[32m([0m[32mfrequentist counting method, Bayesian statistics, neural language models[0m[32m)[0m[32m.\n'[0m,
    [32m'sources'[0m: [32m'Obsidian_DB/02-SlipBox/NLP.md'[0m
[1m}[0m
