# Document Q&A using Langchain

This notebook will go over how to use Langchain's Writer integration to answer questions based on the contents of a set of documents. It will accomplish roughly the same result as the File Q&A example in this repository, but with the added simplicity of using built in Langchain functionality.

### Dependencies

Make sure you have a virtual environment selected if you don't want to install these globally.

In [80]:
%pip install langchain chromadb python-dotenv pypdf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [81]:
from langchain.llms import Writer
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import TokenTextSplitter
from langchain.vectorstores import Chroma
from dotenv import load_dotenv
import json

### Extracting document data

The first step is to extract all of the written information from our documents. This example only handles pdfs, but it can be easily expanded to handle plaintext, markdown, etc.

In [82]:
from pypdf import PdfReader
import os

def extract_text_from_file(dir: str, filename: str):
    file = open(f"{dir}/{filename}", "rb")
    _, ext = os.path.splitext(filename)

    if ext == ".pdf":
        reader = PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()

    file.close()
    return text

Next, we need to split up our text into chunks and find the most relevant ones. In the File Q&A example we did this manually, but Langchain has built in functionality to make this easier:

In [83]:
text_splitter = TokenTextSplitter(encoding_name="cl100k_base", chunk_size=150, chunk_overlap=50)
embeddings = SentenceTransformerEmbeddings()

def get_relevant_chunks(text: str, query: str):
    chunks = text_splitter.split_text(text)

    docsearch = Chroma.from_texts(chunks, embeddings).as_retriever()

    relevant_chunks = docsearch.get_relevant_documents(query)
    return relevant_chunks

### Getting an answer

Now all we need to do is feed our query and the relevant chunks to a Writer model. We do this by first creating a qa chain of type "stuff" and giving it a Writer instance, then running that chain.

In [84]:
FILE_DIR = "documents"
load_dotenv()
org_id = os.environ.get("WRITER_ORG_ID")
model_id = 'palmyra-instruct'

chain = load_qa_chain(
    Writer(
        base_url=f"https://enterprise-api.writer.com/llm/organization/{org_id}/model/{model_id}/completions", 
        tokens_to_generate=500
    ), 
    chain_type="stuff"
)

def run_query(query: str):
    context = ""
    for filename in os.listdir(FILE_DIR):
        context += extract_text_from_file(FILE_DIR, filename)

    relevant_chunks = get_relevant_chunks(context, query)

    answer = chain.run(input_documents=relevant_chunks, question=query)
    return json.loads(answer)["choices"][0]["text"]

And then just ask it a question:

In [86]:
run_query("How do I enable wifi?")

invalid pdf header: b'\n%PDF'
Using embedded DuckDB without persistence: data will be transient


' You can enable wifi by choosing "Turn AirPort on" from the AirPort'