# Document Q&A using Langchain

This notebook will go over how to use Langchain's Writer integration to answer questions based on the contents of a set of documents. It will accomplish roughly the same result as the File Q&A example in this repository, but with the added simplicity of using built in Langchain functionality.

### Dependencies

Make sure you have a virtual environment selected if you don't want to install these globally.

In [1]:
%pip install -q --disable-pip-version-check\
    langchain chromadb setuptools tiktoken python-dotenv pypdf

Note: you may need to restart the kernel to use updated packages.


The Writer Langchain integration looks for two environment variables: `WRITER_ORG_ID` and `WRITER_API_KEY`. If you're running this notebook, make sure you have a `.env` file in the parent directory that looks like this:
```
WRITER_ORG_ID=<your org ID>
WRITER_API_KEY=<your API key>
```
and run the following cell. You can alternatively use whatever method you want to set environment variables as long as they get set. 

Yet another alternative if you don't want environment variables is passing these parameters to the Langchain `Writer` object on construction.

In [8]:
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
from langchain.llms import Writer
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import TokenTextSplitter
from langchain.vectorstores import Chroma
import json

### Extracting document data

The first step is to extract all of the written information from our documents. This example only handles pdfs, but it can be easily expanded to handle plaintext, markdown, etc.

In [4]:
from pypdf import PdfReader
import os

def extract_text_from_file(dir: str, filename: str):
    file = open(f"{dir}/{filename}", "rb")
    _, ext = os.path.splitext(filename)

    if ext == ".pdf":
        reader = PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()

    file.close()
    return text

Next, we need to split up our text into chunks and find the most relevant ones. In the File Q&A example we did this manually, but Langchain has built in functionality to make this easier:

In [5]:
text_splitter = TokenTextSplitter(encoding_name="cl100k_base", chunk_size=150, chunk_overlap=50)
embeddings = SentenceTransformerEmbeddings()

def get_relevant_chunks(text: str, query: str):
    chunks = text_splitter.split_text(text)

    docsearch = Chroma.from_texts(chunks, embeddings).as_retriever()

    relevant_chunks = docsearch.get_relevant_documents(query)
    return relevant_chunks

  from .autonotebook import tqdm as notebook_tqdm


### Getting an answer

Now all we need to do is feed our query and the relevant chunks to a Writer model. We do this by first creating a qa chain of type "stuff" and giving it a Writer instance, then running that chain.

In [6]:
FILE_DIR = "documents"
org_id = os.environ.get("WRITER_ORG_ID")
model_id = 'palmyra-instruct'

chain = load_qa_chain(
    Writer(
        max_tokens=500,
        temperature=0
    ), 
    chain_type="stuff"
)

context = ""
for filename in os.listdir(FILE_DIR):
    context += extract_text_from_file(FILE_DIR, filename)

def run_query(query: str):
    relevant_chunks = get_relevant_chunks(context, query)

    answer = chain.run(input_documents=relevant_chunks, question=query)
    return json.loads(answer)["choices"][0]["text"]

invalid pdf header: b'\n%PDF'


And then just ask it a question:

In [7]:
run_query("How do I enable wifi?")

Using embedded DuckDB without persistence: data will be transient


' Choose "Turn AirPort on" from the AirPort (Z) status menu in the menu bar. AirPort will then detect available wireless networks.'