# File Q&A
This notebook will show you how to create an application that can answer your questions based on information in a group of files. This can be useful if you have hundreds of pages of information that you don't want to spend the time searching through yourself. Since it uses semantic information matching, there's no need to use the exact same wording that your documentation might use.

At a high level, this application will:
* Split your source material into relatively small chunks of text
* Assign [embeddings](https://txt.cohere.com/text-embeddings/) to each chunk
* Compare your query to each chunk and assign a relevance score
* Give your query along with the most relevant chunks of info to a Writer model to summarize it into an understandable answer

In [144]:
import tiktoken
import os

from writer import Writer
from writer.models import operations, shared
from sentence_transformers import SentenceTransformer, util
from PyPDF2 import PdfReader

First, we have to give the library our Writer API info. Make sure you have a file named `.env` in the parent directory with the lines 
```
WRITER_ORG_ID=YOUR_ORG_ID
WRITER_API_KEY=YOUR_API_KEY
```
or simply set the variables in the cell below.

In [145]:
from dotenv import load_dotenv

load_dotenv("..")
org_id = os.environ.get("WRITER_ORG_ID")
api_key = os.environ.get("WRITER_API_KEY")

writer = Writer(
    security=shared.Security(
        api_key=api_key
    ),
    organization_id=org_id
)

To get started, we need a function to extract text from a file. This example only handles PDFs, but can be easily extended.

In [146]:
def extract_text_from_file(dir: str, filename: str):
    file = open(f"{dir}/{filename}", "rb")
    _, ext = os.path.splitext(filename)

    if ext == ".pdf":
        reader = PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()

    file.close()
    return text

Next, we need a function to split the extracted text into chunks. For now a chunk size of 50 should work, but tweaking it could yield better results depending.

In [147]:
CHUNK_TOKENS = 50

def chunk_text(text: str):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    # We can tokenize the text to count the number of tokens for chunking
    tokens = tokenizer.encode(text)

    chunks = [
        tokenizer.decode(tokens[i : i + CHUNK_TOKENS])
        for i in range(0, len(tokens), CHUNK_TOKENS)
    ]
    
    return chunks

Now we need to determine which chunks of text are actually relevant to the question. To do this, we use `all-MiniLM-L6-v2` (Found on huggingface [here](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)), which has sentence similarity functionality. It's a relatively lightweight model to run locally, but the direct calls can be replaced with API calls if you wish. Alternatively, other models with similar abilities could be used instead.

In this function, TOP_N is how many chunks of text it will return. A higher number means more context, but it could just be extra irrelevant information, so a balance is needed.

In [148]:
TOP_N = 10

def get_relevant_text(query: str, chunks: list[str]):
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    text_embeddings = model.encode(chunks)
    query_embeddings = model.encode(query)

    cos_sims = util.cos_sim(query_embeddings, text_embeddings)

    sorted_chunk_nums = sorted(zip(cos_sims.tolist()[0], range(len(chunks))))
    # We also take the chunk before and after each relevant chunk. 
    # Since the text is split arbitrarily, this is just in case some important context got cut off.
    relevant_text = "\n".join(
        chunks[i - 1] or "" + chunks[i] + chunks[i + 1] or ""
        for _, i in sorted_chunk_nums[-TOP_N:]
    )

    return relevant_text

This function will send your question and relevant chunks along with an instructional prompt to Writer's `palmyra-instruct` model for summarization.

In [149]:
def get_answer(query: str, text: str):
    prompt = (
        f"Given the following context information, try to answer the following question with as much detail as possible."
        f"Use only the given context. Do not use outside information.\n"
        f"If no answer can be found in the given context, output \"Could not find an answer in your files.\"\n"
        f"Question: {query} \n\nContext:\n{text}\nAnswer: "
    )

    req = operations.CreateCompletionRequest(
        completion_request=shared.CompletionRequest(
            prompt=prompt, max_tokens=2000, temperature=1.1
        ),
        model_id="palmyra-instruct",
    )

    res = writer.completions.create(req)
    if res.completion_response is not None:
        return res.completion_response.choices[0].text
    else:
        print(res.fail_response)

Finally, we create a function to combine everything:

In [150]:
FILE_DIR = "files"

def run_query(query: str):
    context = ""
    for filename in os.listdir(FILE_DIR):
        context += extract_text_from_file(FILE_DIR, filename)

    chunks = chunk_text(context)

    relevant_text = get_relevant_text(query, chunks)

    answer = get_answer(query, relevant_text)
    return answer

Now we can ask it a question! For this example, I put an old Macbook Pro user manual in the `files` directory.

In [151]:
answer = run_query("How do I add a wifi network step by step?")
print(answer)

incorrect startxref pointer(1)


 To add a wifi network step by step, you should:
1. Choose Apple ( ) > System Preferences and click Network. 
2. Select Wifi in the list of the interfaces and click Create Network. 
3. Provide a Network Name and optionally a Password, select the type of encryption you wish to use and press OK.
4. Make sure the Airport is ON and click Advanced. 
5. Configure any settings and click OK to save the changes and reconnect to the wifi network.
