# Gain valuable corporate insights by questioning a 10-k report in 5 easy steps with dense passage retrieval (DPR) and extractive question answering (extractiveQA) using deepset.ai's Haystack

In this article, we are going to help the sales account executive Ben with his task of extracting useful knowledge from corporate annual report pdfs with the help of deep learning (we use dense passage retrieval (DPR) and extractive question answering (eQA)).

![hf](images/dset.png)

### The Setup: A Sales Account Executives Approach to Cold Calling

Ben works for a big digital corporate as an executive account manager. With thirty big accounts a year, he is very, very careful on who to engage with. And he makes sure he know everything. Therefore, he reads corporate annual reports ([10-k](https://www.investopedia.com/terms/1/10-k.asp)), and looks for certain aspects such as: 

’Does the target company make digital experiences strategic initiatives?’ 

If they do, then he has a reason to contact someone in that target company. Yet, there are 10 people that are relevant to purchase his offer. This is a risk, and so Ben believes in being very prepared and cold calling these subjects. 

Mediums such as mail and LinkedIn messages don’t really work for him since he deems them too impersonal, and considers his target managers too busy to respond. _Hence_, he told me that he would really benefit from a tool that could find key insights for him in business plans by answering the aspects that he is looking for. 

### Questions Ben would ask are: 

- Do they mention digital experiences in the report?
- What are the strategic priorities of the target company?
- Is the target company growing?
- Does the target company have a positive cashflow? 
- Are there aspects in the report that I have not considered yet?

Let us find answers to them!

### The business benefit of DPR and eQA

For the purpose of this tutorial, I downloaded five arbitrary pdf 10-k business reports (Apple, Broadcom, Pfizer, Salesforce, Walmart), to test our setup. These sample reports and all accompanying code can be viewed in my [Github repo](https://github.com/seduerr91/dpr_eqa.git).

Our five business reports combine 483 pages with approximately 450 words per page. That results in around 220.000 words. Assuming that ordinary human beings read with 300 words per minute, the time it takes to read those reports cumulates to 12 hours.

Yet, an implementation as presented in this article can cut the time to identify & extract key findings for Ben dramatically.

### The Answers we will achieve

Here, we see how the results to Ben's question 'What is surprising?' look like for the Pfizer annual report:

![pfizer-answers](images/result.png)


Ben may immediately spark a conversation with the CMO of Pfizer, and allow this manager to understand how his digital experience product can allow Pfizer to rightfully market their drug-pricing initiatives to states.

### Implementing a tool to auto-extract knowledge from 10-k annual reports


In the following we take 5 steps, to help Ben to get answers to his questions with deepset.ai's Haystack framework.
In case, that there are any issues with installing Haystack there are some instructions at the end of this article.

![process](images/process.png)

__In Step 1__: We load the five different 10-k reports as pdf, and transform them to text with optical character recognition.

__In Step 2__: We preprocess the texts by cleaning them.

__In Step 3__: We set up the retriever with our requirements for 10-k reports.

__In Step 4__: We set up the reader with our requirements for 10-k reports.

__In Step 5__: We use both the retriever and reader to enquire the 10-k report of Pfizer.

First, we have to import our dependencies.

In [None]:
from haystack.utils import convert_files_to_docs, export_answers_to_csv
from haystack.nodes import FARMReader, DensePassageRetriever, PreProcessor
from haystack.document_stores import FAISSDocumentStore
from haystack.pipelines import ExtractiveQAPipeline
import pandas as pd

![1](images/process/1.png)

Second, we convert our pdf reports to text. The function 'convert_files_to_docs'reads pdfs (txt or .docx files), and parses them to text. Honestly, this is so cool. We can use OCR to just read in strictly formatted business reports. Kudos to deepset.ai for this!

In [None]:
# in this instance, we only use the sample report of pfizer, but feel free to use other ones, by changing report_sample to reports
all_docs = convert_files_to_docs(dir_path="report_sample/")

![2](images/process/2.png)

Next, we preprocess, the text of the docs, with the 'preprocesser', by setting a few parameters, that should improve the results further along the way. The parameter have the following purposes: 

- clean_empty_lines: This removes empty lines if there are more than two empty ones.
- clean_whitespace: This removes leading and trailing whitespaces in document part.
- clean_header_footer: Uses a heuristic to clean headers and footers from a document.
- split_by: Type we will split a document by. This can be word, sentence or passage.
- split_length: Only documents with less than the length of the split_by value (like only 150 words).
- split_respect_sentence_boundary: Respect the boundaries of a sentence when splitting.

In [None]:
# Setting our parameters for the preprocessing
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=150,
    split_respect_sentence_boundary=True,
)
# Actually preprocessing
preprocessed_docs = preprocessor.process(all_docs)

In [None]:
# Instantiate the document store
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")
# Save all preprocessed documents to the document store
document_store.write_documents(preprocessed_docs)

![3](images/process/3.png)

At this point, we have our preprocssed documents in the document store. Next, we need to prepare them (retrieve them for analysis with the deep neural net). Retrievers help narrowing down the scope for the reader by cutting it into smaller units. 

This 'narrowing down' can be done with 

- _Sparse retrievers_: algorithms based on counting the occurrences of words (bag-of-words) or
- _Dense retrievers_: use neural network models to create "dense" embedding vectors (embeddings are a type of word representation that allows words with similar meaning to have a similar representation).

We'll use a dense retriever in the proceedings to get more sophisticated results.

In [None]:
# We set up our retriever with the preferred parameters.
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    max_seq_len_query=64,
    max_seq_len_passage=256,
    batch_size=16,
    use_gpu=True,
    embed_title=True,
    use_fast_tokenizers=True,
)
# We calculate the embeddings for all of our documents in the document store.
document_store.update_embeddings(retriever)

![4](images/process/4.png)

Next we need a Reader. A reader scans the texts given by retrievers in detail and extracts the best answers. Readers are based on deep learning models.

In [None]:
# After having initialized our retriever, we initialize our reader
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

![5](images/process/5.png)

Finally, we add our buildings blocks together with a _Haystack Pipeline_.
We use an _ExtractiveQAPipeline_ that combines a retriever and a reader to answer our questions. Further pipelines can be found [here](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [None]:
pipe = ExtractiveQAPipeline(reader, retriever)

We are almost done! Now, we bring in Ben's questions, so that our reader can use them to search in our documents.

In [None]:
questions = ['Are there efforts for digital experiences?', 'What are the strategic priorities?',
"What is the company's growth?",'What is the cash level?', 'What is suprising?']

We can configure how many candidates the reader and retriever returns for our questions. The higher the top_k for the retriever, the better (but also the slower) the answers.

In [None]:
for question in questions:
    prediction = pipe.run(query=question, params={"Retriever": {"top_k": 4}, "Reader": {"top_k": 4}})
    export_answers_to_csv(output_file="answers/" + question[:12] + '.csv', agg_results=prediction)

Now let us take a look at the answers to the question 'What is surprising?' on the business report of Pfizer.

In [None]:
sample_answers = pd.read_csv('answers/What is supr.csv', sep=',')
sample_answers.columns = ['query', 'Answer to: What is surprising?', 'Rank', 'prediction_context']
sample_answers[['Rank', 'Answer to: What is surprising?']].style.set_properties(**{'text-align': 'left'})

### Interpretation of the results

In general, current results are good. They are a starter to look for the specifics in a given text (or extensive report), and the context allows to understand them as a human. Personally, I think the answers to 'what is surprising?' are exciting. 
The reader is very confident with the results to this questions as the confidence percentage is usually above 50%. Hence, the surprising insights indeed seem very relevant. 
It seems that narrow questions (such as around digital intiatives) result in less confident results in comparison to broader questions - which kinda makes sense. Let us look at that per question.

__Analysis per question__

- Are there efforts for digital experiences? This question generally got answered with digital aspects, and I would also consider to do a key word search.
- What are the strategic priorities? Actually, these results quere answered in a good way.
- What is the cash level? This question was tricky because Pfizer presents their business figures in a table format. Since our neural net of choice in this article (roberta-base-squad2) was not trained on tables, the results are confusing. Yet, as we explore in the next section, their is deep neural nets that ar optimized for such tables.
- What is suprising? 

__Avenues for improvements of deep neural answers__

Deepset.ai provides a plethora of useful techniques that can be used to further improve these results: 

- Use a reader model that is trained or, at least, fine-tuned on 10-k, 10-q, financial reports. An example for that can be found in [tutorial 2 by deepset.ai](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial2_Finetune_a_model_on_your_data.ipynb).
- Use models that are fine-tuned on tabular data like Google Tapas (an example for that can be found in [tutorial 15 by deepset.ai](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial15_TableQA.ipynb)) since business reports have a lot of tables.
- Try to improve the phrasing of the questions by having different ways of asking / paraphrasing these or using a deep neural net that helps with phrasing questions (e.g., via questions generation in [tutorial 13 by deepset.ai](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial13_Question_generation.ipynb)).
- Using different retrieval alogrithms (sparse vs. dense) which is done in [tutorial 1 (sparse) by deepset.ai](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb) and [tutorial 6 (dense) by deepset.ai](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb). 

__Avenues for further development of such a tool to make it accessible for Ben__

- Deploy this as a micro-service to a cloud platform & expose it as an API.
- Build a front end to interact with the API in the backend.
- Integrate a data fly wheel, to continuously fine-tune the neural net with qualified data inputs.

###  Summary 

Based on the business need of a senior sales account executive, we used the haystack framework from deepset.ai to ask questions for querying annual financial reports from S&P 500 companies.
The results of the DPR are pretty impressive. While they give a prior indication, and allow to scale massively, there is still aspects to be minded: 

- Based on the answers, the user has to be trained in how to ask questions, in order to achieve his results. 
- Special care must be taken when defining the context of a question to ensure that the results are relevant.

Since, I cannot fully identify what is most useful to the manager Ben, I am not further advancing my 'question phrasing skills' to score better in this very tutorial. Yet, I am amazed by the results.

## Thank you deepset.ai


Thank you @deepset.ai for wonderful tutorials that I used as a base for developing this business use case.

[Deepset.ai](deepset.ai) is an AI startup from Berlin. They are 'building a semantic layer for the modern tech stack — driven by the latest NLP and open source.' 
I was happy to go through their tutorials, since I really like how clear they are written.


## About the author: Seb

I am Seb an NLP enthusiast, that believes that deepset, SpaCy and HuggingFace are shaping the future of how people communicate in written.

## Appendix

Please find instructions to run this code locally, and to fix bugs that occured on my machine.

In [None]:
## Dependencies to install on a Mac OS M1X chip.

## In order to run this script on my local machine, I had to download / install the following dependencies.

## Download & install xpdf tools to process PDF documents with OCR
# !wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-mac-4.03.tar.gz
# !tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-mac-4.03/bin64/pdftotext /usr/local/bin

## Download & install ocr (optical character recognition to read in pdfs with Python)
# !pip install 'farm-haystack[ocr]' -q

## Download & install FAISS (i.e., Facebook AI Similarity Search, a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other.)
# !pip install 'farm-haystack[faiss]' -q

## Reset FAISS document store by deleting it, if it throws an error. FAISS is a library for efficient similarity search on a cluster of dense vectors.
# !rm faiss_document_store.db

## Reduce the output of a nasty error models
# import os
# os.environ["TOKENIZERS_PARALLELISM"] = "false"

## configuration to have nicer printing in pandas
# pd.set_option('display.max_colwidth', None)