# Notebook [2]: Using the PDF converter



This notebook shows how to use the PDF converter to create an input dataframe for the cdQA pipeline from a directory of PDF files.


***Note:*** *To run this notebook you will need to have access to GPU. If you are using colab, you will need to install `cdQA` by executing `!pip install cdqa` in a cell.* 

In [1]:
!pip install cdqa

Collecting cdqa
[?25l  Downloading https://files.pythonhosted.org/packages/39/f5/af831b7ee653aa6bace99e39ec6b2754b1adb10bb60a1296f5e16f1f24ee/cdqa-1.3.9.tar.gz (45kB)
[K     |███████▎                        | 10kB 26.5MB/s eta 0:00:01[K     |██████████████▌                 | 20kB 3.1MB/s eta 0:00:01[K     |█████████████████████▊          | 30kB 4.1MB/s eta 0:00:01[K     |█████████████████████████████   | 40kB 3.0MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.7MB/s 
Collecting flask_cors==3.0.8
  Downloading https://files.pythonhosted.org/packages/78/38/e68b11daa5d613e3a91e4bf3da76c94ac9ee0d9cd515af9c1ab80d36f709/Flask_Cors-3.0.8-py2.py3-none-any.whl
Collecting joblib==0.13.2
[?25l  Downloading https://files.pythonhosted.org/packages/cd/c1/50a758e8247561e58cb87305b1e90b171b8c767b15b12a1734001f41d356/joblib-0.13.2-py2.py3-none-any.whl (278kB)
[K     |████████████████████████████████| 286kB 8.1MB/s 
[?25hCollecting pandas==0.25.0
[?25l  Downloading https:/

In [2]:
import os
import pandas as pd
from ast import literal_eval

from cdqa.utils.converters import pdf_converter
from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline import QAPipeline
from cdqa.utils.download import download_model



### Download pre-trained reader model and PDF files

In [3]:
# Download model
download_model(model='bert-squad_1.1', dir='./models')


Downloading trained model...


In [4]:
# Download pdf files from BNP Paribas public news
def download_pdf():
    import os
    import wget
    directory = './data/pdf/'
    models_url = [
      'https://invest.bnpparibas.com/documents/1q19-pr-12648',
      'https://invest.bnpparibas.com/documents/4q18-pr-18000',
      'https://invest.bnpparibas.com/documents/4q17-pr'
    ]

    print('\nDownloading PDF files...')

    if not os.path.exists(directory):
        os.makedirs(directory)
    for url in models_url:
        wget.download(url=url, out=directory)

download_pdf()


Downloading PDF files...


### Convert the PDF files into a DataFrame for cdQA pipeline

In [5]:
df = pdf_converter(directory_path='./data/pdf/')
df.head()

2019-12-27 19:11:04,422 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to /tmp/tika-server.jar.
2019-12-27 19:11:04,971 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to /tmp/tika-server.jar.md5.
2019-12-27 19:11:05,359 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


Unnamed: 0,title,paragraphs
0,4q17-pr,"[2017 FULL YEAR RESULTS PRESS RELEASE Paris,..."
1,4q18-pr2,"[2018 FULL YEAR RESULTS PRESS RELEASE Paris,..."
2,1q19-pr-12648,[FIRST QUARTER 2019 RESULTS PRESS RELEASE Pa...


### Instantiate the cdQA pipeline from a pre-trained reader model

In [6]:
cdqa_pipeline = QAPipeline(reader='./models/bert_qa.joblib', max_df=1.0)

# Fit Retriever to documents
cdqa_pipeline.fit_retriever(df=df)

100%|██████████| 231508/231508 [00:00<00:00, 1245724.61B/s]


QAPipeline(reader=BertQA(adam_epsilon=1e-08, bert_model='bert-base-uncased',
                         do_lower_case=True, fp16=False,
                         gradient_accumulation_steps=1, learning_rate=5e-05,
                         local_rank=-1, loss_scale=0, max_answer_length=30,
                         n_best_size=20, no_cuda=False,
                         null_score_diff_threshold=0.0, num_train_epochs=3.0,
                         output_dir=None, predict_batch_size=8, seed=42,
                         server_ip='', server_po..._size=8,
                         verbose_logging=False, version_2_with_negative=False,
                         warmup_proportion=0.1, warmup_steps=0),
           retrieve_by_doc=False,
           retriever=BM25Retriever(b=0.75, floor=None, k1=2.0, lowercase=True,
                                   max_df=1.0, min_df=2, ngram_range=(1, 2),
                                   preprocessor=None, stop_words='english',
                                   t

 ### Execute a query

In [0]:
query = 'How many contracts did BNP Paribas Cardif sell in 2019?'
prediction = cdqa_pipeline.predict(query)

### Explore predictions

In [8]:
print('query: {}'.format(query))
print('answer: {}'.format(prediction[0]))
print('title: {}'.format(prediction[1]))
print('paragraph: {}'.format(prediction[2]))

query: How many contracts did BNP Paribas Cardif sell in 2019?
answer: 140,000
title: 1q19-pr-12648
paragraph: 3 Excluding PEL/CEL effects of +2 million euros compared to +1 million euros in the first quarter 2018    5 RESULTS AS AT 31 MARCH 2019   The new property and casualty offering launched in May 2018 as part of the partnership between  BNP Paribas Cardif and Matmut (Cardif IARD) recorded good growth with already  140,000 contracts sold as at 31 March 2019.   The business is accelerating individual customers’ mobile uses and developing self-care features with the roll-out of the conversational chatbots Telmi in the Mes comptes BNP Paribas app and Helloïz at Hello bank!.  
