# Notebook [1]: First steps with cdQA

This notebook shows how to use the `cdQA` pipeline to perform question answering on a custom dataset.

***Note:*** *If you are using colab, you will need to install `cdQA` by executing `!pip install cdqa` in a cell.*

In [1]:
!pip install cdqa

import os
import pandas as pd
from ast import literal_eval

from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline.cdqa_sklearn import QAPipeline





### Download pre-trained reader model and example dataset

### Read sample txt file

In [2]:
data = {'title': [], 'paragraphs': []}
txt = open('test_emails/test_email_1.txt', 'r').read()
data['title'].append('test_email')
data['paragraphs'].append(txt.split("\n"))
data

{'title': ['test_email'],
 'paragraphs': [['Hi,',
   '',
   'Please do an transfer of $15000 from checking 1243 as follows:',
   '',
   'Bank:\t\t\tChase Bank',
   'Routing:\t\t0123445466',
   'Account #:\t\t987654412',
   'Account Name: Sample company LLC',
   'Ref: \t\t\tInvoice 6754',
   '',
   'Thank you',
   'Tameka',
   '',
   'John Sam',
   'Executive Assistant',
   'SSC corp']]}

In [3]:
# df before filtering
# convert to df
test_df = pd.DataFrame.from_dict(data)
test_df.iloc[0]['paragraphs']

['Hi,',
 '',
 'Please do an transfer of $15000 from checking 1243 as follows:',
 '',
 'Bank:\t\t\tChase Bank',
 'Routing:\t\t0123445466',
 'Account #:\t\t987654412',
 'Account Name: Sample company LLC',
 'Ref: \t\t\tInvoice 6754',
 '',
 'Thank you',
 'Tameka',
 '',
 'John Sam',
 'Executive Assistant',
 'SSC corp']

In [4]:
# df after filtering
# test_df = filter_paragraphs(test_df)
# test_df
# test_df.iloc[0]['paragraphs']

In [5]:
other_df = pd.read_csv('./data/bnpp_newsroom_v1.1/bnpp_newsroom-v1.1.csv', converters={'paragraphs': literal_eval})
other_df = other_df.head(4)[['title', 'paragraphs']]
df = pd.concat([test_df, other_df], ignore_index=True)
df

Unnamed: 0,title,paragraphs
0,test_email,"[Hi,, , Please do an transfer of $15000 from c..."
1,BNP Paribas at #VivaTech : discover the progra...,"[From may 16, 2019 to may 18, 2019, VivaTechno..."
2,The banking jobs : Assistant Vice President – ...,[When Artificial Intelligence participates in ...
3,BNP Paribas at #VivaTech : discover the progra...,"[From may 16, 2019 to may 18, 2019, VivaTechno..."
4,"""The bank with an IT budget of more than EUR6 ...","[Nordic region: an opportunity for Europe?, In..."


### Instantiate the cdQA pipeline from a pre-trained CPU reader

In [6]:
cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib')
cdqa_pipeline.fit_retriever(X=df)

QAPipeline(reader=BertQA(bert_model='bert-base-uncased', do_lower_case=True,
                         fp16=False, gradient_accumulation_steps=1,
                         learning_rate=3e-05, local_rank=-1, loss_scale=0,
                         max_answer_length=30, n_best_size=20, no_cuda=False,
                         null_score_diff_threshold=0.0, num_train_epochs=2,
                         output_dir=None, predict_batch_size=8, seed=42,
                         server_ip='', server_port='', train_batch_size=8,
                         verbose_logging=False, version_2_with_negative=False,
                         warmup_proportion=0.1))

### Execute a query

In [7]:
query = "what is routing?"
prediction = cdqa_pipeline.predict(X=query)
print('query: {}'.format(query))
print('answer: {}'.format(prediction[0]))
print('title: {}'.format(prediction[1]))
print('paragraph: {}'.format(prediction[2]))

3it [00:00, 859.08it/s]


query: what is routing?
answer: 0123445466
title: test_email
paragraph: Routing:		0123445466


In [8]:
query = "what is the bank name?"
prediction = cdqa_pipeline.predict(X=query)
print('query: {}'.format(query))
print('answer: {}'.format(prediction[0]))
print('title: {}'.format(prediction[1]))
print('paragraph: {}'.format(prediction[2]))

3it [00:00, 1832.64it/s]


query: what is the bank name?
answer: Chase Bank
title: test_email
paragraph: Bank:			Chase Bank
