# Notebook [1]: First steps with cdQA

This notebook shows how to use the `cdQA` pipeline to perform question answering on a custom dataset.

***Note:*** *If you are using colab, you will need to install `cdQA` by executing `!pip install cdqa` in a cell.*

In [16]:
import os
import pandas as pd
import numpy as np
from nltk import tokenize
from ast import literal_eval

from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline import QAPipeline

### Download pre-trained reader model and example dataset

In [2]:
from cdqa.utils.download import download_model, download_bnpp_data

download_model(model='bert-squad_1.1', dir='./models')


Downloading trained model...


### Visualize the dataset

In [8]:
df = pd.read_csv('m6l1_transcripts.csv')
df.head()

Unnamed: 0,title,paragraphs
0,Intro,Analytics is something every business requires...
1,Ecosystem,"When we look at the marketing ecosystem, today..."
2,Data Collection,"Now, the ecosystem of data is no doubt impress..."
3,Layering in Data,One of my first jobs out of high school was wo...
4,Analytics Process,"In the world of business and in marketing, we ..."


In [18]:
df.paragraphs = df.paragraphs.apply(lambda x: tokenize.sent_tokenize(x))
df.loc[0].paragraphs

['Analytics is something every business requires to stay competitive.',
 'The more information you can use to inform your actions, the better.',
 'Analytics and the techniques to properly use them support both strategic marketing decisions such as how much to spend or what a customer is worth, and more tactical campaign decisions, such as targeting the right customer with the right message at the right time.',
 "No matter what industry you're business is in, it's important to understand the information surrounding your market, your competition, and the journey your customers take.",
 "Marketing analytics benefits your entire organization and it's one of the most fundamental skills any marketer or business owner must hone.",
 "Now, I'm sure you're aware there's a lot of data available to us.",
 "So it's important that we really measure what matters.",
 'We have to go beyond getting those metrics, we must sort out how to take action with that data.',
 "We're going to spend a lot of time 

In [20]:
''.join(df.loc[0].paragraphs)

"Analytics is something every business requires to stay competitive.The more information you can use to inform your actions, the better.Analytics and the techniques to properly use them support both strategic marketing decisions such as how much to spend or what a customer is worth, and more tactical campaign decisions, such as targeting the right customer with the right message at the right time.No matter what industry you're business is in, it's important to understand the information surrounding your market, your competition, and the journey your customers take.Marketing analytics benefits your entire organization and it's one of the most fundamental skills any marketer or business owner must hone.Now, I'm sure you're aware there's a lot of data available to us.So it's important that we really measure what matters.We have to go beyond getting those metrics, we must sort out how to take action with that data.We're going to spend a lot of time together talking about analytics, so let'

### Instantiate the cdQA pipeline from a pre-trained reader model

In [21]:
cdqa_pipeline = QAPipeline(reader='./models/bert_qa.joblib')
cdqa_pipeline.fit_retriever(df=df)

100%|██████████| 231508/231508 [00:00<00:00, 978462.99B/s]


QAPipeline(reader=BertQA(adam_epsilon=1e-08, bert_model='bert-base-uncased',
                         do_lower_case=True, fp16=False,
                         gradient_accumulation_steps=1, learning_rate=5e-05,
                         local_rank=-1, loss_scale=0, max_answer_length=30,
                         n_best_size=20, no_cuda=False,
                         null_score_diff_threshold=0.0, num_train_epochs=3.0,
                         output_dir=None, predict_batch_size=8, seed=42,
                         server_ip='', server_po...size=8,
                         verbose_logging=False, version_2_with_negative=False,
                         warmup_proportion=0.1, warmup_steps=0),
           retrieve_by_doc=False,
           retriever=BM25Retriever(b=0.75, floor=None, k1=2.0, lowercase=True,
                                   max_df=0.85, min_df=2, ngram_range=(1, 2),
                                   preprocessor=None, stop_words='english',
                                   t

### Execute a query

In [24]:
query = 'Who fall victim to common mistakes?'
prediction = cdqa_pipeline.predict(query)

### Explore predictions

In [25]:
print('query: {}'.format(query))
print('answer: {}'.format(prediction[0]))
print('title: {}'.format(prediction[1]))
print('paragraph: {}'.format(prediction[2]))

query: Who fall victim to common mistakes?
answer: we don't fall victim to common mistakes.
title: Intro
paragraph: We also want to figure out how to watch out for data reporting pitfalls so we don't fall victim to common mistakes.


In [26]:
query2 = "What is analytics?"
prediction2 = cdqa_pipeline.predict(query2)

In [27]:
print('query: {}'.format(query2))
print('answer: {}'.format(prediction2[0]))
print('title: {}'.format(prediction2[1]))
print('paragraph: {}'.format(prediction2[2]))

query: What is analytics?
answer: being actionable
title: Intro
paragraph: Everything that we do in marketing analytics is about being actionable.
