# Base Model Trained on Example Website

source: https://towardsdatascience.com/how-to-create-your-own-question-answering-system-easily-with-python-2ef8abc8eb5

In [2]:
domains = ["waterboards.ca.gov"]
websites = ["https://www.waterboards.ca.gov/sanfranciscobay/water_issues/programs/agriculture/Cannabis/index.html"]


In [3]:
titles = ["Cannabis Cultivation Regulatory Program"]
paragraphs = [
      ["Welcome to the San Francisco Bay Regional Water Board’s cannabis cultivation regulatory program website. We regulate water quality impacts from cannabis grows in the hydrologic region of the San Francisco Bay Area, which approximately corresponds to the nine counties of the Bay Area (map of Regional Water Boards).",
       "In October 2017, the State Water Board adopted requirements for cannabis cultivation to reduce impacts from discharges of waste and water diversions associated with cannabis cultivation activities. Cannabis cultivators are now required to obtain licenses and meet all state and local environmental regulations including the California Water Code and Basin Plan, as well as the new Cannabis Policy and Cannabis General Order. Cannabis cultivation can cause significant environmental damage, including discharges of polluted wastes to surface water and groundwater, erosion and sedimentation of surface water bodies, and illegal diversions of surface water.",
       "Visit the Cannabis Cultivation Programs Registration Portal to enroll for coverage under the Cannabis General Order or to file for a cannabis Small Irrigation Use Registration water right.",
       "State Water Board Cannabis Cultivation Homepage For information about the Cannabis Policy and Cannabis General Order, trainings around the State, and enforcement for cannabis cultivators discharging waste to waters of the State without a permit or diverting water without an appropriate water right, see the State Water Board’s Cannabis Cultivation Programs homepage.",
       "State Water Board's Cannabis Cultivation Policy For information on the Cannabis Policy, which establishes principles and guidelines for the diversion and use of water, land disturbances, and the activities related to cannabis cultivation to protect water quantity and quality, see the State Water Board’s Cannabis Cultivation Policy homepage.",
       "State Water Board's Cannabis Cultivation Water Quality For information on cannabis cultivation water quality programs to address water quality impacts from cannabis cultivation and associated activities on private property, see the State Water Board’s Cannabis Cultivation Water Quality homepage.",
       "State Water Board's Cannabis Cultivation Water Rights For information on the development of principles and guidelines for diversion and use of water for cannabis cultivation and water rights information related to cannabis cultivation, see the State Water Board’s Cannabis Cultivation Water Rights homepage.",
       "Other State and Regional Water Board Permits That May Apply This list is not exhaustive. Please contact the Regional Water Board for assistance if you have questions about whether an activity impacts water quality and needs a permit."
       "Construction stormwater general permit: https://www.waterboards.ca.gov/water_issues/programs/stormwater/construction.shtml Waste discharge requirements for stream or wetland fill or impacts: https://www.waterboards.ca.gov/sanfranciscobay/certs.html Waste discharge requirements for discharges to land: https://www.waterboards.ca.gov/sanfranciscobay/water_issues/programs/permits.html Registration Portal, Policy, and General Order Questions General Order Enrollment DWQ.Cannabis@waterboards.ca.gov 916-341-5580 Small Irrigation Use Registration CannabisReg@waterboards.ca.gov 916-319-9427 Policy CannabisWR@waterboards.ca.gov",
       "General Order wb-dwr-cango@waterboards.ca.gov San Francisco Bay Regional Water Board Contacts Mailbox for Cannabis Questions SanFranciscoBay.Cannabis@waterboards.ca.gov",
       "Sami Harper Water Resource Control Engineer samantha.harper@waterboards.ca.gov 510-622-2415 Josh Hoeflich Engineering Geologist joshua.hoeflich@waterboards.ca.gov 510-622-2370 James Ponton Senior Engineering Geologist james.ponton@waterboards.ca.gov 510-622-2492"]
             ]

In [4]:
import pandas as pd
from ast import literal_eval

from cdqa.utils.filters import filter_paragraphs
from cdqa.utils.download import download_model, download_bnpp_data
from cdqa.pipeline.cdqa_sklearn import QAPipeline



In [5]:
# Download data and models
download_bnpp_data(dir = './data/bnpp_newsroom_v1.1/')
download_model(model='bert-squad_1.1', dir = './models')


Downloading BNP data...
bnpp_newsroom-v1.1.csv already downloaded

Downloading trained model...
bert_qa.joblib already downloaded


In [6]:
# Loading data and filtering / preprocessing the documents
df_test = pd.read_csv('data/bnpp_newsroom_v1.1/bnpp_newsroom-v1.1.csv', converters={'paragraphs': literal_eval})
df_test = filter_paragraphs(df_test)
df_test

Unnamed: 0,date,title,category,link,abstract,paragraphs
0,13.05.2019,The banking jobs : Assistant Vice President – ...,Careers,https://group.bnpparibas/en/news/banking-jobs-...,Within the Group’s Corporate and Institutional...,[I manage a team in charge of designing and im...
1,13.05.2019,BNP Paribas at #VivaTech : discover the progra...,Innovation,https://group.bnpparibas/en/news/bnp-paribas-v...,"From Thursday 16 to Saturday 18 May 2019, join...","[With François Hollande, Chairman of French fo..."
2,13.05.2019,"""The bank with an IT budget of more than EUR6 ...",Group,https://group.bnpparibas/en/news/the-bank-budg...,"Interview with Jean-Laurent Bonnafé, Director ...","[We did the groundwork between 2012 and 2016, ..."
3,10.05.2019,BNP Paribas at #VivaTech : discover the progra...,Innovation,https://group.bnpparibas/en/news/bnp-paribas-v...,"From Thursday 16 to Saturday 18 May 2019, join...","[As part of the ‘United Tech of Europe’ theme,..."
4,10.05.2019,When Artificial Intelligence participates in r...,Careers,https://group.bnpparibas/en/news/artificial-in...,As the competition to attract talent intensifi...,[Online recruitment is already the norm. Accor...
5,09.05.2019,Dream Up in Marseille: showcasing dance and pe...,Corporate philanthropy,https://group.bnpparibas/en/news/dream-marseil...,The BNP Paribas Foundation is working closely ...,[ADOLéDANSE allows a 6th grade class at Edgar-...
6,07.05.2019,BGL BNP Paribas joins the social entrepreneurs...,Entrepreneurship,https://group.bnpparibas/en/news/bgl-bnp-parib...,"BGL BNP Paribas, a key player in social entrep...","[BGL BNP Paribas, a key player in social entre..."
7,07.05.2019,Viva Technology 2019: Dive into the world of p...,Innovation,https://group.bnpparibas/en/news/viva-technolo...,"Viva Technology, the international innovation ...","[Viva Technology, the international innovation..."
8,06.05.2019,Mr. Christian Noyer is appointed as non-voting...,Press release,https://group.bnpparibas/en/press-release/mr-c...,,"[Mr. Christian Noyer, 68 years old, is a membe..."
9,03.05.2019,"To be sustainable, growth must be inclusive",Economy,https://group.bnpparibas/en/news/sustainable-g...,In the three latest episodes of our podcast Ma...,[Growth is inclusive when it narrows social in...


In [7]:
column_names = ["title", "paragraphs"]
df = pd.DataFrame(columns = column_names)
df.loc[0, "title"] = titles[0]
df.loc[0, "paragraphs"] = paragraphs[0]
df

Unnamed: 0,title,paragraphs
0,Cannabis Cultivation Regulatory Program,[Welcome to the San Francisco Bay Regional Wat...


In [11]:
# Loading QAPipeline with CPU version of BERT Reader pretrained on SQuAD 1.1
cdqa_pipeline = QAPipeline(reader='models/bert_qa.joblib')

# Fitting the retriever to the list of documents in the dataframe
cdqa_pipeline.fit_retriever(df=df)

# Sending a question to the pipeline and getting prediction
query = 'What are all the permits I need?'
prediction = cdqa_pipeline.predict(query=query, n_predictions=3)

print('query: {}\n'.format(query))
print('answer: {}\n'.format(prediction[0]))
print('title: {}\n'.format(prediction[1]))
print('paragraph: {}\n'.format(prediction[2]))

query: What are all the permits I need?

answer: ('California Water Code and Basin Plan', 'Cannabis Cultivation Regulatory Program', 'In October 2017, the State Water Board adopted requirements for cannabis cultivation to reduce impacts from discharges of waste and water diversions associated with cannabis cultivation activities. Cannabis cultivators are now required to obtain licenses and meet all state and local environmental regulations including the California Water Code and Basin Plan, as well as the new Cannabis Policy and Cannabis General Order. Cannabis cultivation can cause significant environmental damage, including discharges of polluted wastes to surface water and groundwater, erosion and sedimentation of surface water bodies, and illegal diversions of surface water.', 3.0687530279159545)

title: ('cannabis Small Irrigation Use Registration water right', 'Cannabis Cultivation Regulatory Program', 'Visit the Cannabis Cultivation Programs Registration Portal to enroll for cov