### Walkthrough

In [1]:
import os
os.chdir('..')

import pandas as pd

pdpa_raw = pd.read_csv('data/pdpa/raw.csv')
pdpa_raw.head()

Unnamed: 0,question,meta,answer
0,What is personal data?,"Organisations, General\n","Personal data refers to data, whether true or ..."
1,When did the PDPA come into force?,"Organisations, General\n",The PDPA was implemented in phases to allow ti...
2,What are the objectives of the PDPA?,"Organisations, General\n",The PDPA aims to safeguard individuals’ person...
3,How does the PDPA benefit business?,"Organisations, General\n",The PDPA will strengthen Singapore's overall e...
4,How will the PDPA impact business costs?,"Organisations, General\n",The provisions of the PDPA were formulated kee...


### 1. Generate Keywords

In [2]:
text = pdpa_raw.answer[0]
text

'Personal data refers to data, whether true or not, about an individual who can be identified from that data; or from that data and other information to which the organisation has or is likely to have access.This includes unique identifiers (e.g. NRIC number, passport number); photographs or video images of an individual (e.g. CCTV images); as well as any set of data (e.g. name, age, address, telephone number, occupation, etc), which when taken together would be able to identify the individual. For example, Jack Lim, 36 years old, civil servant, lives at Blk 123 Bishan St 23.'

In [3]:
from src.kw_extractor import kw_extractor
kwe = kw_extractor()

2020-05-19 16:34:00 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | ewt     |
| pos       | ewt     |

2020-05-19 16:34:00 INFO: Use device: cpu
2020-05-19 16:34:00 INFO: Loading: tokenize
2020-05-19 16:34:00 INFO: Loading: pos
2020-05-19 16:34:01 INFO: Done loading processors!


In [11]:
kw, kw_rank, alias, sentences = kwe.extract_kw(text, beta=0.9)
context_w_kw = pd.DataFrame( kw, columns=['keywords'] ).assign(context = text)

In [24]:
def get_kw_sentence_pairs(kw, sentences):
    """
    Pair the keywords and their sentences
    Allows for multiple sentences
    
    args:
    ----
        keyword: (list of str)
        sentence: (list of str)
    return:
    ------
        list of keyword-sentence tuple pairs 
    """
    return [(keyword, sentence) 
            for keyword in kw 
            for sentence in sentences 
            if keyword.lower() in sentence.lower()]

kw_sentence_pairs = get_kw_sentence_pairs(kw, sentences)
kw_sentence_pairs = pd.DataFrame(kw_sentence_pairs, columns = ['keyword', 'sentence'])
kw_sentence_pairs.head()

Unnamed: 0,keyword,sentence
0,personal data,"Personal data refers to data, whether true or ..."
1,unique identifiers,This includes unique identifiers (e.g. NRIC nu...
2,cctv images,This includes unique identifiers (e.g. NRIC nu...
3,data,"Personal data refers to data, whether true or ..."
4,data,This includes unique identifiers (e.g. NRIC nu...


In [25]:
kw_sentence_context = kw_sentence_pairs.assign(context = text)
kw_sentence_context.head()

Unnamed: 0,keyword,sentence,context
0,personal data,"Personal data refers to data, whether true or ...","Personal data refers to data, whether true or ..."
1,unique identifiers,This includes unique identifiers (e.g. NRIC nu...,"Personal data refers to data, whether true or ..."
2,cctv images,This includes unique identifiers (e.g. NRIC nu...,"Personal data refers to data, whether true or ..."
3,data,"Personal data refers to data, whether true or ...","Personal data refers to data, whether true or ..."
4,data,This includes unique identifiers (e.g. NRIC nu...,"Personal data refers to data, whether true or ..."


In [42]:
generated_questions = kw_sentence_context.copy()

sentence_qg_input = kw_sentence_context.sentence + ' [SEP] ' + kw_sentence_context.keyword
generated_questions = generated_questions.assign(sentence_qg_input=sentence_qg_input)
print(f"sentence_input sample: {sentence_qg_input[0]}")

context_qg_input = kw_sentence_context.context + ' [SEP] ' + kw_sentence_context.keyword
generated_questions = generated_questions.assign(context_qg_input=context_qg_input)
print(f"\ncontext_input sample: {context_qg_input[0]}")

sentence_input sample: Personal data refers to data, whether true or not, about an individual who can be identified from that data; or from that data and other information to which the organisation has or is likely to have access. [SEP] personal data

context_input sample: Personal data refers to data, whether true or not, about an individual who can be identified from that data; or from that data and other information to which the organisation has or is likely to have access.This includes unique identifiers (e.g. NRIC number, passport number); photographs or video images of an individual (e.g. CCTV images); as well as any set of data (e.g. name, age, address, telephone number, occupation, etc), which when taken together would be able to identify the individual. For example, Jack Lim, 36 years old, civil servant, lives at Blk 123 Bishan St 23. [SEP] personal data


### 2. Generate questions

In [37]:
from src.q_gen import q_gen
qg = q_gen()

Better speed can be achieved with apex installed from https://www.github.com/artitw/apex.
qg_model.bin found in current directory.
***** Recover model: %s ***** qg_model.bin


In [38]:
qg.predict(qg_input)

100%|██████████| 1/1 [00:43<00:00, 43.92s/it]


[('What is the term for data about an individual who can be identified from what?',
  'personal data'),
 ('What is a NRIC number?', 'unique identifiers'),
 ('What is an example of a CCTV image?', 'cctv images'),
 ('What is personal data?', 'data'),
 ('Name a set of what?', 'data'),
 ('What is an example of a unique identifier?', 'passport number'),
 ('What type of data is personal data?', 'individual'),
 ('CCTV images can identify what?', 'individual'),
 ('What is an example of a unique identifier?', 'nric number'),
 ("What is Jack Lim's occupation?", 'civil servant'),
 ('What is one example of a unique identifier that can be used to identify an individual?',
  'telephone number'),
 ('What is personal data?', 'other information')]

In [49]:
gen_questions_kw_pair = qg.predict(context_qg_input)

100%|██████████| 1/1 [01:01<00:00, 61.90s/it]


In [51]:
gen_questions_kw_pair

[('What is data about an individual that can be identified from what?',
  'personal data'),
 ('What type of identifiers are NRIC numbers and passport number?',
  'unique identifiers'),
 ('What type of images can be used to identify an individual?', 'cctv images'),
 ('What is personal data?', 'data'),
 ('What is personal data?', 'data'),
 ('What is an example of a unique identifier?', 'passport number'),
 ('What is personal data?', 'individual'),
 ('What is personal data?', 'individual'),
 ('What is an example of a unique identifier?', 'nric number'),
 ("What is Jack Lim's occupation?", 'civil servant'),
 ('What is one example of a personal data identifier?', 'telephone number'),
 ('What is included in personal data?', 'other information')]

### We can see the context based inputs give much better questions

In [50]:
gen_questions_str = [tup[0] for tup in gen_questions_kw_pair]
generated_questions = generated_questions.assign(gen_questions = gen_questions_str)
generated_questions