corpus2question tutorial
====================

This notebook is a demo of how the `corpus2question` technique works and how can you apply it to your own corpus.

## Setup

### Basic NLP-torch stack


In [1]:
!pip install torch transformers tqdm pandas

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


### Model Download

Download the pretrained model from it's repository and load it using the transformers library. corpus2question is based in doc2query.

In [34]:
! wget -nc https://storage.googleapis.com/doctttttquery_git/t5-base.zip
! unzip -o t5-base.zip

File ‘t5-base.zip’ already there; not retrieving.

Archive:  t5-base.zip
  inflating: model.ckpt-1004000.data-00000-of-00002  
  inflating: model.ckpt-1004000.data-00001-of-00002  
  inflating: model.ckpt-1004000.index  
  inflating: model.ckpt-1004000.meta  


In [54]:
from typing import List, Iterable

import nltk
import torch
import pandas as pd
from tqdm.notebook import tqdm
from transformers import T5Config, T5Tokenizer, T5ForConditionalGeneration


nltk.download('punkt')

# Define the target device. Use GPU if available.
device = 'cuda' if torch.cuda.is_available() else 'cpu'

[nltk_data] Downloading package punkt to /home/gsurita/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [55]:
# Instantiate and load the QG model to the GPU. 
qg_tokenizer = T5Tokenizer.from_pretrained('t5-base')
qg_config = T5Config.from_pretrained('t5-base')
qg_model = T5ForConditionalGeneration.from_pretrained('model.ckpt-1004000', from_tf=True, config=qg_config)

qg_model.to(device)

True

True

## Generation Pipeline

Here we define our generation and preprocessing functions. Here you find the examples used in the paper, but you may customize these functions for your needs.

In [26]:
def preprocess(document: str, span=10, stride=5) -> List[str]:
    """
    Define your preprocessing function.
    
    This function should take the a corpus document and output a list of generation
    spans. This is required so we can match the expected sequence size of the
    generation model.
    """
    
    sentences = nltk.tokenize.sent_tokenize(document)
    chunks = [" ".join(sentences[i:i+span]) for i in range(0, len(sentences), stride)]

    return chunks
    


def generate_questions(text: str) -> List[str]:
    """
    Define your generation function. 
    
    This function should take a text passage and generate a list of questions.
    With the current configuration it always generate one question per passage.
    
    You may copy this example to use the same configuration as the paper. 
    You may also configure the generation parameters (such as using sampling and
    generating multiple questions) for other use cases.
    """
    
    # Append an end of sequence token </s> after the context.
    doc_text = f"{text} </s>"

    input_ids = qg_tokenizer.encode(doc_text, return_tensors='pt').to(device)
    outputs = qg_model.generate(
        input_ids=input_ids,
        max_length=64,
        do_sample=False,
        n_beams=4,
    )

    return [qg_tokenizer.decode(output) for output in outputs]    

### Provide a corpus

This section provides a corpus to the model. Here we provide an in-memory toy example, but you may read the text from other sources. We expect the corpus to be a list or iterable object of strings. You may also filter out non-natural language symbols before feeding the text into the model.

In [60]:
corpus = [
    # Extracted from https://en.wikipedia.org/wiki/Lorem_ipsum
    """
    In publishing and graphic design, Lorem ipsum is a placeholder text commonly used to demonstrate 
    the visual form of a document or a typeface without relying on meaningful content. Lorem ipsum 
    may be used before final copy is available, but it may also be used to temporarily replace copy 
    in a process called greeking, which allows designers to consider form without the meaning of the
    text influencing the design.

    Lorem ipsum is typically a corrupted version of De finibus bonorum et malorum, a first-century BC
    text by the Roman statesman and philosopher Cicero, with words altered, added, and removed to make
    it nonsensical, improper Latin.

    Versions of the Lorem ipsum text have been used in typesetting at least since the 1960s, when it 
    was popularized by advertisements for Letraset transfer sheets. Lorem ipsum was introduced to the 
    digital world in the mid-1980s when Aldus employed it in graphic and word-processing templates for 
    its desktop publishing program PageMaker. Other popular word processors including Pages and Microsoft 
    Word have since adopted Lorem ipsum as well. 
    """,
    
    # Extracted from https://www.lipsum.com/
    """
    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the 
    industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and 
    scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap
    into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the 
    release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing 
    software like Aldus PageMaker including versions of Lorem Ipsum.

    It is a long established fact that a reader will be distracted by the readable content of a page when 
    looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution
    of letters, as opposed to using 'Content here, content here', making it look like readable English. 
    Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, 
    and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have 
    evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).

    Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical 
    Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at 
    Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a 
    Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the 
    undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" 
    (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, 
    very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes 
    from a line in section 1.10.32.
    
    The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 
    1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original
    form, accompanied by English versions from the 1914 translation by H. Rackham.

    There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration
    in some form, by injected humour, or randomised words which don't look even slightly believable. If you are
    going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the
    middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary,
    making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined 
    with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated
    Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.
    """,
]

### Generate the questions

Here we apply the preprocessing and generation functions defined earlier. You may save questions into a list if your source is small. For large datasets we recommend adding some sort of checkpointing.

In [None]:
questions = [
    [generate_questions(span) for span in preprocess(doc)] 
    for doc in tqdm(corpus)
]

questions

### Aggregate with Pandas

Pandas is a very efficient way to aggregate the generations. In this example we define document, generation and question ids and group questions regarding these ids. We than count the unique examples for every span and document.

In [50]:
question_df = pd.DataFrame([
    dict(
        document_id=doc_idx,
        span_id=f"{doc_idx}:{span_idx}",
        gen_id=f"{doc_idx}:{span_idx}:{gen_idx}",
        question=question,
    )
    for doc_idx, document_gen in enumerate(questions)
    for span_idx, span_gen in enumerate(document_gen)
    for gen_idx, question in enumerate(span_gen)
])

question_df

Unnamed: 0,document_id,span_id,gen_id,question
0,0,0:0,0:0:0,what is lorem ipsum
1,0,0:1,0:1:0,what is lorem ipsum
2,1,1:0,1:0:0,what is lorem ipsum
3,1,1:1,1:1:0,where does lorem ipsum come from
4,1,1:2,1:2:0,where does the word ipsum come from
5,1,1:3,1:3:0,what is lorem ipsum
6,1,1:4,1:4:0,what is lorem ipsum


In [59]:
# Group the results by question, count unique results and order by generation id counts.
question_df \
    .groupby("question") \
    .nunique() \
    .sort_values("gen_id", ascending=False)

Unnamed: 0_level_0,document_id,span_id,gen_id,question
question,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
what is lorem ipsum,2,5,5,1
where does lorem ipsum come from,1,1,1,1
where does the word ipsum come from,1,1,1,1


So our results suggest that the most frequent question in the corpus is `what is lorem ipsum` (on 2 documents and 5 spans). 