<br><br>

<center><font size="5">📝 Question-Answering Starter pack with 🤗transformers</font></center>
   
<br>

<center>
<font size="3">
  In this notebook, by making use of  <a href="http://https://github.com/huggingface/transformers">transformers</a> we express the learning problem as a <strong>question-answering system</strong>.
  
  <br><br>
  
  The code and the notebook-format have been designed to be easy-to-understand for beginners but hopefully also useful for advanced Kagglers.
  
  <br><br>
  
  Any comment/feedback is very appreciated. Disclaimer: work in progress, I will add new resources and comments soon. 
  
    
</font>
</center>


### 1. Problem formulation

We formulate the task as question answering problem: given a question and a context, we train a transformer model to find the **answer** in the `text` column (the context).

We have:
 1. Question: `sentiment` column (`positive` or `negative`)
 2. Context:  `text` column
 3. Answer: `selected_text` column


### 2. Getting started with QA

A great resource to quickly recap question answering is this great amazing Stanford Lecture: [Question Answering](https://web.stanford.edu/class/cs124/lec/watsonqa.pdf).

#### 2.1 Other free online resources:

 - [Youtube: Stanford CS224N - Question Answering](https://www.youtube.com/watch?v=yIdF-17HwSk)
 - [Medium: Building a Question-Answering System from Scratch— Part 1](https://towardsdatascience.com/building-a-question-answering-system-part-1-9388aadff507)
 - [Github: awesome question answering](https://github.com/seriousran/awesome-qa)


### 3. Learning QA from scratch

The final project of the Stanford course CS224n, **Natural Language Processing with Deep Learning** consist of creating (almost) from scratch a Question-Ansering system using deep neural nets and transformers. [Here](https://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf) you can find the handout of 24 pages. For the most enthusiast out there: you may want to do this project and implement your Question-Answering system. It's probably the best way to fully understand and learn what QA is about.

### 4. Model: DistilBERT + SQuAD

The current version of the notebook makes use of the `distilbert-base-uncased-distilled-squad` model.

DistilBERT paper: [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108)

> As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

The distilBERT model has already been fine-tuned on a question-answering challenge: SQuAD, the [Stanford Question Answering Dataset](https://rajpurkar.github.io/SQuAD-explorer/). This is the main reason why it performs already well out-of-the-box (0.666 score in the LB).

#### 4.1 Training time

Thanks to the limited size of the transformer model, the notebook runs quite fast, training time is about 20 minutes with _GPU_.


### 5. Dataset publicly available

#### 5.1 DistilBERT + SQuAD model
Because Tweet Sentiment Extraction's notebooks must have internet switched off, I already downloaded and stored the transformer model in a public Kaggle dataset: [Transformers pre-trained distilBERT models](https://www.kaggle.com/jonathanbesomi/transformers-pretrained-distilbert). In future, I plan to upload all [distilBERT pre-trained models](https://huggingface.co/transformers/pretrained_models.html) to the same dataset so that we can easily play around with many models and configuration.

#### 5.2 Simple Transformers PyPI

To keep the code to-the-point, this notebook makes use of an external python package: [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers). For your convenience, the wheel files to install the package have already been stored in this database: [Simple Transformers PyPI](https://www.kaggle.com/jonathanbesomi/simple-transformers-pypi).


### 6. Acknowledgement

- [RoBERTa Baseline Starter (+ simple postprocessing)](https://www.kaggle.com/cheongwoongkang/roberta-baseline-starter-simple-postprocessing)

In [1]:
"""
LOAD DATA
"""

import numpy as np 
import pandas as pd 
import json


train_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/train.csv')
test_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/test.csv')
sub_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/sample_submission.csv')

train = np.array(train_df)
test = np.array(test_df)

!mkdir -p data

"""
SETTINGS
"""

use_cuda = True # whether to use GPU or not

In [2]:
train_df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


### Prepare data in QA format

Example-format:

```
train_data = [
    {
        'context': "This tweet sentiment extraction challenge is great",
        'qas': [
            {
                'id': "00001",
                'question': "positive",
                'answers': [
                    {
                        'text': "is great",
                        'answer_start': 43
                    }
                ]
            }
        ]
    }
    ]
```

In [3]:
%%time

"""
Prepare training data in QA-compatible format
"""

# Adpated from https://www.kaggle.com/cheongwoongkang/roberta-baseline-starter-simple-postprocessing
def find_all(input_str, search_str):
    l1 = []
    length = len(input_str)
    index = 0
    while index < length:
        i = input_str.find(search_str, index)
        if i == -1:
            return l1
        l1.append(i)
        index = i + 1
    return l1

def do_qa_train(train):

    output = []
    for line in train:
        context = line[1]

        qas = []
        question = line[-1]
        qid = line[0]
        answers = []
        answer = line[2]
        if type(answer) != str or type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answer_starts = find_all(context, answer)
        for answer_start in answer_starts:
            answers.append({'answer_start': answer_start, 'text': answer.lower()})
            break
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})

        output.append({'context': context.lower(), 'qas': qas})
        
    return output

qa_train = do_qa_train(train)

with open('data/train.json', 'w') as outfile:
    json.dump(qa_train, outfile)

nan <class 'float'>
nan <class 'float'>
neutral <class 'str'>
CPU times: user 1.23 s, sys: 24.6 ms, total: 1.26 s
Wall time: 1.25 s


In [4]:
%%time

"""
Prepare testing data in QA-compatible format
"""

def do_qa_test(test):
    output = []
    for line in test:
        context = line[1]
        qas = []
        question = line[-1]
        qid = line[0]
        if type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answers = []
        answers.append({'answer_start': 1000000, 'text': '__None__'})
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})
        output.append({'context': context.lower(), 'qas': qas})
    return output

qa_test = do_qa_test(test)

with open('data/test.json', 'w') as outfile:
    json.dump(qa_test, outfile)

CPU times: user 140 ms, sys: 3.9 ms, total: 144 ms
Wall time: 143 ms


Install [simple-transformers](https://github.com/ThilinaRajapakse/simpletransformers), a tool to train and test transformers model easily.

In [5]:
!pip install '/kaggle/input/simple-transformers-pypi/seqeval-0.0.12-py3-none-any.whl' -q
!pip install '/kaggle/input/simple-transformers-pypi/simpletransformers-0.22.1-py3-none-any.whl' -q

### Train model

Train the `distilbert-base-uncased-distilled-squad` model

In [6]:
%%time


from simpletransformers.question_answering import QuestionAnsweringModel

MODEL_PATH = '/kaggle/input/transformers-pretrained-distilbert/distilbert-base-uncased-distilled-squad/'

# Create the QuestionAnsweringModel
model = QuestionAnsweringModel('distilbert', 
                               MODEL_PATH, 
                               args={'reprocess_input_data': True,
                                     'overwrite_output_dir': True,
                                     'learning_rate': 5e-5,
                                     'num_train_epochs': 3,
                                     'max_seq_length': 192,
                                     'doc_stride': 64,
                                     'fp16': False,
                                    },
                              use_cuda=use_cuda)

model.train_model('data/train.json')

100%|██████████| 27480/27480 [00:52<00:00, 528.13it/s]


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=3435.0, style=ProgressStyle(descr…

Running loss: 4.023317



Running loss: 0.751208


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=3435.0, style=ProgressStyle(descr…

Running loss: 0.510467


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=3435.0, style=ProgressStyle(descr…

Running loss: 0.684219

### Submission

In [7]:
%%time

predictions = model.predict(qa_test)
predictions_df = pd.DataFrame.from_dict(predictions)

sub_df['selected_text'] = predictions_df['answer']

sub_df.to_csv('submission.csv', index=False)

print("File submitted successfully.")

100%|██████████| 3534/3534 [00:05<00:00, 620.79it/s]


HBox(children=(FloatProgress(value=0.0, max=442.0), HTML(value='')))


File submitted successfully.
CPU times: user 32.9 s, sys: 3.55 s, total: 36.5 s
Wall time: 36.6 s
