# Sequence Post-Processing V2

This is a fork of Chris Deotte's [notebook](https://www.kaggle.com/cdeotte/pytorch-bigbird-ner-cv-0-615). Thanks Chris!

This notebook contains an ensemble of 5 BigBird models trained on different subsets of the dataset. It also contains a post-processing pipeline that uses XGBoost to convert token-level predictions into sequence-level predictions. Because models like BigBird use a per-token loss function, and this competition uses a per-sequence scoring metric, post-processing token or word predictions into sequence predictions can yield signficant performance gains. Each model in the ensemble is trained on a randomly chosen 93% subset of the entire train set, with a distinct 7% subset left out from each training set.

Additionally, this notebook shows how this competition can be treated as 7 separate classification tasks. This is because, to my knowledge, the scoring metric does not take discourse type/class intersections into account. See this [discussion](https://www.kaggle.com/c/feedback-prize-2021/discussion/300001). Each class is predicted with a separate sequence classifier, and no measures are taken to prevent sequence intersections between classes.

Secondary datasets are created for which each sample is a sub-sequence of words in a text. Class probability predictions from BigBird are used to generate features for each of these samples. A gradient boosting classifier is trained to predict the probability of a true positive for each discourse type. It was discovered that training these classifiers on out-of-sample predictions did not yield a significant performance increase, so they were trained on as much of the dataset as practically possible (50% at present). 

Most of the post-processing code and description is at the bottom of this notebook.

Currently this notebook uses

* backbone BigBird  (with HuggingFace's head for TokenClassification)
* NER formulation (with `is_split_into_words=False` tokenization)
* 5 models trained on 93% of training data

This notebook uses many code cells from Raghavendrakotala's great notebook [here][1]. Don't forget to upvote Raghavendrakotala's notebook :-)

[1]: https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533
[2]: https://www.kaggle.com/cdeotte/tensorflow-longformer-ner-cv-0-617
[3]: https://arxiv.org/abs/2007.14062

**Changes in V2:**
* BigBird ensemble
* larger sequence classifier training set
* tuned XGBoost model
* sequence position features
* more inclusive heuristic constraints for considered word sub-sequences
* resampling samples in sequence classification to 1:1
* AdamW net training
* more granular features (7 quantiles instead of 5) to represent distribution of class probabilites for a sequence
* increased iterations (100) of sequence classifier probability threshold tuning
* computation efficiency improvements

# [Discussion: Ideas for Future Work](https://www.kaggle.com/c/feedback-prize-2021/discussion/308511)

# Configuration
This notebook can either train a new model or load a previously trained model (made from previous notebook version). Furthermore, this notebook can either create new NER labels or load existing NER labels (made from previous notebook version). In this notebook version, we will load model and load NER labels.

Also this notebook can load huggingface stuff (like tokenizers) from a Kaggle dataset, or download it from internet. (If it downloads from internet, you can then put it in a Kaggle dataset, so next time you can turn internet off).

In [None]:
import os, sys
# DECLARE HOW MANY GPUS YOU WISH TO USE. 
# KAGGLE ONLY HAS 1, BUT OFFLINE, YOU CAN USE MORE
os.environ["CUDA_VISIBLE_DEVICES"]="0" #0,1,2,3 for four gpu

# VERSION FOR SAVING MODEL WEIGHTS
VER=26

# IF VARIABLE IS NONE, THEN NOTEBOOK COMPUTES TOKENS
# OTHERWISE NOTEBOOK LOADS TOKENS FROM PATH
LOAD_TOKENS_FROM = '../input/py-bigbird-v26'

# IF VARIABLE IS NONE, THEN NOTEBOOK TRAINS A NEW MODEL
# OTHERWISE IT LOADS YOUR PREVIOUSLY TRAINED MODEL
LOAD_MODEL_FROM = '../input/fullensemble'

# Use the entire ensemble.
ENSEMBLE_IDS = list(range(5))

# Setting Fold = None leaves out an arbitrary 10% of the dataset for sequence classifier training.
# Setting Fold to one of [0,1,2,3,4] leaves out the portion of the dataset not trained on by the corresponding ensemble model.
# 'half' leaves out an arbitrary 50%.
FOLD = None

# IF FOLLOWING IS NONE, THEN NOTEBOOK 
# USES INTERNET AND DOWNLOADS HUGGINGFACE 
# CONFIG, TOKENIZER, AND MODEL
DOWNLOADED_MODEL_PATH = '../input/py-bigbird-v26' 

if DOWNLOADED_MODEL_PATH is None:
    DOWNLOADED_MODEL_PATH = 'model'    
MODEL_NAME = 'google/bigbird-roberta-base'

# Tune the probability threshold for sequence classifiers to maximize F1
TRAIN_SEQ_CLASSIFIERS = False

# A cache of the BigBird predictions for the validation/sequence training set and the corresponding sequence dataset
KAGGLE_CACHE = '../input/feedbackcache2'

cache = 'cache'
cacheExists = os.path.exists(cache)
if not cacheExists:
  os.makedirs(cache)

In [None]:
# skopt optimizer has a bug when scipy is installed with its default version
if TRAIN_SEQ_CLASSIFIERS:
    os.system('pip install --no-dependencies scipy==1.5.2 ')

In [None]:
from torch import cuda
config = {'model_name': MODEL_NAME,   
         'max_length': 1024,
         'train_batch_size':4,
         'valid_batch_size':4,
         'epochs':5,
         'learning_rates': [2.5e-5, 2.5e-5, 2.5e-6, 2.5e-6, 2.5e-7],
         'max_grad_norm':10,
         'device': 'cuda' if cuda.is_available() else 'cpu'}

# How To Submit PyTorch Without Internet
Many people ask me, how do I submit PyTorch models without internet? With HuggingFace Transformer, it's easy. Just download the following 3 things (1) model weights, (2) tokenizer files, (3) config file, and upload them to a Kaggle dataset. Below shows code how to get the files from HuggingFace for Google's BigBird-base. But this same code can download any transformer, like for example roberta-base.

In [None]:
from transformers import *
if DOWNLOADED_MODEL_PATH == 'model':
    os.mkdir('model')
    
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, add_prefix_space=True)
    tokenizer.save_pretrained('model')

    config_model = AutoConfig.from_pretrained(MODEL_NAME) 
    config_model.num_labels = 15
    config_model.save_pretrained('model')

    backbone = AutoModelForTokenClassification.from_pretrained(MODEL_NAME, 
                                                               config=config_model)
    backbone.save_pretrained('model')

# Load Data and Libraries
In addition to loading the train dataframe, we will load all the train and text files and save them in a dataframe.

In [None]:
import numpy as np, os 
from scipy import stats
import pandas as pd, gc 
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForTokenClassification, AdamW


from torch.utils.data import Dataset, DataLoader
import torch
from sklearn.metrics import accuracy_score
from torch.cuda import amp

In [None]:
train_df = pd.read_csv('../input/feedback-prize-2021/train.csv')
print( train_df.shape )
train_df.head()

In [None]:
# https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533
test_names, test_texts = [], []
for f in list(os.listdir('../input/feedback-prize-2021/test')):
    test_names.append(f.replace('.txt', ''))
    test_texts.append(open('../input/feedback-prize-2021/test/' + f, 'r').read())
test_texts = pd.DataFrame({'id': test_names, 'text': test_texts})

SUBMISSION = False
if len(test_names) > 5:
      SUBMISSION = True

test_texts.head()

In [None]:
# https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533
test_names, train_texts = [], []
for f in tqdm(list(os.listdir('../input/feedback-prize-2021/train'))):
    test_names.append(f.replace('.txt', ''))
    train_texts.append(open('../input/feedback-prize-2021/train/' + f, 'r').read())
train_text_df = pd.DataFrame({'id': test_names, 'text': train_texts})
train_text_df.head()

# Convert Train Text to NER Labels
We will now convert all text words into NER labels and save in a dataframe.

In [None]:
if not LOAD_TOKENS_FROM:
    all_entities = []
    for ii,i in enumerate(train_text_df.iterrows()):
        if ii%100==0: print(ii,', ',end='')
        total = i[1]['text'].split().__len__()
        entities = ["O"]*total
        for j in train_df[train_df['id'] == i[1]['id']].iterrows():
            discourse = j[1]['discourse_type']
            list_ix = [int(x) for x in j[1]['predictionstring'].split(' ')]
            entities[list_ix[0]] = f"B-{discourse}"
            for k in list_ix[1:]: entities[k] = f"I-{discourse}"
        all_entities.append(entities)
    train_text_df['entities'] = all_entities
    train_text_df.to_csv('train_NER.csv',index=False)
    
else:
    from ast import literal_eval
    train_text_df = pd.read_csv(f'{LOAD_TOKENS_FROM}/train_NER.csv')
    # pandas saves lists as string, we must convert back
    train_text_df.entities = train_text_df.entities.apply(lambda x: literal_eval(x) )
    
print( train_text_df.shape )
train_text_df.head()

In [None]:
# CREATE DICTIONARIES THAT WE CAN USE DURING TRAIN AND INFER
output_labels = ['O', 'B-Lead', 'I-Lead', 'B-Position', 'I-Position', 'B-Claim', 'I-Claim', 'B-Counterclaim', 'I-Counterclaim', 
          'B-Rebuttal', 'I-Rebuttal', 'B-Evidence', 'I-Evidence', 'B-Concluding Statement', 'I-Concluding Statement']

labels_to_ids = {v:k for k,v in enumerate(output_labels)}
ids_to_labels = {k:v for k,v in enumerate(output_labels)}
disc_type_to_ids = {'Evidence':(11,12),'Claim':(5,6),'Lead':(1,2),'Position':(3,4),'Counterclaim':(7,8),'Rebuttal':(9,10),'Concluding Statement':(13,14)}

In [None]:
labels_to_ids

# Define the dataset function
Below is our PyTorch dataset function. It always outputs tokens and attention. During training it also provides labels. And during inference it also provides word ids to help convert token predictions into word predictions.

Note that we use `text.split()` and `is_split_into_words=True` when we convert train text to labeled train tokens. This is how the HugglingFace tutorial does it. However, this removes characters like `\n` new paragraph. If you want your model to see new paragraphs, then we need to map words to tokens ourselves using `return_offsets_mapping=True`. See my TensorFlow notebook [here][1] for an example.

Some of the following code comes from the example at HuggingFace [here][2]. However I think the code at that link is wrong. The HuggingFace original code is [here][3]. With the flag `LABEL_ALL` we can either label just the first subword token (when one word has more than one subword token). Or we can label all the subword tokens (with the word's label). In this notebook version, we label all the tokens. There is a Kaggle discussion [here][4]

[1]: https://www.kaggle.com/cdeotte/tensorflow-longformer-ner-cv-0-617
[2]: https://huggingface.co/docs/transformers/custom_datasets#tok_ner
[3]: https://github.com/huggingface/transformers/blob/86b40073e9aee6959c8c85fcba89e47b432c4f4d/examples/pytorch/token-classification/run_ner.py#L371
[4]: https://www.kaggle.com/c/feedback-prize-2021/discussion/296713

In [None]:
# Return an array that maps character index to index of word in list of split() words
def split_mapping(unsplit):
    splt = unsplit.split()
    offset_to_wordidx = np.full(len(unsplit),-1)
    txt_ptr = 0
    for split_index, full_word in enumerate(splt):
        while unsplit[txt_ptr:txt_ptr + len(full_word)] != full_word:
            txt_ptr += 1
        offset_to_wordidx[txt_ptr:txt_ptr + len(full_word)] = split_index
        txt_ptr += len(full_word)
    return offset_to_wordidx

In [None]:
class dataset(Dataset):
  def __init__(self, dataframe, tokenizer, max_len, get_wids):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.get_wids = get_wids # for validation

  def __getitem__(self, index):
        # GET TEXT AND WORD LABELS 
        text = self.data.text[index]        
        word_labels = self.data.entities[index] if not self.get_wids else None

        # TOKENIZE TEXT
        encoding = self.tokenizer(text,
                             return_offsets_mapping=True, 
                             padding='max_length', 
                             truncation=True, 
                             max_length=self.max_len)
        
        word_ids = encoding.word_ids()  
        split_word_ids = np.full(len(word_ids),-1)
        offset_to_wordidx = split_mapping(text)
        offsets = encoding['offset_mapping']
        
        # CREATE TARGETS AND MAPPING OF TOKENS TO SPLIT() WORDS
        label_ids = []
        # Iterate in reverse to label whitespace tokens until a Begin token is encountered
        for token_idx, word_idx in reversed(list(enumerate(word_ids))):
            
            if word_idx is None:
                if not self.get_wids: label_ids.append(-100)
            else:
                if offsets[token_idx] != (0,0):
                    #Choose the split word that shares the most characters with the token if any
                    split_idxs = offset_to_wordidx[offsets[token_idx][0]:offsets[token_idx][1]]
                    split_index = stats.mode(split_idxs[split_idxs != -1]).mode[0] if len(np.unique(split_idxs)) > 1 else split_idxs[0]
                    
                    if split_index != -1: 
                        if not self.get_wids: label_ids.append( labels_to_ids[word_labels[split_index]] )
                        split_word_ids[token_idx] = split_index
                    else:
                        # Even if we don't find a word, continue labeling 'I' tokens until a 'B' token is found
                        if label_ids and label_ids[-1] != -100 and ids_to_labels[label_ids[-1]][0] == 'I':
                            split_word_ids[token_idx] = split_word_ids[token_idx + 1]
                            if not self.get_wids: label_ids.append(label_ids[-1])
                        else:
                            if not self.get_wids: label_ids.append(-100)
                else:
                    if not self.get_wids: label_ids.append(-100)
        
        encoding['labels'] = list(reversed(label_ids))

        # CONVERT TO TORCH TENSORS
        item = {key: torch.as_tensor(val) for key, val in encoding.items()}
        if self.get_wids: 
            item['wids'] = torch.as_tensor(split_word_ids)
        
        return item

  def __len__(self):
        return self.len

# Create Train and Validation Dataloaders
We will use the same train and validation subsets as my TensorFlow notebook [here][1]. Then we can compare results. And/or experiment with ensembling the validation fold predictions.

[1]: https://www.kaggle.com/cdeotte/tensorflow-longformer-ner-cv-0-617

In [None]:
# CHOOSE VALIDATION INDEXES (that match my TF notebook)
IDS = train_df.id.unique()


np.random.seed(42)

if FOLD == 'half':
    train_idx = np.random.choice(np.arange(len(IDS)),int(0.5*len(IDS)),replace=False)
    valid_idx = np.setdiff1d(np.arange(len(IDS)),train_idx)
    
elif FOLD is not None:
    print('There are',len(IDS),'train texts. We will split 93% 7% for ensemble training.')
    shuffled_ids = np.arange(len(IDS))
    np.random.shuffle(shuffled_ids)

    valid_len = int(.07 * len(IDS))
    valid_idx = shuffled_ids[FOLD*valid_len:(FOLD+1)*valid_len]
    train_idx = np.setdiff1d(np.arange(len(IDS)),valid_idx)
    
else:
    print('There are',len(IDS),'train texts. We will split 90% 10% for ensemble training.')
    train_idx = np.random.choice(np.arange(len(IDS)),int(0.9*len(IDS)),replace=False)
    valid_idx = np.setdiff1d(np.arange(len(IDS)),train_idx)

np.random.seed(None)

In [None]:
# CREATE TRAIN SUBSET AND VALID SUBSET
data = train_text_df[['id','text', 'entities']]
train_dataset = data.loc[data['id'].isin(IDS[train_idx]),['text', 'entities']].reset_index(drop=True)
test_dataset = data.loc[data['id'].isin(IDS[valid_idx])].reset_index(drop=True)

print("FULL Dataset: {}".format(data.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

tokenizer = AutoTokenizer.from_pretrained(DOWNLOADED_MODEL_PATH) 
training_set = dataset(train_dataset, tokenizer, config['max_length'], False)
testing_set = dataset(test_dataset, tokenizer, config['max_length'], True)

In [None]:
# TRAIN DATASET AND VALID DATASET
train_params = {'batch_size': config['train_batch_size'],
                'shuffle': True,
                'num_workers': 2,
                'pin_memory':True
                }

test_params = {'batch_size': config['valid_batch_size'],
                'shuffle': False,
                'num_workers': 2,
                'pin_memory':True
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

# TEST DATASET
test_texts_set = dataset(test_texts, tokenizer, config['max_length'], True)
test_texts_loader = DataLoader(test_texts_set, **test_params)

# Train Model
The PyTorch train function is taken from Raghavendrakotala's great notebook [here][1]. I assume it uses a masked loss which avoids computing loss when target is `-100`. If not, we need to update this.

[1]: https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533

In [None]:
# https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533
def train(epoch):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    #tr_preds, tr_labels = [], []
    
    # put model in training mode
    model.train()
    
    for idx, batch in enumerate(training_loader):
        
        ids = batch['input_ids'].to(config['device'], dtype = torch.long)
        mask = batch['attention_mask'].to(config['device'], dtype = torch.long)
        labels = batch['labels'].to(config['device'], dtype = torch.long)

        with amp.autocast():
            loss, tr_logits = model(input_ids=ids, attention_mask=mask, labels=labels,
                                   return_dict=False)
        tr_loss += loss.item()

        nb_tr_steps += 1
        nb_tr_examples += labels.size(0)
        
        if idx % 200==0:
            loss_step = tr_loss/nb_tr_steps
            print(f"Training loss after {idx:04d} training steps: {loss_step}")
           
        # compute training accuracy
        flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        
        # only compute accuracy at active labels
        active_accuracy = labels.view(-1) != -100 # shape (batch_size, seq_len)
        #active_labels = torch.where(active_accuracy, labels.view(-1), torch.tensor(-100).type_as(labels))
        
        labels = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)
        
        #tr_labels.extend(labels)
        #tr_preds.extend(predictions)

        tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
    
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=config['max_grad_norm']
        )
        
        # backward pass
        optimizer.zero_grad()
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")

In [None]:
# CREATE MODEL
scaler = amp.GradScaler()
config_model = AutoConfig.from_pretrained(DOWNLOADED_MODEL_PATH+'/config.json') 
model = AutoModelForTokenClassification.from_pretrained(
                   DOWNLOADED_MODEL_PATH+'/pytorch_model.bin',config=config_model)
model.to(config['device'])
optimizer = AdamW(params=model.parameters(), lr=config['learning_rates'][0])

In [None]:
import warnings

warnings.filterwarnings('ignore', '.*__floordiv__ is deprecated.*',)

# LOOP TO TRAIN MODEL (or load model)
if not LOAD_MODEL_FROM:
    for epoch in range(config['epochs']):
        
        print(f"### Training epoch: {epoch + 1}")
        for g in optimizer.param_groups: 
            g['lr'] = config['learning_rates'][epoch]
        lr = optimizer.param_groups[0]['lr']
        print(f'### LR = {lr}\n')
        
        train(epoch)
        torch.cuda.empty_cache()
        gc.collect()
        
    torch.save(model.state_dict(), f'bigbird_v{VER}.pt')

# Inference and Validation Code
We will infer in batches using our data loader which is faster than inferring one text at a time with a for-loop. The metric code is taken from Rob Mulla's great notebook [here][2]. Our model achieves validation F1 score 0.615! 

During inference our model will make predictions for each subword token. Some single words consist of multiple subword tokens. In the code below, we use a word's first subword token prediction as the label for the entire word. We can try other approaches, like averaging all subword predictions or taking `B` labels before `I` labels etc.

[1]: https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533
[2]: https://www.kaggle.com/robikscube/student-writing-competition-twitch

In [None]:

# Returns per-word, mean class prediction probability over all tokens corresponding to each word
def inference(data_loader, model_ids):
    
    gc.collect()
    torch.cuda.empty_cache()
    
    ensemble_preds = np.zeros((len(data_loader.dataset), config['max_length'], len(labels_to_ids)), dtype=np.float32)
    wids = np.full((len(data_loader.dataset), config['max_length']), -100)
    for model_i, model_id in enumerate(model_ids):
        
        model.load_state_dict(torch.load(f'{LOAD_MODEL_FROM}/ensemble-{model_id}.pt', map_location=config['device']))
        
        # put model in training mode
        model.eval()
        for batch_i, batch in enumerate(data_loader):
            
            if model_i == 0: wids[batch_i*config['valid_batch_size']:(batch_i+1)*config['valid_batch_size']] = batch['wids'].numpy()

            # MOVE BATCH TO GPU AND INFER
            ids = batch["input_ids"].to(config['device'])
            mask = batch["attention_mask"].to(config['device'])
            with amp.autocast():
                outputs = model(ids, attention_mask=mask, return_dict=False)
            all_preds = torch.nn.functional.softmax(outputs[0], dim=2).cpu().detach().numpy() 
            ensemble_preds[batch_i*config['valid_batch_size']:(batch_i+1)*config['valid_batch_size']] += all_preds
            
            del ids
            del mask
            del outputs
            del all_preds
            
        gc.collect()
        torch.cuda.empty_cache()
            
    ensemble_preds /= len(model_ids)
    predictions = []
    # INTERATE THROUGH EACH TEXT AND GET PRED
    for text_i in range(ensemble_preds.shape[0]):
        token_preds = ensemble_preds[text_i]
        
        prediction = []
        previous_word_idx = -1
        prob_buffer = []
        word_ids = wids[text_i][wids[text_i] != -100]
        for idx,word_idx in enumerate(word_ids):                            
            if word_idx == -1:
                pass
            elif word_idx != previous_word_idx:              
                if prob_buffer:
                    prediction.append(np.mean(prob_buffer, dtype=np.float32, axis=0))
                    prob_buffer = []
                prob_buffer.append(token_preds[idx])
                previous_word_idx = word_idx
            else: 
                prob_buffer.append(token_preds[idx])
        prediction.append(np.mean(prob_buffer, dtype=np.float32, axis=0))
        predictions.append(prediction)
            
    gc.collect()
    torch.cuda.empty_cache()
    return predictions

In [None]:
# from Rob Mulla @robikscube
# https://www.kaggle.com/robikscube/student-writing-competition-twitch
def calc_overlap(row):
    """
    Calculates the overlap between prediction and
    ground truth and overlap percentages used for determining
    true positives.
    """
    set_pred = set(row.predictionstring_pred.split(' '))
    set_gt = set(row.predictionstring_gt.split(' '))
    # Length of each and intersection
    len_gt = len(set_gt)
    len_pred = len(set_pred)
    inter = len(set_gt.intersection(set_pred))
    overlap_1 = inter / len_gt
    overlap_2 = inter/ len_pred
    return [overlap_1, overlap_2]


def score_feedback_comp(pred_df, gt_df):
    """
    A function that scores for the kaggle
        Student Writing Competition
        
    Uses the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = gt_df[['id','discourse_type','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df = pred_df[['id','class','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df['pred_id'] = pred_df.index
    gt_df['gt_id'] = gt_df.index
    # Step 1. all ground truths and predictions for a given class are compared.
    joined = pred_df.merge(gt_df,
                           left_on=['id','class'],
                           right_on=['id','discourse_type'],
                           how='outer',
                           suffixes=('_pred','_gt')
                          )
    joined['predictionstring_gt'] = joined['predictionstring_gt'].fillna(' ')
    joined['predictionstring_pred'] = joined['predictionstring_pred'].fillna(' ')

    joined['overlaps'] = joined.apply(calc_overlap, axis=1)

    # 2. If the overlap between the ground truth and prediction is >= 0.5, 
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    joined['overlap1'] = joined['overlaps'].apply(lambda x: eval(str(x))[0])
    joined['overlap2'] = joined['overlaps'].apply(lambda x: eval(str(x))[1])


    joined['potential_TP'] = (joined['overlap1'] >= 0.5) & (joined['overlap2'] >= 0.5)
    joined['max_overlap'] = joined[['overlap1','overlap2']].max(axis=1)
    tp_pred_ids = joined.query('potential_TP') \
        .sort_values('max_overlap', ascending=False) \
        .groupby(['id','predictionstring_gt']).first()['pred_id'].values

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    fp_pred_ids = [p for p in joined['pred_id'].unique() if p not in tp_pred_ids]

    matched_gt_ids = joined.query('potential_TP')['gt_id'].unique()
    unmatched_gt_ids = [c for c in joined['gt_id'].unique() if c not in matched_gt_ids]

    # Get numbers of each type
    TP = len(tp_pred_ids)
    FP = len(fp_pred_ids)
    FN = len(unmatched_gt_ids)
    #calc microf1
    my_f1_score = TP / (TP + 0.5*(FP+FN))
    return my_f1_score

Aggregate the per-word, mean class probability predictions from BigBird for validation and submit sets.

In [None]:
import pickle
valid = train_df.loc[train_df['id'].isin(IDS[valid_idx])]

print('Predicting with BigBird...')
if not SUBMISSION:
    try:
        with open( KAGGLE_CACHE + "/valid_preds.p", "rb" ) as validFile:
            valid_word_preds = pickle.load( validFile )    
    except:
        valid_word_preds = inference(testing_loader, ENSEMBLE_IDS)
else: valid_word_preds = []
        
test_word_preds = inference(test_texts_loader, ENSEMBLE_IDS)
    
with open( cache + "/valid_preds.p", "wb" ) as validFile:
    pickle.dump( valid_word_preds, validFile )
print('Done.')

uniqueValidGroups = range(len(valid_word_preds))
uniqueSubmitGroups = range(len(test_word_preds))


# Sequence Datasets
We will create datasets that, instead of describing individual words or tokens, describes sequences of words. Within some heuristic constraints, every possible sub-sequence of words in a text will converted to a dataset sample with the following attributes:

* features- sequence length, position, and various kinds of class probability predictions/statistics
* labels- whether the sequence matches exactly a discourse instance
* truePos- whether the sequence matches a discourse instance by competition criteria for true positive
* groups- the integer index of the text where the sequence is found
* wordRanges- the start and end word index of the sequence in the text

Sequence datasets are generated for each discourse type and for validation and submission datasets. 

In [None]:
from collections import Counter
from bisect import bisect_left

# Percentile code taken from https://www.kaggle.com/vuxxxx/tensorflow-longformer-ner-postprocessing
# Thank Vu!
#
# Use 99.5% of the distribution of lengths for a disourse type as maximum. 
# Increasing this constraint makes this step slower but generally increases performance.
MAX_SEQ_LEN = {}
train_df['len'] = train_df['predictionstring'].apply(lambda x:len(x.split()))
max_lens = train_df.groupby('discourse_type')['len'].quantile(.995)
for disc_type in disc_type_to_ids:
    MAX_SEQ_LEN[disc_type] = int(max_lens[disc_type])

#The minimum probability prediction for a 'B'egin class for which we will evaluate a word sequence
MIN_BEGIN_PROB = {
    'Claim': .35,
    'Concluding Statement': .15,
    'Counterclaim': .04,
    'Evidence': .1,
    'Lead': .32,
    'Position': .25,
    'Rebuttal': .01,
}
        
class SeqDataset(object):
    
    def __init__(self, features, labels, groups, wordRanges, truePos):
        
        self.features = np.array(features, dtype=np.float32)
        self.labels = np.array(labels)
        self.groups = np.array(groups, dtype=np.int16)
        self.wordRanges = np.array(wordRanges, dtype=np.int16)
        self.truePos = np.array(truePos)

# Adapted from https://stackoverflow.com/questions/60467081/linear-interpolation-in-numpy-quantile
# This is used to prevent re-sorting to compute quantile for every sequence.
def sorted_quantile(array, q):
    array = np.array(array)
    n = len(array)
    index = (n - 1) * q
    left = np.floor(index).astype(int)
    fraction = index - left
    right = left
    right = right + (fraction > 0).astype(int)
    i, j = array[left], array[right]
    return i + (j - i) * fraction
        
def seq_dataset(disc_type, pred_indices=None, submit=False):
    word_preds = valid_word_preds if not submit else test_word_preds
    window = pred_indices if pred_indices else range(len(word_preds))
    X = np.empty((int(1e6),13), dtype=np.float32)
    X_ind = 0
    y = []
    truePos = []
    wordRanges = []
    groups = []
    for text_i in tqdm(window):
        text_preds = np.array(word_preds[text_i])
        num_words = len(text_preds)
        disc_begin, disc_inside = disc_type_to_ids[disc_type]
        
        # The probability that a word corresponds to either a 'B'-egin or 'I'-nside token for a class
        prob_or = lambda word_preds: (1-(1-word_preds[:,disc_begin]) * (1-word_preds[:,disc_inside]))
        
        if not submit:
            gt_idx = set()
            gt_arr = np.zeros(num_words, dtype=int)
            text_gt = valid.loc[valid.id == test_dataset.id.values[text_i]]
            disc_gt = text_gt.loc[text_gt.discourse_type == disc_type]
            
            # Represent the discourse instance locations in a hash set and an integer array for speed
            for row_i, row in enumerate(disc_gt.iterrows()):
                splt = row[1]['predictionstring'].split()
                start, end = int(splt[0]), int(splt[-1]) + 1
                gt_idx.add((start, end))
                gt_arr[start:end] = row_i + 1
            gt_lens = np.bincount(gt_arr)
        
        # Iterate over every sub-sequence in the text
        quants = np.linspace(0,1,7)
        prob_begins = np.copy(text_preds[:,disc_begin])
        min_begin = MIN_BEGIN_PROB[disc_type]
        for pred_start in range(num_words):
            prob_begin = prob_begins[pred_start]
            if prob_begin > min_begin:
                begin_or_inside = []
                for pred_end in range(pred_start+1,min(num_words+1, pred_start+MAX_SEQ_LEN[disc_type]+1)):
                    
                    new_prob = prob_or(text_preds[pred_end-1:pred_end])
                    insert_i = bisect_left(begin_or_inside, new_prob)
                    begin_or_inside.insert(insert_i, new_prob[0])

                    # Generate features for a word sub-sequence

                    # The length and position of start/end of the sequence
                    features = [pred_end - pred_start, pred_start / float(num_words), pred_end / float(num_words)]
                    
                    # 7 evenly spaced quantiles of the distribution of relevant class probabilities for this sequence
                    features.extend(list(sorted_quantile(begin_or_inside, quants)))

                    # The probability that words on either edge of the current sub-sequence belong to the class of interest
                    features.append(prob_or(text_preds[pred_start-1:pred_start])[0] if pred_start > 0 else 0)
                    features.append(prob_or(text_preds[pred_end:pred_end+1])[0] if pred_end < num_words else 0)

                    # The probability that the first word corresponds to a 'B'-egin token
                    features.append(text_preds[pred_start,disc_begin])

                    exact_match = (pred_start, pred_end) in gt_idx if not submit else None

                    if not submit:
                        true_pos = False
                        for match_cand, count in Counter(gt_arr[pred_start:pred_end]).most_common(2):
                            if match_cand != 0 and count / float(pred_end - pred_start) >= .5 and float(count) / gt_lens[match_cand] >= .5: true_pos = True
                    else: true_pos = None

                    # For efficiency, use a numpy array instead of a list that doubles in size when full to conserve constant "append" time complexity
                    if X_ind >= X.shape[0]:
                        new_X = np.empty((X.shape[0]*2,13), dtype=np.float32)
                        new_X[:X.shape[0]] = X
                        X = new_X
                    X[X_ind] = features
                    X_ind += 1
                    
                    y.append(exact_match)
                    truePos.append(true_pos)
                    wordRanges.append((np.int16(pred_start), np.int16(pred_end)))
                    groups.append(np.int16(text_i))

    return SeqDataset(X[:X_ind], y, groups, wordRanges, truePos)


In [None]:
from joblib import Parallel, delayed
from multiprocessing import Manager

manager = Manager()


def sequenceDataset(disc_type, submit=False):
    if not submit: validSeqSets[disc_type] = seq_dataset(disc_type) if not SUBMISSION else None
    else: submitSeqSets[disc_type] = seq_dataset(disc_type, submit=True)

try:
    with open( KAGGLE_CACHE + "/valid_seqds.p", "rb" ) as validFile:
        validSeqSets = pickle.load( validFile )  
except:
    print('Making validation sequence datasets...')
    validSeqSets = manager.dict()
    Parallel(n_jobs=-1, backend='multiprocessing')(
            delayed(sequenceDataset)(disc_type, False) 
           for disc_type in disc_type_to_ids
        )
    print('Done.')
    
    
print('Making submit sequence datasets...')
submitSeqSets = manager.dict()
Parallel(n_jobs=-1, backend='multiprocessing')(
        delayed(sequenceDataset)(disc_type, True) 
       for disc_type in disc_type_to_ids
    )
print('Done.')
    
with open( cache + "/valid_seqds.p", "wb" ) as validFile:
    pickle.dump( dict(validSeqSets), validFile )

In [None]:
NEGATIVE_SAMPLE_RATIO = 1

# Downsample negative samples to 1:1 for efficiency/ease. There are many samples, and performance increase was observed.
def resample(y):
    global resample_call
    counts = np.bincount(y)
    np.random.seed((resample_call+counts[0]) % 2**32)
    
    neg_sample_count = NEGATIVE_SAMPLE_RATIO*counts[1]
    indices = np.concatenate((
        np.random.choice(np.arange(len(y))[y==0], neg_sample_count, replace=False),
        np.arange(len(y))[y==1]
    ))
    indices.sort()
    resample_call += 1
    return indices

resample_call = 0

# Sequence Classifier Tuning

Every discorse type/class is trained in separate, parallel optimization loops (if you have enough CPUs).

During a training iteration, the validation set is split into 8 folds. For every fold, a classifier is trained on 7 folds to predict the probability that a word sub-sequence is a true positive in the remaining fold. To compose a set of predictions for a text, those sub-sequences with the highest predicted probability of being a true positive are included iteratively, so long as they do not intersect with previously included sub-sequences. Sub-sequences are no longer included when their predicted probability is beneath a threshold. Tuing this per-class theshold is the objective of this tuning stage.

Once a set of predicted sub-sequences is composed for each text in each fold, the competition Macro F1 score is computed for the entire validation set. A noisy optimization algorithm, `skopt.gp_minimize`, is used to find the optimal probability threshold described above for each class with minimal iterations. After tuning is complete, the classifier in each fold is saved to disc. The mean probability predictions of this ensemble of classifiers is used during submission.

In [None]:
from joblib import Parallel, delayed
from multiprocessing import Manager
from sklearn.model_selection import cross_val_score, GroupKFold
from sklearn.ensemble import GradientBoostingClassifier
from skopt.space import Real
from skopt import gp_minimize
import sys
import xgboost

NUM_FOLDS = 8

warnings.filterwarnings('ignore', '.*ragged nested sequences*',)

seq_cache = {} # For each fold and each text. cache score predictions sorted by score
clfs = []  # Each fold will add its classifier here
# Predict sub-sequences for a discourse type and set of train/test texts
def predict_strings(disc_type, probThresh, test_groups, train_ind=None, submit=False):
    string_preds = []
    validSeqDs = validSeqSets[disc_type]
    submitSeqDs = submitSeqSets[disc_type]
    
    # Average the probability predictions of a set of classifiers
    get_tp_prob = lambda testDs, classifiers: np.mean([clf.predict_proba(testDs.features)[:,1] for clf in classifiers], axis=0) if testDs.features.shape[0] > 0 else np.array([])
    
    if not submit:
        # Point to validation set values
        predict_df = test_dataset
        text_df = train_text_df
        groupIdx = np.isin(validSeqDs.groups, test_groups)
        testDs = SeqDataset(validSeqDs.features[groupIdx], validSeqDs.labels[groupIdx], validSeqDs.groups[groupIdx], validSeqDs.wordRanges[groupIdx], validSeqDs.truePos[groupIdx])
        
        # Cache the classifier predictions to speed up tuning iterations
        seq_key = (disc_type, tuple(test_groups), tuple(train_ind))
        if seq_key in seq_cache:
            text_to_seq = seq_cache[seq_key]
        else:

            clf = xgboost.XGBClassifier(
                learning_rate = 0.05,
                n_estimators=200,
                max_depth=7,
                min_child_weight=5,
                gamma=0,
                subsample=0.7,
                reg_alpha=.0005,
                colsample_bytree=0.6,
                scale_pos_weight=1,
                use_label_encoder=False,
                eval_metric='logloss',
                tree_method='hist'
            )
            
            resampled = resample(validSeqDs.truePos[train_ind])
            clf.fit(validSeqDs.features[train_ind][resampled],validSeqDs.truePos[train_ind][resampled])
            clfs.append(clf)
            prob_tp = get_tp_prob(testDs, [clf])
        
    else:
        # Point to submission set values
        predict_df = test_texts
        text_df = test_texts
        groupIdx = np.isin(submitSeqDs.groups, test_groups)
        testDs = SeqDataset(submitSeqDs.features[groupIdx], submitSeqDs.labels[groupIdx], submitSeqDs.groups[groupIdx], submitSeqDs.wordRanges[groupIdx], submitSeqDs.truePos[groupIdx])
        
        # Classifiers are always loaded from disc during submission
        with open( f"../input/seqclassifiers6/{disc_type}_clf.p", "rb" ) as clfFile:
            classifiers = pickle.load( clfFile )  
        prob_tp = get_tp_prob(testDs, classifiers)
        
    if submit or seq_key not in seq_cache:
        text_to_seq = {}
        for text_idx in test_groups:
            # The probability of true positive and (start,end) of each sub-sequence in the curent text
            prob_tp_curr = prob_tp[testDs.groups == text_idx]
            word_ranges_curr = testDs.wordRanges[testDs.groups == text_idx]
            sorted_seqs = list(reversed(sorted(zip(prob_tp_curr, [tuple(wr) for wr in word_ranges_curr]))))
            text_to_seq[text_idx] = sorted_seqs
        if not submit: seq_cache[seq_key] = text_to_seq
    
    for text_idx in test_groups:
        
        i = 1
        split_text = text_df.loc[text_df.id == predict_df.id.values[text_idx]].iloc[0].text.split()
        
        # Start and end word indices of sequence candidates kept in sorted order for efficiency
        starts = []
        ends = []
        
        # Include the sub-sequence predictions in order of predicted probability
        for prob, wordRange in text_to_seq[text_idx]:
            
            # Until the predicted probability is lower than the tuned threshold
            if prob < probThresh: break
                
            # Binary search already-placed word sequence intervals, and insert the new word sequence interval if it does not intersect an existing interval.
            insert = bisect_left(starts, wordRange[0])
            if (insert == 0 or ends[insert-1] <= wordRange[0]) and (insert == len(starts) or starts[insert] >= wordRange[1]):
                starts.insert(insert, wordRange[0])
                ends.insert(insert, wordRange[1])
                string_preds.append((predict_df.id.values[text_idx], disc_type, ' '.join(map(str, list(range(wordRange[0], wordRange[1]))))))
                i += 1     
    return string_preds

def sub_df(string_preds):
    return pd.DataFrame(string_preds, columns=['id','class','predictionstring'])
    
# Convert skopt's uniform distribution over the tuning threshold to a distribution that exponentially decays from 100% to 0%
def prob_thresh(x): 
    return .01*(100-np.exp(100*x))

# Convert back to the scalar supplied by skopt
def skopt_thresh(x): 
    return np.log((x/.01-100.)/-1.)/100.
    
# This function is called every tuning iteration.
# It takes the probability threshold as input and returns Macro F1
def score_fmin(arr, disc_type):
    validSeqDs = validSeqSets[disc_type]
    string_preds = []
    folds = np.array(list(GroupKFold(n_splits=NUM_FOLDS).split(validSeqDs.features, groups=validSeqDs.groups)))
    gt_indices = []
    for ind in folds[:,1]: gt_indices.extend(ind)
        
    # Texts that have no samples in our dataset for this class
    unsampled_texts = np.array(np.array_split(list(set(uniqueValidGroups).difference(set(np.unique(validSeqDs.groups)))), NUM_FOLDS))
    
    gt_texts = test_dataset.id.values[np.unique(validSeqDs.groups[np.array(gt_indices, dtype=int)]).astype(int)]
    
    # Generate predictions from each fold of the validation set
    for fold_i, (train_ind, test_ind) in enumerate(folds):
        string_preds.extend(predict_strings(disc_type, prob_thresh(arr[0]), np.concatenate((np.unique(validSeqDs.groups[test_ind]), unsampled_texts[fold_i])).astype(int), train_ind))
    boost_df = sub_df(list(string_preds))
    gt_df = valid.loc[np.bitwise_and(valid['discourse_type']==disc_type, valid.id.isin(gt_texts))].copy()
    f1 = score_feedback_comp(boost_df.copy(), gt_df)
    return -f1


def train_seq_clfs(disc_type):
    # The optimization bounds on the tuned probability threshold 
    space_start = skopt_thresh(.999)
    space_end = skopt_thresh(0)
    space  = [Real(space_start,space_end)]
    
    # Minimize F1
    score_fmin_disc = lambda arr: score_fmin(arr, disc_type)
    res_gp = gp_minimize(score_fmin_disc, space, n_calls=100, x0=[skopt_thresh(.5)])
    
    # Use the gaussian approximation of f(threshold) -> F1 to select the minima
    thresh_cand = np.rot90([np.linspace(0,1,1000)])
    cand_scores = res_gp.models[-1].predict(thresh_cand)
    best_thresh_raw = space_start + (space_end - space_start)*thresh_cand[np.argmin(cand_scores)][0]
    best_thresh = prob_thresh(best_thresh_raw)
    exp_score = -np.min(cand_scores)
    
    # Make predictions at the inferred function minimum
    pred_thresh_score = -score_fmin_disc([best_thresh_raw])
    
    # And the best iteration in the optimization run
    best_iter_score = -score_fmin_disc(res_gp.x)
    
    # Save the trained classifiers to disc
    with open( f"{disc_type}_clf.p", "wb" ) as clfFile:
        pickle.dump( clfs, clfFile )
        
    # Save the tuning run results to file
    with open( f"{disc_type}_res.p", "wb" ) as resFile:
        pickle.dump( 
            {
                'pred_thresh': best_thresh,  # The location of the minimum of the gaussian function inferred by skopt
                'min_thresh': prob_thresh(res_gp.x[0]),  # The threshold which produces the best score
                'pred_score': exp_score,  # The minimum of the gaussian function inferred by skopt
                'min_score': best_iter_score, # The best score in the tuning run
                'pred_thresh_score': pred_thresh_score  # The score produced by 'pred_thresh'
            }, 
            resFile 
        )
    print('Done training', disc_type)
    
if TRAIN_SEQ_CLASSIFIERS and not SUBMISSION:
    print('Training sequence classifiers... (This takes a long time.)')
    Parallel(n_jobs=-1, backend='multiprocessing')(
            delayed(train_seq_clfs)(disc_type) 
           for disc_type in disc_type_to_ids
    )
    print('Done training all sequence classifiers.')

Load the tuned probability thresholds from tuning result files, and make sub-sequence predictions!

In [None]:
thresholds = {}
for disc_type in disc_type_to_ids:
    with open( f"../input/seqclassifiers6/{disc_type}_res.p", "rb" ) as res_file:
        train_result = pickle.load( res_file )  
    thresholds[disc_type] = train_result['pred_thresh']
    print(disc_type, train_result)
sub = pd.concat([sub_df(predict_strings(disc_type, thresholds[disc_type], uniqueSubmitGroups, submit=True)) for disc_type in disc_type_to_ids ]).reset_index(drop=True)

In [None]:
sub.to_csv("submission.csv", index=False)

In [None]:
print(sub)