## OneStopEnglish corpus 


One of the challenges of this competition is to fine tuning very large models (350/400M parameters) with a small sample (~2.800 rows).

One way to avoid overfit train/validation set is to check how your model generalizes with a completely different dataset.

OneStopEnglish corpus of texts (https://aclanthology.org/W18-0535/) is a dataset written at three reading levels.  

The corpus consists of 189 texts, each represented in three versions: elementary, intermediate and advanced text:


<img src="https://d3i71xaburhd42.cloudfront.net/f6d485c14786abbab731b0cf5e1f4de6b69dc57b/5-Table1-1.png" alt="Example sentences for three reading levels" style="width:600px;"/>


This dataset is not designed specifically for children's readability, but accuracy on such dataset may show your model's skills on assessing readability.


The accuracy on this dataset can be defined in this way
- if *pred* for elementary text is grater than pred for intermediate text than 1 else 0 (elementary text is predicted to be more readable then intermediate text)
- if *pred* for itermediate text is grater than pred for adcanced text than 1 else 0 (itermediate text is predicted to be more readable then advanced text)


In this notebook we show how some of our models prepared for the CommonLit competition generalize well on this dataset (**83.5%-88.6%** accuracy) as well as how a linear combination of these models with classical readability features outperform accuracy on the dataset as reported in litterature (**98.9%**)

|Model|OneEnglishCorpus (Accuracy)|CommonLit CV (RMSE)|CommonLit Public LB (RMSE)|CommonLit Private LB (RMSE)|
|:----|:-------------------------:|:-----------------:|:------------------------:|:-------------------------:|
|clrp-deberta-large-ppln4-atthead|0.85979|0.48395|0.472|0.469|
|clrp-deberta-large-2-se|0.88624|0.48962|0.472|0.471|
|clrp-deberta-large-4-se|0.87831|0.49059|0.468|0.470|
|clrp-roberta-large-2f-se|0.83598|0.48927|0.472|0.470|
|clrp-roberta-large-2h-atthead-se|0.84656|0.49703|NA|NA|
|**ensemble**|**0.98942**|**0.45964**|**0.458**|**0.454**|


### Download OneStopEnglish corpus dataset 

We download the OneStop EnglishCorpus from Github and we assemble it into a dataset.

In [None]:
!git clone https://github.com/nishkalavallabhi/OneStopEnglishCorpus.git

In [None]:
import numpy as np 
import pandas as pd
import os

name = []
ele = []
inter = []
adv= []

path_adv = './OneStopEnglishCorpus/Texts-SeparatedByReadingLevel/Adv-Txt'
path_ele = './OneStopEnglishCorpus/Texts-SeparatedByReadingLevel/Ele-Txt'
path_int = './OneStopEnglishCorpus/Texts-SeparatedByReadingLevel/Int-Txt'

for dirname, _, filenames in os.walk(path_adv):
    for filename in filenames:
        path =  dirname +"/" +filename
        if ".txt" in path:
            name.append (filename.replace("-adv.txt",""))
            with open(path) as f:
                content = " ".join(f.readlines()) 
                adv.append(content)
            with open(path_ele + "/" + filename.replace("-adv","-ele")) as f:
                content = " ".join(f.readlines()) 
                ele.append(content)
            with open(path_int + "/" + filename.replace("-adv","-int")) as f:
                content = " ".join(f.readlines()[1:]) 
                inter.append(content)
                
df_1 = pd.DataFrame ({
    "name": name,
    "excerpt": ele
})
df_1["level"] = "ele"

df_2 = pd.DataFrame ({
    "name": name,
    "excerpt": inter
})
df_2["level"] = "inter"

df_3 = pd.DataFrame ({
    "name": name,
    "excerpt": adv
})
df_3["level"] = "adv"

df = df_1.append(df_2).append(df_3)

df.to_csv("OneStopEnglishCorpus.csv", index=False)

df.head()

## Predict readability on OneStopEnglish corpus

### Single Models

Here we entrust 4 of our competition models to predict readability on the OneStopEnglish corpus. Notice that we first write on disk a python script for each model prediciton pipeline, then we execute them by calling them separately by a shell command. In such a way we avoid problems with extensive and repetive usage of memory and GPU in the same script.

In [None]:
%%writefile ensemble.py

import numpy as np 
import pandas as pd 
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt


import torch
import transformers

import random
import os
import sys

from tqdm import tqdm

class CLRPDataset():
    def __init__(self,df,max_len, tokenizer):
        self.excerpt = df['excerpt'].values
        self.max_len = max_len
        self.tokenizer = tokenizer 


        if "target" in df.columns:
            self.target = df['target'].values
        else:
            self.target = None
    
    def __getitem__(self,index):
        encode = self.tokenizer(self.excerpt[index],
                                return_tensors='pt',
                                max_length=self.max_len,
                                padding='max_length',
                                return_token_type_ids = True,
                                truncation=True)  

        #token_ids = encode['input_ids'].squeeze(0)
        #attn_masks = encode['attention_mask'].squeeze(0)
        #token_type_ids = encode['token_type_ids'].squeeze(0)

        token_ids = encode['input_ids'][0]
        attn_masks = encode['attention_mask'][0]
        token_type_ids = encode['token_type_ids'][0]
        
        
        if self.target is None:
            return token_ids, attn_masks, token_type_ids


        target = self.target[index]
        target = torch.tensor(target).float()    

        return token_ids, attn_masks, token_type_ids, target  


    def __len__(self):
        return len(self.excerpt)
    
class BertRegreesion(torch.nn.Module):

    def __init__(self, dropout, bert_model, model_path,  freeze_bert=False):
        super(BertRegreesion, self).__init__()
        
        self.bert_layer = transformers.AutoModel.from_pretrained(model_path)
        
        #  Fix the hidden-state size of the encoder outputs (If you want to add other pre-trained models here, search for the encoder output size)
        if bert_model == "roberta-base":  
            hidden_size = 768
        elif bert_model == "roberta-large":  
            hidden_size = 1024
        elif bert_model == "microsoft/deberta-large":  
            hidden_size = 1024
            
        # Freeze bert layers and only train the regression layer weights
        if freeze_bert:
            for p in self.bert_layer.parameters():
                p.requires_grad = False

        # ReGression layer
        self.dropout = torch.nn.Dropout(p=dropout)
        self.head = torch.nn.Linear(hidden_size, 1)
        self.bert_model = bert_model
    
    def forward(self, input_ids, attn_masks, token_type_ids):

        # Feeding the inputs to the BERT-based model to obtain contextualized representations
        if  self.bert_model == "microsoft/deberta-large":
            output = self.bert_layer(input_ids, attn_masks, token_type_ids)
            output = output[0]
            output = output[:,0,:].squeeze(1)
        else:  
            cont_reps, output = self.bert_layer(input_ids, attn_masks, token_type_ids,  return_dict=False)

        output = self.head(self.dropout(output))

        return output
    
BATCH_SIZE = 4
FOLDS = 5
NUM_WORKERS = 4
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

class Config:
    epochs = 20 
    eval_per_epoch = 8 #4
    dropout = 0.4
    max_len = 256
    bert_model = "roberta-base"
    warm_up_steps = 300

def get_preds (bert_model, instances_path, max_len, df_test):
    print(instances_path)
    model_path = "/kaggle/input/transformers/" + bert_model + "-hf"
    print(f"model_path:{model_path}")

    vocab_path = model_path
    instances_path = "/kaggle/input/" + instances_path
    instance_name = bert_model.replace("/","_")

    device = DEVICE
    tokenizer = transformers.AutoTokenizer.from_pretrained(vocab_path)

    test_data = CLRPDataset(df_test,max_len, tokenizer=tokenizer)
    test_loader = torch.utils.data.DataLoader(test_data,
                                          batch_size=BATCH_SIZE,
                                          shuffle=False,
                                          num_workers=NUM_WORKERS)

    p = np.zeros((len(df_test),))
    for fold in range(FOLDS): 
        preds = []

        model = BertRegreesion (dropout = Config.dropout, bert_model=bert_model, model_path= model_path, freeze_bert=False)
        
        filename = f"{instances_path}/{instance_name}_{fold}.pt"
        model.load_state_dict(torch.load(filename, map_location=torch.device(device)))
        model.to(device)
        model.eval()
    
        with torch.no_grad():
            for token_ids, attn_masks, token_type_ids in tqdm(test_loader):
                token_ids = token_ids.to(device)
                attn_masks = attn_masks.to(device)
                token_type_ids = token_type_ids.to(device)

                output = model.forward(token_ids, attn_masks, token_type_ids)
                output = output.detach().cpu()[:,0]

                preds.append(output)
        preds = np.concatenate(preds)
        p += preds
        del model
    
    return p/FOLDS

def main():
    from numba import cuda 
    device = cuda.get_current_device()
    device.reset()
    
    df_test = pd.read_csv("OneStopEnglishCorpus.csv")
    args = sys.argv[1:]
    
    preds = get_preds (args[0],args[1], max_len = Config.max_len, df_test = df_test)
    df_test [args[1]] = preds
    
    df_test[[args[1]]].to_pickle(args[1]+".pkl")

if __name__ == "__main__":
    main()            
            

In [None]:
%%writefile ensemble_atthead.py

import numpy as np 
import pandas as pd 
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt


import torch
import transformers

import random
import os
import sys

from tqdm import tqdm

class CLRPDataset():
    def __init__(self,df,max_len, tokenizer):
        self.excerpt = df['excerpt'].values
        self.max_len = max_len
        self.tokenizer = tokenizer 


        if "target" in df.columns:
            self.target = df['target'].values
        else:
            self.target = None
    
    def __getitem__(self,index):
        encode = self.tokenizer(self.excerpt[index],
                                return_tensors='pt',
                                max_length=self.max_len,
                                padding='max_length',
                                return_token_type_ids = True,
                                truncation=True)  

        #token_ids = encode['input_ids'].squeeze(0)
        #attn_masks = encode['attention_mask'].squeeze(0)
        #token_type_ids = encode['token_type_ids'].squeeze(0)

        token_ids = encode['input_ids'][0]
        attn_masks = encode['attention_mask'][0]
        token_type_ids = encode['token_type_ids'][0]
        
        
        if self.target is None:
            return token_ids, attn_masks, token_type_ids


        target = self.target[index]
        target = torch.tensor(target).float()    

        return token_ids, attn_masks, token_type_ids, target  


    def __len__(self):
        return len(self.excerpt)
    
class AttentionHead(torch.nn.Module):
    def __init__(self, in_features, hidden_dim, num_targets):
        super().__init__()
        self.in_features = in_features
        self.middle_features = hidden_dim
        self.W = torch.nn.Linear(in_features, hidden_dim)
        self.V = torch.nn.Linear(hidden_dim, 1)
        self.out_features = hidden_dim

    def forward(self, features):
        att = torch.tanh(self.W(features))
        score = self.V(att)
        attention_weights = torch.softmax(score, dim=1)
        context_vector = attention_weights * features
        context_vector = torch.sum(context_vector, dim=1)

        return context_vector


class BertRegreesion(torch.nn.Module):

    def __init__(self, dropout, bert_model, model_path,  freeze_bert=False):
        super(BertRegreesion, self).__init__()
        
        self.bert_layer = transformers.AutoModel.from_pretrained(model_path, output_hidden_states=True)

        #  Fix the hidden-state size of the encoder outputs (If you want to add other pre-trained models here, search for the encoder output size)
        if bert_model == "roberta-base":  
            hidden_size = 768
        elif bert_model == "roberta-large":  
            hidden_size = 1024
        elif bert_model == "microsoft/deberta-large":  
            hidden_size = 1024

        # Freeze bert layers and only train the regression layer weights
        if freeze_bert:
            for p in self.bert_layer.parameters():
                p.requires_grad = False

        self.head = AttentionHead(hidden_size,hidden_size,1)
                
        # ReGression layer
        self.dropout = torch.nn.Dropout(p=dropout)
        self.linear = torch.nn.Linear(hidden_size, 1)
        self.bert_model = bert_model
    
    def forward(self, input_ids, attn_masks, token_type_ids):

        # Feeding the inputs to the BERT-based model to obtain contextualized representations
        if  self.bert_model == "microsoft/deberta-large":
            output = self.bert_layer(input_ids, attn_masks, token_type_ids)
            output = output[0]

        else:  
            output = self.bert_layer(input_ids, attn_masks, token_type_ids,  return_dict=False)
            output = output[0]
        
        output = self.head(output)
        output = self.dropout(output)
        output = self.linear(output)

        return output
    
BATCH_SIZE = 4
FOLDS = 5
NUM_WORKERS = 4
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

class Config:
    epochs = 20 
    eval_per_epoch = 8 #4
    dropout = 0.4
    max_len = 256
    bert_model = "roberta-base"
    warm_up_steps = 300

def get_preds (bert_model, instances_path, max_len, df_test):
    print(instances_path)
    model_path = "/kaggle/input/transformers/" + bert_model + "-hf"
    print(f"model_path:{model_path}")

    vocab_path = model_path
    instances_path = "/kaggle/input/" + instances_path
    instance_name = bert_model.replace("/","_")

    
    device = DEVICE
    tokenizer = transformers.AutoTokenizer.from_pretrained(vocab_path)

    test_data = CLRPDataset(df_test,max_len, tokenizer=tokenizer)
    test_loader = torch.utils.data.DataLoader(test_data,
                                          batch_size=BATCH_SIZE,
                                          shuffle=False,
                                          num_workers=NUM_WORKERS)

    p = np.zeros((len(df_test),))
    for fold in range(FOLDS): 
        preds = []

        model = BertRegreesion (dropout = Config.dropout, bert_model=bert_model, model_path= model_path, freeze_bert=False)
        
        filename = f"{instances_path}/{instance_name}_{fold}.pt"
        model.load_state_dict(torch.load(filename, map_location=torch.device(device)))
        model.to(device)
        model.eval()
    
        with torch.no_grad():
            for token_ids, attn_masks, token_type_ids in tqdm(test_loader):
                token_ids = token_ids.to(device)
                attn_masks = attn_masks.to(device)
                token_type_ids = token_type_ids.to(device)

                output = model.forward(token_ids, attn_masks, token_type_ids)
                output = output.detach().cpu()[:,0]

                preds.append(output)
        preds = np.concatenate(preds)
        p += preds
        del model
    
    return p/FOLDS

def main():
    from numba import cuda 
    device = cuda.get_current_device()
    device.reset()
    
    df_test = pd.read_csv("OneStopEnglishCorpus.csv")
    args = sys.argv[1:]
    
    preds = get_preds (args[0],args[1], max_len = Config.max_len, df_test = df_test)
    df_test [args[1]] = preds
    
    df_test[[args[1]]].to_pickle(args[1]+".pkl")

if __name__ == "__main__":
    main()            
            

In [None]:
!python ensemble_atthead.py "roberta-large" "clrp-roberta-large-2h-atthead-se"

!python ensemble_atthead.py "microsoft/deberta-large" "clrp-deberta-large-ppln4-atthead"
!python ensemble.py "roberta-large" "clrp-roberta-large-2f-se"

!python ensemble.py "microsoft/deberta-large" "clrp-deberta-large-2-se"
!python ensemble.py "microsoft/deberta-large" "clrp-deberta-large-4-se"

### Ensemble

In this part of the code, we retireve the out of fold predictions (OOF) from the CommonLit dataset, we produce a series of readability measures (Syllable count, Gunning Fog, Readability index, Coleman Liau Index, Text Standard, Dale Chall Readability Score) and we combine all together into a linear combination with L2 regularization (a Ridge regression). Though such is based on the CommonLit dataset, we apply it to the OneStopEnglish corpus in order to obtain an ensemble.

In [None]:
!pip install textstat

In [None]:
import numpy as np 
import pandas as pd 
import sklearn.linear_model
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

import textstat

In [None]:
FOLDS = 5

def read_oof (name, return_all = True):
    oof = pd.read_csv(f"/kaggle/input/{name}/oof.csv")
    
    if "pred_x" in oof.columns:
        oof = oof.rename (columns={"pred_x":name})
    else:
        oof = oof.rename (columns={"pred":name})
        
    if return_all:
        return oof
    else:
        return oof[["id",name]]


name_1 = "clrp-roberta-large-2f-se"
name_2 = "clrp-deberta-large-2-se"
name_3 = "clrp-deberta-large-4-se"
name_4 = "clrp-deberta-large-ppln4-atthead"
name_5 = "clrp-roberta-large-2h-atthead-se"

oof = read_oof (name_1, return_all =True)
tmp = read_oof (name_2, return_all =False)
oof = oof.merge(tmp, on ="id")
tmp = read_oof (name_3, return_all =False)
oof = oof.merge(tmp, on ="id")
tmp = read_oof (name_4, return_all =False)
oof = oof.merge(tmp, on ="id")
tmp = read_oof (name_5, return_all =False)
oof = oof.merge(tmp, on ="id")

for name in [name_1, name_2, name_3, name_4, name_5]:
    score = mean_squared_error (oof["target"], oof[name], squared=False)
    print(f"rmse {name}: {score:.5f}"  )

oof_textstats = pd.read_csv("/kaggle/input/clrp-textstats/textstats.csv")
oof = oof.merge(oof_textstats, on="id")

oof["t1"] = oof["syllable_count"]**2
oof["t2"] = oof["coleman_liau_index"]**2

textstats_feats = [
    'syllable_count', 
    'gunning_fog', 'automated_readability_index', 'coleman_liau_index', 'text_standard', 
    "dale_chall_readability_score",
    "t1", 
    "t2", 
] 

In [None]:
df_test = pd.read_csv("OneStopEnglishCorpus.csv")

pred = pd.read_pickle(name_1 + ".pkl")
df_test[name_1] = pred.values

pred = pd.read_pickle(name_2+ ".pkl")
df_test[name_2] = pred.values

pred = pd.read_pickle(name_3+ ".pkl")
df_test[name_3] = pred.values

pred = pd.read_pickle(name_4+ ".pkl")
df_test[name_4] = pred.values

pred = pd.read_pickle(name_5+ ".pkl")
df_test[name_5] = pred.values

df_test["syllable_count"] = df_test ["excerpt"].map(lambda x: textstat.syllable_count(x))
df_test["gunning_fog"] = df_test ["excerpt"].map(lambda x: textstat.gunning_fog(x))
df_test["automated_readability_index"] = df_test ["excerpt"].map(lambda x: textstat.automated_readability_index(x))
df_test["coleman_liau_index"] = df_test ["excerpt"].map(lambda x: textstat.coleman_liau_index(x))
df_test["text_standard"] = df_test ["excerpt"].map(lambda x: textstat.text_standard(x, float_output=True))
df_test["dale_chall_readability_score"] = df_test ["excerpt"].map(lambda x: textstat.dale_chall_readability_score(x))


df_test["t1"] = df_test["syllable_count"]**2
df_test["t2"] = df_test["coleman_liau_index"]**2

df_test["target"] = 0

In [None]:
cols = [name_1, name_2, name_3, name_4, name_5] + textstats_feats

In [None]:
scaler = StandardScaler().fit(oof[cols])

oof[cols] = scaler.transform(oof[cols])
df_test[cols] = scaler.transform(df_test[cols])

oof_pred = []
oof_target = []

X_test = df_test[cols].values

for fold in range(FOLDS):
    df_val = oof.query("kfold == @fold")
    X_val = df_val[cols].values
    y_val = df_val["target"].values
    
    df_train = oof.query("kfold != @fold")
    X_train = df_train[cols].values
    y_train = df_train["target"].values
    
    model = sklearn.linear_model.Ridge(alpha=5.0)
    model.fit (X_train, y_train)
    p_val = model.predict (X_val)
    
    df_test["target"] += model.predict(X_test)
    
    oof_pred.append(p_val)
    oof_target.append (y_val)
    score = mean_squared_error (y_val, p_val, squared=False)

oof_pred = np.concatenate (oof_pred)
oof_target = np.concatenate (oof_target)
ens_score = mean_squared_error (oof_target, oof_pred, squared=False)
print(f"rmse ens: {ens_score:.5f}"  )

In [None]:
df_test["target"] /= FOLDS  

## Evaluate OneStopEnglish accuracy on single models and ensemble

At this point we just need to evaluate performances of each model alone and of their ensemble on the OneStopEnglish corpus (a blend by a ridge linear regression on out of fold predictions on the commonlit dataset). Please notice how the ensemble outruns all the single models.

In [None]:
def eval_onecorpus_accuracy(df_test, model_name):
    df_ele =  df_test.query("level == 'ele'")[["name",model_name]].copy().rename(columns={model_name:"ele"})
    df_inter =  df_test.query("level == 'inter'")[["name",model_name]].copy().rename(columns={model_name:"inter"})
    df_adv =  df_test.query("level == 'adv'")[["name",model_name]].copy().rename(columns={model_name:"adv"})

    df_test = df_ele.merge(df_inter, on="name").merge(df_adv, on="name")
    ## accuracy
    df_test["flag_1"] =  (df_test["ele"] > df_test["inter"]).astype(int) 
    df_test["flag_2"] =  (df_test["inter"] > df_test["adv"]).astype(int)
    accuracy = (df_test["flag_1"].sum() + df_test["flag_2"].sum())/(2*df_test.shape[0])
    return accuracy    

In [None]:
for model_name in [name_1, name_2, name_3, name_4, name_5]:
    accuracy = eval_onecorpus_accuracy(df_test, model_name)
    print (f"model:{model_name}, accuracy:{accuracy:.5f}")

accuracy = eval_onecorpus_accuracy(df_test, "target")
print (f"model:ensemble, accuracy:{accuracy:.5f}")