# Sentence Comparison Approach - CommonLit Readability Comp - CV 0.495
All public discussions and notebooks use regression to predict the Bradley-Terry target. In this notebook, we will convert the Bradley-Terry value back into a probability that one excerpt is easier to read than another excerpt. Then we will train a classical binary classification model on pairs of excerpts to predict this probability. Then we will convert the predicted probabiltiy back into a Bradley-Terry score. This model achieves a solid CV 0.495

Bradley-Terry is referenced by the host [here][1] and explained by Wikipedia here [here][2]

We will build a transformer model that takes two excerpts as input and outputs the probability that the first excerpt is more difficult to read than the second excerpt. Given two excerpts with Bradley-Terry score `t1` and `t2`. Then the probability that excerpt 1 is more difficult to read than excerpt 2 is

$$ prob = \frac{exp(t_1-t_2)}{1+exp(t_1-t_2)} $$

We will use this formula to encode the comp data targets during sentence pair training. And we will use this formula backwards to decode during inference.

[1]: https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423
[2]: https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model

# Load Libraries

In [None]:
VER = 162
VER2 = 0

# if you put path here we load model instead of train model
# if you wish to train a model in this notebook use None instead of path
LOAD_MODEL_PATH = '../input/roberta-base-162/' #or None
COMPUTE_OOF = False
PREDICT_TEST = True

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
from sklearn.model_selection import KFold
from transformers import RobertaTokenizer, TFRobertaModel

import tensorflow as tf
print('TF',tf.__version__)

# Enable GPU mixed precision
For fast training and larger batch size capability, we will use mixed precision. On Kaggle's P100 GPU we do not get much benefit but if we use V100 or A100 we will get benefit

In [None]:
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})
print('mixed precision enabled')

# Load Train and Test Data

In [None]:
train = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
print( train.shape )
train.head()

In [None]:
test = pd.read_csv('../input/commonlitreadabilityprize/test.csv')
print( test.shape )
test.head()

In [None]:
if len(test)>7: 
    COMPUTE_OOF = False
    PREDICT_TEST = True

# Display Bradley-Terry values (i.e. comp targets)

In [None]:
plt.hist(train.target.values,bins=100)
plt.title('Bradley-Terry values',size=16)
plt.show()

# DataLoader
We need a dataloader to feed random pairs of sentences into our NLP model

In [None]:
MAX_TOK = 250
tokenizer = RobertaTokenizer.from_pretrained('../input/tfroberta-base')

class DataGenerator(tf.keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, df, batch_size=32, shuffle=False, tokenizer=tokenizer): 

        self.df = df.reset_index(drop=True)
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.tokenizer = tokenizer
        self.on_epoch_end()
        
    def __len__(self):
        'Denotes the number of batches per epoch'
        ct = int( np.ceil( len(self.df) / self.batch_size ) )
        return ct

    def __getitem__(self, index):
        'Generate one batch of data'
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
        X, y = self.__data_generation(indexes)        
        return X,y

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange( len(self.df ) )
        if self.shuffle: np.random.shuffle(self.indexes)
            
    def __data_generation(self, indexes):
        'Generates data containing batch_size samples' 
        
        X = np.ones((len(indexes),MAX_TOK*2),dtype='int32')
        y = np.zeros(len(indexes),dtype='float32')
        
        y1 = np.zeros(len(indexes),dtype='float32')
        y2 = np.zeros(len(indexes),dtype='float32')
        
        df = self.df.loc[indexes]
        text = df.excerpt.values
            
        for k in range(len(text)):
            # ENCODE FIRST EXCERPT
            use = text[k]
            tk = self.tokenizer.encode(use)
            ln = min(MAX_TOK,len(tk))
            X[k,:ln] = tk[:MAX_TOK]
            
            # RANDOM PICK AND ENCODE SECOND EXCERPT FOR COMPARISON
            rw = np.random.randint(0,len(self.df))
            use = self.df.excerpt.values[rw]
            tk = self.tokenizer.encode(use)
            ln2 = min(MAX_TOK,len(tk))
            X[k,ln] = 2
            X[k,ln+1:ln+ln2] = tk[1:MAX_TOK]
            y2[k] = self.df.target.values[rw]
        y1 = df.target.values
            
        for k in range(len(indexes)):
            # CONVERT BRADLEY-TERRY VALUES INTO PROBABILITIES
            # THAT FIRST EXCERPT IS MORE DIFFICULT THAN SECOND EXCERPT
            t = np.exp( y1[k]-y2[k] )
            y[k,] = t/(t+1)
          
        return X,y

# Display Dataloader
We notice that our dataloader returns pairs of encoded sentences. Roberta tokenizer encodes sentences together by first using a cls token of `[0]`. Then encode first sentence. Then use two separator tokens of `[2][2]`. Then encode the second sentence. Then a final separator token `[2]`.

In [None]:
fake_data = pd.DataFrame(columns=['id','excerpt','target'])
fake_data.loc[0] = [1,'Sentence one',1]
fake_data.loc[1] = [2,'Sentence two',-1]
fake_data.head()

In [None]:
trn = DataGenerator(fake_data)
for b in trn:
    break
b[0][0,:16]

In [None]:
tokenizer.encode('Sentence one','Sentence two')

# Make Sentence Compare Model

In [None]:
def build_model(use_cls_token=False):
    
    tokens = tf.keras.layers.Input(shape=(MAX_TOK*2,), name = 'tokens', dtype=tf.int32)
    masks = tf.cast( tokens!=1, dtype=tf.int32 )
    
    bert_model = TFRobertaModel.from_pretrained('../input/tfroberta-base') 
    
    x = bert_model(tokens, attention_mask=masks)  

    # BUILD HEAD WITH EITHER CLS TOKEN OR MEAN LAST LAYER TOKENS
    if use_cls_token:
        x = x[0][:,0,:]
        x = tf.keras.layers.Dense(768,activation='tanh')(x)
    else: #use mean last layer
        x = tf.keras.layers.GlobalAveragePooling1D()(x[0])
        
    x = tf.keras.layers.Dense(1, activation='sigmoid', dtype='float32')(x)
    
    model = tf.keras.Model(inputs=tokens, outputs=x)
    model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 2e-5),
                  loss = [tf.keras.losses.BinaryCrossentropy()]) 
    
    return model, bert_model

# Train Model

In [None]:
if LOAD_MODEL_PATH is None:
    skf = KFold(n_splits=5, random_state=42, shuffle=True)

    for fold, (idx_t, idx_v) in enumerate(skf.split(train)):
    
        print('#'*25)
        print('### FOLD',fold+1)
        print('#'*25)
    
        model,_ = build_model()
    
        train_gen = DataGenerator(train.iloc[idx_t], shuffle=True, batch_size=8, augment=False)
        val_gen = DataGenerator(train.iloc[idx_v], batch_size=16)
        
        sv = tf.keras.callbacks.ModelCheckpoint(
            'RoBERTa_Base_v%i_v%i_fold%i.h5'%(VER,VER2,fold), monitor='val_loss', verbose=1, 
            save_best_only=True, save_weights_only=True, mode='auto', save_freq='epoch'
        )
        rop = tf.keras.callbacks.ReduceLROnPlateau(
            monitor='val_loss', factor=0.3, patience=2, verbose=1,
            mode='auto', min_delta=0.0001, cooldown=0, min_lr=0
        )
        
        model.fit(train_gen, validation_data = val_gen,
            epochs=10, verbose=1, callbacks=[sv,rop],
            use_multiprocessing=True, workers=2)

# Inference
Inference is a little tricky. For each test excerpt that we wish to predict, we must compare it with 16 train excerpts. Then we will deduce it's Bradley-Terry value from those 16 comparisons

In [None]:
mn = train.target.mean()
st = train.target.std()
RANGE_LOW = mn-st
RANGE_HIGH = mn+st
COMPARE = 8

In [None]:
class DataGenerator2(tf.keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, df1, df2, batch_size=COMPARE, tokenizer=tokenizer, log=[]): 

        self.df1 = df1.reset_index(drop=True)
        self.df2 = df2.reset_index(drop=True)
        self.batch_size = batch_size
        self.tokenizer = tokenizer
        self.log = log
        
    def __len__(self):
        'Denotes the number of batches per epoch'
        return len(self.df1)*2

    def __getitem__(self, index):
        'Generate one batch of data'
        X = self.__data_generation(index)    
        return X
    
    def on_epoch_end(self):
        'Updates indexes after each epoch'
        pass

    def __data_generation(self,index):
        'Generates data containing batch_size samples' 
        
        idx = index//2
        mode = index%2
        X = np.ones((self.batch_size,MAX_TOK*2),dtype='int32')
        
        use1 = self.df1.excerpt.values[idx]
        if mode==0:
            tk = self.tokenizer.encode(use1)
            ln = min(MAX_TOK,len(tk))
            for k in range(self.batch_size):
                
                # FIRST SENTENCE
                X[k,:ln] = tk[:MAX_TOK]
            
                # SECOND SENTENCE
                rw = self.df2.loc[(self.df2.target>RANGE_LOW)&(self.df2.target<RANGE_HIGH)].sample(1).index[0]
                use = self.df2.excerpt.values[rw]
                tk2 = self.tokenizer.encode(use)
                ln2 = min(MAX_TOK,len(tk2))
                X[k,ln] = 2
                X[k,ln+1:ln+ln2] = tk2[1:MAX_TOK]
                
                # RECORD TARGET
                t = self.df2.target.values[rw]
                self.log.append(t)
            
        else:
            tk = self.tokenizer.encode(use1)
            ln2 = min(MAX_TOK,len(tk))
            for k in range(self.batch_size):
                
                # FIRST SENTENCE
                rw = self.df2.loc[(self.df2.target>RANGE_LOW)&(self.df2.target<RANGE_HIGH)].sample(1).index[0]
                use = self.df2.excerpt.values[rw]
                tk2 = self.tokenizer.encode(use)
                ln = min(MAX_TOK,len(tk2))
                X[k,:ln] = tk2[:MAX_TOK]
            
                # SECOND SENTENCE
                X[k,ln] = 2
                X[k,ln+1:ln+ln2] = tk[1:MAX_TOK]
                
                # RECORD TARGET
                t = self.df2.target.values[rw]
                self.log.append(t)
                                  
        return X

# Predict Test Data

In [None]:
if PREDICT_TEST:
    FOLDS = 5
    skf = KFold(n_splits=FOLDS, random_state=42, shuffle=True)
    test_preds = np.zeros(len(test))

    for fold, (idx_t, idx_v) in enumerate(skf.split(train)):

        log = []
        test_gen = DataGenerator2(test,train.iloc[idx_t], log=log)
        model,_ = build_model()

        if LOAD_MODEL_PATH is not None:
            model.load_weights(LOAD_MODEL_PATH+'RoBERTa_Base_v%i_v%i_fold%i.h5'%(VER,VER2,fold))
        else:
            model.load_weights('RoBERTa_Base_v%i_v%i_fold%i.h5'%(VER,VER2,fold))

        pr = model.predict(test_gen, verbose=1)
        log = log[COMPARE:] #remove first batch since dataloader double wrote
    
        # COMPUTE PREDICTIONS
        log = np.array(log)
        for k in range(len(test)):
            #if k%10==0: print(k,', ',end='')
        
            a = COMPARE*2*k
            preds2 = []
    
            t1 = log[a:a+COMPARE]
            p1 = pr[a:a+COMPARE,0]
            for j in range(COMPARE):
                tmp = np.log( p1[j] / (1-p1[j]) ) + t1[j]
                preds2.append(tmp)
    
            a += COMPARE
            t2 = log[a:a+COMPARE]
            p2 = pr[a:a+COMPARE,0]
            for j in range(COMPARE):
                tmp = t2[j] - np.log( p2[j] / (1-p2[j]) ) 
                preds2.append(tmp)
    
            test_preds[k] += np.mean(preds2) / FOLDS

# Write Submission File

In [None]:
test['target'] = test_preds
test[['id','target']].to_csv('submission.csv',index=False)

In [None]:
sub = pd.read_csv('submission.csv')
sub.head()

# Compute OOF Score
For each excerpt in the validation fold, we will compare it with 16 excerpts from the train fold. Based on these 16 comparisons, we will compute its Bradley-Terry score.

In [None]:
if COMPUTE_OOF:
    skf = KFold(n_splits=5, random_state=42, shuffle=True)
    oof = np.zeros(len(train))

    for fold, (idx_t, idx_v) in enumerate(skf.split(train)):

        log = []
        val_gen = DataGenerator2(train.iloc[idx_v],train.iloc[idx_t], log=log)
        model,_ = build_model()

        if LOAD_MODEL_PATH is not None:
            model.load_weights(LOAD_MODEL_PATH+'RoBERTa_Base_v%i_v%i_fold%i.h5'%(VER,VER2,fold))
        else:
            model.load_weights('RoBERTa_Base_v%i_v%i_fold%i.h5'%(VER,VER2,fold))

        pr = model.predict(val_gen, verbose=1)
        log = log[COMPARE:] #remove first batch since dataloader double wrote
    
        # COMPUTE PREDICTIONS
        preds = np.zeros(len(idx_v))
        log = np.array(log)
        for k in range(len(idx_v)):
            #if k%10==0: print(k,', ',end='')
        
            a = COMPARE*2*k
            preds2 = []
    
            t1 = log[a:a+COMPARE]
            p1 = pr[a:a+COMPARE,0]
            for j in range(COMPARE):
                tmp = np.log( p1[j] / (1-p1[j]) ) + t1[j]
                preds2.append(tmp)
    
            a += COMPARE
            t2 = log[a:a+COMPARE]
            p2 = pr[a:a+COMPARE,0]
            for j in range(COMPARE):
                tmp = t2[j] - np.log( p2[j] / (1-p2[j]) ) 
                preds2.append(tmp)
    
            preds[k] = np.mean(preds2) 
 
        rsme = np.sqrt(np.mean( (train.target.values[idx_v] - preds)**2 ))
        print(); print(f'FOLD {fold+1} OOF rsme',rsme)
    
        oof[idx_v] = preds
    
    
    print(); print('#'*25)
    rsme = np.sqrt(np.mean( (train.target.values - oof)**2 ))
    print('OOF rsme',rsme)
    print('#'*25)

# Visualize Sentence Comparison Inference
To predict one test excerpt, we infer many comparisons. Let's plot those comparisons. Below each plot is the inference of one validation OOF excerpt. We will compare each validation excerpt with 64 other train excerpts. (We don't need this many for inference but we will use a lot here to make better plots). Then we will plot (1) a histogram of all excerpts easier to read among the 64, (2) a histogram of all excerpts more difficult to read among the 64, and (3) a black line indicating the true target of the OOF excerpt we are comparing with 64.

In [None]:
COMPARE = 32 # we will compare twice this many
DISPLAY = 5 # we will display thrice this many

if COMPUTE_OOF:
    # GET FIRST FOLD
    skf = KFold(n_splits=5, random_state=42, shuffle=True)
    for fold, (idx_t, idx_v) in enumerate(skf.split(train)): break
        
    # PREDICT 15 EXCERPTS FROM FIRST FOLD
    log = [] # save compared targets here
    val_gen = DataGenerator2(train.iloc[idx_v[:DISPLAY*3]],train.iloc[idx_t], batch_size=COMPARE, log=log)
    pr = model.predict(val_gen, verbose=1)
    log = log[COMPARE:] # remove first batch of targets because dataloader wrote twice

In [None]:
if COMPUTE_OOF:
    log = np.array(log)
    for k in range(DISPLAY*3): 
        if k%3==0: plt.figure(figsize=(20,5))

        # FIND ALL EXCERPTS OVER AND UNDER
        # WITH LEFT COMPARE
        a = COMPARE*2*k
        t1 = log[a:a+COMPARE]
        p1 = pr[a:a+COMPARE,0]
        over1 = t1[p1<0.5]
        under1 = t1[p1>0.5]
    
        # FIND ALL EXCERPTS OVER AND UNDER
        # WITH RIGHT COMPARE
        a += COMPARE
        t2 = log[a:a+COMPARE]
        p2 = pr[a:a+COMPARE,0]
        over2 = t2[p2>0.5]
        under2 = t2[p2<0.5]    
        over = np.concatenate([over1,over2])
        under = np.concatenate([under1,under2])
    
        # PLOT EXCERPTS OVER AND UNDER
        plt.subplot(1,3,k%3+1)
        sns.kdeplot(under,label='under')
        sns.kdeplot(over,label='over')
        x = train.target.values[idx_v[k]]
        yy = plt.ylim()
        plt.plot([x,x],[yy[0],yy[1]],color='black')
        plt.legend()
        if k%3==2: plt.show()