The english text contains some errors. We'll see how they came about, and try to correct them.

# Data cleanup

First we need to download and import the necessary libraries, and the datasets.

In [None]:
pip install pyspellchecker

In [None]:
import pandas as pd
from spellchecker import SpellChecker
import regex as re

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
train = pd.read_csv('../input/contradictory-my-dear-watson/train.csv')
test = pd.read_csv('../input/contradictory-my-dear-watson/test.csv')

Here we'll just be working with the english entries. We could also do this for the non-english entries after translating.

In [None]:
train

In [None]:
eng = train.loc[train.language == 'English']

Now we'll load the spellchecker. The default list is incomplete so we'll load a dictionary from a separate text file from https://github.com/dwyl/english-words

In [None]:
spell = SpellChecker()  # loads default word frequency list
spell.word_frequency.load_text_file('../input/english-words/words_alpha.txt')
spell.word_frequency.load_words([''])

We want to check for misspelled words, but first we split on punctuation to get just the words back. We also remove things like numbers and dollar signs as things like $10 will be flagged as misspelled. Additionally, we remove the square brackets and apostrophes as this makes processing the words into the spellchecker easier.

In [None]:
eng['premise_misspelled'] = eng.premise.apply(lambda sentence: tuple(spell.unknown(re.split('[!\:;,.\-\% \b\s()\"/$0-9]',re.sub('[\'\[\]]', '', sentence)))))

In [None]:
misspelled_df = eng.loc[eng.premise_misspelled != ()]
list(misspelled_df.premise_misspelled)

By looking through the list, we see one source of misspellings is when we have a double question mark, which we will see comes from accented characters. Another source is a '?\xad' (which specifies a soft hyphen).

In [None]:
pd.set_option('display.max_colwidth', -1) #This allows us to see the full sentences of the dataframes

In [None]:
err1 = misspelled_df.loc[eng.premise.str.contains('\?\?', case = False)] #Let the double question mark errors be err1
err2 = misspelled_df.loc[eng.premise.str.contains("\xad", case = False)] #and those containing \xad be err2
print(len(err1), len(err2))

There are not too many of these, and we can fix them individually. Let's see what is left after removing these guys.

In [None]:
reduced_1 = pd.concat([misspelled_df, err1, err1]).drop_duplicates(keep=False)
reduced_2 = pd.concat([reduced_1, err2, err2]).drop_duplicates(keep=False)

We can look through misspellings once the common errors are removed.

In [None]:
list(reduced_2.premise_misspelled)

Looking through, we see that the marked mistakes are mostly either an actual word followed by a question mark, proper nouns, acronyms or words joined together. Most of these we are fine to leave as is. Only the words joined together might be worth correcting, but after tokenisation we hope that the meaning is mostly captured there. 

Regarding spelling errors, I've only spotted one so far: in index 10307, we have 'Behind the cathedral, croseover the Rue de la Republique to the 15th-century Eglise Saint-Maclou, the richest example of Flam­boy­ant Gothic in the country.', where 'croseover' should be 'cross over', but that's not to say there aren't more errors.

So only the mistakes identified should change the meaning.
From here, we just need to correct mistakes of the first kind:

In [None]:
err1

Replacing the ?? with an 'e' fixes most of these, there are just 9 corrections left: rhone (occurring three times), ataturk (occurring twice), madrileno (occurring twice), alacahoyuk and alcudia. We also see why these mistakes appeared - all of these should be accented. Most of these are the French 'é', but we also should have 'Rhône', 'Atatürk', 'Madrileño', 'Alacahöyük' and 'Alcúdia'. 

In [None]:
correction = err1.premise.apply(lambda sentence: re.sub('\?\?', 'e', sentence))
correct_rhone = correction.apply(lambda sentence: re.sub('Rhene', 'Rhone', sentence))
correct_ataturk = correct_rhone.apply(lambda sentence: re.sub('Ataterk', 'Ataturk', sentence))
correct_madrileno = correct_ataturk.apply(lambda sentence: re.sub('Madrileeo', 'Madrileno', sentence))
correct_alacahoyuk = correct_madrileno.apply(lambda sentence: re.sub('Alacaheyek', 'Alacahoyuk', sentence))
correct_alcudia = correct_alacahoyuk.apply(lambda sentence: re.sub('Alcedia', 'Alcudia', sentence))
correction1 = correct_alcudia

And then we update the dataframe

In [None]:
for i in correction1.index:
    eng.loc[eng.index == i, 'premise'] = correction1.loc[correction1.index == i]

And now for corrections of the second kind

In [None]:
err2 = eng.loc[eng.premise.str.contains("\?\xad", case = False)]
err2

It looks like somehow one of the double question marks was missed, in index 7534, 'Cham??bord'! Fortunately, it seems that we can simply remove the question marks, as with all the other question marks. (The only iffy one is 'Arab?', in index 1449, which should be Punta Arabí, but it looks like the mistake has been carried over into the test, so we won't bother.)

In [None]:
correction2 = err2.premise.apply(lambda sentence: re.sub('\?\xad', '', sentence))
for i in correction2.index:
    eng.loc[eng.index == i, 'premise'] = correction2.loc[correction2.index == i]

How about the hypotheses?

In [None]:
eng['hypothesis_misspelled'] = eng.hypothesis.apply(lambda sentence: tuple(spell.unknown(re.split('[\?!\:;,.\-\% \b\s()\"/$0-9]',re.sub('[\'\[\]]', '', sentence)))))
misspelled_hyp_df = eng.loc[eng.hypothesis_misspelled != ()]
list(misspelled_hyp_df.hypothesis_misspelled) #contains the 'misspelled' words

Again, some typos can be spotted, like 'availalbe' at id 11997 and 'asssess'. We'll hope that whatever encoding is being used will not get too tripped up by these. Also by taking len gives 636 mistakes, much shorter than premise.

After searching for mistakes of the kind we fixed in the 'premise', it looks like we don't get any errors of the form we corrected earlier. There is a small correction worth making for the hypotheses though: 'Ile de Re' is recorded as 'Ile de R' in the hypotheses where the isle is mentioned.

In [None]:
eng.loc[eng.hypothesis.str.contains('Ile de R', case = False)]

In [None]:
eng.loc[eng.hypothesis == 'Ile de R is no longer part of the attraction.', 'hypothesis'] = 'Ile de Re is no longer part of the attraction.'
eng.loc[eng.hypothesis == 'Ile de R.', 'hypothesis'] = 'Ile de Re.'

One last thing we can do is remove brackets where they occur, noting that the meaning doesn't change if they are taken out, and remove the &amp from the premise and replace with the word 'and'.

In [None]:
square_brackets = eng.loc[eng.premise.str.contains('[\[\]]', case = False)]
for i in square_brackets.index:
     eng.loc[eng.index == i, 'premise'] = re.sub('[\[\]]', '', str(eng.loc[eng.index == i].premise.values[0]))
ampersands = eng.loc[eng.premise.str.contains('\&amp', case = False)]
for i in ampersands.index:
    eng.loc[eng.index == i, 'premise'] = re.sub('\&amp', ' and ', str(eng.loc[eng.index == i].premise.values[0]))

Finally, we drop the misspelled columns and update the training frame with the cleaned data.

In [None]:
eng.drop(columns=['premise_misspelled', 'hypothesis_misspelled'])
for i in eng.index:
    train.loc[train.index == i] = eng.loc[eng.index == i]

**Cleaning the test set**

In [None]:
eng_test = test.loc[test.language == 'English']

In [None]:
err1 = eng_test.loc[eng_test.premise.str.contains('\?\?', case = False)]
correction = err1.premise.apply(lambda sentence: re.sub('\?\?', 'e', sentence))
correct_alacahoyuk = correction.apply(lambda sentence: re.sub('Alacaheyek', 'Alacahoyuk', sentence))
correct_madrileno = correct_alacahoyuk.apply(lambda sentence: re.sub('Madrileeo', 'Madrileno', sentence))
correct_alcudia = correct_madrileno.apply(lambda sentence: re.sub('Alcedia', 'Alcudia', sentence))
correct_ataturk = correct_alcudia.apply(lambda sentence: re.sub('Ataterk', 'Ataturk', sentence))
for i in correct_ataturk.index:
    eng_test.loc[eng_test.index == i, 'premise'] = correct_ataturk.loc[correct_ataturk.index == i]

In [None]:
err2 = eng_test.loc[eng_test.premise.str.contains('\?\xad', case = False)]
correction = err2.premise.apply(lambda sentence: re.sub('\?\xad', '', sentence))
for i in correction.index:
    eng_test.loc[eng_test.index == i, 'premise'] = correction.loc[correction.index == i]

In [None]:
square_brackets = eng_test.loc[eng_test.premise.str.contains('[\[\]]', case = False)]
for i in square_brackets.index:
     eng_test.loc[eng_test.index == i, 'premise'] = re.sub('[\[\]]', '', str(eng_test.loc[eng_test.index == i].premise.values[0]))

In [None]:
ampersands = eng_test.loc[eng_test.premise.str.contains('\&amp', case = False)]
for i in ampersands.index:
    eng_test.loc[eng_test.index == i, 'premise'] = re.sub('\&amp', ' and ', str(eng_test.loc[eng_test.index == i].premise.values[0]))

In [None]:
for i in eng_test.index:
    test.loc[test.index == i] = eng_test.loc[eng_test.index == i]

Cool, so that's english. I wonder what the other languages are like regarding the two main errors identified?
It turns out that after checking the other languages, we don't get the same kinds of mistakes.
Let's save our cleaned dataframes.

In [None]:
train.to_csv('train_cleaned.csv',index=False)
test.to_csv('test_cleaned.csv',index=False)

# Data Augmentation

**Translation augmentation**

One way we can augment the data is by translating the premise-hypothesis pairs into a different language, following JohnM's notebook.

In [None]:
!pip install git+https://github.com/ssut/py-googletrans.git

In [None]:
from googletrans import Translator
from dask import bag, diagnostics
import numpy as np

In [None]:
def translate(words, dest):
    dest_choices = ['zh-cn',
                    'ar',
                    'fr',
                    'sw',
                    'ur',
                    'vi',
                    'ru',
                    'hi',
                    'el',
                    'th',
                    'es',
                    'de',
                    'tr',
                    'bg'
                    ]
    if not dest:
        dest = np.random.choice(dest_choices)
        
    translator = Translator()
    decoded = translator.translate(words, dest=dest).text
    return decoded


#TODO: use a dask dataframe instead of all this
def trans_parallel(df, dest):
    premise_bag = bag.from_sequence(df.premise.tolist()).map(translate, dest)
    hypo_bag =  bag.from_sequence(df.hypothesis.tolist()).map(translate, dest)
    with diagnostics.ProgressBar():
        premises = premise_bag.compute()
        hypos = hypo_bag.compute()
    df[['premise', 'hypothesis']] = list(zip(premises, hypos))
    return df

    
eng_trans = train.loc[train.lang_abv == "en"].copy() \
           .pipe(trans_parallel, dest=None)

non_eng_trans =  train.loc[train.lang_abv != "en"].copy() \
                .pipe(trans_parallel, dest='en')

#These two lines are not in JohnM's notebook and are here to update the language and lang_abv column for the new dataframes
eng_trans[['lang_abv', 'language']] = [['mx', 'Mixed']]*len(eng)
non_eng_trans[['lang_abv', 'language']] = [['en', 'English']]*len(non_eng_trans)

train = train.append([eng_trans, non_eng_trans])
train.reset_index
train.shape

**Synonym Augmentation**

We could also make new samples by swapping out words for synonyms, where synonymous words have been learned by some selected language model. We'll use the nlpaug library to do this.

In [None]:
!pip install git+https://github.com/makcedward/nlpaug

In [None]:
import nlpaug.augmenter.word as naw
import nlpaug.flow as nafc
from nlpaug.util import Action

There's a number of different things we can do - insert or substitute words based on contextual embeddings, replace by synonyms, etc. However we have to be careful not to change the meaning of the sentence too much. Substitution based on contextual embeddings *could* give a sentence with a similar meaning, but could also change the meaning completely, especially if the word being substituted is integral to the meaning of the sentence. 

The two types of augmentation that are likely to not change the meaning too much are synonym substitution and contextual insertion. Let's take a look at it in action.

Insertion

There's a large number of models that nlpaug can use for contextual insertion: BERT, DistilBERT, RoBERTa and XLNet and within these models you may choose cased or uncased, the size of the model, etc.

In [None]:
text = train.premise.values[0]
model = 'distilbert-base-uncased'
ins_aug = naw.ContextualWordEmbsAug(
    model_path=model, action="insert")
print("Original:")
print(text)
print("Augmented Text:")
for i in range(10):
    augmented_text = ins_aug.augment(text)
    print(augmented_text)

It looks as though most of these have the same meaning, but a couple have an opposite meaning, and some might not make sense. Additionally the tokens can be split introducing a text error. 

Synonym

In [None]:
syn_aug = naw.SynonymAug(aug_src='wordnet')
augmented_text = syn_aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
for i in range(10):
    augmented_text = syn_aug.augment(text)
    print(augmented_text)

Synonym substitution is less likely to negate the meaning: negation is more likely with the contextual insertion as there the completion will sound logical, but the negative of the statement could be a completely logical completion. Here, we are more likely to get nonsense though, as the synonym is blindly selected and doesn't take into account different usage of the same word.

The most striking example I found was when 'he' was substituted for 'atomic number 2', having interpreted 'he' as helium! 

However, we'll press on. After all, machine learning is more of an art than a science!

In [None]:

def insert_augment(words, model):
    ins_aug = naw.ContextualWordEmbsAug(
        model_path=model, action="insert")    
    augmented_text = ins_aug.augment(words)
    return augmented_text

def ins_aug_parallel(df, model):
    premise_bag = bag.from_sequence(df.premise.tolist()).map(insert_augment, model)
    hypo_bag =  bag.from_sequence(df.hypothesis.tolist()).map(insert_augment, model)
    with diagnostics.ProgressBar():
        premises = premise_bag.compute()
        hypos = hypo_bag.compute()
    df[['premise', 'hypothesis']] = list(zip(premises, hypos))
    return df

eng_ins_aug = train.loc[train.lang_abv == "en"].copy() \
           .pipe(ins_aug_parallel, model='distilbert-base-uncased')

train = train.append([eng_ins_aug])

train.to_csv('train_cleaned_ins.csv',index=False)



In [None]:

def synonym_augment(words):
    syn_aug = naw.SynonymAug(
        aug_src = 'wordnet')    
    augmented_text = syn_aug.augment(words)
    return augmented_text

def syn_aug_parallel(df):
    premise_bag = bag.from_sequence(df.premise.tolist()).map(synonym_augment)
    hypo_bag =  bag.from_sequence(df.hypothesis.tolist()).map(synonym_augment)
    with diagnostics.ProgressBar():
        premises = premise_bag.compute()
        hypos = hypo_bag.compute()
    df[['premise', 'hypothesis']] = list(zip(premises, hypos))
    return df

eng_syn_aug = train.loc[train.lang_abv == "en"].copy() \
           .pipe(syn_aug_parallel)

train = train.append([eng_syn_aug])

train.to_csv('train_cleaned_ins_syn.csv',index=False)


# Train using TPU

We'll follow Shahules' kernel.

Import libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from kaggle_datasets import KaggleDatasets
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import transformers
from transformers import TFAutoModel, AutoTokenizer
from sklearn.model_selection import StratifiedKFold
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
import os

Set up TPUs:

In [None]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
    print('Running on TPU ', tpu.master())
except ValueError:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

Define variables:

In [None]:
MODEL = 'jplu/tf-xlm-roberta-large'
EPOCHS = 8
MAX_LEN = 96

# Our batch size will depend on number of replicas
BATCH_SIZE= 16 * strategy.num_replicas_in_sync
AUTO = tf.data.experimental.AUTOTUNE

Skip loading datasets as we already have the ones we want.

So next we encode the training data:

In [None]:
#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL)

In [None]:
def quick_encode(df,maxlen=100):
    
    values = df[['premise','hypothesis']].values.tolist()
    tokens=tokenizer.batch_encode_plus(values,max_length=maxlen,pad_to_max_length=True)
    
    return np.array(tokens['input_ids'])

x_train = quick_encode(train)
x_test = quick_encode(test)
y_train = train.label.values

Convert to tf.data.Dataset

In [None]:
def create_dist_dataset(X, y,val,batch_size= BATCH_SIZE):
    
    
    dataset = tf.data.Dataset.from_tensor_slices((X,y)).shuffle(len(X))
          
    if not val:
        dataset = dataset.repeat().batch(batch_size).prefetch(AUTO)
    else:
        dataset = dataset.batch(batch_size).prefetch(AUTO)

    
    
    return dataset



test_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_test))
    .batch(BATCH_SIZE)
)

Train the model

In [None]:
def build_model(transformer,max_len):
    
    input_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_ids")
    sequence_output = transformer(input_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(3, activation='softmax')(cls_token)

    # It's time to build and compile the model
    model = Model(inputs=input_ids, outputs=out)
    model.compile(
        Adam(lr=1e-5), 
        loss='sparse_categorical_crossentropy', 
        metrics=['accuracy']
    )
    
    return model

In [None]:
n_steps = len(x_train) // batch_size

train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=n_epochs
)

# K-fold validation

Following Shahules' kernel

In [None]:
pred_test=np.zeros((test.shape[0],3))
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=777)
val_score=[]
history=[]


for fold,(train_ind,valid_ind) in enumerate(skf.split(x_train,y_train)):
    
    if fold < 4:
    
        print("fold",fold+1)
        
       
        tf.tpu.experimental.initialize_tpu_system(tpu)
        
        train_data = create_dist_dataset(x_train[train_ind],y_train[train_ind],val=False)
        valid_data = create_dist_dataset(x_train[valid_ind],y_train[valid_ind],val=True)
    
        Checkpoint=tf.keras.callbacks.ModelCheckpoint(f"roberta_base.h5", monitor='val_loss', verbose=0, save_best_only=True,
        save_weights_only=True, mode='min')
        
        with strategy.scope():
            transformer_layer = TFAutoModel.from_pretrained(MODEL)
            model = build_model(transformer_layer, max_len=MAX_LEN)
            
        

        n_steps = len(train_ind)//BATCH_SIZE
        print("training model {} ".format(fold+1))

        train_history = model.fit(
        train_data,
        steps_per_epoch=n_steps,
        validation_data=valid_data,
        epochs=EPOCHS,callbacks=[Checkpoint],verbose=0)
        
        print("Loading model...")
        model.load_weights(f"roberta_base.h5")
        
        

        print("fold {} validation acc {}".format(fold+1,np.mean(train_history.history['val_accuracy'])))
        print("fold {} validation acc {}".format(fold+1,np.mean(train_history.history['val_loss'])))
        
        history.append(train_history)

        val_score.append(np.mean(train_history.history['val_accuracy']))
        
        print('predict on test....')
        preds=model.predict(test_dataset,verbose=1)

        pred_test+=preds/4

# Evaluation

In [None]:
plt.figure(figsize=(15,10))

for i,hist in enumerate(history):

    plt.subplot(2,2,i+1)
    plt.plot(np.arange(EPOCHS),hist.history['accuracy'],label='train accu')
    plt.plot(np.arange(EPOCHS),hist.history['val_accuracy'],label='validation acc')
    plt.gca().title.set_text(f'Fold {i+1} accuracy curve')
    plt.legend()

In [None]:
plt.figure(figsize=(15,10))

for i,hist in enumerate(history):

    plt.subplot(2,2,i+1)
    plt.plot(np.arange(EPOCHS),hist.history['loss'],label='train loss')
    plt.plot(np.arange(EPOCHS),hist.history['val_loss'],label='validation loss')
    plt.gca().title.set_text(f'Fold {i+1} loss curve')
    plt.legend()

# Prediction

In [None]:
submission = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/sample_submission.csv')
submission['prediction'] = np.argmax(pred_test,axis=1)
submission.head()

In [None]:
submission.to_csv('submission.csv',index=False)