# Watson's monolingual love

As we know that the dataset for the competition is multilingual and it is very necessary for the model to get equal number of samples for all languages present in the dataset in order to generalize well i.e the model should not align to only a single language or we would get wrong estimations for the model's performance.
The provided dataset contains comparatively higher amount of English samples (Watson's monolingual love) which we would be upsampling in the following notebook using the [XNLI corpus](https://cims.nyu.edu/~sbowman/xnli/).

Que. Why not BERT but ROBERTa and sucessors?

Ans. Basically BERT is a great success for multilingual embeddings and it works perfectly well for datasets with a) large amount of samples b) multilingual embeddings but when we have either small amount of data or [crosslingual embeddings](https://www.reddit.com/r/LanguageTechnology/comments/f3epzu/what_is_the_difference_between_multilingual_and/), ROBERTa and its successors perform pretty much well as they have comparatively large model architecture and trained on crosslingual embeddings which takes into consideration all the languages regarding a particular word vector.

### Scope for tweaks:

However there could be many things you could try out but here are some of the few ones that I suppose one should try out. If you've already tried these approaches please consider suggesting some new ones in the comments.

1. <b>LR Scheduling:</b>  Introducing LRScheduling improves the model performance.
2. <b>Model architecture:</b> Although the model architecture (XLM-R) used in this notebook is itself quite good at performance, you could also try replacing it with one of the architectures given [here](https://huggingface.co/transformers/pretrained_models.html). (As we used TFAutoModel, you only need to change the MODEL_NAME parameter to you model's name.)
3. <b>One Hot labels</b>: To use categorical_crossentropy loss we need to have our labels one hot encoded.
4. <b>TTA</b>: We can use test time augmentations to improve the predictions by our model. (See reference)
5. <b>StratifiedKFold</b>: Using cross validation techniques highly improves the performance of the model. However I am currently unable to do it due to resource limitations.
6. <b>More training data</b>: Adding more data could improve the performance of the model but keep in mind the ratio of languages in the dataset.
7. <b>Stop Words</b>: Stop words for some languages are not available in nltk library and hence we can either create them using word frequency or some better approach. 

**Updates:**
1. Added Wordcloud plot for multiple languages 
2. Used one hot encoded labels

In [None]:
# Installing nlp library for loading external dataset.
!pip install -q nlp
!pip install -q wordcloud

In [None]:
# Importing Libs
import nlp
import numpy as np
import pandas as pd
import tensorflow as tf
import random, os, math, cv2
import matplotlib.pyplot as plt
import tensorflow.keras.backend as K
from wordcloud import WordCloud, STOPWORDS
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold, train_test_split
from transformers import BertTokenizer, TFBertModel, AutoTokenizer, TFAutoModel

# Text visualization
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS

In [None]:
#PREPAIRING TPU
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)

except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU

print('Number of replicas:', strategy.num_replicas_in_sync)

In [None]:
# Loading original data
train_csv = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")
test_csv  = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")

In [None]:
# CONFIGURATION

AUTO = tf.data.experimental.AUTOTUNE
MODEL_NAME = "jplu/tf-xlm-roberta-base"
REPLICAS  = strategy.num_replicas_in_sync
TOKENIZER  = AutoTokenizer.from_pretrained(MODEL_NAME)

# HYPER-PARAMS
BATCH_SIZE = 16 * REPLICAS
MAX_LEN = 192
EPOCHS = 8 # Due to running time for notebook. Please try to train model on atleast 5-10 epochs.
SEED = 48
FONT_DIR = "../input/font-dataset/FontScripts/"

np.random.seed(SEED)
random.seed(SEED)

In [None]:
def prepare_input_v2(sentences):
    """ Converts the premise and hypothesis to the input format required by model"""
    sen_enc = TOKENIZER.batch_encode_plus(sentences,
                                          pad_to_max_length=True,
                                          return_attention_mask=False,
                                          return_token_type_ids=False,
                                          max_length=MAX_LEN)
    return np.array(sen_enc["input_ids"])

def get_dataset(features, labels=None, labelled=True, batch_size=8, repeat=True, shuffle=True):
    """Generates a tf.data pipeline from the encoded sentences."""
    if labelled:
        ds = tf.data.Dataset.from_tensor_slices((features, labels))
    else:
        ds = tf.data.Dataset.from_tensor_slices(features)

    if repeat:
        ds = ds.repeat()
        
    if shuffle:
        ds = ds.shuffle(2048)
        
    ds = ds.batch(batch_size*REPLICAS)
    ds.prefetch(AUTO)
    return ds


def build_model():
    """Prepare the model for fine-tuning."""
    encoder = TFAutoModel.from_pretrained(MODEL_NAME)
    input_word_ids = tf.keras.layers.Input(shape=(MAX_LEN,), dtype=tf.int32, name="input_word_ids")
    
    # Passing input to pretrained model.
    embeddings = encoder(input_word_ids)[0]
    x = embeddings[:, 0, :]
    
    output = tf.keras.layers.Dense(3, activation="softmax")(x)
    
    model = tf.keras.models.Model(inputs=input_word_ids, outputs=output)
    
    model.compile(loss=tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.05),
                  optimizer=tf.keras.optimizers.Adam(lr=1e-5),
                  metrics=["accuracy"])
    return model

def ratio_languages(df):
    """Prints out the ratio of all the languages in the dataset"""
    languages = np.unique(df.language)
    total = df.language.value_counts().sum() 
    ratios = {}
    for e in languages:
        ratios[e] = round((df.language.value_counts().loc[e] / total), 2)*100
    
    ratios = sorted(ratios.items(), key=lambda x: (x[1],x[0]), reverse=True)
    
    languages = []
    values = []
    for e in ratios:
        languages.append(e[0])
        values.append(e[1])
    _, texts, _ = plt.pie(values, explode=[0.2]*(len(values)), labels=languages, autopct="%.2i%%", radius=2, 
                             rotatelabels=True)
    for e in texts:
        e.set_fontsize(15)
        e.set_fontfamily('fantasy')
    plt.show()

def get_lr_callback(batch_size):
    lr_start = 0.000001
    lr_max   = 0.00000125 * batch_size
    lr_min   = 0.00000001
    lr_sus_epoch = 0
    lr_decay = 0.80
    lr_ramp_ep = 5
    lr = lr_start
    
    def lrfn(epoch):
        if epoch < lr_ramp_ep:
            lr = (lr_max- lr_start)/lr_ramp_ep * epoch + lr_start
        elif epoch < (lr_ramp_ep + lr_sus_epoch):
            lr = lr_max
        else:
            lr = (lr_max - lr_min)*lr_decay**(epoch - lr_ramp_ep - lr_sus_epoch)+ lr_min
        return lr
    
    lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=True)
    return lr_callback

def plot_wordcloud(df, col):
    """Function to plot word cloud for multiple languages"""
    words = " "
    font_path = None

    fig, ax = plt.subplots(nrows=2, ncols=2)
    fig.set_size_inches(12, 12)

    res = []
    for i in range(2):
      for j in range(2):
        res.append([i,j])

    for i,lang in enumerate(["English", 
                             "Hindi", 
                             "Urdu",
                             "German" ,        
                            ]):
      
          for line in df[df.language==lang][col].values:
                tokens = line.split()

                tokens = [word.lower() for word in tokens]
                words += " ".join(tokens)+" "
        
          fig.add_subplot(ax[res[i][0]][res[i][1]])

          if lang=="Hindi":
            font_path = FONT_DIR + "Hindi.ttf"

          if lang=="French":
            font_path =  FONT_DIR + "French.ttf"

          if lang=="Russian":
            font_path= FONT_DIR + "Russian.ttf"

          if lang=="Arabic":
            font_path = FONT_DIR + "Arabic.ttf"

          if lang=="Chinese":
            font_path = FONT_DIR + "Chinese.otf"

          if lang=="Swahili":
            font_path = FONT_DIR + "Swahili.ttf"

          if lang=="Urdu":
            font_path = FONT_DIR + "Urdu.ttf"

          if lang=="Vietnamese":
            font_path = FONT_DIR + "Vietnamese.ttf"

          if lang=="Greek":
            font_path = FONT_DIR + "Greek.ttf"

          if lang=="Thai":
            font_path = FONT_DIR + "Thai.ttf"

          if lang=="Spanish":
            font_path = FONT_DIR + "Spanish.ttf"

          if lang=="German":
            font_path = FONT_DIR + "German.ttf"

          if lang=="Turkish":
            font_path = FONT_DIR + "Turkish.ttf"

          if lang=="Bulgarian":
            font_path = FONT_DIR + "Bulgarian.ttf"

          s_words = STOPWORDS

          wordcloud = WordCloud(font_path=font_path, width=800, height=800, 
                                background_color="black",
                                min_font_size=10,
                                stopwords=s_words).generate(words)

          ax[res[i][0]][res[i][1]].imshow(wordcloud)
          ax[res[i][0]][res[i][1]].axis("off")
          ax[res[i][0]][res[i][1]].set_title(f"Language: {lang}",  fontsize=14)  

There was a problem in creation of word cloud for other languages. That will be the work for further version. 

# Preprocessing Data
Now let us analyze the training dataset to see how many samples are there for each language in the dataset.

In [None]:
# Value counts of samples for each language.
print(train_csv["language"].value_counts())

# Printing ratio of different language in the dataset
print()
ratio_languages(train_csv)

As can be seen that the samples with English language are quite higher (56%) and samples for other languages are less than 5%. Hence we need to upsample data with respect to other languages to reduce this imbalance in the data.

This can be done in two ways I suppose:

1. Create Synthetic samples for low resource languages from the original data.
2. Import external data for low resource languages.

Both the approaches work good in their place. In this notebook we would use the 2nd approach i.e loading external data.

In the following, we would load Cross lingual NLI corpus [XNLI](https://cims.nyu.edu/~sbowman/xnli/). It contains data in different languages and same labels (entailment, contradiction and neutral) as our dataset.

Reference [[1]](https://www.kaggle.com/yihdarshieh/more-nli-datasets-hugging-face-nlp-library#The-Cross-Lingual-NLI-Corpus-(XNLI)).

In [None]:
# Load xnli dataset
xnli = nlp.load_dataset(path="xnli")

# As this dataset does not contain direct 
# column name (premise, hypothesis) to sentence pair 
# and so we need to extract it out.
buff = {}
buff["premise"] = []
buff["hypothesis"] = []
buff["label"] = []
buff["language"] = []

# Making a set to map our dataset language abbreviations to 
# their complete names.
uniq_lang = set()
for e in xnli["test"]:
  for i in e["hypothesis"]["language"]:
    uniq_lang.add(i)

# Creating a dict that maps abv to their complete names. 
language_map = {}

# Taken test_csv just to use lang_abv column and nothing else.
for e in uniq_lang:
  language_map[e] = test_csv.loc[test_csv.lang_abv==e, "language"].iloc[0]

# Prepairing the dataset with the required columns.
for x in xnli['test']:
    label = x['label']
    for idx, lang in enumerate(x['hypothesis']['language']):
        
        # Skipping english samples as we don't want to upsample the samples
        # corresponding to english language.
        if lang=="en":
            continue
            
        hypothesis = x['hypothesis']['translation'][idx]
        premise = x['premise'][lang]
        buff['premise'].append(premise)
        buff['hypothesis'].append(hypothesis)
        buff['label'].append(label)
        buff['language'].append(language_map[lang])

# A pandas DataFrame for the prepared dataset.
xnli_df = pd.DataFrame(buff)

Let's just see what is the number of samples for languages in this dataset.

In [None]:
xnli_df["language"].value_counts()

In [None]:
# Extract columns which required from the dataset.
# Note: The columns in train_df and xnli_df must be same as we would be merging 
# them for upsampling.
train_df = train_csv[["premise", "hypothesis", "label", "language"]]
train_df.head()

In [None]:
# Concatenate the complete dataset
new_df = pd.concat([train_df, xnli_df], axis=0)
new_df.sample(5)

Now let's check if we accidently added some test samples in our dataset as a result of upsampling.

In [None]:
pd.merge(new_df, test_csv, how="inner")

As we can see we accidently added some samples from the test dataset (present in xnli_df) into our training dataset.
If we train the model on this dataset, even if get a good accuracy, that would not work as we are providing labels for the test dataset to the model.

So we need to eradicate these samples from the dataset which we would do in the following cell-

In [None]:
new_df = new_df.merge(pd.merge(new_df, test_csv, how="inner"), how="left", indicator=True)
new_df = new_df[new_df._merge=="left_only"]
new_df = new_df.drop(["id", "lang_abv", "_merge"], axis=1)
new_df.info()
# No null instances in the dataset.

In [None]:
pd.merge(new_df, test_csv, how="inner")

Since the intersection of train and test dataset is NULL, we can say that we have separated the samples and now we can train on this dataset.

Before training let us have a look at the ratio of languages after upsampling. 

In [None]:
ratio_languages(new_df)

In [None]:
new_df.language.value_counts()

In [None]:
# LOAD EXTRA DATA END

# Plotting Wordcloud

- Let's plot wordclouds for some languages in our **new_df** dataframe.

In [None]:
plot_wordcloud(new_df, "premise")

In [None]:
plot_wordcloud(new_df, "hypothesis")

# Training the Model

In [None]:
X, y = new_df[["premise", "hypothesis"]], new_df.label

Since dataset has multiple languages and we want model to be trained one each language equally. Hence we need to divide the train and validation samples such that both get equal ratio of all languages. But equal splitting of labels is also necessary for good learning of model. Both of these can be achieved by making an additional column which includes info for both labels and languages and we split our dataset on this column.   

In [None]:
X["language_label"] = new_df.language.astype(str) + "_" + new_df.label.astype(str)

In [None]:
print("Splitting Data...")

# Using train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=X.language_label, test_size=0.2, random_state=SEED)

y_train = tf.one_hot(y_train, depth=3)
y_test  = tf.one_hot(y_test, depth=3)

print("Prepairing Input...")
train_input = prepare_input_v2(x_train[["premise", "hypothesis"]].values.tolist())
valid_input = prepare_input_v2(x_test[["premise", "hypothesis"]].values.tolist())

print("Preparing Dataset...")
train_dataset = get_dataset(train_input, y_train, labelled=True, batch_size=BATCH_SIZE, repeat=True, 
                            shuffle=True)
valid_dataset   = get_dataset(valid_input, y_test, labelled=True, batch_size=BATCH_SIZE//REPLICAS, repeat=False,
                            shuffle=False)

print("Downloading and Building Model...")
with strategy.scope():
    model  = build_model()

# Callbacks
#reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss", mode="min", factor=0.4, patience=3,
#                                                 verbose=1)

lr_callback = get_lr_callback(BATCH_SIZE)
checkpoint = tf.keras.callbacks.ModelCheckpoint("XLM-R-base.h5", save_weights_only=True,
                                                save_best_only=True, save_freq="epoch", monitor="val_loss",
                                                mode="min")

print("Training...")
model.fit(train_dataset, 
         steps_per_epoch= x_train.shape[0]/BATCH_SIZE,
         validation_data=valid_dataset,
         epochs=EPOCHS,
         callbacks=[lr_callback, checkpoint])

However the gap between validation accuracy and training accuracy is not that much but still the model needs to be more regularized to reduce overfitting.

# Making Submission

In [None]:
test_input = prepare_input_v2(test_csv[["premise", "hypothesis"]].values.tolist())
test_dataset = get_dataset(test_input, None, labelled=False, batch_size=BATCH_SIZE, repeat=False, shuffle=False) 

In [None]:
preds = model.predict(test_dataset)

In [None]:
preds = preds.argmax(axis=1)

In [None]:
submission = pd.read_csv("../input/contradictory-my-dear-watson/sample_submission.csv")
submission.head()

In [None]:
submission["prediction"] = preds

In [None]:
submission.sample(10)

In [None]:
submission.to_csv("submission.csv", header=True, index=False)

Although we have got a good validation accuracy, the model is slightly overfitting and we need to introduce more regularization in it. This regularization could be done by adding BatchNormalization or Dropout layer in the model architecture. This will be the work for later versions. In the meantime if anyone get's success in reducing overfitting for this model, please share with the community.

# References:

- https://www.kaggle.com/yihdarshieh/more-nli-datasets-hugging-face-nlp-library
- https://www.kaggle.com/tuckerarrants/xlm-r-back-translation-tta
- https://www.kaggle.com/anasofiauzsoy/tutorial-notebook
- https://huggingface.co/transformers/pretrained_models.html
- https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
- https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/

Hope you enjoyed the Notebook. Please UPVOTE if so. THANKS!