----------------------------------------------------------------------------------------------------------

## XLM RoBERTa Large + Semi-Supervised Learning

**Description:** First I forked the notebook below since the tensorflow wiring and some parameter tuning was already accomplished (no point in recreating the wheel).  From this base code I wanted to add external data for 1 round of self-training.  In my prior work, excessive self-training has not shown promise.  <=3 rounds of self-training in my prior work has proven useful.  For deep learning, 1 round of self-training has proven useful when blending in the predictions vs. just taking the final predictions after all training is completed.  In this version of self-training, I'm including more than just the most confident examples.  There are other ideas that could have been tried here as well, but I will leave those to the reader.

**Forked:** https://www.kaggle.com/yeayates21/fork-contra-watson-concise-keras-xlm-r-on-tpu

**Additions:** 
 - Added twitter data and preformed 1 round of self-training

**Acknowledgments:** 
 - [xhlulu](https://www.kaggle.com/xhlulu)

**Learning Resources:**
 - See self-training sections:  
   - http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdf
   - http://www.cs.cmu.edu/~10701/slides/17_SSL.pdf
 
 ----------------------------------------------------------------------------------------------------------

# Imports

In [None]:
import os

import numpy as np
import pandas as pd
from kaggle_datasets import KaggleDatasets
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import transformers
from transformers import TFAutoModel, AutoTokenizer
from tqdm.notebook import tqdm
import plotly.express as px

## Setting up the TPUs

### original commentz:

This line is necessary in order to initialize the TPUs. 

Here, "replicas" simply means number of "cores". In the case of GPUs or CPUs, the number of replicas will be 1. [Read this](https://cloud.google.com/tpu/docs/tpus#replicas) for more information.

In [None]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
    print('Running on TPU ', tpu.master())
except ValueError:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

## Define variables

### original comment:

Make sure to keep those variables in mind as you navigate this notebook! They are all placed below so you can easily change and rerun this notebook.

Don't worry about the model right now. We will come back to it later.

In [None]:
model_name = 'jplu/tf-xlm-roberta-large'
n_epochs = 10
max_len = 80

# Our batch size will depend on number of replicas
batch_size = 16 * strategy.num_replicas_in_sync

## Load datasets

### original comment:

Just regular CSV files. Nothing scary here!

In [None]:
train = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/train.csv')
test = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/test.csv')
submission = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/sample_submission.csv')

### New Comment! 

Loading our twitter data.

In [None]:
twitter = pd.read_csv('/kaggle/input/sentiment140/training.1600000.processed.noemoticon.csv',
                      encoding = 'latin1',
                      names=['polarity','id','date','flag','user','text'])

In [None]:
print("twitter data shape: ", len(twitter), len(twitter.columns))

In [None]:
twitter.head()

### New Comment! 

Here we split the data in half at random, and we pair the data together.  So 1 half will represent our 'premise' and the other half with be or 'hypothesis'.  Since we're doing this at random, I image that most of these pairs will have nothing to do with one another, but in some cases they might result in a nice 'premise' and 'hypothesis' pair.  A good error analysis would bee to look at the pairs that get a high score.

In [None]:
twitter_premise, twitter_hypothesis = train_test_split(twitter['text'], test_size=0.5, random_state=2020)

In [None]:
print(len(twitter_premise))
print(len(twitter_hypothesis))

In [None]:
twitter_df = pd.DataFrame()
twitter_df['premise'] = twitter_premise.values
twitter_df['hypothesis'] = twitter_hypothesis.values
twitter_df.head()

## Encode Training data

### original comment:

Now, we need to encode the training and test data into `tokens`, which are numerical representation of our words. To learn more, [read this](https://huggingface.co/transformers/main_classes/tokenizer.html).

In [None]:
# First load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
%%time

# Convert the text so that we can feed it to `batch_encode_plus`
train_text = train[['premise', 'hypothesis']].values.tolist()
test_text = test[['premise', 'hypothesis']].values.tolist()
twitter_text = twitter_df[['premise', 'hypothesis']].values.tolist()

# Now, we use the tokenizer we loaded to encode the text
train_encoded = tokenizer.batch_encode_plus(
    train_text,
    pad_to_max_length=True,
    max_length=max_len
)

test_encoded = tokenizer.batch_encode_plus(
    test_text,
    pad_to_max_length=True,
    max_length=max_len
)

twitter_encoded = tokenizer.batch_encode_plus(
    twitter_text,
    pad_to_max_length=True,
    max_length=max_len
)

### original comment:

Train and validation split happens here.

### New Comment!

We split our twitter data into 2 training sets.  Why?  Because this will give us more control over our model\prediction blending in the end.

In [None]:
### train
x_train, x_valid, y_train, y_valid = train_test_split(
    train_encoded['input_ids'], train.label.values, 
    test_size=0.2, random_state=2020
)

### test
x_test = test_encoded['input_ids']

### twitter
x_twitter_train1, x_twitter_train2 = train_test_split(
    twitter_encoded['input_ids'], 
    test_size=0.5, random_state=2020
)

## Convert to tf.data.Dataset

### original comment:

`tf.data.Dataset` is one of many different ways to define the input to our models. Here, it is a good choice since it is easily compatible with TPUs. Read more about it [in this article](https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428).

In [None]:
%%time

auto = tf.data.experimental.AUTOTUNE

train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .repeat()
    .shuffle(2048)
    .batch(batch_size)
    .prefetch(auto)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    .batch(batch_size)
    .cache()
    .prefetch(auto)
)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(x_test)
    .batch(batch_size)
)

### for creating pseudo labels ###
train_twitter_dataset1 = (
    tf.data.Dataset
    .from_tensor_slices(x_twitter_train1)
    .batch(batch_size)
)

### for creating pseudo labels ###
train_twitter_dataset2 = (
    tf.data.Dataset
    .from_tensor_slices(x_twitter_train2)
    .batch(batch_size)
)

## Train the model

### original comment:

It's time to teach our lovely XLM-Roberta how to infer natural language. Notice here we are using `strategy.scope()`. We need to load `transformer_encoder` inside this scope in order to tell Tensorflow that we want our model on the TPUs. Otherwise, it will try to load it in your CPU machine!

XLM-Roberta is one of the best models out there for multilingual classification tasks. Essentially, it is a model that was trained on inherently multilingual text, and used methods that helped it become larger, train longer and on more data! Highly recommend you to read [this blog post by the authors](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/), as well as the [Huggingface docs](https://huggingface.co/transformers/model_doc/xlmroberta.html) on the subject.

### New Comment!

A great article on the TensorFlow logic:  http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

![](http://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png)

In [None]:
with strategy.scope():
    # First load the transformer layer
    transformer_encoder = TFAutoModel.from_pretrained(model_name)

    # This will be the input tokens 
    input_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_ids")

    # Now, we encode the text using the transformers we just loaded
    last_hidden_states = transformer_encoder(input_ids)[0]

    # Only extract the token used for classification, which is <s>
    cls_token = last_hidden_states[:, 0, :]

    # Finally, pass it through a 3-way softmax, since there's 3 possible laels
    out = Dense(3, activation='softmax')(cls_token)

    # It's time to build and compile the model
    model = Model(inputs=input_ids, outputs=out)
    model.compile(
        Adam(lr=1e-5), 
        loss='sparse_categorical_crossentropy', 
        metrics=['accuracy']
    )

model.summary()

In [None]:
n_steps = len(x_train) // batch_size

train_history1 = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=n_epochs
)

### New Comment!

We're going to blend in preds from SSL (semi-supervised learning), so we need to save the predictions from our 1st run.  This will make more sense as we go.

In [None]:
test_preds1 = model.predict(test_dataset, verbose=1)

### New Comment!

So now we take our model trained on the competition data only, and predict on our twitter to get pseudo lables for 1 round of self training.  The hope is that our unlabeled data can help identify our boundaries more accurately.  In a sense, we're kind of acting like an EM algorithm except we're not running EM very long at all.  Our original model takes guesses on where these points should land, and then we update our model but using those predictions as pseudo labels.

In [None]:
twitter_preds = model.predict(train_twitter_dataset1, verbose=1)
twitter_pseudo_labels_train1 = twitter_preds.argmax(axis=1)
twitter_preds = model.predict(train_twitter_dataset2, verbose=1)
twitter_pseudo_labels_train2 = twitter_preds.argmax(axis=1)

### New Comment! 

Now we need to create new tf.datasets with our newly created pseudo labels.  We'll use this to do 1 more round of modeling i.e. 1 round of self-training to update our model based on our pseudo labels.  

In [None]:
twitter_dataset_ssl1 = (
    tf.data.Dataset
    .from_tensor_slices((x_twitter_train1, twitter_pseudo_labels_train1))
    .repeat()
    .shuffle(2048)
    .batch(batch_size)
    .prefetch(auto)
)

twitter_dataset_ssl2 = (
    tf.data.Dataset
    .from_tensor_slices((x_twitter_train2, twitter_pseudo_labels_train2))
    .repeat()
    .shuffle(2048)
    .batch(batch_size)
    .prefetch(auto)
)

### New Comment!

So we run model.fit twice, each time making prediction on the test set.  Why?  This will give us more control over model\prediction blending on the end and will help us from overfitting.

In [None]:
n_steps = len(x_twitter_train1) // batch_size
n_steps_val = len(x_valid) // batch_size

model.fit(
    twitter_dataset_ssl1,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    validation_steps=n_steps_val,
    epochs=2
)

test_preds2 = model.predict(test_dataset, verbose=1)

In [None]:
n_steps = len(x_twitter_train2) // batch_size
n_steps_val = len(x_valid) // batch_size

model.fit(
    twitter_dataset_ssl2,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    validation_steps=n_steps_val,
    epochs=2
)

test_preds3 = model.predict(test_dataset, verbose=1)

### New Comment!

Here we predict on test set and submit.  We give the 1st model the most weight.  The predictions from our self-training are given less weight.  Why?  Honestly, because these weights work.  One intuition is that the self-training clearly has some new information to add, but it's not doing all the heavy lifting so giving too much weight to the self-training rounds would cause overfitting \ model distortion.

In [None]:
test_preds = (0.92)*test_preds1 + (0.05)*test_preds2 + (0.03)*test_preds3
test_preds = test_preds.argmax(axis=1)
submission['prediction'] = test_preds

In [None]:
submission.to_csv('submission.csv', index=False)
submission.head()

## Visualize Training History

### original comment

With Plotly Express, this can be done in one function call:

In [None]:
hist = train_history1.history

In [None]:
px.line(
    hist, x=range(1, len(hist['loss'])+1), y=['accuracy', 'val_accuracy'], 
    title='Model Accuracy', labels={'x': 'Epoch', 'value': 'Accuracy'}
)

In [None]:
px.line(
    hist, x=range(1, len(hist['loss'])+1), y=['loss', 'val_loss'], 
    title='Model Loss', labels={'x': 'Epoch', 'value': 'Loss'}
)

### New Comment!

We don't add training plots for the self training because we're not running for too many rounds and we're just blending in the predictions.