# Roberta model for NLP

In this kernel, we want to rate the complexity of literary passages. This will allow student to choose a text according to their level.

[In the dataset](https://www.kaggle.com/c/commonlitreadabilityprize), we will mainly focus on the text and the score we want to predict. Here we will faced a regression problem

In order to accomplish that, we will preprocessing the text and pass it to a pretrained [RoBERTa model](https://arxiv.org/abs/1907.11692). 

Don't hesitate if have questions or if you see some improvements that can be made.

In [None]:
import numpy as np 
import pandas as pd 
import tensorflow as tf

from tensorflow import keras
from keras.models import Sequential, Model
from keras.layers import Dense, Flatten, Embedding, Input, Dropout, SpatialDropout1D, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.layers.wrappers import TimeDistributed
from tensorflow.python.keras.layers.recurrent import LSTM
from tensorflow.keras.optimizers import Adam
import tensorflow.keras.backend as K
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

from keras.initializers import Constant
from tensorflow.keras.metrics import RootMeanSquaredError
from sklearn.model_selection import train_test_split
from os import path

In [None]:
# Read the data
df_train = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
df_test = pd.read_csv("../input/commonlitreadabilityprize/test.csv")
df_sample = pd.read_csv("../input/commonlitreadabilityprize/sample_submission.csv")

# Visualize some data

In [None]:
df_train['excerpt'][0]

In [None]:
# Remove unused columns
df_train = df_train.drop(columns=['url_legal', 'license'])
df_train.head()

In [None]:
# Remove unused columns
df_test = df_test.drop(columns=['url_legal', 'license'])
df_test.head()

In [None]:
# Get the maximum number of words used in each text

max_length_training = max(df_train.apply(lambda x : len(x["excerpt"].split(' ')), axis=1))
max_length_testing = max(df_test.apply(lambda x : len(x["excerpt"].split(' ')), axis=1))

print("Maximum length of text in training set : ", max_length_training, " | in the testing set : ", max_length_testing)

# Preprocess the data

In order to preprocess the data, we are going to :

- Word tokenize : we want to break down the sentence to get the words that compose it.
- To lower case : normalize each word.
- Remove punctuations/digits.
- (optional) Remove stopwords : remove non significative words.
- (optional) Stemming : get the word stem, the root form of the word. (Example : fishing, fished, fisher => fish)
- (optional) Lemmatized : Get the lemma of the word.

In this approach, I wanted to keep each words, because I think the connection between words is relevent. But, you can with the processing function uncomment the code and add the stopword list.  
Also, by using RoBERTa, the usage of stemming and lemmatization could be avoid.


In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words("english"))
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    
    # Extract all the words in the phrase : get a list 
    tokens = word_tokenize(text)
    
    # Lowercase the words
    tokens = [word.lower() for word in tokens]
    
    # Remove all tokens that are not alphabetic
    words = [word for word in tokens if word.isalpha()]
    
    # Remove word in the stop word
    # words = [word for word in words if not word in stop_words]

    # Get the root of the word 
    # stemmed = [porter.stem(word) for word in words]
    
    # Lematize the word
    # lematized = [lemmatizer.lemmatize(word) for word in words]

    return " ".join(words)

In [None]:
# Apply the preprocessing on the text
df_train['preprocess_text'] = df_train.excerpt.apply(preprocess_text)
df_test['preprocess_text'] = df_test.excerpt.apply(preprocess_text)

In [None]:
# Get the list of unique word
unique_words = list(df_train.preprocess_text.str.split(' ', expand=True).stack().unique())
print("Number of unique words : ", len(unique_words))

## Build our model

In [None]:
import tokenizers
from transformers import RobertaConfig, TFRobertaModel
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification

roberta_path = '../input/tf-roberta/'

# Get the max size from the analysis we have made at the beginning
MAX_INPUT_LENGTH = max(max_length_training, max_length_testing)

# Load our pretrained model
tok = RobertaTokenizer.from_pretrained('../input/roberta-base')
print("Vocabulary size : ", tok.vocab_size)

In [None]:
# Encoder for our input data
def roberta_encode(texts, tokenizer, max_len=MAX_INPUT_LENGTH):
    all_tokens = np.ones((len(texts), max_len), dtype='int32')
    all_masks = np.zeros((len(texts), max_len), dtype='int32')
    
    for k, text in enumerate(texts):
        encoded = tok.encode_plus(
            text,                
            add_special_tokens=True,
            max_length=max_len,
            truncation=True,
            padding='max_length',
            return_attention_mask=True,
        )
        
        all_tokens[k, :max_len] = encoded['input_ids']
        all_masks[k, :max_len] = encoded['attention_mask']
        
    return all_tokens, all_masks

# Create our model
def build_roberta(max_len=MAX_INPUT_LENGTH):
    
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    
    # Get the pretrained roberta model
    config = RobertaConfig.from_pretrained(roberta_path + 'config-roberta-base.json')
    roberta_model = TFRobertaModel.from_pretrained(roberta_path + 'pretrained-roberta-base.h5', config=config)
    
    x = roberta_model([input_word_ids, input_mask])[0]
    x = tf.keras.layers.Dropout(0.2)(x)
    out = tf.keras.layers.Dense(1, activation='linear')(x)
    
    model = Model(inputs = [input_word_ids, input_mask], outputs=out)
    model.compile(Adam(lr = 1e-5), 
                  loss="mean_squared_error", 
                  metrics=['mse', 'mae', RootMeanSquaredError()])
    
    return model

In [None]:
# Create our training and valdiation set
X = df_train['preprocess_text']
Y = df_train['target'].values

X_final = df_test['preprocess_text']

X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.1, random_state=42)

In [None]:
# Encode our data
X_train_encode = roberta_encode(X_train, tok, MAX_INPUT_LENGTH)
X_val_encode = roberta_encode(X_val, tok, MAX_INPUT_LENGTH)

# Change if you want to use your pretrained model
PRETRAINED = False

# Build our model
K.clear_session()
model = build_roberta(MAX_INPUT_LENGTH)

if not PRETRAINED:

    # Save the best model
    check = ModelCheckpoint(f'roberta_model.h5', 
                            monitor='val_loss', 
                            verbose=1, 
                            save_best_only=True,
                            save_weights_only=True, 
                            mode='auto', 
                            save_freq='epoch')

    # Train our model
    history = model.fit(X_train_encode, 
                        y_train, 
                        validation_data=(X_val_encode, y_val),
                        epochs=4, 
                        batch_size=8, 
                        verbose=1, 
                        callbacks=[check])
    
    # Load the best model
    model.load_weights(f'roberta_model.h5')

else :
    # Load pretrained model
    model.load_weights(f'../input/robertapretrained/roberta_model.h5')

# Make the prediction

In [None]:
# Encode our final data
X_final_encoding = roberta_encode(X_final, tok, max_len=MAX_INPUT_LENGTH)

# Make the prediction
y_pred = model.predict(X_final_encoding)

# Do the mean of the output
y_mean = np.mean(y_pred, axis=1)

# Save to our submission file
df_sample['target'] = y_mean
df_sample.to_csv("submission.csv", index=False)

# Improvement

- For improvement, one possibility is to use Kfolding and create, let say 5 RoBERTa model. Then, you use the combination of the five models to generate a prediction.


# References

- https://www.kaggle.com/msafi04/tensorflow-roberta-commonlit-readability
