<center><img src="https://github.com/dimitreOliveira/MachineLearning/blob/master/Kaggle/CommonLit%20Readability%20Prize/banner.png?raw=true" width="1000"></center>
<br>
<center><h1>CommonLit Readability - EDA & RoBERTa TF baseline</h1></center>
<br>

# イントロ
**Goal:** 3年生から12年生？（アメリカの教育制度でGradeというのがあり、日本でいう小３から高３）で使用するために、テキストの複雑さを実数値で評価するアルゴリズムを構築する。


**Data:** 
> **train.csv / test.csv** 
> - ```id``` - ID
> - ```url_legal``` - テキストのソースのURL
> - ```license``` - テキストのソースのライセンス？
> - ```excerpt``` - 読みやすさを予測するテキスト（抜粋）
> - ```target``` - 読みやすさ（ターゲット）
> - ```standard_error``` - 複数の評価者による読みやすさスコアの広がりを示す指標

**Note:** ```url_legal```, ```license``` と ```standard error``` はテストデータには無し

**Evaluation metric:** Root Mean Squared Error (RMSE)
> $$RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{y_i - \hat{y_i}}{\sigma_i}\Big)^2}}$$
> where 
> * $y_i$ : original value
> * $\hat{y_i}$ : predicted value
> * $n$ : number of rows in the test data
> * $\sigma_i$ : standard_error

## 良さげなDiscussion（Most Voteで上位にあるもの）
- **My Experience So Far and Further Improvements** (@torch) https://www.kaggle.com/c/commonlitreadabilityprize/discussion/241029
    - どのように予測精度を上げていったのかが書いてある
    - 参考にしたノートブックが記載されている
    - 精度を向上させるアイディアが載っている
- **Grandmaster Series on NLP** (@Dieter) 
https://www.kaggle.com/c/commonlitreadabilityprize/discussion/245004
    - kaggle grandmasterの@cdeotte, @cpmpml, @boliu0, @DieterさんがNLPとトランスフォーマーについて話している
    - 動画URL : https://youtu.be/PXc_SlnT2g0
- **Best Single Model** (@Mr_KnowNothing) https://www.kaggle.com/c/commonlitreadabilityprize/discussion/236645
    - 簡単なパラメータと使用したモデル（１つ）の精度を共有している
    - 単体なモデルの精度の上限がわかるかも？
- **Readability : what is it?** (@kkiller) https://www.kaggle.com/c/commonlitreadabilityprize/discussion/236626
    - 読みやすさ（readability）について定義している
    - 文の長さの平均（ASL）と単語あたりの音節数の平均（ASW）を用いて読みやすさを数式的に定義
     
     $$206.835 - (1.015 \times ASL ) - (84.6 \times ASW )$$
      
- **How I improved my transformer from 0.53 to 0.48 with couple of lines of code** (@Bignesh Baskaran) https://www.kaggle.com/c/commonlitreadabilityprize/discussion/246817
    - 同じROBERTAを使用しているが精度に0.05差が出来てしまった原因を紹介
    - パラメータチューニングに使えるかも？
- **the magic? read more!!** (@Kamal Das) https://www.kaggle.com/c/commonlitreadabilityprize/discussion/241081
    - 初心者が見たらいいdiscussionを共有してくれている
- **Augmented dataset** (@Konrad Banachewicz) https://www.kaggle.com/c/commonlitreadabilityprize/discussion/237182
    - テキストを他の言語に翻訳し，再翻訳することでデータ数を増やす方法について紹介している
    - 今回のコンペは読みやすさを評価するものであり，再翻訳をするとテキストが簡単になってしまうため今回は意味がないかも
        - 実際に行った人でLBスコアが下がったという人もいる
        - 要検討 

## Dependencies

In [None]:
import random, os, warnings, math
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import tensorflow as tf
import tensorflow.keras.layers as L
import tensorflow.keras.backend as K
from tensorflow.keras import optimizers, losses, metrics, Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler
from transformers import TFAutoModelForSequenceClassification, TFAutoModel, AutoTokenizer


def seed_everything(seed=0):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

seed = 0
seed_everything(seed)
sns.set(style='whitegrid')
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', 150)

### Hardware configuration

In [None]:
# TPU or GPU detection
# Detect hardware, return appropriate distribution strategy
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print(f'Running on TPU {tpu.master()}')
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()

AUTO = tf.data.experimental.AUTOTUNE
REPLICAS = strategy.num_replicas_in_sync
print(f'REPLICAS: {REPLICAS}')

# Load data

In [None]:
train_filepath = '/kaggle/input/commonlitreadabilityprize/train.csv'
test_filepath = '/kaggle/input/commonlitreadabilityprize/test.csv'

train = pd.read_csv(train_filepath)
test = pd.read_csv(test_filepath)

print(f'Train samples: {len(train)}')
display(train.head())

print(f'Test samples: {len(test)}')
display(test.head())

# removing unused columns
train.drop(['url_legal', 'license'], axis=1, inplace=True)
test.drop(['url_legal', 'license'], axis=1, inplace=True)

# Model parameters

In [None]:
BATCH_SIZE = 8 * REPLICAS
LEARNING_RATE = 1e-5 * REPLICAS
EPOCHS = 35
ES_PATIENCE = 7
PATIENCE = 2
N_FOLDS = 5
SEQ_LEN = 256 #300
BASE_MODEL = '/kaggle/input/huggingface-roberta/roberta-base/'

## Auxiliary functions

In [None]:
# Datasets utility functions
def custom_standardization(text):
    text = text.lower() # if encoder is uncased
    text = text.strip()
    return text


def sample_target(features, target):
    mean, stddev = target
    sampled_target = tf.random.normal([], mean=tf.cast(mean, dtype=tf.float32), 
                                      stddev=tf.cast(stddev, dtype=tf.float32), dtype=tf.float32)
    
    return (features, sampled_target)
    

def get_dataset(pandas_df, tokenizer, labeled=True, ordered=False, repeated=False, 
                is_sampled=False, batch_size=32, seq_len=128):
    """
        Return a Tensorflow dataset ready for training or inference.
    """
    text = [custom_standardization(text) for text in pandas_df['excerpt']]
    
    # Tokenize inputs
    tokenized_inputs = tokenizer(text, max_length=seq_len, truncation=True, 
                                 padding='max_length', return_tensors='tf')
    
    if labeled:
        dataset = tf.data.Dataset.from_tensor_slices(({'input_ids': tokenized_inputs['input_ids'], 
                                                      'attention_mask': tokenized_inputs['attention_mask']}, 
                                                      (pandas_df['target'], pandas_df['standard_error'])))
        if is_sampled:
            dataset = dataset.map(sample_target, num_parallel_calls=tf.data.AUTOTUNE)
    else:
        dataset = tf.data.Dataset.from_tensor_slices({'input_ids': tokenized_inputs['input_ids'], 
                                                      'attention_mask': tokenized_inputs['attention_mask']})
        
    if repeated:
        dataset = dataset.repeat()
    if not ordered:
        dataset = dataset.shuffle(1024)
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(tf.data.AUTOTUNE)
    
    return dataset


def plot_metrics(history):
    metric_list = list(history.keys())
    size = len(metric_list)//2
    fig, axes = plt.subplots(size, 1, sharex='col', figsize=(20, size * 5))
    axes = axes.flatten()
    
    for index in range(len(metric_list)//2):
        metric_name = metric_list[index]
        val_metric_name = metric_list[index+size]
        axes[index].plot(history[metric_name], label='Train %s' % metric_name)
        axes[index].plot(history[val_metric_name], label='Validation %s' % metric_name)
        axes[index].legend(loc='best', fontsize=16)
        axes[index].set_title(metric_name)

    plt.xlabel('Epochs', fontsize=16)
    sns.despine()
    plt.show()

# EDA

### Looking at a few examples

In [None]:
display(train.head())

### Now the examples with the 5 lowest `target` values

In [None]:
display(train.sort_values(by=['target']).head())

### Now the examples with the 5 highest `target` values

In [None]:
display(train.sort_values(by=['target'], ascending=False).head())

テキストをぱっと見ただけでは，`target` スコアを決めるのは難しそう

ただ，スコアが小さなデータは文法的，意味的，記号的エラーが混じってテキストに含まれているので読みにくそう

## Label distribution

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(train['target'], ax=ax)
plt.show()

`target` の分布は正規分布っぽいけど，中央値が負の値になっている

負の値は-4まであるが，正の値は2ぐらいまでしかない

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(train['standard_error'], ax=ax)
plt.show()

The `standard_error` column seems to have some outliers with values lower than `0.4`.
`standard_error`が0.4よりも小さい外れ値がいくつかある

これらのデータは人によって読みやすさの評価が異なる

→予測に使える？ テストデータにはこの特徴量がないから無理そう

**モデルの評価には`standard_error`で重みをつけたRMSEで評価すべきでは?**

In [None]:
print(f"standard_error values >= than 0.4: {len(train[train['standard_error'] >= 0.4])}")
print(f"standard_error values < than 0.4: {len(train[train['standard_error'] < 0.4])}")

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
sns.scatterplot(x=train['target'], y=train['standard_error'], s=10, color=".15")
sns.kdeplot(x=train['target'], y=train['standard_error'], levels=5, color="r", linewidths=1)
plt.ylim([0.4, None])
plt.show()

`target`, `standard_error`の散布図

とても読みやすい or とても読みにくい（target=2 or -4）テキストは評価が分かれる

↑自然な考えな気がする

## `excerpt` text distribution

### `excerpt` length

In [None]:
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
train['excerpt_len'] = train['excerpt'].apply(lambda x : len(x))
train['excerpt_wordCnt'] = train['excerpt'].apply(lambda x : len(x.split(' ')))
train['excerpt_tokenCnt'] = train['excerpt'].apply(lambda x : len(tokenizer.encode(x, add_special_tokens=False)))

fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(train['excerpt_len'], ax=ax).set_title('Excerpt length')
plt.show()

### `excerpt` word count

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(train['excerpt_wordCnt'], ax=ax).set_title('Excerpt word count')
plt.show()

### `excerpt` token count (after using the tokenizer)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(train['excerpt_tokenCnt'], ax=ax).set_title('Excerpt token count')
plt.show()

# Model

In [None]:
def model_fn(encoder, seq_len=256):
    input_ids = L.Input(shape=(seq_len,), dtype=tf.int32, name='input_ids')
    input_attention_mask = L.Input(shape=(seq_len,), dtype=tf.int32, name='attention_mask')
    
    outputs = encoder({'input_ids': input_ids, 
                       'attention_mask': input_attention_mask})
    
    model = Model(inputs=[input_ids, input_attention_mask], outputs=outputs)

    optimizer = optimizers.Adam(lr=LEARNING_RATE)
    model.compile(optimizer=optimizer, 
                  loss=losses.MeanSquaredError(), 
                  metrics=[metrics.RootMeanSquaredError()])
    
    return model


with strategy.scope():
    encoder = TFAutoModelForSequenceClassification.from_pretrained(BASE_MODEL, num_labels=1)
    model = model_fn(encoder, SEQ_LEN)
    
model.summary()

# Training

In [None]:
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
skf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=seed)
oof_pred = []; oof_labels = []; history_list = []; test_pred = []

for fold,(idxT, idxV) in enumerate(skf.split(train)):
    if tpu: tf.tpu.experimental.initialize_tpu_system(tpu)
    print(f'\nFOLD: {fold+1}')
    print(f'TRAIN: {len(idxT)} VALID: {len(idxV)}')

    # Model
    K.clear_session()
    with strategy.scope():
        encoder = TFAutoModelForSequenceClassification.from_pretrained(BASE_MODEL, num_labels=1)
        model = model_fn(encoder, SEQ_LEN)
        
    model_path = f'model_{fold}.h5'
    es = EarlyStopping(monitor='val_root_mean_squared_error', mode='min', 
                       patience=ES_PATIENCE, restore_best_weights=True, verbose=1)
    checkpoint = ModelCheckpoint(model_path, monitor='val_root_mean_squared_error', mode='min', 
                                 save_best_only=True, save_weights_only=True)

    # Train
    history = model.fit(x=get_dataset(train.loc[idxT], tokenizer, repeated=True, is_sampled=True, 
                                      batch_size=BATCH_SIZE, seq_len=SEQ_LEN), 
                        validation_data=get_dataset(train.loc[idxV], tokenizer, ordered=True, 
                                                    batch_size=BATCH_SIZE, seq_len=SEQ_LEN), 
                        steps_per_epoch=50, 
                        callbacks=[es, checkpoint], 
                        epochs=EPOCHS,  
                        verbose=2).history
      
    history_list.append(history)
    # Save last model weights
    model.load_weights(model_path)
    
    # Results
    print(f"#### FOLD {fold+1} OOF RMSE = {np.min(history['val_root_mean_squared_error']):.4f}")

    # OOF predictions
    valid_ds = get_dataset(train.loc[idxV], tokenizer, ordered=True, batch_size=BATCH_SIZE, seq_len=SEQ_LEN)
    oof_labels.append([target[0].numpy() for sample, target in iter(valid_ds.unbatch())])
    x_oof = valid_ds.map(lambda sample, target: sample)
    oof_pred.append(model.predict(x_oof)['logits'])

    # Test predictions
    test_ds = get_dataset(test, tokenizer, labeled=False, ordered=True, batch_size=BATCH_SIZE, seq_len=SEQ_LEN)
    x_test = test_ds.map(lambda sample: sample)
    test_pred.append(model.predict(x_test)['logits'])

## Model loss and metrics graph

In [None]:
for fold, history in enumerate(history_list):
    print(f'\nFOLD: {fold+1}')
    plot_metrics(history)

# Model evaluation

Out Of Fold(OOF)を用いてモデルを評価する

## OOF metrics

In [None]:
y_true = np.concatenate(oof_labels)
y_preds = np.concatenate(oof_pred)


for fold, history in enumerate(history_list):
    print(f"FOLD {fold+1} RMSE: {np.min(history['val_root_mean_squared_error']):.4f}")
    
print(f'OOF RMSE: {mean_squared_error(y_true, y_preds, squared=False):.4f}')

### **Error analysis**, label x prediction distribution

Here we can compare the distribution from the labels and the predicted values, in a perfect scenario they should align.



In [None]:
preds_df = pd.DataFrame({'Label': y_true, 'Prediction': y_preds[:,0]})

fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(preds_df['Label'], ax=ax, label='Label')
sns.distplot(preds_df['Prediction'], ax=ax, label='Prediction')
ax.legend()
plt.show()

In [None]:
sns.jointplot(data=preds_df, x='Label', y='Prediction', kind='reg', height=10)
plt.show()

# Test set predictions

In [None]:
submission = test[['id']]
submission['target'] = np.mean(test_pred, axis=0)
submission.to_csv('submission.csv', index=False)
display(submission.head(10))