# CommonLit Readability Feature Importances

This competition is interesting because it can be approached with a variety of features and approaches - tabular features, NLP, CV etc. In this notebook, I'd like to explore the tabular features and their importances. I'll use the `readability` package (thanks to @takamichitoda for the demo notebook). The feature importance investigation follows the process from **Deep Learning for Coders with Fastai and PyTorch** book by @jhoward and @sgugger. 

Sources:
- https://github.com/fastai/fastbook/blob/master/09_tabular.ipynb
- https://www.kaggle.com/takamichitoda/commonlit-classical-methods-for-text-readability
- https://github.com/andreasvc/readability/

In [None]:
!pip install ../input/readability-package -qq

!mkdir -p /tmp/pip/cache/
!cp ../input/syntok/wheels/syntok-1.3.1.xyz /tmp/pip/cache/syntok-1.3.1.tar.gz
!cp ../input/syntok/wheels/regex-2021.4.4-cp37-cp37m-manylinux2014_x86_64.whl /tmp/pip/cache/
!pip install --no-index --find-links /tmp/pip/cache/ syntok

In [None]:
import readability
import numpy as np
import pandas as pd 
import os
import syntok.segmenter as segmenter
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from tqdm.auto import tqdm
import scipy
from scipy.cluster import hierarchy as hc
from matplotlib import pyplot as plt

In [None]:
# https://www.kaggle.com/takamichitoda/commonlit-classical-methods-for-text-readability
def _calc_readability(text):
    tokenized = '\n\n'.join(
         '\n'.join(' '.join(token.value for token in sentence)
            for sentence in paragraph)
         for paragraph in segmenter.analyze(text))
    return readability.getmeasures(tokenized, lang='en')

def _extract_feat(row):
    dic = {}
    for k, v in row.items():
        for kk, vv in v.items():
            key = f'{"_".join(k.split())}_{kk}'
            dic.update({key: vv})
    return dic

# https://github.com/fastai/fastbook/blob/master/utils.py
def cluster_columns(df, figsize=(10,6), font_size=12):
    corr = np.round(scipy.stats.spearmanr(df).correlation, 4)
    corr_condensed = hc.distance.squareform(1-corr)
    z = hc.linkage(corr_condensed, method='average')
    fig = plt.figure(figsize=figsize)
    hc.dendrogram(z, labels=df.columns, orientation='left', leaf_font_size=font_size)
    plt.show()

## Random Forest Regressor

Let's start by building a simple random forest regressor model with all the readability features and see its score on a single fold (using folds by @abhishek)

In [None]:
df = pd.read_csv('../input/step-1-create-folds/train_folds.csv')
df['readability'] = df['excerpt'].map(lambda x: _calc_readability(x))

df_features = pd.DataFrame(df['readability'].map(_extract_feat).tolist())
df_features['kfold'] = df['kfold']
df_features['target'] = df['target']

df_train = df_features[df_features.kfold != 0].reset_index(drop=True)
df_valid = df_features[df_features.kfold == 0].reset_index(drop=True)

train_features = df_train.drop(['kfold', 'target'], axis=1)
valid_features = df_valid.drop(['kfold', 'target'], axis=1)
train_labels = df_train.target.values
valid_labels = df_valid.target.values

In [None]:
def rf(train_features, valid_features, train_labels, valid_labels):
    model = RandomForestRegressor(random_state=42)
    model.fit(train_features, train_labels)
    valid_preds = model.predict(valid_features)
    rmse = mean_squared_error(valid_labels, valid_preds, squared=False)
    return model, rmse

model, rmse = rf(train_features, valid_features, train_labels, valid_labels)
rmse

## Feature Importances

We can now check the relative feature importances and plot them. 

In [None]:
def rf_feat_importance(m, df):
    return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
                       ).sort_values('imp', ascending=False)

In [None]:
fi = rf_feat_importance(model, train_features)
fi[:5]

In [None]:
def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(fi);

Let's try to cut off the less important features and see the effect on the score. 

In [None]:
to_keep = fi[fi.imp>0.01].cols

train_features_imp = train_features[to_keep]
valid_features_imp = valid_features[to_keep]

model, rmse = rf(train_features_imp, valid_features_imp, train_labels, valid_labels)
rmse

The score improved! Fewer features means smaller risk of overfitting, so this is promissing. 

## Similar Features

We can now see which features are highly correlated, and see if removing them helps us get a better score. 

In [None]:
cluster_columns(train_features_imp)

In [None]:
def get_score(train_feats, f):
    model = RandomForestRegressor(random_state=42)
    model.fit(train_feats.drop(f, axis=1), train_labels)
    valid_preds = model.predict(valid_features_imp.drop(f, axis=1))
    rmse = mean_squared_error(valid_labels, valid_preds, squared=False)
    return rmse

In [None]:
{f:get_score(train_features_imp, f) for f in (
    'sentence_info_characters_per_word', 'readability_grades_Coleman-Liau', 'readability_grades_Kincaid',
    'readability_grades_ARI', 'readability_grades_LIX', 'readability_grades_RIX')}

So the biggest improvement comes from removing `readability_grades_RIX` feature, which incidentally is also the most important feature! Glad we discovered this, as written for example [here](https://readable.com/blog/the-lix-and-rix-readability-formulas), LIX and RIX features are highly correlated, so we should be safe to remove RIX from our feature set. Let's also review our final feature importances. 

In [None]:
train_features_final = train_features_imp.drop('readability_grades_RIX', axis=1)
valid_features_final = valid_features_imp.drop('readability_grades_RIX', axis=1)

model, rmse = rf(train_features_final, valid_features_final, train_labels, valid_labels)
rmse

In [None]:
plot_fi(rf_feat_importance(model, train_features_final));

# 5-fold Training and Test Inference

Let's confirm that our feature removal also works when evaluated in a 5-fold cv setting. We'll get the mean RMSE score for both *all features* and *final features* setting. 

In [None]:
test_df = pd.read_csv('../input/commonlitreadabilityprize/test.csv')
test_df['readability'] = test_df['excerpt'].map(lambda x: _calc_readability(x))
test_df_features = pd.DataFrame(test_df['readability'].map(_extract_feat).tolist())

In [None]:
def get_preds(features):
    test_features = test_df_features[features]

    fold_scores = []
    fold_preds = []

    for fold in tqdm(range(5)):
        df_train = df_features[df_features.kfold != fold].reset_index(drop=True)
        df_valid = df_features[df_features.kfold == fold].reset_index(drop=True)

        train_features = df_train[features]
        valid_features = df_valid[features]

        train_labels = df_train.target.values
        valid_labels = df_valid.target.values

        model, rmse = rf(train_features, valid_features, train_labels, valid_labels)
        fold_scores.append(rmse)

        test_preds = model.predict(test_features)
        fold_preds.append(test_preds)

    return np.mean(fold_scores), fold_scores, fold_preds

In [None]:
features = [x for x in df_features.columns.tolist() if x not in ['kfold', 'target']]
score, fold_scores, fold_preds = get_preds(features)
score

In [None]:
features = train_features_final.columns.tolist()
score, fold_scores, fold_preds = get_preds(features)
score

Our final features perform sligthly better than all features, so we will use them to make our final submission. 

In [None]:
preds = np.stack(fold_preds).mean(axis=0)

## Submission

In [None]:
sub = pd.read_csv('../input/commonlitreadabilityprize/sample_submission.csv')
sub.target = preds
sub.to_csv('submission.csv', index=False)
sub.head()