# What & why

This notebook provides a function that adds a number of text features to all of the CommonLit Readability challenge dataset.

Depending on your model, it might be worth trying out including those features. Although they are comparatively rough in contrast to more sophisticated methods such as transformer-based models, some of these metrics are correlated with excerpt complexity. In addition, adding a model based on them to your prediction ensemble might improve performance.

# How

This function is based on the package `textstat`. It provides a very easy-to-use API to extract a number of features from texts. For more details on these metrics, please refer to their documentation at https://textstat.readthedocs.io/en/latest/.

# How to use

You can either: 
1) import the dataset into your inference notebook; or
2) use the function `include_text_features` in your dataset. 

If you choose (1), please consider using a good validation scheme. If you choose (2), pass the function separately to your training, validation and test datasets could also help with performance.

Also, if you pass the values to a neural network, consider subtracting the batch mean and dividing by the batch standard deviation, since neural networks prefer working on normalised data.

# Ok, that's it...

Good luck in the competition, and leave your comment below with feedback and with your impressions from using these features!

# Code

In [None]:
!pip install textstat

In [None]:
import pandas as pd
import textstat

In [None]:
file_path = "../input/commonlitreadabilityprize/"
train_df = pd.read_csv(file_path+"train.csv")
train_df = train_df[['id', 'excerpt', 'target']]
train_df = train_df.sample(frac=1).reset_index(drop=True)

In [None]:
def text_features(excerpt):
    feat = excerpt.map(lambda x: {
        "flesch_reading_ease": textstat.flesch_reading_ease(x),
        "smog_index": textstat.smog_index(x),
        "flesch_kincaid_grade": textstat.flesch_kincaid_grade(x),
        "coleman_liau_index": textstat.coleman_liau_index(x),
        "automated_readability_index": textstat.automated_readability_index(x),
        "dale_chall_readability_score": textstat.dale_chall_readability_score(x),
        "difficult_words": textstat.difficult_words(x),
        "linsear_write_formula": textstat.linsear_write_formula(x),
        "gunning_fog": textstat.gunning_fog(x),
#        textstat.text_standard(excerpt)
    })
    return pd.DataFrame([pd.Series(x) for x in pd.DataFrame(feat).iloc[:,0]])

def include_text_features(df):
    return pd.concat([df, text_features(df.excerpt)], axis=1)

In [None]:
augmented_df = include_text_features(train_df)
augmented_df

In [None]:
augmented_df.to_csv("augmented_df.csv")