### **Introduction**

In this notebook I'm going to do exploratory data analysis, data cleaning, some basic feature engineering utilizing NLP techniques and finally, I'll build an sklearn pipeline to find a reasonably good model.

### **Import modules and packages**

In [None]:
!pip install pandarallel
!pip install xgboost
!pip install catboost

In [None]:
import numpy as np
import pandas as pd 
import os
import re
import pickle
import seaborn as sns
import matplotlib.pyplot as plt

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold, RandomizedSearchCV
from sklearn.feature_selection import RFECV, RFE
from sklearn.metrics import mean_squared_error, make_scorer, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

import xgboost
import lightgbm
import catboost

#### **Load the data**

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train_df = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
test_df = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')

### **Exploratory data analysis**

During this first step, I'm going to take a look at basic information about the variables we have in the dataset like the share of missing values, unique values and the distribution of values in case of numeric variables.

**Variables**
- *id* - unique ID for excerpt
- *url_legal* - URL of source - this is blank in the test set.
- *license* - license of source material - this is blank in the test set.
- *excerpt* - text to predict reading ease of
- *target* - reading ease
- *standard_error* - measure of spread of scores among multiple raters for each excerpt. Not included for test data.

In [None]:
def get_basic_info(df):
    """ Get basic information and statistics on a given dataframe.
    
    :param df: dataframe to be analyzed
    """
    print('Basic info: \n')
    print(df.info())
    print('\n Basic stats: \n')
    print(df.describe(include='all'))
    
    num_cols = df.loc[:, df.dtypes != object].columns
    
    if len(num_cols) > 0:
        f, axes = plt.subplots(1, len(num_cols), figsize=(16, 6), sharex=False)

        for i, col in enumerate(num_cols):
            sns.histplot(df[col], color="skyblue", ax=axes[i])

In [None]:
get_basic_info(train_df)

As we can see, in 4 out of 6 columns there are no missing values while in the remaining two, *url_legal* and *license*, around 70% of the values is missing. 

All of the ids and excerpts are unique. 

As for the distributions of the numeric features, *target*'s distribution is very close to normal distribution while *standard_error* is a bit trickier. It has some 0 values yet the 25th percentile is at 0.4685 so taking the range of values into account, one could first assume that it's a negatively or left-skewed distribution. However, the median is somewhat smaller than the mean which shouldn't be the case for a negatively skewed distribution, right? The histogram on the right justifies this suspicion as we see a positively skewed distribution. The zeros are therefore presumably outliers.

In [None]:
get_basic_info(test_df)

Regarding the test set's variables, we can make similar observations as all ids and excerpts are unique and there are some missing values in the *url_legal* and *license* columns.

**The Target**

The values in the *target* variable that contains scores about reading ease range from -3.676 which is supposed to be the most difficult text to 1.711, the easiest one.

As we have seen above, values in the target variable follow a distribution very close to normal.

Let's see some example texts from either end of the scale!

In [None]:
print("Min Target:", train_df["target"].min(), "\n" +
      "Text:", train_df[train_df["target"] == train_df["target"].min()]["excerpt"][1705], "\n" +
      "\n" +
      "Max Target:", train_df["target"].max(), "\n" +
      "Text:", train_df[train_df["target"] == train_df["target"].max()]["excerpt"][2829])

**The Standard Error**

Each excerpt was rated by multiple people which makes this variable a measure of the spread of scores among these raters.

When it comes to rating the difficulty of texts, the tasks is actually more complex than it might seem at first. There are no well-defined set of factors that raters can use to map how easily a text can be read to values on a scale. This makes these ratings very subjective.

The latter is reflected in the *standard_error* variable as we can see on the histogram above, the 25th percentile is 0.4685! The mean standard error is 0.4914 which means that raters disagreed on the difficulty of a particular text by almost half a point on average! This is a huge disagreement if we take into account that the range of values in the target is only around 5.387.

**Let's check out the correlation between the target and the standard error!**

In [None]:
sns.scatterplot(x='target', y='standard_error', data=train_df)

Based on this plot we can say that on average, people tended to disagree more in case of texts which are towards either end of the reading ease scale. What is pretty obvious is that there is no correlation between the target and the standard error. We also have an outlier whose target and standard error is both 0. This is the observation we have seen earlier on the distribution plot, it should be excluded from the dataset.

In [None]:
train_df = train_df[train_df['standard_error'] != 0.0]

**Segmenting**

Now let's do some segmenting in terms of difficulty to understand the data a bit more. I'm dividing the dataset into 3 parts based on percentiles: easy, moderate and hard.

In [None]:
train_df['reading_difficulty'] = pd.cut(
    train_df['target'], bins=[min(train_df['target']),
                              np.percentile(train_df['target'], 33),
                              np.percentile(train_df['target'], 67),
                              max(train_df['target'])], 
    labels=['hard', 'moderate', 'easy'])

In [None]:
dummy_segments = pd.get_dummies(train_df['reading_difficulty'])

train_df['reading_difficulty_easy'] = dummy_segments.easy
train_df['reading_difficulty_moderate'] = dummy_segments.moderate
train_df['reading_difficulty_hard'] = dummy_segments.hard

In [None]:
train_df['reading_difficulty'].value_counts(normalize = True) * 100
sns.countplot(x = 'reading_difficulty', data = train_df, color = 'teal')

So, now the distribution of the target varible per segment looks like this:

In [None]:
sns.catplot(x = 'reading_difficulty', y = 'target', kind = 'violin', split = True, data = train_df)

As we have seen above on the scatter plot, the standard error in case of medium difficulty is slightly lower, however, as for the average standard error per reading difficulty there are no large differences, we can say that on average there is a standard error of 0.5 per rating.

In [None]:
sns.catplot(x = 'reading_difficulty', y = 'standard_error', kind = 'violin', split = True, data = train_df)

### **Data cleaning and text preprocessing**

Since there is nothing we can do about imputing values for the missing ones in the *url_legal* and *license* columns, I'm going to leave them as they are and just clean the texts in the *excerpt* column. I'm also going to tokenize this text, get rid of the stop words and lemmatize the word tokens.

In [None]:
def get_cleaned_text(text):
    """ Clean the provided text: get rid of special characters, convert to lowercase.
    
    :param text: string to be cleaned
    :returns cleaned text as a string
    """
    text = re.sub(r"[^a-zA-Z0-9]+", ' ', text)
    text = text.lower()
    return text

In [None]:
train_df['excerpt_cleaned'] = train_df['excerpt'].apply(lambda text: get_cleaned_text(text))

In [None]:
train_df['excerpt_tokenized'] = train_df['excerpt_cleaned'].apply(
    lambda text: word_tokenize(text))

In [None]:
stop_words = set(stopwords.words('english')) 
train_df['excerpt_without_stopwords'] = train_df['excerpt_tokenized'].apply(
    lambda text: [word for word in text if not word in stop_words])

In [None]:
lemmatizer = WordNetLemmatizer()
train_df['excerpt_lemmatized'] = train_df['excerpt_without_stopwords'].apply(
    lambda text: [lemmatizer.lemmatize(word, pos='v') for word in text])

### **Feature engineering**

First of all, I'm going to extract some basic characteristics of the excerpts into features.

In [None]:
train_df['paragraph_length'] = train_df['excerpt_cleaned'].apply(
    lambda text: len(text))

In [None]:
train_df['avg_sentence_length'] = train_df['excerpt'].apply(
    lambda text: round((sum([len(sentence) for sentence in text.split('.')])/len(text.split('.'))),2))

In [None]:
train_df['avg_word_length'] = train_df['excerpt_tokenized'].apply(
    lambda text: round((sum([len(word) for word in text])/len(text)),2))

In [None]:
train_df['sentence_count'] = train_df['excerpt'].apply(
    lambda text: len(text.split('.')))

In [None]:
train_df['word_count'] = train_df['excerpt_tokenized'].apply(
    lambda text: len(text))

In [None]:
train_df['unique_word_count'] = (
    train_df['excerpt_lemmatized'].apply(
        lambda text: len(set(text))
    )
)

Next, using NLTK's parts of speech tags, I'm going to count certain parts of speech in the excerpts.

In [None]:
def get_parts_of_speech_count(text, list_of_pos):
    """ Count the specified parts of speech in a string.
    
    :param text: tokenized string the specified pos will be calculated in
    :param list_of_pos: the list of the parts of speech to determine the occurence of
    :returns the number of times a pos occured within the text as a integer
    """
    text_pos_list = nltk.pos_tag(text)

    pos_tag_list = []
    pos_tag_list = [text_pos_tuple[1] for text_pos_tuple in text_pos_list]

    pos_count = 0
    pos_count = [1 if pos_tag in list_of_pos else 0 for pos_tag in pos_tag_list].count(1)
    
    return pos_count

In [None]:
train_df['noun_count'] = (
    train_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['NN', 'NNS'])
)

train_df['proper_noun_count'] = (
    train_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['NNP', 'NNPS'])
)

train_df['verb_count'] = (
    train_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
)

train_df['adjective_count'] = (
    train_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['JJ', 'JJR', 'JJS'])
)

train_df['cardinal_digit_count'] = (
    train_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['CD'])
)

train_df['adverb_count'] = (
    train_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['RB', 'RBR', 'RBS'])
)

train_df['preposition_count'] = (
    train_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['IN'])
)

train_df['foreign_word_count'] = (
    train_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['FW'])
)

Let's calculate word frequencies within each text as well and create some more features out of it!

*For this part @andradaolteanu's awesome notebook has been a major inspiration.*

In [None]:
word_freq = pd.read_csv('../input/english-word-frequency/unigram_freq.csv')

word_freq = dict(zip(word_freq['word'], word_freq['count']))
available_words = set(word_freq.keys())

train_df['word_freq_in_text'] = train_df['excerpt_lemmatized'].parallel_apply(
    lambda x: [word_freq.get(word, 0) for word in list(x) if word in available_words])

In [None]:
train_df['mean_word_freq_in_text'] = train_df['word_freq_in_text'].apply(lambda x: np.mean(x))
train_df['std_word_freq_in_text'] = train_df['word_freq_in_text'].apply(lambda x: np.std(x))
train_df['min_word_freq_in_text'] = train_df['word_freq_in_text'].apply(lambda x: np.min(x))
train_df['max_word_freq_in_text'] = train_df['word_freq_in_text'].apply(lambda x: np.max(x))

**Distributions and correlations**

In [None]:
plt.figure(figsize=(16, 6))

mask = np.triu(np.ones_like(train_df.corr(), dtype=np.bool))
heatmap = sns.heatmap(train_df.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Variable Correlation Heatmap', fontdict={'fontsize':18}, pad=16);

We can see that the target variable correlates positively and somewhat significantly with the verb count, and the mean word frequency per text. 

It also correlates negatively and yet again somewhat significantly with the average word length, the paragraph length and the average sentence length. In case of these three varibales there is a high chance that we are dealing with multicollinearity since the longer words are, the longer the sentences and the paragraphs are so these three variables are probably correlated. During modelling I will apply feature selection techniques that ensure multicollinearity is dealth with properly.

In [None]:
ax = sns.pairplot(train_df, 
                  vars=['paragraph_length','sentence_count','avg_sentence_length',
                        'avg_word_length','word_count','unique_word_count'], 
                  hue='reading_difficulty', 
                  palette='Set2', 
                  diag_kind='kde', 
                  height=2.5)

In [None]:
ax = sns.pairplot(train_df, 
                  vars=['mean_word_freq_in_text','std_word_freq_in_text','min_word_freq_in_text',
                        'max_word_freq_in_text','noun_count','verb_count'], 
                  hue='reading_difficulty', 
                  palette='Set2', 
                  diag_kind='kde', 
                  height=2.5)

### **Modelling**

#### **Train-test split**

In [None]:
X = train_df[['paragraph_length', 'avg_sentence_length',
                       'avg_word_length', 'sentence_count', 'word_count', 'unique_word_count',
                       'noun_count', 'proper_noun_count', 'verb_count', 'adjective_count',
                       'cardinal_digit_count', 'adverb_count', 'preposition_count',
                       'foreign_word_count', 'mean_word_freq_in_text',
                       'std_word_freq_in_text', 'min_word_freq_in_text',
                       'max_word_freq_in_text']].copy()

X = X.astype(float)

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

y = train_df[['target']]
y = np.array(y)
y = y.ravel()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

#### **Make root mean squared error a scoring function**

In [None]:
scoring_func = make_scorer(mean_squared_error)

#### **Create pipeline**

In [None]:
pipe = Pipeline([('scaler', RobustScaler()),
                 ('dimension_reduction', PCA()),
                 ('regression', RandomForestRegressor())])

#### **Create space of candidate learning algorithms and their hyperparameters**

In [None]:
search_space = [{'scaler' : [None, StandardScaler(), RobustScaler(), MinMaxScaler()],
                 'dimension_reduction' : [None, PCA(0.97), PCA(0.95), PCA(0.9)],
                 'regression' : [LinearRegression()]
                },
                {'scaler' : [None, MinMaxScaler()],
                 'dimension_reduction' : [None, PCA(0.97), PCA(0.95), PCA(0.9)],
                 'regression': [RandomForestRegressor()],
                 'regression__n_estimators' : [10, 100, 200, 300],
                 'regression__max_features' : ['sqrt', 'auto'],
                 'regression__min_samples_split' :[2, 5, 10],
                 'regression__min_samples_leaf' :[1,2,4],
                 'regression__max_depth' : [10, 20, 50, 60,90, 100],
                 'regression__bootstrap' : [True]
                },
                {'scaler' : [None, MinMaxScaler()],
                 'dimension_reduction' : [None, PCA(0.97), PCA(0.95), PCA(0.9)],
                 'regression': [xgboost.XGBRegressor()],
                 'regression__n_estimators' : [10, 100, 200, 300],
                 'regression__learning_rate' : [0.05, 0.10, 0.20, 0.30 ] ,
                 'regression__max_depth' : [ 3, 8, 12],
                 'regression__min_child_weight' : [ 1, 3, 7 ],
                 'regression__gamma' : [ 0.0,  0.2, 0.4 ]
                },
                {'scaler' : [None, MinMaxScaler()],
                 'dimension_reduction' : [None, PCA(0.97), PCA(0.95), PCA(0.9)],
                 'regression' : [lightgbm.LGBMRegressor()],
                 'regression__learning_rate' : [0.05, 0.10, 0.20, 0.30],
                 'regression__max_depth' : [-1, 3, 8, 12],
                 'regression__n_estimators' : [ 100, 200, 300, 500],
                 'regression__scale_pos_weight' : [0.5, 0.54, 0.6, 0.8, 1.0],
                 'regression__boosting_type' : ['gbdt', 'dart', 'goss']
                }, 
                {'scaler' : [None, MinMaxScaler()],
                 'dimension_reduction' : [None, PCA(0.97), PCA(0.95), PCA(0.9)],
                 'regression' : [catboost.CatBoostRegressor()],
                 'regression__learning_rate' : [0.05, 0.10, 0.20, 0.30],
                 'regression__iterations' : [100,200,500],
                 'regression__depth' : [2,3,5,6],
                 'regression__grow_policy' : ['SymmetricTree', 'Lossguide']
                }
               ]

#### **Randomized search**

*In contrast to GridSearchCV, in this case not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.*

**n_iter** trades off runtime vs. quality of the solution. First I set it to 100 and **cv** to 5 which meant 500 iterations but it took me quite a few hours to get it done so after experimenting a bit, I settled with 50 as **n_iter** and 3-fold cv.

In [None]:
clf = RandomizedSearchCV(pipe, param_distributions = search_space, n_iter = 50, 
                         cv = 3, scoring=scoring_func, verbose = 2, n_jobs = -1)

In [None]:
clf.fit(X_train, y_train)
clf_df = pd.DataFrame(clf.cv_results_)

In [None]:
clf.best_estimator_.get_params()['steps'][2][1]

In [None]:
y_pred = clf.best_estimator_.predict(X_test)
print('RMSE: %1.4f\n' % (np.sqrt(mean_squared_error(y_test, y_pred))))

The RMSE is above 1! Having checked the leader board of the competition, I have to say that there is plenty of room for improvement. It's not so surprising since the techniques I've implemented in this notebook are rather basic. For the next attempt I should probably include tf-idf vectorization or definitely experiment with transformer models.

#### **Re-train the chosen model on all data**

In [None]:
X_train = X_scaled.copy()
y_train = train_df[['target']]
y_train = np.array(y_train)
y_train = y_train.ravel()

In [None]:
clf_final = clf.best_estimator_.get_params()['steps'][2][1]

In [None]:
clf_final.fit(X_train, y_train)

#### **Prediction**

Before the prediction could be done on the test set, we need to create the same features we have in the training data.

In [None]:
y_test = test_df[['excerpt']]

In [None]:
test_df['excerpt_cleaned'] = test_df['excerpt'].apply(lambda text: get_cleaned_text(text))

test_df['excerpt_tokenized'] = test_df['excerpt_cleaned'].apply(
    lambda text: word_tokenize(text))

test_df['excerpt_without_stopwords'] = test_df['excerpt_tokenized'].apply(
    lambda text: [word for word in text if not word in stop_words])

test_df['excerpt_lemmatized'] = test_df['excerpt_without_stopwords'].apply(
    lambda text: [lemmatizer.lemmatize(word, pos='v') for word in text])

test_df['paragraph_length'] = test_df['excerpt_cleaned'].apply(
    lambda text: len(text))

test_df['avg_sentence_length'] = test_df['excerpt'].apply(
    lambda text: round((sum([len(sentence) for sentence in text.split('.')])/len(text.split('.'))),2))

test_df['avg_word_length'] = test_df['excerpt_tokenized'].apply(
    lambda text: round((sum([len(word) for word in text])/len(text)),2))

test_df['sentence_count'] = test_df['excerpt'].apply(
    lambda text: len(text.split('.')))

test_df['word_count'] = test_df['excerpt_tokenized'].apply(
    lambda text: len(text))

test_df['unique_word_count'] = (
    test_df['excerpt_lemmatized'].apply(
        lambda text: len(set(text))
    )
)

test_df['noun_count'] = (
    test_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['NN', 'NNS'])
)

test_df['proper_noun_count'] = (
    test_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['NNP', 'NNPS'])
)

test_df['verb_count'] = (
    test_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
)

test_df['adjective_count'] = (
    test_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['JJ', 'JJR', 'JJS'])
)

test_df['cardinal_digit_count'] = (
    test_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['CD'])
)

test_df['adverb_count'] = (
    test_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['RB', 'RBR', 'RBS'])
)

test_df['preposition_count'] = (
    test_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['IN'])
)

test_df['foreign_word_count'] = (
    test_df['excerpt_lemmatized'].apply(
        get_parts_of_speech_count, 
        list_of_pos=['FW'])
)

test_df['word_freq_in_text'] = test_df['excerpt_lemmatized'].parallel_apply(
    lambda x: [word_freq.get(word, 0) for word in list(x) if word in available_words])

test_df['mean_word_freq_in_text'] = test_df['word_freq_in_text'].apply(lambda x: np.mean(x))
test_df['std_word_freq_in_text'] = test_df['word_freq_in_text'].apply(lambda x: np.std(x))
test_df['min_word_freq_in_text'] = test_df['word_freq_in_text'].apply(lambda x: np.min(x))
test_df['max_word_freq_in_text'] = test_df['word_freq_in_text'].apply(lambda x: np.max(x))

In [None]:
X_test = test_df[['paragraph_length', 'avg_sentence_length',
                       'avg_word_length', 'sentence_count', 'word_count', 'unique_word_count',
                       'noun_count', 'proper_noun_count', 'verb_count', 'adjective_count',
                       'cardinal_digit_count', 'adverb_count', 'preposition_count',
                       'foreign_word_count', 'mean_word_freq_in_text',
                       'std_word_freq_in_text', 'min_word_freq_in_text',
                       'max_word_freq_in_text']].copy()

X_test = X_test.astype(float)

scaler = StandardScaler()
scaler.fit(X_test)
X_test_scaled = scaler.transform(X_test)

In [None]:
data = [test_df['id'], pd.Series(clf_final.predict(X_test_scaled))]
headers = ['id', 'target']
submission = pd.concat(data, axis=1, keys = headers)

print(submission)

submission.to_csv('submission.csv', index = False)