In this notebook, I try to solve the [CommonLit Readability Prize](https://www.kaggle.com/c/commonlitreadabilityprize/overview) competition using [Ridge Regression](https://en.wikipedia.org/wiki/Ridge_regression).

I have created similar models using [Decision Tree](https://en.wikipedia.org/wiki/Decision_tree), [Support Vector Machine](https://en.wikipedia.org/wiki/Support-vector_machine) and [Random Forest](https://en.wikipedia.org/wiki/Random_forest) which got a score of 0.941, 0.820 and 0.771 respectively.

The notebook for the models are:
 - [Decision Tree with score 0.941](https://www.kaggle.com/aniketsharma00411/commonlit-readability-decision-tree)
 - [Support Vector Machine with score 0.820](https://www.kaggle.com/aniketsharma00411/commonlit-readability-svr)
 - [Random Forest with score 0.771](https://www.kaggle.com/aniketsharma00411/commonlit-readability-random-forest)
 
I have also created a notebook containing insights gathered from dataset. I will be using insights from that in this notebook also. [Here](https://www.kaggle.com/aniketsharma00411/commonlit-readability-data-observations) is the link to that notebook.

This notebook will be similar to the [Support Vector Machine](https://www.kaggle.com/aniketsharma00411/commonlit-readability-svr) one.

# Initialization

I am using the [readability](https://pypi.org/project/readability/) and [syntok](https://pypi.org/project/syntok/) to gather features from excerpts.

In [None]:
! pip install -q /kaggle/input/readability/readability-0.3.1-py3-none-any.whl
! pip install -q /kaggle/input/syntok/syntok-1.3.1-py3-none-any.whl
import readability
import syntok.segmenter as segmenter

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train_data = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
test_data = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')

In [None]:
train_data.info()
train_data.head()

In [None]:
test_data.info()
test_data.head()

In [None]:
pd.read_csv('/kaggle/input/commonlitreadabilityprize/sample_submission.csv')

# Functions

In [None]:
def tokenize(text):
    """Tokenizing and creating excerpts in the format suggested in the README of readability project."""
    return '\n\n'.join(
        '\n'.join(
            ' '.join(token.value for token in sentence)
            for sentence in paragraph)
        for paragraph in segmenter.analyze(text))

# Creating Features

In [None]:
train_data.loc[:,'readability_object'] = train_data.apply(lambda row: readability.getmeasures(tokenize(row.excerpt), lang='en'), axis=1)

In [None]:
train_data.info()
train_data.head()

I will be using the **SMOGIndex** readability grade as it was found to be best in [this notebook](https://www.kaggle.com/aniketsharma00411/commonlit-readability-data-observations). Also, I am removing (not creating) some features based on insights gained from the same notebook.

In [None]:
X = pd.DataFrame(train_data['id'])
X.loc[:,'readability'] = train_data.apply(lambda row: row.readability_object['readability grades']['SMOGIndex'], axis=1)
X.loc[:,'syll_per_word'] = train_data.apply(lambda row: row.readability_object['sentence info']['syll_per_word'], axis=1)
X.loc[:,'words_per_sentence'] = train_data.apply(lambda row: row.readability_object['sentence info']['words_per_sentence'], axis=1)
X.loc[:,'type_token_ratio'] = train_data.apply(lambda row: row.readability_object['sentence info']['type_token_ratio'], axis=1)
X.loc[:,'syllables'] = train_data.apply(lambda row: row.readability_object['sentence info']['syllables'], axis=1)
X.loc[:,'words'] = train_data.apply(lambda row: row.readability_object['sentence info']['words'], axis=1)
X.loc[:,'wordtypes'] = train_data.apply(lambda row: row.readability_object['sentence info']['wordtypes'], axis=1)
X.loc[:,'sentences'] = train_data.apply(lambda row: row.readability_object['sentence info']['sentences'], axis=1)
X.loc[:,'complex_words_dc'] = train_data.apply(lambda row: row.readability_object['sentence info']['complex_words_dc'], axis=1)
X.loc[:,'tobeverb'] = train_data.apply(lambda row: row.readability_object['word usage']['tobeverb'], axis=1)
X.loc[:,'auxverb'] = train_data.apply(lambda row: row.readability_object['word usage']['auxverb'], axis=1)
X.loc[:,'conjunction'] = train_data.apply(lambda row: row.readability_object['word usage']['conjunction'], axis=1)
X.loc[:,'pronoun'] = train_data.apply(lambda row: row.readability_object['word usage']['pronoun'], axis=1)
X.loc[:,'preposition'] = train_data.apply(lambda row: row.readability_object['word usage']['preposition'], axis=1)
X.loc[:,'nominalization'] = train_data.apply(lambda row: row.readability_object['word usage']['nominalization'], axis=1)

In [None]:
X.info()
X.head()

In [None]:
tar_corr = pd.merge(X, train_data['target'], left_index=True, right_index=True).corr().loc['target']
tar_corr

We will remove every feature with correlation value between -0.1 and 0.1.

In [None]:
to_remove = ['id']
for val in tar_corr.index:
    if tar_corr[val] > -0.1 and tar_corr[val] < 0.1:
        to_remove.append(val)

to_remove

In [None]:
X = X.drop(to_remove, axis=1)

In [None]:
X.info()
X.head()

In [None]:
y = train_data['target']

In [None]:
y.describe()

# Training

In [None]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

Using Grid Search to find the optimal values of hyperparameters.

In [None]:
model = Ridge()
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=0)
space = {'alpha': [1e-3, 1e-2, 1e-1, 1, 10],
         'normalize': [True, False]}
search = GridSearchCV(model, space, cv=cv, scoring='neg_root_mean_squared_error', n_jobs=-1)
result = search.fit(train_X, train_y)

In [None]:
print(result.best_score_)
print(result.best_params_)

In [None]:
model = Ridge(alpha=result.best_params_['alpha'], normalize=result.best_params_['normalize'])

model.fit(train_X, train_y)

# Evaluating the result

In [None]:
train_preds = model.predict(train_X)
mean_squared_error(train_y, train_preds)

In [None]:
val_preds = model.predict(val_X)
mean_squared_error(val_y, val_preds)

# Creating features for test set and predicting results

In [None]:
test_data.loc[:,'readability_object'] = test_data.apply(lambda row: readability.getmeasures(tokenize(row.excerpt), lang='en'), axis=1)

In [None]:
test_data.info()
test_data.head()

In [None]:
X_test = pd.DataFrame(test_data['id'])
X_test.loc[:,'readability'] = test_data.apply(lambda row: row.readability_object['readability grades']['SMOGIndex'], axis=1)
X_test.loc[:,'syll_per_word'] = test_data.apply(lambda row: row.readability_object['sentence info']['syll_per_word'], axis=1)
X_test.loc[:,'words_per_sentence'] = test_data.apply(lambda row: row.readability_object['sentence info']['words_per_sentence'], axis=1)
X_test.loc[:,'type_token_ratio'] = test_data.apply(lambda row: row.readability_object['sentence info']['type_token_ratio'], axis=1)
X_test.loc[:,'syllables'] = test_data.apply(lambda row: row.readability_object['sentence info']['syllables'], axis=1)
X_test.loc[:,'words'] = test_data.apply(lambda row: row.readability_object['sentence info']['words'], axis=1)
X_test.loc[:,'wordtypes'] = test_data.apply(lambda row: row.readability_object['sentence info']['wordtypes'], axis=1)
X_test.loc[:,'sentences'] = test_data.apply(lambda row: row.readability_object['sentence info']['sentences'], axis=1)
X_test.loc[:,'complex_words_dc'] = test_data.apply(lambda row: row.readability_object['sentence info']['complex_words_dc'], axis=1)
X_test.loc[:,'tobeverb'] = test_data.apply(lambda row: row.readability_object['word usage']['tobeverb'], axis=1)
X_test.loc[:,'auxverb'] = test_data.apply(lambda row: row.readability_object['word usage']['auxverb'], axis=1)
X_test.loc[:,'conjunction'] = test_data.apply(lambda row: row.readability_object['word usage']['conjunction'], axis=1)
X_test.loc[:,'pronoun'] = test_data.apply(lambda row: row.readability_object['word usage']['pronoun'], axis=1)
X_test.loc[:,'preposition'] = test_data.apply(lambda row: row.readability_object['word usage']['preposition'], axis=1)
X_test.loc[:,'nominalization'] = test_data.apply(lambda row: row.readability_object['word usage']['nominalization'], axis=1)

In [None]:
X_test = X_test.drop(to_remove, axis=1)

In [None]:
test_preds = model.predict(X_test)

In [None]:
solution = pd.DataFrame(test_data['id'])
solution.loc[:, 'target'] = test_preds

In [None]:
solution.info()

In [None]:
solution.to_csv('submission.csv', index=False)