In this notebook, I try to solve the [CommonLit Readability Prize](https://www.kaggle.com/c/commonlitreadabilityprize/overview) competition using [Random Forest](https://en.wikipedia.org/wiki/Random_forest).

I have created a similar model using [Decision Tree](https://en.wikipedia.org/wiki/Decision_tree) which got a score of 0.941. The notebook for that model is [here](https://www.kaggle.com/aniketsharma00411/commonlit-readability-decision-tree).

I will extract the following features from the excerpts to train the Random Forest model.
- Readability
- Length
- Sentiment

For my next approach, I will try extracting more features to gather more insights from the excerpts.

# Initialization



I am using the [readability Python package](https://pypi.org/project/readability/) to evaluate readability of each excerpt and [textblob](https://pypi.org/project/textblob/) for sentiment analysis.


In [None]:
! pip install -q /kaggle/input/readability/readability-0.3.1-py3-none-any.whl
from textblob import TextBlob
import readability

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

import pandas as pd

In [None]:
train_data = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
test_data = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')

In [None]:
train_data.info()
train_data.head()

In [None]:
test_data.info()
test_data.head()

In [None]:
pd.read_csv('/kaggle/input/commonlitreadabilityprize/sample_submission.csv')

# Functions

In [None]:
def readability_analysis(text):
    rd = readability.getmeasures(text, lang='en')
    return rd['readability grades']['FleschReadingEase']

In [None]:
def length_analysis_words(text):
    return len(text.split())

In [None]:
def length_analysis_chars(text):
    return len(text)

In [None]:
def sentiment_analysis(text):
    return TextBlob(text).sentiment.polarity

# Creating Features

In [None]:
X = pd.DataFrame(train_data['id'])
X.loc[:,'readability'] = train_data.apply(lambda row: readability_analysis(row.excerpt), axis=1)
X.loc[:,'len_words'] = train_data.apply(lambda row: length_analysis_words(row.excerpt), axis=1)
X.loc[:,'len_chars'] = train_data.apply(lambda row: length_analysis_chars(row.excerpt), axis=1)
X.loc[:,'sentiment'] = train_data.apply(lambda row: sentiment_analysis(row.excerpt), axis=1)

In [None]:
X.info()
X.head()

In [None]:
X = X[['readability', 'len_words', 'len_chars', 'sentiment']]

In [None]:
y = train_data['target']

In [None]:
y.describe()

# Training

In [None]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

Using Grid Search to find the optimal values of hyperparameters.

In [None]:
model = RandomForestRegressor(random_state=0)
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=0)
space = {'n_estimators': [100, 300, 1000, 3000, 10000],
        'max_depth': [3, 10, 30, 100, 300, 1000]}
search = GridSearchCV(model, space, cv=cv, scoring='neg_root_mean_squared_error', n_jobs=-1)
result = search.fit(train_X, train_y)

In [None]:
print(result.best_score_)
print(result.best_params_)

In [None]:
model = RandomForestRegressor(random_state=0, n_estimators=result.best_params_['n_estimators'], max_depth=result.best_params_['max_depth'])

model.fit(train_X, train_y)

# Evaluating the result

In [None]:
train_preds = model.predict(train_X)
mean_squared_error(train_y, train_preds)

In [None]:
val_preds = model.predict(val_X)
mean_squared_error(val_y, val_preds)

# Creating features for test set and predicting results

In [None]:
X_test = pd.DataFrame(test_data['id'])
X_test.loc[:,'readability'] = test_data.apply(lambda row: readability_analysis(row.excerpt), axis=1)
X_test.loc[:,'len_words'] = test_data.apply(lambda row: length_analysis_words(row.excerpt), axis=1)
X_test.loc[:,'len_chars'] = test_data.apply(lambda row: length_analysis_chars(row.excerpt), axis=1)
X_test.loc[:,'sentiment'] = test_data.apply(lambda row: sentiment_analysis(row.excerpt), axis=1)

In [None]:
X_test.info()
X_test.head()

In [None]:
val_preds = model.predict(X_test[['readability', 'len_words', 'len_chars', 'sentiment']])

In [None]:
solution = pd.DataFrame(X_test['id'])
solution.loc[:, 'target'] = val_preds

In [None]:
solution.info()

In [None]:
solution.to_csv('submission.csv', index=False)