<div align='center'><font size="5" color='#353B47'>CommonLit readibility prize</font></div>
<div align='center'><font size="4" color="#353B47">A first approach using XGBoost with hyperopt</font></div>
<br>
<hr>

**<font color="blue" size="4">Context</font>**

> Currently, most educational texts are matched to readers using traditional readability methods or commercially available formulas. However, each has its issues. Tools like Flesch-Kincaid Grade Level are based on weak proxies of text decoding (i.e., characters or syllables per word) and syntactic complexity (i.e., number or words per sentence). As a result, they lack construct and theoretical validity. At the same time, commercially available formulas, such as Lexile, can be cost-prohibitive, lack suitable validation studies, and suffer from transparency issues when the formula's features aren't publicly available.

**<font color="blue" size="4">What is CommonLit ?</font>**

> CommonLit, Inc., is a nonprofit education technology organization serving over 20 million teachers and students with free digital reading and writing lessons for grades 3-12. Together with Georgia State University, an R1 public research university in Atlanta, they are challenging Kagglers to improve readability rating methods.

**<font color="blue" size="4">What does this competition consists in ?</font>**

> The purpose of this competition is to build algorithms to rate the complexity of reading passages for grade 3-12 classroom use. To accomplish this, you'll pair your machine learning skills with a dataset that includes readers from a wide variety of age groups and a large collection of texts taken from various domains. Winning models will be sure to incorporate text cohesion and semantics.

**<font color="blue" size="4">What if it works well ?</font>**

> you'll aid administrators, teachers, and students. Literacy curriculum developers and teachers who choose passages will be able to quickly and accurately evaluate works for their classrooms. Plus, these formulas will become more accessible for all. Perhaps most importantly, students will benefit from feedback on the complexity and readability of their work, making it far easier to improve essential reading skills.

# <div id="summary">Summary</div>

**<font size="2"><a href="#chap1">1. Import libraries</a></font>**
**<br><font size="2"><a href="#chap2">2. EDA</a></font>**
**<br><font size="2"><a href="#chap3">3. Preprocessing</a></font>**
**<br><font size="2"><a href="#chap4">4. Training</a></font>**

# <div id="chap1">1. Import libraries

In [None]:
import numpy as np
import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import math
from hyperopt import tpe, hp, Trials
from hyperopt.fmin import fmin
from functools import partial
import matplotlib.pyplot as plt
import plotly.graph_objects as go

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Load pretrained spacy model for lemmatization
sp = spacy.load('en_core_web_sm')

**<font size="2"><a href="#summary">Back to summary</a></font>**

----

# <div id="chap2">2. EDA

In [None]:
# Import train and test
train = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
test = pd.read_csv("../input/commonlitreadabilityprize/test.csv")

In [None]:
train.info()

In [None]:
# Check stats on target column
print(train.target.mean())
print(train.target.std())

train.hist(['target'], bins=20)

In [None]:
fig = go.Figure(data=[go.Histogram(x=train.target.values,
                                   marker_line_width=1, 
                                   marker_line_color="midnightblue", 
                                   xbins_size = 0.2)])

fig.update_layout(title_text='Distribution of reading ease score')
fig.show()

In [None]:
fig = go.Figure(data = [go.Box(y=train.target.values, name = "Reading ease score")])

fig.update_layout(title_text='Boxplot of reading ease score')
fig.show()

In [None]:
# Check missing values
train.isnull().sum()

In [None]:
train.url_legal.value_counts()

In [None]:
train.license.value_counts()

In [None]:
train.excerpt.head()

In [None]:
# url_legal_seen = train.url_legal.values
# for idx in range(len(test)):
#     if test.loc[idx, "url_legal"] not in url_legal_seen:
#         test.loc[idx, "url_legal"] = "unknown"

# le = LabelEncoder()
# train["url_legal_encoded"] = le.fit_transform(train.url_legal.values)
# test["url_legal_encoded"] = le.transform(test.url_legal.values)

In [None]:
le2 = LabelEncoder()
train["license_encoded"] = le2.fit_transform(train.license.values)
test["license_encoded"] = le2.transform(test.license.values)

In [None]:
train.corr()

**<font size="2"><a href="#summary">Back to summary</a></font>**

----

# <div id="chap3">3. Preprocessing

In [None]:
stopwords = stopwords.words('english')

def preprocessing_excerpt(text):
    text = text.lower()
    text = word_tokenize(text)
    text = [x for x in text if x not in stopwords]
    text = " ".join(text)
    return str(sp(text))

In [None]:
train['excerpt'] = train['excerpt'].apply(preprocessing_excerpt)
test['excerpt'] = test['excerpt'].apply(preprocessing_excerpt)

In [None]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(train.excerpt.values)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()

test_vectors = vectorizer.transform(test.excerpt.values)
test_feature_names = vectorizer.get_feature_names()
test_dense = test_vectors.todense()
test_denselist = test_dense.tolist()

In [None]:
X = pd.DataFrame(denselist)
y = train.target.values

X_test = test_denselist

**<font size="2"><a href="#summary">Back to summary</a></font>**

----

# <div id="chap4">4. Training

In [None]:
# hyperopt
def optimize(params, x, y):
        
    regressor = xgb.XGBRegressor(**params,
                                 n_estimators = 100,
                                 tree_method='gpu_hist', gpu_id=0)
    
    kf = KFold(n_splits = 5)
    
    rmses = []
    
    for train_idx, test_idx in kf.split(X=x):

        xtrain = x.loc[train_idx, :]
        ytrain = y[train_idx]       
        xtest = x.loc[test_idx, :]
        ytest = y[test_idx]
        
        regressor.fit(xtrain, ytrain)
        
        preds = regressor.predict(xtest)
        
        mse = mean_squared_error(ytest, preds)
        fold_rmse = math.sqrt(mse)
        rmses.append(fold_rmse)
        
    return np.mean(rmses)

In [None]:
# seed = 42

# param_space ={'eta': hp.choice('eta', np.arange(0.05, 0.31, 0.05)),
#               'max_depth': hp.choice('max_depth', np.arange(5, 16, 1, dtype=int)),
#               'colsample_bytree': hp.choice('colsample_bytree', np.arange(0.3, 0.8, 0.1)),
#               'min_child_weight': hp.choice('min_child_weight', np.arange(1, 8, 1, dtype=int)),
#               'subsample': hp.uniform('subsample', 0.8, 1)
#               }

# opt_f = partial(optimize,
#                 x = X,
#                 y = y)
    
# trials = Trials()

    
# hopt = fmin(fn = opt_f,
#             space = param_space,
#             algo = tpe.suggest,
#             max_evals = 10,
#             trials = trials,
#             return_argmin=False,
#             rstate = np.random.RandomState(seed))

# print(hopt)

In [None]:
# f, ax = plt.subplots(1)
# xs = [t['tid'] for t in trials.trials]
# ys = [t['misc']['vals']['eta'] for t in trials.trials]
# ax.set_xlim(xs[0]-10, xs[-1]+10)
# ax.scatter(xs, ys, s=20, linewidth=0.01, alpha=0.75)
# ax.set_title('$x$ $vs$ $t$ ', fontsize=18)
# ax.set_xlabel('$t$', fontsize=16)
# ax.set_ylabel('$x$', fontsize=16)

In [None]:
hopt = {'colsample_bytree': 0.4, 'eta': 0.2, 'max_depth': 12, 'min_child_weight': 4, 'subsample': 0.8820207917706627}

In [None]:
eta = hopt['eta']
md = hopt['max_depth']
cbt = hopt['colsample_bytree']
mcw = hopt['min_child_weight']
ss = hopt['subsample']

regressor = xgb.XGBRegressor(eta = eta,
                             max_depth = md,
                             colsample_bytree = cbt,
                             min_child_weight = mcw,
                             subsample = ss,
                             tree_method='gpu_hist', gpu_id=0)

regressor.fit(pd.DataFrame(X), y)

In [None]:
y_pred = regressor.predict(pd.DataFrame(X_test))
test_ids = test['id'].values

submission = pd.DataFrame({
    'id': test_ids,
    'target': y_pred
})

submission.to_csv('submission.csv', index=False)

**<font size="2"><a href="#summary">Back to summary</a></font>**

----

# References

* https://medium.com/district-data-labs/parameter-tuning-with-hyperopt-faa86acdfdce

* Approach (almost) any Machine Learning Problem - Abishek Thakur

<hr>
<br>
<div align='justify'><font color="#353B47" size="4">Thank you for taking the time to read this notebook. I hope that I was able to answer your questions or your curiosity and that it was quite understandable. <u>any constructive comments are welcome</u>. They help me progress and motivate me to share better quality content. I am above all a passionate person who tries to advance my knowledge but also that of others. If you liked it, feel free to <u>upvote and share my work.</u> </font></div>
<br>
<div align='center'><font color="#353B47" size="3">Thank you and may passion guide you.</font></div>