# About This Notebook
This is a first run through the competition data to try and understand the datatset and realise the problem at hand with some quick EDA and classical ML methods.  
**If you found this notebook useful and use parts of it in your work, please don't forget to show your appreciation by upvoting this kernel. That keeps me motivated and inspires me to write and share these public kernels.** ðŸ˜Š

# Problem Statement
* Currently, most educational texts are matched to readers using traditional readability methods or commercially available formulas.
* Tools like Flesch-Kincaid Grade Level are based on weak proxies of text decoding (i.e., characters or syllables per word) and syntactic complexity (i.e., number or words per sentence).
* They lack construct and theoretical validity.
* Commercially available formulas, such as Lexile, can be cost-prohibitive, lack suitable validation studies, and suffer from transparency issues when the formula's features aren't publicly available.

# Why this competition?
As evident from the problem statement, this competition prsents an interesting angle to the use of NLP and has the potential to make real life contribution/change.  

If successful, you'll aid administrators, teachers, and students. Literacy curriculum developers and teachers who choose passages will be able to quickly and accurately evaluate works for their classrooms. Plus, these models will become more accessible for all. Perhaps most importantly, students will benefit from feedback on the complexity and readability of their work, making it far easier to improve essential reading skills.

# Expected Outcome
Loosely speaking, ***Given an excerpt of text, we need to rate the complexity of reading passages for grade 3-12 classroom use.***

# Data Description
The training file contains the following features:-
* `id` - unique ID for excerpt
* `url_legal` - URL of source - this is blank in the test set.
* `license` - license of source material - this is blank in the test set.
* `excerpt` - text to predict reading ease of
* `target` - reading ease
* `standard_error` - measure of spread of scores among multiple raters for each excerpt. Not included for test data.

# Grading Metric
Submissions are scored on the root mean squared error. RMSE is defined as:  
$$ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} $$

where $ \hat{y} $ is the predicted value, $ y $ is the original value, and $ n $ is the number of rows in the test data.

# Problem Category:-
From the data and objective its is evident that this is a **Regression Problem** in the NLP Domain.

So without further ado, let's now start with some basic imports to take us through this:-

In [None]:
import sys
sys.path.append('../input/autokeras-april-2021')

In [None]:
# Asthetics
import warnings
import sklearn.exceptions
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

# General
from scipy.stats import pearsonr, spearmanr, kendalltau
from tqdm.autonotebook import tqdm
from collections import Counter
import pandas as pd
import numpy as np
import os
import random
import string
import re
pd.set_option('display.max_columns', None)

# Visualizations
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid")
from PIL import Image
from wordcloud import WordCloud, STOPWORDS
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.offline import iplot

# NLP
import spacy
nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])

# Machine Learning
# Utils
from sklearn.model_selection import StratifiedKFold, cross_val_score, RepeatedKFold
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
#Models
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, StackingRegressor
from lightgbm import LGBMRegressor
import autokeras as ak
#Metrics
from sklearn.metrics import mean_squared_error

# Random Seed Initialize
RANDOM_SEED = 42

def seed_everything(seed=RANDOM_SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    random.seed(seed)
    
seed_everything()

In [None]:
data_dir = '../input/commonlitreadabilityprize'

train_file_path = os.path.join(data_dir, 'train.csv')
test_file_path = os.path.join(data_dir, 'test.csv')
sample_sub_file_path = os.path.join(data_dir, 'sample_submission.csv')

print(f'Train file: {train_file_path}')
print(f'Train file: {test_file_path}')
print(f'Train file: {sample_sub_file_path}')

# EDA

In [None]:
train_df = pd.read_csv(train_file_path)
test_df = pd.read_csv(test_file_path)
sub_df = pd.read_csv(sample_sub_file_path)

In [None]:
train_df.sample(10)

In [None]:
train_df.describe().T

In [None]:
test_df.head()

In [None]:
sub_df.head()

## Word Count Distribution

In [None]:
word_count = [len(x.split()) for x in train_df['excerpt'].tolist()]
barplot_dim = (12, 6)
ax = plt.subplots(figsize =barplot_dim);
ax = sns.distplot(word_count, kde=False);
ax.set_ylabel('No. of Observations', size=15)
ax.set_xlabel('No. of Words', size=15)
ax.set_title('Word Count Distribution', size=20);

Let's see how the word count varies across each range of readability (target):-

In [None]:
num_bins = 1 + (3.322*np.log10(train_df.shape[0])) #Sturgeâ€™s Rule
print(f'Number of bins: {num_bins}')
train_df['target_binned'] = pd.cut(train_df['target'], bins=int(num_bins), labels=False)

In [None]:
barplot_dim = (20, 30)
plt.figure(figsize=barplot_dim)
for i in range(int(num_bins)):
    temp_df = train_df[train_df['target_binned'] == i]
    word_count = [len(x.split()) for x in temp_df['excerpt'].tolist()]
    plt.subplot(4, 3, i+1)
    ax = sns.distplot(word_count, kde=False);
    ax.set_ylabel('No. of Observations', size=15)
    ax.set_xlabel('No. of Words', size=15)
    ax.set_title(f'Word Count Distribution (Bin: {i})', size=20);
    plt.xlim([140, 220])
plt.show();

In [None]:
train_df['excerpt_word_count'] = train_df['excerpt'].apply(lambda x: len(x.split()))

pearson_corr, _ = pearsonr(train_df['excerpt_word_count'], train_df['target'])
spearman_corr, _ = spearmanr(train_df['excerpt_word_count'], train_df['target'])
tau_corr, _ = kendalltau(train_df['excerpt_word_count'], train_df['target'])

print('Pearsons correlation: %.3f' % pearson_corr)
print('Spearmans correlation: %.3f' % spearman_corr)
print('Kendall Tau correlation: %.3f' % tau_corr)

**As we can infer from this, the word count is not that much relevant to the readability of the excerpt.**  
Let's look at excerpts with high scores and low scores in word cloud and see if something jumps out...

In [None]:
temp_df = train_df[(train_df['target_binned'] >= 0) & (train_df['target_binned'] < 3)]

text = ' '.join(temp_df['excerpt'])
wordcloud = WordCloud(background_color='white', stopwords=STOPWORDS, width=2560, height=1440).generate(text)

barplot_dim = (15, 15)
ax = plt.subplots(figsize=barplot_dim, facecolor='w')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

In [None]:
temp_df = train_df[train_df['target_binned'] >= 9]

text = ' '.join(temp_df['excerpt'])
wordcloud = WordCloud(background_color='white', stopwords=STOPWORDS, width=2560, height=1440).generate(text)

barplot_dim = (15, 15)
ax = plt.subplots(figsize=barplot_dim, facecolor='w')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

So there isn't anything apparent when we look at the words in isolation as well.  
Thus, a **sequential method will work on this dataset as comapred to any bag-of-words method**.

# Text Cleaning

In [None]:
def text_cleaning(text):
    '''
    Converts all text to lower case, Removes special charecters, emojis and multiple spaces
    text - Sentence that needs to be cleaned
    '''
    text = ''.join([k for k in text if k not in string.punctuation])
    text = str(text).lower()
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = re.sub(' +', ' ', text)
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    return text

In [None]:
tqdm.pandas()
train_df['excerpt'] = train_df['excerpt'].progress_apply(text_cleaning)

In [None]:
test_df['excerpt'] = test_df['excerpt'].progress_apply(text_cleaning)

# Text Preparation

In [None]:
def prepare_text(text, nlp=nlp):
    '''
    Returns the text after stop-word removal and lemmatization.
    text - Sentence to be processed
    nlp - Spacy NLP model
    '''
    doc = nlp(text)
    lemma_list = [token.lemma_ for token in doc if not token.is_stop]
    lemmatized_sentence = ' '.join(lemma_list)
        
    return lemmatized_sentence

In [None]:
train_df['excerpt'] = train_df['excerpt'].progress_apply(prepare_text)

In [None]:
test_df['excerpt'] = test_df['excerpt'].progress_apply(prepare_text)

# KFolds

In [None]:
# From https://github.com/abhishekkrthakur/approachingalmost
NUM_SPLITS = 5

train_df["kfold"] = -1
train_df = train_df.sample(frac=1).reset_index(drop=True)
y = train_df.target_binned.values
kf = StratifiedKFold(n_splits=NUM_SPLITS)
for f, (t_, v_) in enumerate(kf.split(X=train_df, y=y)):
    train_df.loc[v_, 'kfold'] = f
    
train_df.head()

In [None]:
train_df = train_df[['id', 'excerpt', 'target', 'kfold']]

# Vectorization

In [None]:
doc = nlp.pipe(train_df['excerpt'])
x_train_stv = np.array([text.vector for text in doc])
doc = nlp.pipe(test_df['excerpt'])
x_test_stv = np.array([text.vector for text in doc])

# Models

## 1. Classical ML Models
Although we have already established that sequence models will perform better on this task, let's create a quick and dirty bag-of-words model just as a baseline.

In [None]:
def get_stacking():
    level0 = []
    level0.append(('knn', KNeighborsRegressor()))
    level0.append(('svr', SVR()))
    level0.append(('Ridge', Ridge()))
    level0.append(('LR', LinearRegression()))
    level0.append(('RF', RandomForestRegressor(random_state=42)))
    level0.append(('Lgbm', LGBMRegressor(metric='rmse',
                                         objective='regression',
                                         learning_rate=0.01,
                                         seed=42)))
    
    level1 = LinearRegression()
    model = StackingRegressor(estimators=level0, final_estimator=level1, cv=None)
    return model

In [None]:
def get_models():
    models = dict()
    models['knn'] = KNeighborsRegressor()
    models['svr'] = SVR()
    models['Ridge'] = Ridge()
    models['LR'] = LinearRegression()
    models['RF'] = RandomForestRegressor()
    models['Lgbm'] = LGBMRegressor()
    models['Stacked'] = get_stacking()
    
    return models

In [None]:
def evaluate_model(model, X, y):
    cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
    scores = cross_val_score(model, X, y, scoring='neg_root_mean_squared_error', cv=cv, n_jobs=-1, error_score='raise')
    return scores

In [None]:
%%time

X = x_train_stv
y = np.array(train_df['target'])

models = get_models()
results = []
names = []

for name, model in models.items():
    scores = -evaluate_model(model, X, y)
    results.append(scores)
    names.append(name)
    print(f'{name} : {round(np.mean(scores),3)} ({round(np.std(scores),3)})')

In [None]:
ax = plt.subplots(figsize=(12, 6))
plt.boxplot(results, labels=names, showmeans=True)
plt.show()

In [None]:
mean_scores = []
for score in results:
    mean_scores.append(round(np.mean(score),3))
min_index = mean_scores.index(min(mean_scores))
model_name = names[min_index]

In [None]:
print(f'Best Score: {mean_scores[min_index]}')
print(f'Best Model: {model_name}')

## 2. Auto Keras

In [None]:
%%time

scores = []

for i in range(NUM_SPLITS):
    train = train_df[train_df['kfold'] != i].copy()
    valid = train_df[train_df['kfold'] == i].copy()
    
    train['excerpt'] = train['excerpt'].apply(lambda x: '<START> ' + x + ' <END>')
    valid['excerpt'] = valid['excerpt'].apply(lambda x: '<START> ' + x + ' <END>')
    
    auto_reg = ak.TextRegressor(overwrite=True, max_trials=4)
    auto_reg.fit(train['excerpt'].values, train['target'].values, epochs=10)
    preds = auto_reg.predict(valid['excerpt'].values)
    
    loss = mean_squared_error(valid['target'].values, preds, squared=False)
    scores.append(loss)
    print(f'Fold {i+1}: {loss}')
print('')
print(f'Mean Loss: {np.mean(scores)}')

# Submission

In [None]:
mean_scores = []
for score in results:
    mean_scores.append(round(np.mean(score),3))
min_index = mean_scores.index(min(mean_scores))
model_name = names[min_index]

In [None]:
print(f'Best Score: {mean_scores[min_index]}')
print(f'Best Model: {model_name}')

In [None]:
%%time

#ML
doc = nlp.pipe(train_df['excerpt'])
x_train_stv = np.array([text.vector for text in doc])
doc = nlp.pipe(test_df['excerpt'])
x_test_stv = np.array([text.vector for text in doc])

models = get_models()
reg = models[model_name]
X = x_train_stv
y = np.array(train_df['target'])
reg.fit(X, y)
preds_ml = reg.predict(x_test_stv)

#Auto Keras
train_df['excerpt'] = train_df['excerpt'].apply(lambda x: '<START> ' + x + ' <END>')
test_df['excerpt'] = test_df['excerpt'].apply(lambda x: '<START> ' + x + ' <END>')
auto_reg = ak.TextRegressor(overwrite=True, max_trials=4)
auto_reg.fit(train_df['excerpt'].values, train_df['target'].values, epochs=10)
preds_ak = auto_reg.predict(test_df['excerpt'].values)

# Final Prediction
weights = [5, 1]
preds = (weights[0]*preds_ml + weights[1]*preds_ak.squeeze())/(sum(weights))

submission = pd.DataFrame()
submission['id'] = test_df['id']
submission['target'] = preds

In [None]:
submission.head()

In [None]:
submission.to_csv("submission.csv",index=False)

**If you found this notebook useful and use parts of it in your work, please don't forget to show your appreciation by upvoting this kernel. That keeps me motivated and inspires me to write and share these public kernels.** ðŸ˜Š