# CommonLit Readability Prize

---
Can machine learning identify the appropriate reading level of a passage of text, and help inspire learning? Reading is an essential skill for academic success. When students have access to engaging passages offering the right level of challenge, they naturally develop reading skills.


In this competition, we're predicting the reading ease of excerpts from literature. 
We've provided excerpts from several time periods and a wide range of reading ease scores. 
Note that the test set includes a slightly larger proportion of modern texts (the type of texts we want to generalize to) than the training set.
---


Dataset Link: https://www.kaggle.com/c/commonlitreadabilityprize/data

Study reference material:

https://www.kaggle.com/ruchi798/commonlit-readability-prize-eda-baseline

https://www.kaggle.com/manishkc06/text-pre-processing-data-wrangling




---

**Columns:**

- `id`: Unique ID for each excerpt.

- `url_legal`: URL of the source. This field is blank in the test set.

- `license`: License of the source material. This field is blank in the test set.

- `excerpt`: The text for which we want to predict the reading ease.

- `target`: The reading ease score for each excerpt.

- `standard_error`: A measure of the spread of scores among multiple raters for each excerpt. This field is not included in the test data.

---

# importing Required libraries

In [1]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from spacy.lang.en.stop_words import STOP_WORDS
import string

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error,r2_score

In [2]:
# nltk.download('punkt')
# nltk.download('stopwords')


In [3]:
df = pd.read_csv('_6_2_train.csv',usecols=['excerpt','target','standard_error'])
df

Unnamed: 0,excerpt,target,standard_error
0,When the young people returned to the ballroom...,-0.340259,0.464009
1,"All through dinner time, Mrs. Fayre was somewh...",-0.315372,0.480805
2,"As Roger had predicted, the snow departed as q...",-0.580118,0.476676
3,And outside before the palace a great garden w...,-1.054013,0.450007
4,Once upon a time there were Three Bears who li...,0.247197,0.510845
...,...,...,...
2829,When you think of dinosaurs and where they liv...,1.711390,0.646900
2830,So what is a solid? Solids are usually hard be...,0.189476,0.535648
2831,The second state of matter we will discuss is ...,0.255209,0.483866
2832,Solids are shapes that you can actually touch....,-0.215279,0.514128


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2834 entries, 0 to 2833
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   excerpt         2834 non-null   object 
 1   target          2834 non-null   float64
 2   standard_error  2834 non-null   float64
dtypes: float64(2), object(1)
memory usage: 66.5+ KB


In [5]:
df.nunique()

excerpt           2834
target            2834
standard_error    2834
dtype: int64

In [6]:
df.isna().sum()

excerpt           0
target            0
standard_error    0
dtype: int64

# Data_preprocessing

In [7]:
# lowercasing

df['excerpt'] = df['excerpt'].str.lower()

In [8]:
# Tokenization

df['excerpt'] = df['excerpt'].apply(word_tokenize)

In [9]:
# Removing Punctuation

df['excerpt'] = df['excerpt'].apply(lambda tokens: [i for i in tokens if i not in string.punctuation])

In [10]:
# Removing Stop Words

# stopwords_list = set(stopwords.words('english'))
stopwords_list = set(STOP_WORDS)
df['excerpt'] = df['excerpt'].apply(lambda tokens: [token for token in tokens if token not in stopwords_list])

In [11]:
# Stemming

stemmer = PorterStemmer()
df['excerpt'] = df['excerpt'].apply(lambda tokens: [stemmer.stem(token) for token in tokens])

# Splitting in input and Target features

In [12]:
x = df['excerpt'].astype('U')
y = df['target']

# Splitting in train and test data

In [13]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2, random_state=42)

# vectorization

In [14]:
vectorizer = TfidfVectorizer(min_df=1)

In [15]:
x_train_feature = vectorizer.fit_transform(x_train)
x_test_feature = vectorizer.transform(x_test)

# Model training

1st model

In [16]:
lr = LinearRegression()
lr.fit(x_train_feature,y_train)

In [17]:
y_pred = lr.predict(x_test_feature)

In [18]:
print('r2_Score: ',r2_score(y_true = y_test,y_pred= y_pred),'\nmse :',mean_squared_error(y_true = y_test,y_pred= y_pred))

r2_Score:  0.4248943940722181 
mse : 0.6021858027048967


2nd model

In [19]:
knr = KNeighborsRegressor(n_neighbors =20 ,weights='distance')
knr.fit(x_train_feature,y_train)

In [20]:
kn_pred = knr.predict(x_test_feature)

In [21]:
print('r2_Score: ',r2_score(y_true = y_test,y_pred= kn_pred),'\nmse :',mean_squared_error(y_true = y_test,y_pred= kn_pred))

r2_Score:  0.3025981317951034 
mse : 0.7302406714247832


3rd model

In [22]:
gb = GradientBoostingRegressor(learning_rate=0.2,)
gb.fit(x_train_feature,y_train)

In [23]:
gb_pred = gb.predict(x_test_feature)

In [24]:
print('r2_Score: ',r2_score(y_true = y_test,y_pred= gb_pred),'\nmse :',mean_squared_error(y_true = y_test,y_pred= gb_pred))

r2_Score:  0.34920144695246824 
mse : 0.6814429298317877


4th model

In [25]:
xg = xgb.XGBRegressor()
xg.fit(x_train_feature,y_train)

In [26]:
xg_pred = xg.predict(x_test_feature)

In [27]:
print('r2_Score: ',r2_score(y_true = y_test,y_pred= xg_pred),'\nmse :',mean_squared_error(y_true = y_test,y_pred= xg_pred))

r2_Score:  0.3413065761670129 
mse : 0.6897095491312488


# 1st model LinearRegression is performing well on this data

# Predicting Test Data Using LinearRegression

In [28]:
# importing Test Data

In [29]:
test_df = pd.read_csv('_6_2_test.csv',usecols=['excerpt'])
test_df

Unnamed: 0,excerpt
0,My hope lay in Jack's promise that he would ke...
1,Dotty continued to go to Mrs. Gray's every nig...
2,It was a bright and cheerful scene that greete...
3,Cell division is the process by which a parent...
4,Debugging is the process of finding and resolv...
5,"To explain transitivity, let us look first at ..."
6,Milka and John are playing in the garden. Her ...


In [30]:
# data_preprocessing
test_df['excerpt'] = test_df['excerpt'].str.lower()
test_df['excerpt'] = test_df['excerpt'].apply(word_tokenize)
test_df['excerpt'] = test_df['excerpt'].apply(lambda tokens: [i for i in tokens if i not in string.punctuation])
test_df['excerpt'] = test_df['excerpt'].apply(lambda tokens: [token for token in tokens if token not in stopwords_list])
test_df['excerpt'] = test_df['excerpt'].apply(lambda tokens: [stemmer.stem(token) for token in tokens])

In [31]:
temp1 = test_df['excerpt'].astype('U')
temp = vectorizer.transform(temp1)

In [32]:
lr.predict(temp)

array([-2.05459278, -0.39786707, -0.61500775, -2.03023551, -0.81989995,
       -0.99775614,  1.48879904])