# <center>CommonLit Readability Prize</center>


**To-Do**
* To predict the complexity of reading passages for grade 3-12 classroom use.

**About data** -
> * id - unique ID for excerpt
> * url_legal - URL of source - this is blank in the test set.
> * license - license of source material - this is blank in the test set.
> * excerpt - text to predict reading ease of
> * target - reading ease
> * standard_error - measure of spread of scores among multiple raters for each excerpt. Not included for test data.


**Special Notes** -
* url_legal, license and standard error are not available for test data.

### Imports

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np 
import pandas as pd
import nltk
import re
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

### Reading Data

In [None]:
## train_data
train_data = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')

##test_data
test_data = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')


train_data.head()

### Histogram of token length

In [None]:
len_v = train_data['excerpt'].apply(nltk.word_tokenize).apply(lambda x : len(x))
len_v.head()

In [None]:

plt.figure(figsize=(10,10))
sns.histplot(len_v)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(train_data['excerpt'],train_data['target'],random_state=42)

### Universal Sentence Encoder - 

* USE encodes text into some n-dimensional vectors, which then can be used for text classification, clustering etc.
* It comes with two variations i.e. one trained with Transformer encoder and other trained with Deep Averaging Network (DAN). 
* Tf-Hub provides both versions. In this notebook, I will be comparing both models.
* Best part about USE is, it can convert paragraphs to embeddings as well.



#### DAN model

In [None]:
model = tf.keras.models.Sequential()
model.add(hub.KerasLayer("/kaggle/input/universalsentenceencoder/",input_shape=[],trainable=False,dtype=tf.string))
model.add(tf.keras.layers.Dense(128))
model.add(tf.keras.layers.Dense(64))
model.add(tf.keras.layers.Dense(32))
model.add(tf.keras.layers.Dense(1))

model.compile(optimizer='adam',loss = "mean_squared_error")

#### Transformer Model

In [None]:
model2 = tf.keras.models.Sequential()
model2.add(hub.KerasLayer("/kaggle/input/universalsentenceencoderlarge/",input_shape=[],trainable=False,dtype=tf.string))
model2.add(tf.keras.layers.Dense(128))
model2.add(tf.keras.layers.Dense(64))
model2.add(tf.keras.layers.Dense(32))
model2.add(tf.keras.layers.Dense(1))
model2.compile(optimizer='adam',loss = "mean_squared_error")

In [None]:
model.fit(X_train,y_train,epochs=20)
preds_dan = model.predict(X_test)

In [None]:
model2.fit(X_train,y_train,epochs=20)
preds_trans = model.predict(X_test)

In [None]:
print("RMSE for DAN Model: " + str(np.sqrt(mean_squared_error(y_test,preds_dan))))

print("RMSE for Transformer Model: " + str(np.sqrt(mean_squared_error(y_test,preds_trans))))

* **We can see that both models have approximately same RMSE. It does not make sense to waste resources on the transformer one.**

In [None]:
test_data['target'] = model.predict(test_data['excerpt'])
test_data.drop(['url_legal','license','excerpt'],axis=1,inplace=True)

test_data.to_csv('/kaggle/working/submission.csv',index=False)
test_data.head()

 ### **If you find this useful, please upvote my work.**