### In this competition we are going to find the readability score for each and every excerpt from test set. 
I generally like to ask why, where the solution is used. So to uderstand the **why** part, we need to understand the problem statement. Here, kaggle wants us to find out the ease of reading score for each and every text so that it becomes easy for non native speakers, non scientific people, business decision makers who are non technical etc. 

The traditional methods like **Flesh-kincaid-score** is based on weak proxies of text decoding and other metircs like **Lexile** is paid one.

Here i am going to provide EDA and baseline solution using traditional Machine Learning methods.

### Import Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Read Data

In [None]:
datapath = "/kaggle/input/commonlitreadabilityprize/"
sub_df = pd.read_csv(f"{datapath}/sample_submission.csv")
train_df = pd.read_csv(f"{datapath}/train.csv")
test_df = pd.read_csv(f"{datapath}/test.csv")
train_df.shape, test_df.shape, sub_df.shape

### Details about the dataset

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
sub_df.head()

In [None]:
train_df.info()

In [None]:
test_df.info()

In [None]:
train_df.describe()

### Lets check the excerpt

In [None]:
train_df['excerpt'].head()

In [None]:
train_df['excerpt'][0]

In [None]:
train_df['target'][0]

### understanding target

Here the target is nothing but reading ease, we need to predict the score for each exceprt in the train and test dataset, the reading ease. Here the target variable show the readability lavel. The higher the score the easier to read and lower the score the difficult it becomes.

### Check the train data ('excerpt') column

In [None]:
train_df['word_count'] = train_df['excerpt'].apply(lambda x:len(x.split()))

In [None]:
train_df['word_count'].min(), train_df['word_count'].max()

So here we have **excerpts" with a minimum of 135 tokens and a maximum of 205 tokens

In [None]:
train_df['target'].min(), train_df['target'].max()

### Machine Learning Model Building

1. TFIDF + Libear Regression
2. TFIDF + Libear Regression + stopwords removal

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LinearRegression

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train_df['excerpt'])

In [None]:
X_train = vectorizer.fit_transform(train_df.excerpt)
X_test = vectorizer.transform(test_df['excerpt'])
y = train_df['target']

In [None]:
LR_tfidf = LinearRegression().fit(X_train, y)
LR_tfidf.score(X, y)

In [None]:
LR_tfidf.coef_, LR_tfidf.intercept_

In [None]:
y_train_pred = LR_tfidf.predict(X_train)

#### check the training and predict scores

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y, y_train_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y, y_train_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, y_train_pred)))

### Submission

In [None]:
y_test_pred = LR_tfidf.predict(X_test)
sub_df['target'] = y_test_pred

In [None]:
sub_df.head()

In [None]:
sub_df.to_csv('submission.csv', index=False)