<a id="table"></a>
<h1 style="background-color:pink;font-family:newtimeroman;font-size:350%;text-align:center;border-radius: 15px 50px;">Table Of Content</h1>


* [1. Problem definition](#1)
* [2. Initial analysis](#2)
* [3. Meta features](#3)
* [3. Exploratory data analysis](#4)

<a id="1"></a>
<h3 style="background-color:pink;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Problem definition</h3>


### Can machine learning identify the appropriate reading level of a passage of text, and help inspire learning? 


- It is very important to understand the problem before solving it, let us try to understand the question fisrt 
- we have seen reading passages in schools, here in this competition we are provided with different reading passages
- Different passages have different reading complexity, some can be easy to read other may not be that easy
- In this competition we are asked to  build an algorithms, which will rate the complexity of reading passages (for grade 3-12 classroom use)
- In simple language : how easy a given passage(from literature) to read ? use your data science in this case?
- This competition will help students, teachers and administrators

#### what are the features in the dataset and what they say

- id - unique ID for excerpt
- url_legal - URL of source (this is blank in the test set.)
- license - license of source material (this is blank in the test set.)
- excerpt - text to predict reading ease of
- target - reading ease
- standard_error - measure of spread of scores among multiple raters for each excerpt(Not included for test data.)

In [None]:
try:
    import pandas as pd
    import numpy as np
    import time
    import os
    import string

    import seaborn as sns
    import matplotlib.pyplot as plt
    from scipy.stats import gaussian_kde
    print("Loading libraries...")
    time.sleep(2)
    print("Loading successful.")
    
    pd.set_option('display.max_colwidth', 10000)
except ImportError as e:
    print(f"There are some libraries missing {e}")

In [None]:
ROOT_DIR = '../input/commonlitreadabilityprize'
train_df = pd.read_csv(os.path.join(ROOT_DIR,"train.csv"),
                    dtype={'target':np.float32,'standard_error':np.float32})

<a id="2"></a>
<h3 style="background-color:pink;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Initial analysis </h3>

- In the initial analysis we will try to understand the data set in a basic level

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
train_df.describe(include='all')

In [None]:
print(f"Ids :{len(train_df.id.unique())}")
print(f"Excerpt :{len(train_df.excerpt.unique())}")
print(f"Target :{len(train_df.target.unique())}")
print(f"Standarad error:{len(train_df.standard_error.unique())}")

-  id,exceprt and target has a one to one relation. there is no duplicate all have separate values
- in standard error there are 3 duplicate values
- there are missing values in the two columns (url_legal and licence)
- as there were multiple raters, standard_error tells us the measure of spread of scores among the raters for each excerpt,it is heavily skewed.

### Lets have a look at the passages based on thair readability level
- Easiest passage is having highest positive value (1.711)
- Baseline has zero value
- Hardest has lowest negative value (-3.676)

In [None]:
train_df.sort_values('target', ascending=False).head(1)[['id','excerpt','standard_error','target']]

In [None]:
train_df[train_df['target'] == train_df['standard_error']][['id','excerpt','standard_error','target']]

In [None]:
train_df.sort_values('target', ascending=True).head(1)[['id','excerpt','standard_error','target']]

- They are taken from various sources, and they are ranked by ease of read compared to baseline excerpt. Hardest and easiest to read excerpts are displayed along with the baseline excerpt (text).
- Easiest excerpt to read (25ca8f498) is a very plain text,easy to read text.Sentences are short, diction and syntax are elementary level. There are no excessive amount of conjunctions and punctuations.

- Hardest excerpt to read (4626100d8) is in fact hard to understand. Sentences are long, diction and syntax are academic level. There are lots sentences connected to each with conjunctions and lots of punctuations.

<a id="3"></a>
<h3 style="background-color:pink;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Meta features </h3>

- lets make some features which will help us for the furthur analysis
- length_passage (feature) : it is the length of each passages
- difficulty (feature): reading difficulty score 0 for easy passages, score 1 for difficult passages


In [None]:
train_df['length_passage'] = train_df['excerpt'].apply(len)

In [None]:
train_df['difficulty'] = np.where(train_df['target'] <= 0, 0, 1)
train_df.head()

<a id="4"></a>
<h3 style="background-color:pink;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Exploratory data analysis</h3>


In [None]:
x = train_df['target']
y = train_df['standard_error']

bins = 100

plt.figure(figsize=(8,6))
sns.set_theme(style="whitegrid")
plt.hist(x, bins, alpha=.7,color='red', label='target')
plt.hist(y, bins, alpha=.7, color='black',label='standard_error')
plt.legend(loc='upper right')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.set_theme(style="whitegrid")
plt.title('standard error vs target', fontsize=10)
sns.histplot(data=train_df, x="target",y="standard_error", bins=250)

- Target feature is showing normal distribution 
- standard error is highly skewed
- As seen from the graph the standard error reduces towards the middle, but increases towards both ends. This means regardless of the excerpt being complex or simple people rating vary widely and thus standard error rises at both ends

In [None]:
plt.figure(figsize=(8,6))
sns.set_theme(style="whitegrid")
plt.title('Easy and difficult passage counts', fontsize=10)
ax = sns.countplot(x="difficulty", data=train_df,palette='husl')

In [None]:

plt.figure(figsize=(8,6))
sns.set_theme(style="whitegrid")
bins=100

plt.hist(train_df[train_df['difficulty']==0]['length_passage'], bins, range=[400,1500],
         alpha=0.6, label='easy to read passage')
plt.hist(train_df[train_df['difficulty']==1]['length_passage'], bins, range=[400,1500],
         alpha=0.6, label='difficult to read passage')
plt.title('Easy and difficult passage lengths', fontsize=10)
plt.xlabel('Length of passage', fontsize=10)
plt.ylabel('Samples', fontsize=10)
plt.legend(fontsize=10)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()

#### *Please upvote.......thank you*