# CommonLit EDA

## Video Tutorial

This EDA comes along with a video tutorial, check it out [here](https://www.youtube.com/watch?v=HwZkxUNbWgI&list=PL_49VD9KwQ_OJCqZOeOlSUQKcr1MyifOc&index=1).

In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
%matplotlib inline

# Data

## Files
- **train.csv** - the training set
- **test.csv** - the test set
- **sample_submission.csv** - a sample submission file in the correct format

## Columns
- **id** - unique ID for excerpt
- **url_legal** - URL of source - this is blank in the test set.
- **license** - license of source material - this is blank in the test set.
- **excerpt** - text to predict reading ease of
- **target** - reading ease
- **standard_error** - measure of spread of scores among multiple raters for each excerpt. Not included for test data.

## Load Data

In [None]:
data_dir = '/kaggle/input/commonlitreadabilityprize'
train_data_path = os.path.join(data_dir, 'train.csv')
test_data_path = os.path.join(data_dir, 'test.csv')

train_df = pd.read_csv(train_data_path)
test_df = pd.read_csv(test_data_path)

print(len(train_df))
print(len(test_df))

In [None]:
train_df.head()

In [None]:
train_df.tail()

In [None]:
test_df.head()

# Looking at Examples

In [None]:
example = train_df.iloc[-1]

print(example)
print(example['excerpt'])

In [None]:
for i, excerpt in enumerate(train_df['excerpt'][:5]):
    print(f'Excerpt #{i}')
    print(excerpt + '\n')

## Excerpt Length (charcters)

In [None]:
train_df['excerpt_len'] = train_df['excerpt'].apply(len)
sns.distplot(train_df['excerpt_len'], kde=False)

## Unique Characters

In [None]:
all_chars = set()
train_df['excerpt'].apply(lambda x: [all_chars.add(c) for c in x])
for c in sorted(all_chars):
    print(c + ' ', end='')
    
print('\n\n')
    
for c in sorted(all_chars):
    print(f'({c}, {str(ord(c))}) ', end='')

# 176 - 339

In [None]:
# Make a boolean column for excerpts with "hard characters"

hard_chars = set()
for c in all_chars:
    if ord(c) >= 176 and ord(c) <= 339:
        hard_chars.add(c)
        
print(hard_chars)

train_df['has_hard_char'] = train_df['excerpt'].apply(lambda x: any([c in hard_chars for c in x]))
sum(train_df['has_hard_char'])

## Excerpt Length (words/tokens)

In [None]:
sns.distplot(train_df['excerpt'].apply(lambda x: len(x.split())), kde=False)

# Excerpt Target Distribution

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
sns.distplot(train_df['target'], kde=False)

# Target Correlation with "Hard Characters"

In [None]:
train_df['excerpt_len'] = train_df['excerpt'].apply(len)
sns.distplot(train_df['excerpt_len'], kde=False)

In [None]:
plt.subplots(figsize=(3, 7))

sns.violinplot(train_df['has_hard_char'], train_df['target'], palette=['b', 'r'])

In [None]:
sns.scatterplot(train_df['excerpt_len'], train_df['target'], alpha=0.4)

# Target Correlation with Excerpt Length

In [None]:
sns.scatterplot(train_df['excerpt_len'], train_df['target'], alpha=0.4)

In [None]:
correlation_matrix = np.corrcoef(train_df['excerpt_len'], train_df['target'])
correlation_xy = correlation_matrix[0,1]
r_squared = correlation_xy**2

print('Linear fit r^2:', r_squared)

# Standard Error Distribution

In [None]:
sns.distplot(train_df['standard_error'], kde=False)

# Target Correlation with Standard Error

In [None]:
sns.scatterplot(train_df['target'], train_df['standard_error'], alpha=0.4)

## Finding the outlier

In [None]:
train_df[train_df['target'] == 0]

# Key Findings
- The length of excerpts is ~700-1300 characters or ~140-200 words
- The excerpts are written in English, with the exception of some special characters
- The presense of special characters seems to have a correlation with the difficulty of the excerpts
- The target difficulty in training examples is withing the range: -4 < x < 2
- The distribution of targets is roughly Gaussian
- Standard errors are all within the range: 0.4 < x < 0.7
- The distribution of standard errors is left skewed
- Targets near the extemes tend to have higher standard errors in a fairly predictable pattern
- There is one clear outlier on row #106, and high standard error data points could also potentially be considered outliers