<a href="https://www.kaggle.com/code/yaaangzhou/linking-eda-and-baseline-model?scriptVersionId=145023746" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Created by Yang Zhou**

**[Linking]EDA and Baseline Model**

**3 Oct 2023**

# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">[Linking]EDA and Baseline Model</center>
<p><center style="color:#949494; font-family: consolas; font-size: 20px;">Use typing behavior to predict essay quality</center></p>

***

# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">Insights and Tricks</center>



# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">Version Detail</center>

# 0. Imports

In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# 1. Load Data

In [None]:
train_logs = pd.read_csv('/kaggle/input/linking-writing-processes-to-writing-quality/train_logs.csv')
train_scores = pd.read_csv('/kaggle/input/linking-writing-processes-to-writing-quality/train_scores.csv')
train = pd.merge(train_logs,train_scores,on='id')

test_logs = pd.read_csv('/kaggle/input/linking-writing-processes-to-writing-quality/test_logs.csv')
submission = pd.read_csv('/kaggle/input/linking-writing-processes-to-writing-quality/sample_submission.csv')

In [None]:
print('The shape of the train data:', train.shape)

print('The shape of the test data:', test_logs.shape)
print('The shape of the test data:', submission.shape)

In [None]:
train.head(3)

In [None]:
num_var = ['down_time','up_time','action_time','cursor_position','word_count']
target = 'score'

In [None]:
train.info()

According to the introduction of the competition, the data is interpreted as follows:

- `id` - The unique ID of the essay
- `event_id` - The index of the event, ordered chronologically
- `down_time` - The time of the down event in milliseconds
- `up_time` - The time of the up event in milliseconds
- `action_time` - The duration of the event (the difference between down_time and up_time)
- `activity` - The category of activity which the event belongs to
    - `Nonproduction` - The event does not alter the text in any way
    - `Input` - The event adds text to the essay
    - `Remove/Cut` - The event removes text from the essay
    - `Paste` - The event changes the text through a paste input
    - `Replace` - The event replaces a section of text with another string
    - `Move From [x1, y1] To [x2, y2]` - The event moves a section of text spanning character index x1, y1 to a new location x2, y2
- `down_event` - The name of the event when the key/mouse is pressed
- `up_event` - The name of the event when the key/mouse is released
- `text_change` - The text that changed as a result of the event (if any)
- `cursor_position` - The character index of the text cursor after the event
- `word_count` - The word count of the essay after the event

# 2. Basic EDA

In [None]:
train.describe().T\
    .style.bar(subset=['mean'], color=px.colors.qualitative.G10[2])\
    .background_gradient(subset=['std'], cmap='Blues')\
    .background_gradient(subset=['50%'], cmap='BuGn')

In [None]:
def summary(df):
    sum = pd.DataFrame(df.dtypes, columns=['dtypes'])
    sum['missing#'] = df.isna().sum()
    sum['missing%'] = (df.isna().sum())/len(df)
    sum['uniques'] = df.nunique().values
    sum['count'] = df.count().values
    #sum['skew'] = df.skew().values
    return sum

summary(train).style.background_gradient(cmap='Blues')

I'm gonna take a look at the column `activity`. I consider creating two new columns to store the start and end coordinates.

In [None]:
train.activity.value_counts()

`Move From [x1, y1] To [x2, y2]` appears to be basically random. 

## Distribution of numeric variables

In [None]:
df = pd.concat([train[num_var].assign(Source = 'Train'), 
                test_logs[num_var].assign(Source = 'Test')], 
               axis=0, ignore_index = True);

fig, axes = plt.subplots(len(num_var), 3 ,figsize = (16, len(num_var) * 4.2), 
                         gridspec_kw = {'hspace': 0.35, 'wspace': 0.3, 'width_ratios': [0.80, 0.20, 0.20]});

for i,col in enumerate(num_var):
    ax = axes[i,0];
    sns.kdeplot(data = df[[col, 'Source']], x = col, hue = 'Source', ax = ax, linewidth = 2.1)
    ax.set_title(f"\n{col}",fontsize = 9, fontweight= 'bold');
    ax.grid(visible=True, which = 'both', linestyle = '--', color='lightgrey', linewidth = 0.75);
    ax.set(xlabel = '', ylabel = '');
    ax = axes[i,1];
    sns.boxplot(data = df.loc[df.Source == 'Train', [col]], y = col, width = 0.25,saturation = 0.90, linewidth = 0.90, fliersize= 2.25, color = '#037d97',
                ax = ax);
    ax.set(xlabel = '', ylabel = '');
    ax.set_title(f"Train",fontsize = 9, fontweight= 'bold');

    ax = axes[i,2];
    sns.boxplot(data = df.loc[df.Source == 'Test', [col]], y = col, width = 0.25, fliersize= 2.25,
                saturation = 0.6, linewidth = 0.90, color = '#E4591E',
                ax = ax); 
    ax.set(xlabel = '', ylabel = '');
    ax.set_title(f"Test",fontsize = 9, fontweight= 'bold');

plt.tight_layout();
plt.show();

## Distribution of target

In [None]:
sns.barplot(data=train,x=target)

I can also check their correlation with target.

## Correlation Plot

In [None]:
corr_matrix = train[num_var].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

plt.figure(figsize=(15, 12))
sns.heatmap(corr_matrix, mask=mask, annot=False, cmap='Blues', fmt='.2f', linewidths=1, square=True, annot_kws={"size": 9} )
plt.title('Correlation Matrix', fontsize=15)
plt.show()