# What

> - The darwin dataset contains handwriting data collected while the participants were performing 25 different tasks.  
> - Contains 452 columns and 175 rows  
> - The last column is the target class: P - Patient and H - Healthy   
> - 89 Patient records and 84 Healthy records 
> - For each task 18 features have been extracted, the column will be identified by the name of the features followed by a numeric identifier 
> representing the task the feature is extracted. 
> - E.g., the column with the header "total_time8" collects the values for the "total time" feature extracted from task #8. 

# Why

> This dataset was created to support the development of better solutions for diagnosing Alzheimer's disease. Since Alzheimer's affects cognitive and motor functions, handwriting analysis can reveal early signs of impairment. By studying handwriting tasks, we can identify patterns that differentiate healthy individuals from patients, potentially enabling earlier and more accurate diagnosis.

# How

> The problem: Binary Classification, but also a Risk-Sensitive Classification where a False Negative is way worse than a False Positive  
> The solution: Decision Tree Ensemble using Gradient Boosting because it can focus on getting difficult cases right by placing more emphasis on them.

# Imports

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.preprocessing import LabelEncoder


# Plot Config

In [16]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
plt.style.use('default')
plt.rcParams['axes.grid'] = True
plt.rcParams['axes.grid.axis'] = 'y'
plt.rcParams['grid.linestyle'] = ':'
plt.rcParams['grid.alpha'] = 0.7
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False

# Data Preprocessing

In [17]:
df = pd.read_csv('../data/darwin/DARWIN.csv')
df.head()

Unnamed: 0,ID,air_time1,disp_index1,gmrt_in_air1,gmrt_on_paper1,max_x_extension1,max_y_extension1,mean_acc_in_air1,mean_acc_on_paper1,mean_gmrt1,...,mean_jerk_in_air25,mean_jerk_on_paper25,mean_speed_in_air25,mean_speed_on_paper25,num_of_pendown25,paper_time25,pressure_mean25,pressure_var25,total_time25,class
0,id_1,5160,1.3e-05,120.804174,86.853334,957,6601,0.3618,0.217459,103.828754,...,0.141434,0.024471,5.596487,3.184589,71,40120,1749.278166,296102.7676,144605,P
1,id_2,51980,1.6e-05,115.318238,83.448681,1694,6998,0.272513,0.14488,99.383459,...,0.049663,0.018368,1.665973,0.950249,129,126700,1504.768272,278744.285,298640,P
2,id_3,2600,1e-05,229.933997,172.761858,2333,5802,0.38702,0.181342,201.347928,...,0.178194,0.017174,4.000781,2.392521,74,45480,1431.443492,144411.7055,79025,P
3,id_4,2130,1e-05,369.403342,183.193104,1756,8159,0.556879,0.164502,276.298223,...,0.113905,0.01986,4.206746,1.613522,123,67945,1465.843329,230184.7154,181220,P
4,id_5,2310,7e-06,257.997131,111.275889,987,4732,0.266077,0.145104,184.63651,...,0.121782,0.020872,3.319036,1.680629,92,37285,1841.702561,158290.0255,72575,P


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174 entries, 0 to 173
Columns: 452 entries, ID to class
dtypes: float64(300), int64(150), object(2)
memory usage: 614.6+ KB


>checking for missing values, the dataset has no missing values

In [27]:
pd.set_option('display.max_rows', None)
print(df.isna().sum())

ID                       0
air_time1                0
disp_index1              0
gmrt_in_air1             0
gmrt_on_paper1           0
max_x_extension1         0
max_y_extension1         0
mean_acc_in_air1         0
mean_acc_on_paper1       0
mean_gmrt1               0
mean_jerk_in_air1        0
mean_jerk_on_paper1      0
mean_speed_in_air1       0
mean_speed_on_paper1     0
num_of_pendown1          0
paper_time1              0
pressure_mean1           0
pressure_var1            0
total_time1              0
air_time2                0
disp_index2              0
gmrt_in_air2             0
gmrt_on_paper2           0
max_x_extension2         0
max_y_extension2         0
mean_acc_in_air2         0
mean_acc_on_paper2       0
mean_gmrt2               0
mean_jerk_in_air2        0
mean_jerk_on_paper2      0
mean_speed_in_air2       0
mean_speed_on_paper2     0
num_of_pendown2          0
paper_time2              0
pressure_mean2           0
pressure_var2            0
total_time2              0
a

>using a label encoder to deal with categorical columns

In [20]:
lb = LabelEncoder()
categorical_columns = df.select_dtypes(include = "object").columns
for col in categorical_columns:
    df[col] = lb.fit_transform(df[col])