<a id = "toc"></a>

# Table of Content

* [1. Import the Libraries and Dataset](#import_libsAndData)
* [2. Exploratory Data Analysis](#eda)
* [3. Data Pre-Processing](#data_preprocessing)
    * [3.1 Numerical Features](#numerical_features)
    * [3.2 Categorical Features](#categorical_features)
* [4. Feature Selection](#feature_selection)
* [5. Model Training](#model_training)
* [6. Performance Assesment](#performance_assesment)





# 1. Import the Libraries and Dataset <a class="anchor" id="import_libsAndData"></a>

In [89]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from math import ceil
from ydata_profiling import ProfileReport

In [90]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [None]:
train_data = pd.read_csv('../Data/train_data.csv', index_col='Claim Identifier')


In [None]:
train_data.tail()

In [None]:
train_data.shape


Surprise we have a whole fucking lot of features with Dtype object

In [None]:
train_data.info()

In [None]:
train_data.describe(include='all')

We have duplicate values that need to be removed

In [None]:
#duplicated values
train_data.duplicated().sum()

Since this all the claims have _`Assembly Date`_ we can assume that in order to have a claim you always have a _`Claim Identifier`_ (that we use as index) and an _`Assembly Date`_. So let's check the rows that only have _`Assembly Date`_ filled, i.e, empty rows.

In [None]:

only_assembly_date = train_data.drop(columns=['Assembly Date']).isnull().all(axis=1) & train_data['Assembly Date'].notnull()


num_only_assembly_date_filled = only_assembly_date.sum()
print(f"Number of rows with only 'Assembly Date' filled: {num_only_assembly_date_filled}")

---



## Percentage of missing values per feature


In [None]:
train_data.isnull().sum()/train_data.shape[0]*100

In [None]:
train_data.describe(include='O')
  

### Observations

#### Missing Values ####
OIICS Nature of Injury Description has no values

_`IME-4 Count`_, _`First Hearing Date`_ and _`C-3 Date`_ have >50% of entries missing ( 77.6%, 74.5 and 68.4%, respectively)

#### Single Value Feature ####
The feature _`WCB Decision`_ only has one value along the whole dataset (excluding missing values of course)

#### Categorical features that could be represented as boolean ####
Some categorical variables that only present 2 unique values, usually '1s and 0s' or 'Y or N' could be changed to boolean. Since we are not doing data pre-processing yet, these changes would have to preserve any NaN data. The variables are:
- Agreement Reached (0s and 1s)
- Attorney/Representative (Y or N)
- COVID-19 Indicator (Y or N)

---

# Type conversion
Here we are converting categorical variables that could be represented as boolean, meaning they only have two unique values, while still perserving the NaN entries present in the dataset.

In [102]:
#function to transform Y and N into boolean while preserving the NaNs
def transform_strings_in_bool(train_data, col_names):
    for col_name in col_names:
        train_data[col_name] = train_data[col_name].map({'Y': True, 'N': False, np.nan: np.nan})
    return train_data

In [103]:
# Agreement Reached only has values of  0s and 1s so lets convert it to boolean
train_data['Agreement Reached'] = train_data['Agreement Reached'].astype("boolean")


In [None]:

train_data = transform_strings_in_bool(train_data, ['Attorney/Representative','COVID-19 Indicator'])
print(train_data['Attorney/Representative'].unique(), train_data['COVID-19 Indicator'].unique())


---

# Visual Exploration

## Numerical Features - univariate analysis

In [None]:
# numerical features only
num_feat = list(train_data.select_dtypes(include='number').columns)
print(num_feat)

# remove categorical variable with code
num_feat = [col for col in num_feat if 'Code' not in col]
num_feat = [col for col in num_feat if 'Description' not in col]
num_feat


### Age at Injury

In [None]:
#14 is the minimum age to work in New York
sns.histplot(train_data[train_data['Age at Injury'] > 13]['Age at Injury'],)
plt.show()

sns.boxplot(train_data[train_data['Age at Injury'] > 13]['Age at Injury'])

### Average Weekly Wage

We can see that this feature has a lot of outliers that need to be treated further in the project

In [None]:
sns.histplot(train_data['Average Weekly Wage'], log_scale=True)
plt.show()

sns.boxplot(train_data['Average Weekly Wage'])

### Birth Year

Here we can see that _`Birth Year`_ has 25081 entries with the value 0

In [None]:
sns.histplot(train_data[train_data['Birth Year'] > 0]['Birth Year'],)
plt.show()

sns.boxplot(train_data[train_data['Birth Year'] > 0]['Birth Year'])

Here we can see that the values for Average Weekly Wage seem fine but due to the large amount of zeros, the auto attributed values for the will not work here

In [None]:
print(train_data['Average Weekly Wage'].describe())

### Number of Dependents

In [None]:
sns.countplot(x='Number of Dependents', data=train_data)
plt.xlabel('Number of Dependents')
plt.ylabel('Count')
plt.title('Distribution of Number of Dependents')
plt.show()

sns.boxplot(train_data['Number of Dependents'])

### IME-4 Count

In [None]:
'IME-4 Count'

sns.countplot(x='IME-4 Count', data=train_data)
plt.xlabel('IME-4 Count')
plt.ylabel('Count')
plt.xticks(rotation=90) 
plt.title('Distribution of IME-4 Count')
plt.show()

sns.boxplot(train_data['IME-4 Count'])

In [None]:
fig = plt.figure(figsize=(10, 8))


corr = train_data[num_feat].corr(method="pearson")


sns.heatmap(data=corr, annot=True, )


plt.show()

----

## Categorical Features

In [None]:
# select categorical features
train_data_cat = train_data.select_dtypes(include='object').columns.tolist()

# add columns that contain 'Code' or 'Description' in their name
train_data_cat += [col for col in train_data.columns if 'Code' in col or 'Description' in col]

# remove any duplicates (in case a column is already in both categories)
train_data_cat = list(set(train_data_cat))

train_data_cat

### Assembly Date

In [None]:
train_datac = train_data.copy()
train_datac['Assembly Date'] = pd.to_datetime(train_datac['Assembly Date'])  
train_datac['year_month'] = train_datac['Assembly Date'].dt.to_period('M')  

train_datac['year_month'].value_counts().sort_index().plot(kind='bar', figsize=(10, 6))
plt.xlabel('Month-Year')
plt.ylabel('Frequency')
plt.title('Frequency of Assembly Date by Month-Year')
plt.xticks(rotation=45)
plt.show()


### Accident Date

In [None]:
train_datac = train_data.copy()
train_datac['Accident Date'] = pd.to_datetime(train_datac['Accident Date'])  

train_datac = train_datac[train_datac['Accident Date'] >= '1961-01-01']

train_datac['year'] = train_datac['Accident Date'].dt.year
print(train_datac['Accident Date'].min(), train_datac['Accident Date'].max())

In [None]:
# consider only dates from 1961 onwards (because of the min value on the cell above)
train_datac = train_datac[train_datac['Accident Date'] >= '1961-01-01']

# create a group for the first 59 year due to the low frequency
train_datac['year_group'] = train_datac['year'].apply(lambda x: '1961-2019' if x <= 2019 else str(x))

# count the frequency of the accident date by year
yearly_grouped_counts = train_datac['year_group'].value_counts().sort_index()

# plot the frequency of the accident date by year
yearly_grouped_counts.plot(kind='bar', figsize=(10, 6))
plt.xlabel('Year/Group')
plt.ylabel('Frequency')
plt.title('Frequency of Accident Date by Year/Group (1961-2019 grouped)')
plt.xticks(rotation=45)
plt.show()

### C-2 and C-3 Date

In [None]:
train_datac = train_data.copy()
train_datac['C-2 Date'] = pd.to_datetime(train_datac['C-2 Date'])  


train_datac['year'] = train_datac['C-2 Date'].dt.year
print(train_datac['C-2 Date'].min(), train_datac['C-2 Date'].max())

In [None]:
# consider only dates from 1996 onwards (because of the min value on the cell above)
train_datac = train_datac[train_datac['C-2 Date'] >= '1996-01-1']

# create a group for the first 2ish decades due to the low frequency
train_datac['year_group'] = train_datac['year'].apply(lambda x: '1996-2019' if x <= 2019 else str(x))

# count the frequency of the accident date by year
yearly_grouped_counts = train_datac['year_group'].value_counts().sort_index()

# plot the frequency of the accident date by year
yearly_grouped_counts.plot(kind='bar', figsize=(10, 6))
plt.xlabel('Year/Group')
plt.ylabel('Frequency')
plt.title('Frequency of C-2 Date by Year/Group (1996-2019 grouped)')
plt.xticks(rotation=45)
plt.show()

In [None]:
train_datac = train_data.copy()

# Convert 'C-2 Date' and 'C-3 Date' to datetime
train_datac['C-2 Date'] = pd.to_datetime(train_datac['C-2 Date'])
train_datac['C-3 Date'] = pd.to_datetime(train_datac['C-3 Date'])

print(train_datac['C-2 Date'].min(), train_datac['C-2 Date'].max())
print(train_datac['C-3 Date'].min(), train_datac['C-3 Date'].max())



In [None]:
# consider only dates from 1996 onwards (because of the min value on the cell above) for C-2
train_datac_c2 = train_datac[train_datac['C-2 Date'] >= '1996-01-01']
train_datac_c2['year_c2'] = train_datac_c2['C-2 Date'].dt.year

# consider only dates from 1992 onwards (because of the min value on the cell above) for C-3
train_datac_c3 = train_datac[train_datac['C-3 Date'] >= '1992-01-01']
train_datac_c3['year_c3'] = train_datac_c3['C-3 Date'].dt.year


# group from the min year to 2019 and then by year due to the low frequency
train_datac_c2['year_group_c2'] = train_datac_c2['year_c2'].apply(lambda x: '1996-2019' if x <= 2019 else str(x))
train_datac_c3['year_group_c3'] = train_datac_c3['year_c3'].apply(lambda x: '1992-2019' if x <= 2019 else str(x))

# Calculate the frequency of the accident date by year group
yearly_grouped_counts_c2 = train_datac_c2['year_group_c2'].value_counts().sort_index()
yearly_grouped_counts_c3 = train_datac_c3['year_group_c3'].value_counts().sort_index()

In [None]:
# Plot both graphs side by side using subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))

# plot for C-2 Date
yearly_grouped_counts_c2.plot(kind='bar', ax=axes[0], color='blue')
axes[0].set_xlabel('Year/Group')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Frequency of C-2 Date by Year/Group (1996-2019 grouped)')
axes[0].tick_params(axis='x', rotation=45)

# plot for C-3 Date
yearly_grouped_counts_c3.plot(kind='bar', ax=axes[1], color='green')
axes[1].set_xlabel('Year/Group')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Frequency of C-3 Date by Year/Group (1992-2019 grouped)')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

### First Hearing Date

In [None]:
train_datac = train_data.copy()
train_datac['First Hearing Date'] = pd.to_datetime(train_datac['First Hearing Date'])  
train_datac['year'] = train_datac['First Hearing Date'].dt.year

train_datac['year'].value_counts().sort_index().plot(kind='bar', figsize=(10, 6))
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.title('Frequency of First Hearing Date by Month-Year')
plt.xticks(rotation=45)
plt.show()

### WCB Decision
Oh no, this categorical feature only has one value

In [None]:
wcb_decision_counts = train_data['WCB Decision'].value_counts()

plt.figure(figsize=(8, 8))
wcb_decision_counts.plot.pie(autopct='%1.1f%%', startangle=90,)
plt.ylabel('')
plt.title('Distribution of WCB Decision')
plt.show()