# **<h3 align="center">Machine Learning - Project</h3>**
## **<h3 align="center">3. Feature Engineering & Encoding</h3>**
### **<h3 align="center">Group 30 - Project</h3>**


### Group Members
| Name              | Email                        | Student ID |
|-------------------|------------------------------|------------|
| Alexandra Pinto   | 20211599@novaims.unl.pt      | 20211599   |
| Gon√ßalo Peres     | 20211625@novaims.unl.pt      | 20211625   |
| Leonor Mira       | 20240658@novaims.unl.pt      | 20240658   |
| Miguel Nat√°rio    | 20240498@novaims.unl.pt      | 20240498   |
| Nuno Bernardino   | 20211546@novaims.unl.pt      | 20211546   |

---

### **3. Feature Engineering & Encoding Notebook**
**Description:**  
This notebook builds upon the preprocessed dataset from the Preprocessing & Cleaning notebook to prepare features for hierarchical classification. Key steps include:  
- **Feature Engineering:** Create or transform features to enhance predictive power, including interaction terms, date-based calculations, and aggregations.  
- **Encoding Categorical Variables:** Apply encoding techniques suited to the cardinality and nature of categorical variables, such as ordinal encoding, one-hot encoding, or frequency encoding.  
- **Tailored Feature Preparation:** Prepare separate datasets for each level of hierarchical classification, ensuring optimal feature sets for Level 1 (binary classification) and Level 2 (binary and multi-class classification).  
- **Output:** Save the feature-engineered datasets (in CSV or Pickle format) for modeling in subsequent notebooks.  

This notebook ensures the dataset is optimally prepared for hierarchical classification, balancing feature relevance and computational efficiency.  

---

<a id = "toc"></a>

## Table of Contents
* [1. Import the Libraries](#chapter1)
* [2. Import the Datasets](#chapter2)       
* [3. Feature Engineering](#chapter3)
    * [3.1. Carrier-District Interaction](#section_3_1)
    * [3.2. Income Category](#section_3_2)
    * [3.3. Days_To_First_Hearing](#section_3_3)
    * [3.4. Accident Quarter](#section_3_4)
    * [3.5. Accident Year](#section_3_5)
    * [3.6. Accident on Day and Weekend](#section_3_6)
    * [3.7. Age Group](#section_3_7)
    * [3.8. Time from Assembly Date to C-2 Filing](#section_3_8)
    * [3.9. Time from Accident to C-2 Filing](#section_3_9)
    * [3.10. Zip_Code_Simplified](#section_3_10)
    * [3.11. Carrier Type Merged](#section_3_11)
    * [3.12. Carrier_Name_Simplified](#section_3_12)
    * [3.13. Body_Part_Category](#section_3_13)
    * [3.14. Injury_Nature_Category](#section_3_14)
    * [3.15. Injury_Cause_Category](#section_3_15)
    * [3.16. Risk of Each Job](#section_3_16)
    * [3.17. Relation between Salary and Dependents](#section_3_17)
* [4. Encoding](#chapter4)
* [5. Save Dataset for Modelling](#chapter5)



# 1. Import the Libraries üìö<a class="anchor" id="chapter1"></a>

[Back to ToC](#toc)<br>

In this section we will imported the needed libraries for this notebook.

In [None]:
# --- Standard Libraries ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import zipfile
import re
import os


# --- Scikit-Learn Modules for Data Partitioning and Preprocessing ---
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, MinMaxScaler, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

# --- Warnings ---
import warnings
warnings.filterwarnings('ignore')

#Import functions from utils
# from utils import analyze_numerical_outliers

# 2. Load and Prepare Datasets üìÅ<a class="anchor" id="chapter2"></a>
[Back to ToC](#toc)<br>

Now, we will load the dataset prepared in **Notebook 2: Preprocessing & Cleaning**, where we addressed key inconsistencies such as missing values and outliers. This preprocessed dataset serves as the foundation for the feature engineering steps in this notebook.

In [None]:
# Load the datasets saved from Notebook 2
X_train = pd.read_csv("/mnt/data/X_train_cleaned.csv", index_col="Claim Identifier")
X_val = pd.read_csv("/mnt/data/X_val_cleaned.csv", index_col="Claim Identifier")
df_test = pd.read_csv("/mnt/data/df_test_cleaned.csv", index_col="Claim Identifier")
y_train = pd.read_csv("/mnt/data/y_train_cleaned.csv", index_col="Claim Identifier")
y_val = pd.read_csv("/mnt/data/y_val_cleaned.csv", index_col="Claim Identifier")

# Verify the datasets are loaded successfully
X_train.head(), X_val.head(), df_test.head(), y_train.head(), y_val.head()


# 3. Feature Engineering <a class="anchor" id="chapter3"></a>
[Back to ToC](#toc)<br>


Feature engineering is the process of preparing data for machine learning models by transforming raw data into meaningful features that enhance model performance. In this section, we create, select, and modify variables to capture significant patterns within the data, making it more informative and useful for the model‚Äôs learning process. Through these transformations, we aim to improve the model‚Äôs accuracy and effectiveness.

## 3.1. Carrier-District Interaction <a class="anchor" id="section_3_1"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>

Combining **Carrier Type** with **District Name** may reveal regional preferences for certain insurance carriers, which could be useful in understanding regional biases or regulations.

In [None]:
# Creating a new feature by combining Carrier Type and District Name
X_train['Carrier_District_Interaction'] = X_train['Carrier Type'] + "_" + X_train['District Name']

# Apply to the val X_val
X_val['Carrier_District_Interaction'] = X_val['Carrier Type'] + "_" + X_val['District Name']

# Apply to the test set
df_test['Carrier_District_Interaction'] = df_test['Carrier Type'] + "_" + df_test['District Name']

## 3.2. Income Category  <a class="anchor" id="section_3_2"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>

Creating categories for **Average Weekly Wage** can simplify the continuous nature of income into meaningful segments such as Low, Average, and High, which could help the model understand different socioeconomic statuses.

In [None]:
# Calculate key percentiles
percentiles = X_train['Average Weekly Wage'].quantile([0.25, 0.5, 0.75, 0.9])
print(percentiles)

0.25     858.0
0.50    1198.0
0.75    1506.0
0.90    1683.0
Name: Average Weekly Wage, dtype: float64


In [None]:
# Defining the bins and labels for categorizing income based on percentiles
income_bins = [0, 764.0, 1056.0, 1455.0, 1895.0, float('inf')]  # float('inf') allows us to set an open-ended range
income_labels = ['Low Income', 'Lower-Middle Income', 'Middle Income', 'Upper-Middle Income', 'High Income']

# Creating the new feature for income categories for the train set
X_train['Income_Category'] = pd.cut(X_train['Average Weekly Wage'], bins=income_bins, labels=income_labels)

# Apply to the val set
X_val['Income_Category'] = pd.cut(X_val['Average Weekly Wage'], bins=income_bins, labels=income_labels)

# Apply to the test set
df_test['Income_Category'] = pd.cut(df_test['Average Weekly Wage'], bins=income_bins, labels=income_labels)

After creating this categorical feature, we drop the original Average Weekly Wage column since it‚Äôs now represented by Income_Category.

In [None]:
# # Drop the 'Average Weekly Wage' column as it's represented by 'Income_Category'
# X_train_processed = X_train_processed.drop(columns=['Average Weekly Wage'])
# X_val_processed = X_val_processed.drop(columns=['Average Weekly Wage'])
# df_test_processed = df_test_processed.drop(columns=['Average Weekly Wage'])

## 3.3. Days_To_First_Hearing  <a class="anchor" id="section_3_3"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>


The feature **Days_To_First_Hearing** was created to capture the number of days between the Accident Date and the First Hearing Date. If a First Hearing Date is available, the feature represents the time elapsed, which can help the model understand the speed of the claim process. If the First Hearing Date is missing, it is represented as 0, indicating that a hearing has not occurred yet. This approach provides more nuanced information than simply indicating whether the hearing occurred or not, allowing the model to learn from both the presence and timing of the first hearing.

In [None]:
# List of columns to convert to datetime
date_columns = ['First Hearing Date', 'C-2 Date', 'C-3 Date']

# Replace 0 with NaT and convert to datetime
for col in date_columns:
    if col in X_train.columns:  # Check if the column exists in the DataFrame
        X_train[col] = pd.to_datetime(X_train[col].replace(0, pd.NaT), errors='coerce')
    if col in X_val.columns:
        X_val[col] = pd.to_datetime(X_val[col].replace(0, pd.NaT), errors='coerce')
    if col in df_test.columns:
        df_test[col] = pd.to_datetime(df_test[col].replace(0, pd.NaT), errors='coerce')

# Verify the conversion
print("First Hearing Date column after conversion:")
print(X_train['First Hearing Date'].head())
print("C-2 Date column after conversion:")
print(X_train['C-2 Date'].head())
print("C-3 Date column after conversion:")
print(X_train['C-3 Date'].head())

First Hearing Date column after conversion:
0          NaT
1   2022-08-29
2   2023-03-24
3   2022-11-14
4          NaT
Name: First Hearing Date, dtype: datetime64[ns]
C-2 Date column after conversion:
0   2020-05-08
1   2022-07-14
2   2021-11-08
3   2022-02-04
4   2021-10-29
Name: C-2 Date, dtype: datetime64[ns]
C-3 Date column after conversion:
0          NaT
1   2022-06-16
2   2021-11-04
3          NaT
4          NaT
Name: C-3 Date, dtype: datetime64[ns]


In [None]:
# Convert date columns to datetime
date_columns = ['First Hearing Date', 'Accident Date']
for col in date_columns:
    X_train[col] = pd.to_datetime(X_train[col], errors='coerce')
    X_val[col] = pd.to_datetime(X_val[col], errors='coerce')
    df_test[col] = pd.to_datetime(df_test[col], errors='coerce')

# Function to calculate days to first hearing
def calculate_hearing_days(row):
    if pd.notna(row['First Hearing Date']) and pd.notna(row['Accident Date']):
        return (row['First Hearing Date'] - row['Accident Date']).days
    return 0  # If no hearing date exists, return 0

# Apply the function to create the new feature
X_train['Days_To_First_Hearing'] = X_train.apply(calculate_hearing_days, axis=1)
X_val['Days_To_First_Hearing'] = X_val.apply(calculate_hearing_days, axis=1)
df_test['Days_To_First_Hearing'] = df_test.apply(calculate_hearing_days, axis=1)

# Verify the new feature
print("Days_To_First_Hearing in Train Set:")
print(X_train['Days_To_First_Hearing'].describe())

Days_To_First_Hearing in Train Set:
count    401156.000000
mean         87.779585
std         265.765399
min        -429.000000
25%           0.000000
50%           0.000000
75%          71.000000
max       16373.000000
Name: Days_To_First_Hearing, dtype: float64


After creating this binary feature, we can drop the original First Hearing Date column from the training, validation, and test sets.

In [None]:
# Drop First Hearing Date from the train, val, and test sets
X_train_processed = X_train_processed.drop(columns=['First Hearing Date'])
X_val_processed = X_val_processed.drop(columns=['First Hearing Date'])
df_test_processed = df_test_processed.drop(columns=['First Hearing Date'])

## 3.4. Accident Quarter  <a class="anchor" id="section_3_4"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>


Temporal data can often influence outcomes. Extracting the quarter of the accident (e.g., 1st, 2nd, etc.) helps the model capture seasonal patterns that may impact accidents.

In [None]:
# Extracting the quarter of the Accident Date
X_train_processed['Accident_Quarter'] = pd.to_datetime(X_train_processed['Accident Date'], errors='coerce').dt.quarter

# Apply to the val set
X_val_processed['Accident_Quarter'] = pd.to_datetime(X_val_processed['Accident Date'], errors='coerce').dt.quarter

# Apply to the test set
df_test_processed['Accident_Quarter'] = pd.to_datetime(df_test_processed['Accident Date'], errors='coerce').dt.quarter


## 3.5. Accident Year <a class="anchor" id="section_3_5"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>


The year can help the model understand seasonal or yearly effects, like accident patterns during different times of the year.

In [None]:
# Extracting the year from the Accident Date
X_train_processed['Accident_Year'] = pd.to_datetime(X_train_processed['Accident Date'], errors='coerce').dt.year

# Apply to the val set
X_val_processed['Accident_Year'] = pd.to_datetime(X_val_processed['Accident Date'], errors='coerce').dt.year

# Apply to the test set
df_test_processed['Accident_Year'] = pd.to_datetime(df_test_processed['Accident Date'], errors='coerce').dt.year

## 3.6. Accident on Day and Weekend <a class="anchor" id="section_3_6"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>


The day of the accident could be significant, as weekends might have different risk factors compared to weekdays. We will extract the day of the week and create a feature to indicate if the accident occurred on a weekend.

In [None]:
# Extracting the day of the week and creating a feature to indicate if the accident occurred on a weekend
X_train_processed['Accident Day'] = pd.to_datetime(X_train_processed['Accident Date'], errors='coerce').dt.dayofweek
X_train_processed['Accident on Weekend'] = X_train_processed['Accident Day'].apply(lambda x: 1 if x >= 5 else 0)

# Apply to the val set
X_val_processed['Accident Day'] = pd.to_datetime(X_val_processed['Accident Date'], errors='coerce').dt.dayofweek
X_val_processed['Accident on Weekend'] = X_val_processed['Accident Day'].apply(lambda x: 1 if x >= 5 else 0)

# Apply to the test set
df_test_processed['Accident Day'] = pd.to_datetime(df_test_processed['Accident Date'], errors='coerce').dt.dayofweek
df_test_processed['Accident on Weekend'] = df_test_processed['Accident Day'].apply(lambda x: 1 if x >= 5 else 0)


## 3.7. Age Group <a class="anchor" id="section_3_7"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>

Grouping ages can help simplify the model‚Äôs understanding of different age demographics (e.g., Youth, Young Adult, Middle Age, Senior). This could potentially improve model interpretability and performance.

In [None]:
# Display unique values in 'Age at Injury' to understand the range
X_train_processed['Age at Injury'].describe()

count    362830.000000
mean         42.961731
std          13.621215
min           5.000000
25%          31.000000
50%          43.000000
75%          54.000000
max          82.000000
Name: Age at Injury, dtype: float64

In [None]:
# Creating bins and labels for age groups
age_bins = [0, 25, 45, 65, float('inf')]
age_labels = ['Youth', 'Young Adult', 'Middle Age', 'Senior']

# Creating a new feature for age groups
X_train_processed['Age Group'] = pd.cut(X_train_processed['Age at Injury'], bins=age_bins, labels=age_labels)

# Apply to the val set
X_val_processed['Age Group'] = pd.cut(X_val_processed['Age at Injury'], bins=age_bins, labels=age_labels)

# Apply to the test set
df_test_processed['Age Group'] = pd.cut(df_test_processed['Age at Injury'], bins=age_bins, labels=age_labels)

In [None]:
# # Drop 'Age at Injury' from the train, val and test set
# X_train_processed = X_train_processed.drop(columns=['Age at Injury'])
# X_val_processed = X_val_processed.drop(columns=['Age at Injury'])
# df_test_processed = df_test_processed.drop(columns=['Age at Injury'])

## 3.8. Promptness_category <a class="anchor" id="section_3_8"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>


The `promptness_category` feature categorizes the time taken between key events in the claims process, specifically measuring the difference between the `Accident Date` and the `Assembly Date`. This feature quantifies the speed or delay in assembling the claim and provides insight into how promptly claims are processed.

In [None]:
def categorize_promptness(df, date1_col, date2_col, new_col_name):
    """
    Calculate and categorize promptness between two date columns.

    Parameters:
    - df: The DataFrame to process.
    - date1_col: The column representing the first date (e.g., Assembly Date).
    - date2_col: The column representing the second date (e.g., Accident Date).
    - new_col_name: The name of the new categorical column for promptness.

    Returns:
    - Updated DataFrame with new categorized promptness column.
    """
    # Ensure the date columns are datetime
    df[date1_col] = pd.to_datetime(df[date1_col], errors='coerce')
    df[date2_col] = pd.to_datetime(df[date2_col], errors='coerce')

    # Calculate the difference in days
    df['Days_Difference'] = (df[date1_col] - df[date2_col]).dt.days

    # Assign categories based on conditions
    def assign_category(row):
        if pd.isna(row[date1_col]) or row['Days_Difference'] <= 0:
            return 'Form Not Received'
        elif row['Days_Difference'] <= 7:
            return 'Until 1 week'
        elif row['Days_Difference'] <= 14:
            return 'Between 1 and 2 weeks'
        elif row['Days_Difference'] <= 30:
            return 'Between 2 weeks and 1 month'
        elif row['Days_Difference'] <= 90:
            return '1 to 3 months'
        elif row['Days_Difference'] <= 180:
            return '3 to 6 months'
        elif row['Days_Difference'] <= 365:
            return '6 months to 1 year'
        else:
            return 'More than 1 year'

    # Apply the function to assign categories
    df[new_col_name] = df.apply(assign_category, axis=1)

    # Drop the intermediate column
    df.drop(columns=['Days_Difference'], inplace=True)

    return df

# Apply the function to X_train_processed
X_train_processed = categorize_promptness(X_train_processed, 'Assembly Date', 'Accident Date', 'promptness_category')

# Apply the function to X_val_processed
X_val_processed = categorize_promptness(X_val_processed, 'Assembly Date', 'Accident Date', 'promptness_category')

# Apply the function to df_test_processed
df_test_processed = categorize_promptness(df_test_processed, 'Assembly Date', 'Accident Date', 'promptness_category')

In [None]:
# Display value counts for the new column
X_train_processed['promptness_category'].value_counts()

promptness_category
Until 1 week                   159623
Between 1 and 2 weeks           85224
Between 2 weeks and 1 month     71377
1 to 3 months                   50045
3 to 6 months                   15345
More than 1 year                 9990
6 months to 1 year               9547
Form Not Received                   5
Name: count, dtype: int64

These categories allow us to observe the promptness in claim processing, with the majority falling within Until 1 week, indicating a generally swift assembly of claims. However, a significant portion extends beyond a month, with a small subset taking more than a year. This feature can provide insights into patterns of delays or rapid processing, possibly indicating areas for improvement in claim management.

## 3.9. promptness_C2_category <a class="anchor" id="section_3_9"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>

The "promptness_C2_category" feature tracks the time taken to register the C-2 Date (the receipt of the employer's report of work-related injury/illness) after the Accident Date. It evaluates employers' promptness in reporting accidents, offering insights into compliance and potential administrative delays.


In [None]:
# Ensure 'C-2 Date' and 'Accident Date' are datetime
X_train_processed['C-2 Date'] = pd.to_datetime(X_train_processed['C-2 Date'], errors='coerce')
X_train_processed['Accident Date'] = pd.to_datetime(X_train_processed['Accident Date'], errors='coerce')

# Count the number of rows where 'C-2 Date' is earlier than 'Accident Date'
num_negative_values = (X_train_processed['C-2 Date'] < X_train_processed['Accident Date']).sum()

# Print the number of rows with this condition
print(f"Number of rows where 'C-2 Date' is earlier than 'Accident Date': {num_negative_values}")

Number of rows where 'C-2 Date' is earlier than 'Accident Date': 627


##this isnt supposed well :/

In [None]:
# Apply the function to X_train_processed
X_train_processed = categorize_promptness(X_train_processed, 'C-2 Date', 'Accident Date', 'promptness_C2_category')

# Apply the function to X_val_processed
X_val_processed = categorize_promptness(X_val_processed, 'C-2 Date', 'Accident Date', 'promptness_C2_category')

# Apply the function to df_test_processed
df_test_processed = categorize_promptness(df_test_processed, 'C-2 Date', 'Accident Date', 'promptness_C2_category')


In [None]:
# Display value counts for the new column
X_train_processed['promptness_C2_category'].value_counts()

promptness_C2_category
Until 1 week                   160468
Between 1 and 2 weeks           76055
Between 2 weeks and 1 month     60347
1 to 3 months                   49633
3 to 6 months                   18602
Form Not Received               13641
6 months to 1 year              12550
More than 1 year                 9860
Name: count, dtype: int64

## 3.10. promptness_C3_category <a class="anchor" id="section_3_10"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>

The "promptness_C3_category" feature tracks the time taken to register the C-3 Date (the receipt of the employer's report of work-related injury/illness) after the Accident Date. It evaluates employers' promptness in reporting accidents, offering insights into compliance and potential administrative delays.


In [None]:
# Apply the function to X_train_processed
X_train_processed = categorize_promptness(X_train_processed, 'C-3 Date', 'Accident Date', 'promptness_C3_category')

# Apply the function to X_val_processed
X_val_processed = categorize_promptness(X_val_processed, 'C-3 Date', 'Accident Date', 'promptness_C3_category')

# Apply the function to df_test_processed
df_test_processed = categorize_promptness(df_test_processed, 'C-3 Date', 'Accident Date', 'promptness_C3_category')

In [None]:
# Display value counts for the new column
X_train_processed['promptness_C3_category'].value_counts()

promptness_C3_category
Form Not Received              273433
1 to 3 months                   31694
Between 2 weeks and 1 month     30784
Between 1 and 2 weeks           19363
Until 1 week                    18328
3 to 6 months                   12393
6 months to 1 year               8614
More than 1 year                 6547
Name: count, dtype: int64

After creating new features based on the existing date columns, we will remove the original date features to avoid redundancy and simplify the dataset. We believe that the impact of these date features is adequately captured in the newly engineered features.

In [None]:
X_train_processed = X_train_processed.drop(columns=['Accident Date', 'Assembly Date', 'C-2 Date','C-3 Date'])

#Apply to the val set
X_val_processed = X_val_processed.drop(columns=['Accident Date','Assembly Date', 'C-2 Date','C-3 Date'])

# Apply to the test set
df_test_processed = df_test_processed.drop(columns=['Accident Date', 'Assembly Date', 'C-2 Date', 'C-3 Date'])

## 3.10. Zip_Code_Simplified <a class="anchor" id="section_3_10"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>

To reduce the dimensionality of the Zip Code feature, we will create a new feature called Zip_Code_Simplified. This feature will group all zip codes that appear less than 2,000 times in the training dataset into a category labeled as 'Other'. By doing this, we effectively reduce the number of unique zip codes, simplifying the model while retaining the most significant information.

In [None]:
# Print the most frequent Carrier Names along with their counts
most_frequent_zipcode = X_train_processed['Zip Code'].value_counts().head(25)  # Adjust the number if you need more
print("Most frequent Zip Codes with their counts:")
print(most_frequent_zipcode)

Most frequent Zip Codes with their counts:
Zip Code
11236.0    3836
11717.0    3779
11434.0    3636
11550.0    2849
10467.0    2794
10940.0    2335
10701.0    2176
10029.0    2089
14150.0    2005
10314.0    1751
14609.0    1703
11706.0    1629
11207.0    1551
12601.0    1541
11368.0    1541
11212.0    1523
11208.0    1519
12550.0    1519
11226.0    1450
11234.0    1365
11203.0    1364
10466.0    1347
13440.0    1285
11385.0    1280
10462.0    1279
Name: count, dtype: int64


In [None]:
# Create a new feature called 'Zip_Code_Simplified' based on 'Zip Code' for train, validation, and test sets
X_train_processed['Zip_Code_Simplified'] = X_train_processed['Zip Code']
X_val_processed['Zip_Code_Simplified'] = X_val_processed['Zip Code']
df_test_processed['Zip_Code_Simplified'] = df_test_processed['Zip Code']

# Identify carrier names that occur fewer than 1000 times in X_train_processed
zipcode_counts = X_train_processed['Zip Code'].value_counts()
zipcode_to_replace = zipcode_counts[zipcode_counts < 1000].index

# Replace carrier names with fewer than 1000 occurrences with 'OTHER' in all datasets using the identified carriers from X_train
for dataset in [X_train_processed, X_val_processed, df_test_processed]:
    dataset['Zip_Code_Simplified'] = dataset['Zip_Code_Simplified'].replace(zipcode_to_replace, 'OTHER')

# Print the counts of the simplified carrier names in X_train_processed to verify the result
print("Counts of 'Zip_Code_Simplified' feature in X_train_processed:")
print(X_train_processed['Zip_Code_Simplified'].value_counts())

Counts of 'Zip_Code_Simplified' feature in X_train_processed:
Zip_Code_Simplified
OTHER      324113
11236.0      3836
11717.0      3779
11434.0      3636
11550.0      2849
10467.0      2794
10940.0      2335
10701.0      2176
10029.0      2089
14150.0      2005
10314.0      1751
14609.0      1703
11706.0      1629
11207.0      1551
11368.0      1541
12601.0      1541
11212.0      1523
12550.0      1519
11208.0      1519
11226.0      1450
11234.0      1365
11203.0      1364
10466.0      1347
13440.0      1285
11385.0      1280
10462.0      1279
10456.0      1246
14094.0      1237
10469.0      1208
11003.0      1179
12180.0      1172
11757.0      1146
10977.0      1143
11413.0      1142
11758.0      1137
11520.0      1121
13090.0      1120
11746.0      1099
10473.0      1098
11221.0      1096
11722.0      1095
11233.0      1093
10453.0      1091
11704.0      1091
11772.0      1083
12303.0      1077
12603.0      1071
11373.0      1066
13021.0      1039
11756.0      1038
11377.0      1009


In [None]:
# Display unique counts to compare the dimensionality reduction
print(f"Original ZIP Code uniqueness: {X_train_processed['Zip Code'].nunique()}")
print(f"Simplified ZIP Code uniqueness: {X_train_processed['Zip_Code_Simplified'].nunique()}")

Original ZIP Code uniqueness: 9705
Simplified ZIP Code uniqueness: 51


This transformation retains regional information while reducing the feature dimensionality, which can be beneficial for model interpretability and efficiency. The original Zip Code column has been removed to avoid redundancy. For this motive we will delete also the Zip Code, for now.

In [None]:
# X_train_processed = X_train_processed.drop(columns=['Zip Code'])
# # Apply to the val set
# X_val_processed = X_val_processed.drop(columns=['Zip Code'])
# # Apply to the test set
# df_test_processed = df_test_processed.drop(columns=['Zip Code'])

## 3.11. Carrier Type Merged <a class="anchor" id="section_3_11"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>


Since there are several categories under "Special Fund" with very few occurrences, combining them into a single category can reduce noise in the data and make the feature more manageable for the model.

After merging, we observe the following distribution of Carrier Type Merged values in the training dataset:

In [None]:
# Creating a new feature that merges all 'Special Fund' categories into a single category for train, validation, and test sets
for dataset in [X_train_processed, X_val_processed, df_test_processed]:
    dataset['Carrier Type Merged'] = dataset['Carrier Type'].replace({
        'SPECIAL FUND - UNKNOWN': 'SPECIAL FUND',
        'SPECIAL FUND - POI CARRIER WCB MENANDS': 'SPECIAL FUND',
        'SPECIAL FUND - CONS. COMM. (SECT. 25-A)': 'SPECIAL FUND'
    })

# Verifying the updated column for X_train_processed
print(X_train_processed['Carrier Type Merged'].value_counts())

Carrier Type Merged
PRIVATE         199662
SELF PUBLIC      85264
SIF              77456
SELF PRIVATE     36835
UNKNOWN           1212
SPECIAL FUND       727
Name: count, dtype: int64


In [None]:
# # Now let's delete Carrier Type from the train, val and test set
# X_train_processed = X_train_processed.drop(columns=['Carrier Type'])
# # Apply to the val set
# X_val_processed = X_val_processed.drop(columns=['Carrier Type'])
# # Apply to the test set
# df_test_processed = df_test_processed.drop(columns=['Carrier Type'])

## 3.12. Carrier_Name_Simplified <a class="anchor" id="section_3_12"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>

The 'Carrier Name' feature has high cardinality, with 1951 unique values. This level of uniqueness can complicate machine learning models, especially if some categories have very few instances. To simplify the analysis and potentially improve model performance, we will group carrier names with fewer than 500 occurrences under a single category called 'OTHER'.


In [None]:
# Print the most frequent Carrier Names along with their counts
most_frequent_carriers = X_train_processed['Carrier Name'].value_counts().head(25)  # Adjust the number if you need more
print("Most frequent Carrier Names with their counts:")
print(most_frequent_carriers)

Most frequent Carrier Names with their counts:
Carrier Name
STATE INSURANCE FUND             77456
POLICE, FIRE, SANITATION         15026
AMERICAN ZURICH INSURANCE CO     12178
CHARTER OAK FIRE INS CO          12021
INDEMNITY INS. OF N AMERICA      10095
SAFETY NATIONAL CASUALTY CORP     9778
NEW HAMPSHIRE INSURANCE CO        8962
LM INSURANCE CORP                 8557
A I U INSURANCE COMPANY           7761
INDEMNITY INSURANCE CO OF         6369
NYC TRANSIT AUTHORITY             5826
HARTFORD ACCIDENT & INDEMNITY     5311
NEW YORK BLACK CAR OPERATORS'     5080
ARCH INDEMNITY INSURANCE CO.      4652
AIU INSURANCE CO                  4507
CNY OTHER THAN ED, HED WATER      4454
HEALTH & HOSPITAL CORP.           3914
ARCH INDEMNITY INSURANCE CO       3731
PENNSYLVANIA MANUFACTURERS'       3396
PUBLIC EMPLOYERS RISK MGMT.       3231
ACE AMERICAN INSURANCE CO.        3171
OLD REPUBLIC INSURANCE CO.        3046
MEMIC INDEMNITY COMPANY           2908
WAL-MART ASSOCIATES, INC.         2764
COUN

In [None]:
# Create a new feature called 'Carrier_Name_Simplified' based on 'Carrier Name' for train, validation, and test sets
X_train_processed['Carrier_Name_Simplified'] = X_train_processed['Carrier Name']
X_val_processed['Carrier_Name_Simplified'] = X_val_processed['Carrier Name']
df_test_processed['Carrier_Name_Simplified'] = df_test_processed['Carrier Name']

# Identify carrier names that occur fewer than 500 times in X_train_processed
carrier_counts = X_train_processed['Carrier Name'].value_counts()
carriers_to_replace = carrier_counts[carrier_counts < 500].index

# Replace carrier names with fewer than 500 occurrences with 'OTHER' in all datasets using the identified carriers from X_train
for dataset in [X_train_processed, X_val_processed, df_test_processed]:
    dataset['Carrier_Name_Simplified'] = dataset['Carrier_Name_Simplified'].replace(carriers_to_replace, 'OTHER')

# Print the counts of the simplified carrier names in X_train_processed to verify the result
print("Counts of 'Carrier_Name_Simplified' feature in X_train_processed:")
print(X_train_processed['Carrier_Name_Simplified'].value_counts())

Counts of 'Carrier_Name_Simplified' feature in X_train_processed:
Carrier_Name_Simplified
OTHER                             85446
STATE INSURANCE FUND              77456
POLICE, FIRE, SANITATION          15026
AMERICAN ZURICH INSURANCE CO      12178
CHARTER OAK FIRE INS CO           12021
                                  ...  
EVEREST NATIONAL INS COMPANY        519
TRAVELERS INDEMNITY CO OF AMER      517
VISITING NURSE SERVICE OF NY        513
HARTFORD INSURANCE COMPANY          508
HARTFORD CASUALTY INSURANCE CO      501
Name: count, Length: 106, dtype: int64


In [None]:
#print the number of unique values in the original 'Carrier Name' feature
print(f"Number of unique values in 'Carrier Name': {X_train_processed['Carrier Name'].nunique()}")

#print the number of unique values in the simplified 'Carrier_Name_Simplified' feature
print(f"Number of unique values in 'Carrier_Name_Simplified': {X_train_processed['Carrier_Name_Simplified'].nunique()}")

Number of unique values in 'Carrier Name': 1966
Number of unique values in 'Carrier_Name_Simplified': 106


In [None]:
# # Drop the 'Carrier Name' column after creating 'Carrier_Name_Simplified'
# X_train_processed = X_train_processed.drop(columns=['Carrier Name'])

# # Apply to the val set
# X_val_processed = X_val_processed.drop(columns=['Carrier Name'])

# # Apply to the test set
# df_test_processed = df_test_processed.drop(columns=['Carrier Name'])

In [None]:
#sum all nan values from train, val, test
X_train_processed.isnull().sum().sum(), X_val_processed.isnull().sum().sum(), df_test_processed.isnull().sum().sum()

(133752, 58457, 143202)

## 3.13. Body_Part_Category <a class="anchor" id="section_3_13"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>

The Body_Part_Category feature will group the WCIO_Part_of_Body_Code into broader categories. Based on the codes in your document, each range of codes represents a specific body part region (e.g., codes from 10 to 19 represent the head). We‚Äôll map these codes to corresponding regions like ‚ÄúHead,‚Äù ‚ÄúNeck,‚Äù etc.

In [None]:
# Mapping of WCIO Part of Body codes to broader categories
part_of_body_mapping = {
    **dict.fromkeys(range(10, 20), 'Head'),
    **dict.fromkeys(range(20, 30), 'Neck'),
    **dict.fromkeys(range(30, 40), 'Upper Extremities'),
    **dict.fromkeys(list(range(40, 50)) + list(range(60, 64)), 'Trunk'),
    **dict.fromkeys(range(50, 60), 'Lower Extremities'),
    **dict.fromkeys([64, 65, 66, 90, 91, 99], 'Multiple Body Parts'),
    **dict.fromkeys([101], 'NonClassificable')

}

# Creating the Body_Part_Category column by mapping Part of Body codes to categories
X_train_processed['Body_Part_Category'] = X_train_processed['WCIO Part Of Body Code'].map(part_of_body_mapping)

#Apply to the val set
X_val_processed['Body_Part_Category'] = X_train_processed['WCIO Part Of Body Code'].map(part_of_body_mapping)

# Apply to the test set
df_test_processed['Body_Part_Category'] = df_test_processed['WCIO Part Of Body Code'].map(part_of_body_mapping)

In [None]:
X_train_processed['Body_Part_Category'].value_counts()

Body_Part_Category
Upper Extremities      124656
Lower Extremities       84322
Trunk                   70778
Head                    39893
Multiple Body Parts     32530
NonClassificable        29322
Neck                     8398
Name: count, dtype: int64

In [None]:
# Check the number of missing values in 'Body_Part_Category'
missing_body_part_count = X_train_processed['Body_Part_Category'].isna().sum()
print(f"Number of missing values in 'Body_Part_Category': {missing_body_part_count}")

# Filter rows where 'Body_Part_Category' is missing
missing_body_part_rows = X_train_processed[X_train_processed['Body_Part_Category'].isna()]

# Investigate the corresponding 'WCIO Part Of Body Code' values
missing_body_part_wcio_codes = missing_body_part_rows['WCIO Part Of Body Code'].value_counts(dropna=False)
print("WCIO Part Of Body Code distribution for missing 'Body_Part_Category':")
missing_body_part_wcio_codes


Number of missing values in 'Body_Part_Category': 11257
WCIO Part Of Body Code distribution for missing 'Body_Part_Category':


WCIO Part Of Body Code
0.0    11257
Name: count, dtype: int64

In [None]:
X_train_processed['Body_Part_Category'].isna().sum()

11257

## 3.14. Injury_Nature_Category <a class="anchor" id="section_3_14"></a>
The Body_Part_Category feature will group the WCIO_Part_of_Body_Code into broader categories. Based on the codes in the document, each range of codes represents a specific body part region (e.g., codes from 10 to 19 represent the head). We will map these codes to corresponding regions like "Head," "Neck," etc.

In [None]:
sorted(X_train_processed['WCIO Nature of Injury Code'].unique())

[0.0,
 1.0,
 2.0,
 3.0,
 4.0,
 7.0,
 10.0,
 13.0,
 16.0,
 19.0,
 22.0,
 25.0,
 28.0,
 30.0,
 31.0,
 32.0,
 34.0,
 36.0,
 37.0,
 38.0,
 40.0,
 41.0,
 42.0,
 43.0,
 46.0,
 47.0,
 49.0,
 52.0,
 53.0,
 54.0,
 55.0,
 58.0,
 59.0,
 60.0,
 61.0,
 62.0,
 63.0,
 64.0,
 65.0,
 66.0,
 67.0,
 68.0,
 69.0,
 70.0,
 71.0,
 72.0,
 73.0,
 74.0,
 75.0,
 76.0,
 77.0,
 78.0,
 79.0,
 80.0,
 83.0,
 90.0,
 91.0]

In [None]:
# Mapping of WCIO Nature of Injury codes to broader categories
nature_of_injury_mapping = {
    **dict.fromkeys(range(1, 60), 'Specific Injury'),
    **dict.fromkeys(range(60, 81), 'Occupational Disease or Cumulative Injury'),
    **dict.fromkeys([83], 'COVID-19 Injury'),
    **dict.fromkeys([90, 91], 'Multiple Injuries')
}

# Creating the Injury_Nature_Category column by mapping Nature of Injury codes to categories
X_train_processed['Injury_Nature_Category'] = X_train_processed['WCIO Nature of Injury Code'].map(nature_of_injury_mapping)

#Apply to the val set
X_val_processed['Injury_Nature_Category'] = X_val_processed['WCIO Nature of Injury Code'].map(nature_of_injury_mapping)

# Apply to the test set
df_test_processed['Injury_Nature_Category'] = df_test_processed['WCIO Nature of Injury Code'].map(nature_of_injury_mapping)


In [None]:
X_train_processed['Injury_Nature_Category'].value_counts()

Injury_Nature_Category
Specific Injury                              353666
COVID-19 Injury                               18028
Occupational Disease or Cumulative Injury      9966
Multiple Injuries                              9202
Name: count, dtype: int64

In [None]:
X_train_processed['Injury_Nature_Category'].isna().sum()

10294

## 3.15. Injury_Cause_Category <a class="anchor" id="section_3_15"></a>
The Injury_Cause_Category feature will classify the WCIO_Cause_of_Injury_Code values into broader cause categories. For example, codes related to burns or scalds can be grouped together, as well as those for falls or motor vehicle accidents.

In [None]:
# Mapping of WCIO Cause of Injury codes to broader categories
cause_of_injury_mapping = {
    **dict.fromkeys(list(range(1, 10)) + [11, 14, 84], 'Burn or Scald'),
    **dict.fromkeys([10, 12, 13, 20], 'Caught In, Under, or Between'),
    **dict.fromkeys(list(range(15, 20)), 'Cut, Puncture, Scrape'),
    **dict.fromkeys(list(range(25, 34)), 'Fall, Slip, or Trip'),
    **dict.fromkeys(list(range(40, 51)), 'Motor Vehicle'),
    **dict.fromkeys(list(range(52, 62)) + [97], 'Strain or Injury By'),
    **dict.fromkeys(list(range(65, 71)), 'Striking Against or Stepping On'),
    **dict.fromkeys(list(range(74, 82)) + [85, 86], 'Struck or Injured by'),
    **dict.fromkeys(list(range(94, 96)), 'Rubbed or Abraded by'),
    **dict.fromkeys(list(range(87, 94)) + [96, 98, 99, 82], 'Miscellaneous Causes'),
    **dict.fromkeys([83], 'COVID-19 Injury')
}


# Creating the Injury_Cause_Category column by mapping Cause of Injury codes to categories
X_train_processed['Injury_Cause_Category'] = X_train_processed['WCIO Cause of Injury Code'].map(cause_of_injury_mapping)

# Apply to the val set
X_val_processed['Injury_Cause_Category'] = X_val_processed['WCIO Cause of Injury Code'].map(cause_of_injury_mapping)

# Apply to the test set
df_test_processed['Injury_Cause_Category'] = df_test_processed['WCIO Cause of Injury Code'].map(cause_of_injury_mapping)


In [None]:
X_train_processed['Injury_Cause_Category'].value_counts()

Injury_Cause_Category
Strain or Injury By                103853
Fall, Slip, or Trip                 83895
Struck or Injured by                70277
Miscellaneous Causes                33947
Cut, Puncture, Scrape               28247
COVID-19 Injury                     17547
Motor Vehicle                       16811
Striking Against or Stepping On     13598
Caught In, Under, or Between        13462
Burn or Scald                        8547
Rubbed or Abraded by                  689
Name: count, dtype: int64

In [None]:
X_train_processed['Injury_Cause_Category'].isna().sum()

10283

Since we have created new categorical features (Injury_Nature_Category, Body_Part_Category, Injury_Cause_Category) that provide a more meaningful representation of the original codes, it makes sense to remove the original code features. Keeping them would add redundancy, decrease interpretability, and unnecessarily increase the dimensionality of the dataset, potentially affecting model performance.

In [None]:
# # Removing the code features from train, validation, and test datasets
# X_train_processed = X_train_processed.drop(columns=[
#     'WCIO Cause of Injury Code',
#     'WCIO Nature of Injury Code',
#     'WCIO Part Of Body Code'
# ])

# X_val_processed = X_val_processed.drop(columns=[
#     'WCIO Cause of Injury Code',
#     'WCIO Nature of Injury Code',
#     'WCIO Part Of Body Code'
# ])

# df_test_processed = df_test_processed.drop(columns=[
#     'WCIO Cause of Injury Code',
#     'WCIO Nature of Injury Code',
#     'WCIO Part Of Body Code'
# ])

## 3.16. Risk of Each Job <a class="anchor" id="section_3_16"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>


In [None]:
high_risk = [11, 21, 23, 31, 32, 33, 48, 49]

medium_risk = [22, 42, 44, 45, 56, 62, 71, 72, 81, 92]

low_risk = [51, 52, 53, 54, 55, 61]

In [None]:
# Define a function to assign risk levels based on the industry code
def assign_risk(industry_code):
    if industry_code in high_risk:
        return 'High Risk'
    elif industry_code in medium_risk:
        return 'Medium Risk'
    elif industry_code in low_risk:
        return 'Low Risk'
    else:
        return 'Unknown Risk'

# Apply the function to create the 'Industry Risk' column for train and test datasets
X_train_processed['Industry Risk'] = X_train_processed['Industry Code'].apply(assign_risk)
X_val_processed['Industry Risk'] = X_val_processed['Industry Code'].apply(assign_risk)
df_test_processed['Industry Risk'] = df_test['Industry Code'].apply(assign_risk)

# Display a preview to verify
print(X_train_processed[['Industry Code', 'Industry Risk']].head())
print(df_test_processed[['Industry Code', 'Industry Risk']].head())


   Industry Code Industry Risk
0           11.0     High Risk
1           31.0     High Risk
2           33.0     High Risk
3           31.0     High Risk
4           62.0   Medium Risk
   Industry Code Industry Risk
0           48.0     High Risk
1           45.0   Medium Risk
2           56.0   Medium Risk
3           48.0     High Risk
4           55.0      Low Risk


ALTERNATIVA POSS√çVEL

In [None]:
# Check unique industry descriptions
#unique_industries = X_train_processed['Industry Code Description'].unique()
#print(f"Unique Industry Descriptions: {len(unique_industries)}")
#print(unique_industries[:10])  # Display the first 10 industry descriptions

In [None]:
# Group by 'Industry Code Description' and calculate the frequency of claims
#industry_injury_counts = X_train_processed.groupby('Industry Code Description')['Claim Injury Type'].count()

# Normalize the injury frequencies to assign risk scores (1 = Low, 2 = Medium, 3 = High)
#min_count = industry_injury_counts.min()
#max_count = industry_injury_counts.max()
#industry_injury_normalized = (industry_injury_counts - min_count) / (max_count - min_count)

# Assign risk levels based on normalized frequencies
#industry_risk_levels = industry_injury_normalized.apply(lambda x: 1 if x < 0.33 else (2 if x < 0.66 else 3))

# Create a mapping dictionary
#industry_risk_mapping = industry_risk_levels.to_dict()

In [None]:
# Add the new "Job Risk Level" column to the dataset
#X_train_processed['Job Risk Level'] = X_train_processed['Industry Code Description'].map(industry_risk_mapping)
#X_val_processed['Job Risk Level'] = X_val_processed['Industry Code Description'].map(industry_risk_mapping)
#df_test_processed['Job Risk Level'] = df_test_processed['Industry Code Description'].map(industry_risk_mapping)

# Verify the new column
#print(X_train_processed[['Industry Code Description', 'Job Risk Level']].head())

## 3.17. Relation between Salary and Dependents <a class="anchor" id="section_3_17"></a>
[Back to 3. Feature Engineering ](#chapter3)<br>

The variable `Salary_Per_Dependent` denotes the average salary allocated per dependent in a household. This metric may provide valuable insights into the financial responsibilities faced by individuals and families, as well as their potential correlation with the frequency and severity of injury claims.

In [None]:
# Creating a new feature: dividing the salary (Average Weekly Wage) by the number of dependents
X_train_processed['Salary_Per_Dependent'] = X_train_processed['Average Weekly Wage'] / (X_train_processed['Number of Dependents'] + 1)

# Apply the same transformation to the validation set
X_val_processed['Salary_Per_Dependent'] = X_val_processed['Average Weekly Wage'] / (X_val_processed['Number of Dependents'] + 1)

# Apply the same transformation to the test set
df_test_processed['Salary_Per_Dependent'] = df_test_processed['Average Weekly Wage'] / (df_test_processed['Number of Dependents'] +1 )


# 4. Encoding <a class="anchor" id="chapter5"></a>


# 5. Save Dataset for Modelling <a class="anchor" id="chapter5"></a>
