# WiDS Datathon  - HistGradientBoostingClassifier Walkthrough


 This aim of this notebook is to provide a basic machine learning competition tutorial using the WiDS 2024 Datathon (Challenge #1) Datathon: https://www.kaggle.com/competitions/widsdatathon2024-challenge1/overview

## 0. Breaking Down the Problem

Going through the problem description, we can see that what we have on our hands is a binary classification task. We want to use the given data to predict if a diagnosis will be given within 90 days or after 90 days, so we'll need to take that into account when processing our data. 

We can start off with any classification model we want, but we'll want to adjust to get a final model that will provide highest possible classification accuracy. We'll take a look at a couple of solutions, one using standard binary classification methods and a third using a boosting: HistGradientBoosting, specifically the Classifier method: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html

In [2]:
#imports - usual culprits 
import pandas as pd 
import numpy as np 

#model imports 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder   
from sklearn.ensemble import HistGradientBoostingClassifier 

### Preprocessing

In [3]:
 #load data 
data = pd.read_csv(("~/Downloads/widsdatathon2024-challenge2/train.csv"), header = None, low_memory = False)
df_test = pd.read_csv(("~/Downloads/widsdatathon2024-challenge2/test.csv"), header = None, low_memory = False) 

In [4]:
#take a look at our data: 
data.head(25)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,142,143,144,145,146,147,148,149,150,151
0,patient_id,patient_race,payer_type,patient_state,patient_zip3,Region,Division,patient_age,patient_gender,bmi,...,Average of Apr-18,Average of May-18,Average of Jun-18,Average of Jul-18,Average of Aug-18,Average of Sep-18,Average of Oct-18,Average of Nov-18,Average of Dec-18,metastatic_diagnosis_period
1,268700,,COMMERCIAL,AR,724,South,West South Central,39,F,,...,52.55,74.77,79.96,81.69,78.30,74.56,59.98,42.98,41.18,191
2,484983,White,,IL,629,Midwest,East North Central,55,F,35.36,...,49.30,72.87,77.40,77.43,75.83,72.64,58.36,39.68,39.71,33
3,277055,,COMMERCIAL,CA,925,West,Pacific,59,F,,...,68.50,70.31,78.61,87.24,85.52,80.75,70.81,62.67,55.58,157
4,320055,Hispanic,MEDICAID,CA,900,West,Pacific,59,F,,...,63.34,63.10,67.45,75.86,75.24,71.10,68.95,65.46,59.46,146
5,190386,,COMMERCIAL,CA,934,West,Pacific,71,F,,...,59.45,60.24,64.77,69.81,70.13,68.10,65.38,60.72,54.08,286
6,559027,,COMMERCIAL,IN,461,Midwest,East North Central,63,F,,...,45.86,71.10,74.27,74.89,74.57,70.70,55.43,37.13,35.43,73
7,293747,White,MEDICARE ADVANTAGE,OH,448,Midwest,East North Central,57,F,33.1,...,42.62,65.91,71.26,74.03,73.94,69.12,53.50,36.43,34.10,59
8,517596,White,COMMERCIAL,DE,198,South,South Atlantic,56,F,31.05,...,48.41,65.17,70.63,75.82,76.17,70.00,56.65,40.90,37.68,316
9,533188,,COMMERCIAL,LA,706,South,West South Central,65,F,,...,63.74,77.51,81.80,83.07,82.46,80.32,71.56,56.24,53.39,86


Columns 0 through 150 are our features, and 151, metastatic_diagnosis_period, is our target column. 

In [5]:
#drop the row of labels, as we won't need to feed this to the model   
data_d = data.drop(index = 0) 
data_d
#we note that in various columns we have NaN or missing values.  

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,142,143,144,145,146,147,148,149,150,151
1,268700,,COMMERCIAL,AR,724,South,West South Central,39,F,,...,52.55,74.77,79.96,81.69,78.30,74.56,59.98,42.98,41.18,191
2,484983,White,,IL,629,Midwest,East North Central,55,F,35.36,...,49.30,72.87,77.40,77.43,75.83,72.64,58.36,39.68,39.71,33
3,277055,,COMMERCIAL,CA,925,West,Pacific,59,F,,...,68.50,70.31,78.61,87.24,85.52,80.75,70.81,62.67,55.58,157
4,320055,Hispanic,MEDICAID,CA,900,West,Pacific,59,F,,...,63.34,63.10,67.45,75.86,75.24,71.10,68.95,65.46,59.46,146
5,190386,,COMMERCIAL,CA,934,West,Pacific,71,F,,...,59.45,60.24,64.77,69.81,70.13,68.10,65.38,60.72,54.08,286
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13169,588544,Hispanic,MEDICAID,PA,191,Northeast,Middle Atlantic,59,F,,...,48.81,66.12,70.38,77.18,77.53,70.90,56.53,41.46,37.49,106
13170,393047,,COMMERCIAL,TX,757,South,West South Central,73,F,30.67,...,62.03,77.82,84.52,85.35,84.61,78.50,67.24,52.16,50.01,92
13171,790904,,COMMERCIAL,CA,928,West,Pacific,19,F,,...,66.20,66.04,70.87,80.68,79.75,75.27,71.40,66.01,59.20,0
13172,455518,,COMMERCIAL,MI,481,Midwest,East North Central,52,F,,...,39.93,63.56,68.68,72.13,72.55,66.17,49.79,34.16,32.28,330


## 1. Processing

#### Feature Engineering Strategies: 
The goal of feature engineering is to provide our model with data that it can use, as well as to adjust our data for issues such as outliers and missing values. Our main considerations for this problem would be the data types - referring to the documentation of HistGradientBoosting (or trial and error), we know that we need to convert our data all to integers, floats, or boolean values, and the data in our table is a mixture of numerical and string values - as well as missing values - inspecting the table, we can see that there are many NaN values that could impact how our model works. 

Below, we write functions to encode columns of data using LabelEncoder and then writing two other simple functions to convert string values to integer and float values. Then, we create a new dataframe to feed our model. We'll also binarize our target column. 

In [14]:
#write a function to label encode a column 
def encode_column(col): 
    le = LabelEncoder() 
    le.fit(col)
    col_enc = le.transform(col)
    return col_enc 

def to_int(col): 
    col_int = col.astype(int)
    return col_int

def to_float(col): 
	col_float= col.astype(float)
	return col_float

In [15]:
#label encode columns 1, 3, 5, 6 with previously defined function 
col_1 = encode_column(data_d[1].values)
col_3 = encode_column(data_d[3].values)
col_5 = encode_column(data_d[5].values)
col_6 = encode_column(data_d[6].values)

In [16]:
#we can check what one of the columns looks like encoded as integers: 
col_1

array([5, 4, 5, ..., 5, 5, 2])

In [17]:
#apply function to columns 4 and 7
col_4 = to_int(data_d[4].values)
col_7 = to_int(data_d[7].values)

In [18]:
#sanity check
print(col_4)
print(col_7)

[724 629 925 ... 928 481 900]
[39 55 59 ... 19 52 63]


In [19]:
len(col_7)

13173

In [20]:
#combine all columns, just making a dataframe from scratch here
dataset_processed = pd.DataFrame({'col_1':col_1, 
                                  'col_3': col_3,
                                  'col_4': col_4,
                                  'col_5': col_5, 
                                  'col_6': col_6, 
                                  'col_7': col_7
                                 }) 

In [21]:
dataset_processed.reset_index(drop = True)

Unnamed: 0,col_1,col_3,col_4,col_5,col_6,col_7
0,5,2,724,2,7,39
1,4,13,629,0,0,55
2,5,4,925,3,4,59
3,2,4,900,3,4,59
4,5,4,934,3,4,71
...,...,...,...,...,...,...
13168,2,33,191,1,2,59
13169,5,37,757,2,7,73
13170,5,4,928,3,4,19
13171,5,19,481,0,0,52


In [22]:
data_dropped_1 = data_d.drop([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], axis = 1)
data_dropped = data_dropped_1.reset_index(drop = True) 
data_dropped

Unnamed: 0,16,17,18,19,20,21,22,23,24,25,...,142,143,144,145,146,147,148,149,150,151
0,82.63,42.58,11.61,13.03,10.87,11.80,12.29,13.22,13.47,10.07,...,52.55,74.77,79.96,81.69,78.30,74.56,59.98,42.98,41.18,191
1,51.79,43.54,11.22,12.19,11.45,11.01,11.35,14.39,14.15,9.17,...,49.30,72.87,77.40,77.43,75.83,72.64,58.36,39.68,39.71,33
2,700.34,36.28,13.27,15.66,13.49,13.45,12.40,11.58,10.47,6.38,...,68.50,70.31,78.61,87.24,85.52,80.75,70.81,62.67,55.58,157
3,5294.33,36.65,9.76,11.27,17.23,17.44,13.09,12.30,9.41,5.67,...,63.34,63.10,67.45,75.86,75.24,71.10,68.95,65.46,59.46,146
4,400.48,41.78,10.03,16.43,12.97,11.29,10.09,11.56,13.28,8.78,...,59.45,60.24,64.77,69.81,70.13,68.10,65.38,60.72,54.08,286
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13168,5512.17,35.72,10.85,10.95,18.16,17.35,11.65,11.10,10.64,5.94,...,48.81,66.12,70.38,77.18,77.53,70.90,56.53,41.46,37.49,106
13169,204.69,40.87,11.27,14.64,12.11,10.93,10.94,14.12,12.84,7.85,...,62.03,77.82,84.52,85.35,84.61,78.50,67.24,52.16,50.01,92
13170,2295.94,38.20,11.88,13.35,14.23,13.42,13.33,14.06,10.25,5.95,...,66.20,66.04,70.87,80.68,79.75,75.27,71.40,66.01,59.20,0
13171,743.56,41.47,10.94,13.59,12.67,11.61,12.14,14.65,12.73,7.93,...,39.93,63.56,68.68,72.13,72.55,66.17,49.79,34.16,32.28,330


In [23]:
i = 0 
for column in data_dropped: 
    col_floatified = to_float(data_dropped[column])
    dataset_processed = pd.concat([dataset_processed, col_floatified], axis = 1)
    i += 1 

In [24]:
dataset_processed

Unnamed: 0,col_1,col_3,col_4,col_5,col_6,col_7,16,17,18,19,...,142,143,144,145,146,147,148,149,150,151
0,5,2,724,2,7,39,82.63,42.58,11.61,13.03,...,52.55,74.77,79.96,81.69,78.30,74.56,59.98,42.98,41.18,191.0
1,4,13,629,0,0,55,51.79,43.54,11.22,12.19,...,49.30,72.87,77.40,77.43,75.83,72.64,58.36,39.68,39.71,33.0
2,5,4,925,3,4,59,700.34,36.28,13.27,15.66,...,68.50,70.31,78.61,87.24,85.52,80.75,70.81,62.67,55.58,157.0
3,2,4,900,3,4,59,5294.33,36.65,9.76,11.27,...,63.34,63.10,67.45,75.86,75.24,71.10,68.95,65.46,59.46,146.0
4,5,4,934,3,4,71,400.48,41.78,10.03,16.43,...,59.45,60.24,64.77,69.81,70.13,68.10,65.38,60.72,54.08,286.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13168,2,33,191,1,2,59,5512.17,35.72,10.85,10.95,...,48.81,66.12,70.38,77.18,77.53,70.90,56.53,41.46,37.49,106.0
13169,5,37,757,2,7,73,204.69,40.87,11.27,14.64,...,62.03,77.82,84.52,85.35,84.61,78.50,67.24,52.16,50.01,92.0
13170,5,4,928,3,4,19,2295.94,38.20,11.88,13.35,...,66.20,66.04,70.87,80.68,79.75,75.27,71.40,66.01,59.20,0.0
13171,5,19,481,0,0,52,743.56,41.47,10.94,13.59,...,39.93,63.56,68.68,72.13,72.55,66.17,49.79,34.16,32.28,330.0


In [25]:
X = dataset_processed.values
#display some of the data. Note the omitted columns (eg. patient number, data we don't need) 
X

### Further Processing 

In [27]:
#processing data - apply lambda function to target column to make binary. 
df_train_binary = data_d.copy() 
df_train_binary[151] = df_train_binary[151].apply(lambda x: int(int(x) > 90))

In [28]:
#take a look at column 151 - we've converted the column to 1s and 0s. 
df_train_binary.head(40)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,142,143,144,145,146,147,148,149,150,151
1,268700,,COMMERCIAL,AR,724,South,West South Central,39,F,,...,52.55,74.77,79.96,81.69,78.3,74.56,59.98,42.98,41.18,1
2,484983,White,,IL,629,Midwest,East North Central,55,F,35.36,...,49.3,72.87,77.4,77.43,75.83,72.64,58.36,39.68,39.71,0
3,277055,,COMMERCIAL,CA,925,West,Pacific,59,F,,...,68.5,70.31,78.61,87.24,85.52,80.75,70.81,62.67,55.58,1
4,320055,Hispanic,MEDICAID,CA,900,West,Pacific,59,F,,...,63.34,63.1,67.45,75.86,75.24,71.1,68.95,65.46,59.46,1
5,190386,,COMMERCIAL,CA,934,West,Pacific,71,F,,...,59.45,60.24,64.77,69.81,70.13,68.1,65.38,60.72,54.08,1
6,559027,,COMMERCIAL,IN,461,Midwest,East North Central,63,F,,...,45.86,71.1,74.27,74.89,74.57,70.7,55.43,37.13,35.43,0
7,293747,White,MEDICARE ADVANTAGE,OH,448,Midwest,East North Central,57,F,33.1,...,42.62,65.91,71.26,74.03,73.94,69.12,53.5,36.43,34.1,0
8,517596,White,COMMERCIAL,DE,198,South,South Atlantic,56,F,31.05,...,48.41,65.17,70.63,75.82,76.17,70.0,56.65,40.9,37.68,1
9,533188,,COMMERCIAL,LA,706,South,West South Central,65,F,,...,63.74,77.51,81.8,83.07,82.46,80.32,71.56,56.24,53.39,0
10,639484,White,COMMERCIAL,CA,922,West,Pacific,60,F,,...,70.91,74.48,83.59,91.04,89.79,85.1,70.73,60.59,53.04,1


In [29]:
#define target variable using our new binarized column: 
target_column = df_train_binary[151]
y = target_column.values 

In [30]:
# see our training and validation sets, which we defined in the previous steps: 
X, y

(array([[  5.  ,   2.  , 724.  , ...,  42.98,  41.18, 191.  ],
        [  4.  ,  13.  , 629.  , ...,  39.68,  39.71,  33.  ],
        [  5.  ,   4.  , 925.  , ...,  62.67,  55.58, 157.  ],
        ...,
        [  5.  ,   4.  , 928.  , ...,  66.01,  59.2 ,   0.  ],
        [  5.  ,  19.  , 481.  , ...,  34.16,  32.28, 330.  ],
        [  2.  ,   4.  , 900.  , ...,  65.46,  59.46,   0.  ]]),
 array([1, 0, 1, ..., 0, 1, 0]))

In [31]:
#train test split to create training and evaluation 
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size = 0.20, random_state = 1) 

In [32]:
#see our current training set: note that it is a randomly chosen subset of X.  
X_train

array([[  5.  ,  13.  , 611.  , ...,  29.8 ,  26.24, 121.  ],
       [  5.  ,  37.  , 760.  , ...,  51.69,  47.3 ,   2.  ],
       [  5.  ,  13.  , 601.  , ...,  33.29,  31.79, 175.  ],
       ...,
       [  5.  ,  14.  , 472.  , ...,  38.99,  37.62,   0.  ],
       [  5.  ,   9.  , 300.  , ...,  47.25,  44.86, 303.  ],
       [  5.  ,  20.  , 554.  , ...,  27.46,  25.11,   6.  ]])

In [33]:
y_eval

array([0, 0, 1, ..., 0, 1, 1])

## 3. Ensembling Method: HistGradientBoosting

We're going to use a HistGradientBoostingClassifier - I decided to try this model after some experimentation as it can handle NaN values which occurs frequently in our data.

In [34]:
model = HistGradientBoostingClassifier(learning_rate = 0.1, verbose = 0) 
model.fit(X_train, y_train)

In [35]:
#create predictions, see how our model performed on validation set 
y_preds = model.predict(X_eval)
accuracy = np.sum(y_preds == y_eval)/len(y_eval)
print(accuracy)

1.0


In a competition, it is not uncommon to get this lucky with our validation set, and we could expect some more trial and error when it comes to test set performance. We can look at our model's parameters using .get_params: 

In [36]:
#see model parameters 
model.get_params() 

{'categorical_features': None,
 'class_weight': None,
 'early_stopping': 'auto',
 'interaction_cst': None,
 'l2_regularization': 0.0,
 'learning_rate': 0.1,
 'loss': 'log_loss',
 'max_bins': 255,
 'max_depth': None,
 'max_iter': 100,
 'max_leaf_nodes': 31,
 'min_samples_leaf': 20,
 'monotonic_cst': None,
 'n_iter_no_change': 10,
 'random_state': None,
 'scoring': 'loss',
 'tol': 1e-07,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

At this stage, we would generate test set predictions and format into a CSV file, and then, depending on performance, adjust our model in various ways: we could finetune the hyperparameters above, conduct further feature engineering, or experiment with other models. Some questions we would need to consider given test set performance would be: are all the features we considered relevant to solving the problem? Is this model the best one or should we consider others? 

We can't know the answer to all of these oftentimes without either doing some data visualization or otherwise directly running experiments, so it is good to have a codebase ready to go with commonly used algorithms for specific tasks. 

## 4. Discussion, Previous Iterations

When initially writing up a solution, I experimented with K-Nearest Neighbors and Logistic Regression classifiers as well as LightGBM, but HistGradientBoosting was the easiest to work with being that it handles NaN values, reducing the amount of feature engineering required on our end. 