# Training

After repeated runs between conducting an exploratory data analysis and model training, the following yielded the best results.
-  Preprocessing: Creating a column that distinguished ICD-9 from ICD-10 codes reduced the RMSE by 10. Coupling this with creating a column for metastatic_cancer_diagnosis_descriptions and categorizing patient bmis helped acheive a private score of 81.
- Features: 'patient_age', 'breast_cancer_diagnosis_code', 'bmi_category', "ICD_code", "metastatic_cancer_diagnosis_desc", 'patient_zip3',"metastatic_cancer_diagnosis_code" gave the best results. In particular, "ICD_code", had the highest contribution to score improvement and has the highest feature importance score for catboost.
- Model: Based on the feature selection methodology performed, catboost was the best performing model and was used for this study.

Current Best Private Score on Test Data: 81.00
Current Validation Score: 82.29

In [1]:
import pandas as pd
import numpy as np
from helper import cleaning_and_null_handling

from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from catboost import CatBoostRegressor

import warnings

warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
ICD_codes_df = pd.read_csv("ICD-CM-Codes.csv")

features = df.columns.to_list()

In [3]:
#Data Cleaning and Null Handling
remove_cols = ["metastatic_diagnosis_period", "patient_id"]
df, test_df = cleaning_and_null_handling(df, test_df, ICD_codes_df,features=features, remove_cols=remove_cols, bmi_groups=True)

new_features = [
    'patient_age', 'breast_cancer_diagnosis_code',
    "ICD_code", "metastatic_cancer_diagnosis_desc", 'patient_zip3',
    "metastatic_cancer_diagnosis_code",
    "payer_type",
    "patient_id", "metastatic_diagnosis_period"
]

df = df[new_features]
test_df = test_df[new_features[:-1]]

cat_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

for column in df.select_dtypes(include='object').columns:
    df[column] = pd.Categorical(df[column].fillna("Missing"))
    test_df[column] = pd.Categorical(test_df[column].fillna("Missing"))   

breast_cancer_diagnosis_code and breast_cancer_diagnosis_desc cleaning done.
Assigned metastatic descriptions
BMI categorization done.


In [4]:
#Data split
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

X_train = train_df.drop(columns=["metastatic_diagnosis_period", "patient_id"])
y_train = train_df["metastatic_diagnosis_period"]

X_val = val_df.drop(columns=["metastatic_diagnosis_period", "patient_id"])
y_val = val_df["metastatic_diagnosis_period"]

ids = test_df[["patient_id"]].copy()
X_test = test_df.drop(columns=["patient_id"])

In [5]:
#Train model
model = CatBoostRegressor(iterations=100, depth=6, learning_rate=0.1, loss_function='RMSE', random_state=42, 
                          verbose=0, cat_features=cat_features)
model.fit(X_train, y_train)

<catboost.core.CatBoostRegressor at 0x1e6206bf310>

In [6]:
#Generate validation RMSE
y_val_pred = np.uint16(np.around(np.clip(model.predict(X_val), a_min = 0, a_max = np.inf),0))
val_rmse = root_mean_squared_error(y_val, y_val_pred)
print(f"Validation RMSE: {round(val_rmse,2)}")

Validation RMSE: 82.28


In [7]:
feature_importance_df = pd.DataFrame({
            'Feature': X_train.columns,
            'Importance': model.feature_importances_
        })
feature_importance_df.sort_values(by='Importance', ascending=False).head(15)

Unnamed: 0,Feature,Importance
2,ICD_code,87.911868
0,patient_age,4.326794
4,patient_zip3,2.058083
3,metastatic_cancer_diagnosis_desc,1.749359
5,metastatic_cancer_diagnosis_code,1.589866
6,payer_type,1.427957
1,breast_cancer_diagnosis_code,0.936073


In [8]:
#Generate test predictions and submission.csv
test_pred = np.uint16(np.around(np.clip(model.predict(X_test), a_min = 0, a_max = np.inf),0))

submission = ids.copy()
submission['metastatic_diagnosis_period'] = test_pred
submission.to_csv('submission.csv', index=False)
submission.head()


Unnamed: 0,patient_id,metastatic_diagnosis_period
0,730681,207
1,334212,51
2,571362,208
3,907331,225
4,208382,37
