<a href="https://colab.research.google.com/github/rllevy/MMAI-Bae/blob/main/day-1-base-line_dropping_health_insurance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# (MJH - DAY1 TESTING) MMAI 2025 869: Team Project Template
*Updated May 3, 2024*

This notebook serves as a template for the Team Project. Teams can use this notebook as a starting point, and update it successively with new ideas and techniques to improve their model results.

Note that is not required to use this template. Teams may also alter this template in any way they see fit.


# Preliminaries: Inspect and Set up environment

No action is required on your part in this section. These cells print out helpful information about the environment, just in case.

In [1]:
! pip install --user pandas
! pip install --user numpy
! pip install --user scikit-learn
! pip install --user xgboost



In [2]:
import datetime
import pandas as pd
import numpy as np

In [3]:
print(datetime.datetime.now())

2024-05-21 14:08:29.375318


In [4]:
!which python

/Users/mhoy/.pyenv/versions/3.11.8/bin/python


In [5]:
!python --version

Python 3.11.8


In [6]:
!echo $PYTHONPATH




In [7]:
# TODO: if you need to install any package, do so here. For example:
#pip install unidecode

# 0. Data Loading and Inspection

## 0.1: Load data

The file containing the labeled training data is conveniently located on the cloud at the address below. Let's load it up and take a look.

In [8]:
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1eYCKuqJda4bpzXBVnqXylg0qQwvpUuum")
# df = pd.read_csv("file:///Users/mhoy/ownCloud - mhoy@owncloud-new.markjhoy.com/Smith_MMAI_2025/Courses/MMAI 869 - Intro to AI and ML/cleaned_test2.csv")

## 0.1 Simple Exploratory Data Analysis

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21365 entries, 0 to 21364
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 21292 non-null  float64
 1   h1n1_knowledge               21274 non-null  float64
 2   behavioral_antiviral_meds    21306 non-null  float64
 3   behavioral_avoidance         21202 non-null  float64
 4   behavioral_face_mask         21351 non-null  float64
 5   behavioral_wash_hands        21329 non-null  float64
 6   behavioral_large_gatherings  21293 non-null  float64
 7   behavioral_outside_home      21306 non-null  float64
 8   behavioral_touch_face        21263 non-null  float64
 9   doctor_recc_h1n1             19629 non-null  float64
 10  doctor_recc_seasonal         19629 non-null  float64
 11  chronic_med_condition        20594 non-null  float64
 12  child_under_6_months         20710 non-null  float64
 13  health_worker   

In [10]:
# Let's print some descriptive statistics for all the numeric features.

df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
h1n1_concern,21292.0,1.618026,0.909311,0.0,1.0,2.0,2.0,3.0
h1n1_knowledge,21274.0,1.265018,0.617816,0.0,1.0,1.0,2.0,2.0
behavioral_antiviral_meds,21306.0,0.049329,0.216559,0.0,0.0,0.0,0.0,1.0
behavioral_avoidance,21202.0,0.724507,0.446773,0.0,0.0,1.0,1.0,1.0
behavioral_face_mask,21351.0,0.070348,0.255739,0.0,0.0,0.0,0.0,1.0
behavioral_wash_hands,21329.0,0.823574,0.381192,0.0,1.0,1.0,1.0,1.0
behavioral_large_gatherings,21293.0,0.357864,0.479383,0.0,0.0,0.0,1.0,1.0
behavioral_outside_home,21306.0,0.337464,0.472856,0.0,0.0,0.0,1.0,1.0
behavioral_touch_face,21263.0,0.675728,0.468113,0.0,0.0,1.0,1.0,1.0
doctor_recc_h1n1,19629.0,0.221662,0.415375,0.0,0.0,0.0,0.0,1.0


In [11]:
# What is the number of unique values in all the categorical features? And what is
# the value with the highest frequency?

df.describe(include=object).T
# df.describe().T

Unnamed: 0,count,unique,top,freq
age_group,21365,5,65+ Years,5454
education,20240,4,College Graduate,8063
race,21365,4,White,16974
sex,21365,2,Female,12748
income_poverty,17851,3,"<= $75,000, Above Poverty",10301
marital_status,20245,2,Married,10880
rent_or_own,19737,2,Own,15012
employment_status,20203,3,Employed,10886
hhs_geo_region,21365,10,lzgpxyit,3406
census_msa,21365,3,"MSA, Not Principle City",9268


In [12]:
# How much missing data is in each feature?

df.isna().sum()

h1n1_concern                      73
h1n1_knowledge                    91
behavioral_antiviral_meds         59
behavioral_avoidance             163
behavioral_face_mask              14
behavioral_wash_hands             36
behavioral_large_gatherings       72
behavioral_outside_home           59
behavioral_touch_face            102
doctor_recc_h1n1                1736
doctor_recc_seasonal            1736
chronic_med_condition            771
child_under_6_months             655
health_worker                    643
health_insurance                9858
opinion_h1n1_vacc_effective      318
opinion_h1n1_risk                311
opinion_h1n1_sick_from_vacc      321
opinion_seas_vacc_effective      371
opinion_seas_risk                410
opinion_seas_sick_from_vacc      431
age_group                          0
education                       1125
race                               0
sex                                0
income_poverty                  3514
marital_status                  1120
r

In [13]:
# For convienience, let's save the names of all numeric features to a list,
# and the names of all categorical features to another list.

numeric_features = [
          "h1n1_concern",
          "h1n1_knowledge",
          "behavioral_antiviral_meds",
          "behavioral_avoidance",
          "behavioral_face_mask",
          "behavioral_wash_hands",
          "behavioral_large_gatherings",
          "behavioral_outside_home",
          "behavioral_touch_face",
          "doctor_recc_h1n1",
          "doctor_recc_seasonal",
          "chronic_med_condition",
          "child_under_6_months",
          "health_worker",
          "health_insurance",
          "opinion_h1n1_vacc_effective",
          "opinion_h1n1_risk",
          "opinion_h1n1_sick_from_vacc",
          "opinion_seas_vacc_effective",
          "opinion_seas_risk",
          "opinion_seas_sick_from_vacc",
          "household_adults",
          "household_children",
]

categorical_features = [
    "age_group",
    "education",
    "race",
    "sex",
    "income_poverty",
    "marital_status",
    "rent_or_own",
    "employment_status",
    "hhs_geo_region",
    "census_msa",
    "employment_industry",
    "employment_occupation",
]

cols_to_drop = [
#    "employment_industry",
#    "employment_occupation",
#    "health_insurance",
]

categorical_features = [x for x in categorical_features if x not in cols_to_drop]


In [14]:
# TODO: Can add more EDA here, as desired

# 1. Train/Test Split

Now we randomly split the available data into train and test subsets.

The training data will later be used to build and assess the model on various combinations of hyperparaters.

The testing data will be used as a "final estimate" of a model's performance.

# 2. Model 1 (A simple DecisionTree model)

As a baseline, we'll do the absolute bare minimum data cleaning and then quickly build a simple Decision Tree.

In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [16]:
# Scikit-learn needs us to put the features in one dataframe, and the label in another.
# It's tradition to name these variables X and y, but it doesn't really matter.

X = df.drop('h1n1_vaccine', axis=1)
y = df['h1n1_vaccine']

## 1.1 Cleaning and FE

In [17]:
# instead of dropping all categorical items (as we'll 1-hot encode these later),
# we drop only specific columns
#
# for now - this is empty (see above)

def drop_bad_columns(df):
    return df.drop(cols_to_drop, axis=1, errors='ignore')

def drop_categorical_columns(df):
    return df.drop(categorical_features, axis=1, errors='ignore')

In [18]:
# perform one hot encoding of categorical data

def one_hot_encode_categories(df):
    return pd.get_dummies(df, columns = categorical_features)

In [19]:
# impute the missing data from the rest of the dataset

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer

def impute_missing_values(df):
    # imputer = IterativeImputer(random_state=100, max_iter=6, verbose=2)
    # ^^ DOES NOT HELP MORE THAN SIMPLE IMPUTATION! and takes longer...

    imputer = SimpleImputer(strategy='mean')
    imputer.fit(df)
    imputed_df = imputer.transform(df)
    return pd.DataFrame(imputed_df, columns=df.columns)
#     return pd.DataFrame(imputer.fit_transform(imputed_x), columns=df.columns)


In [20]:
# here, we normalize the data to ensure all values are 
# in the 0.0 to 1.0 range. This makes it better for some
# of the algorithms to run and typically yields better results

def normalize_dataframe(df):
    return df.apply(lambda iterator: ((iterator.max() - iterator)/(iterator.max() - iterator.min())))


In [21]:
# This (currently unused) method selects the best k features
# from our dataset and returns a DataFrame that only uses those
#
# Through experimentation, this seems to perform slightly worse
# so, the method is not used at the moment

from sklearn.feature_selection import chi2, SelectKBest

def select_best_features(df, kvalue = 20):
    selector = SelectKBest(score_func=chi2,k=kvalue)
    ret_val = selector.fit_transform(df, y)
    cols_idxs = selector.get_support(indices=True)
    feature_columns = list(df.iloc[:,cols_idxs].columns)
    return pd.DataFrame(ret_val, columns=feature_columns)


In [22]:
# Our model cleaning process.
#
# through experimentation, I've observed overall so far:
#  - 1-hot encoding to keep the categorical columns is good.
#  - imputataion is necessary
#  - normalization helps
#  - selecting best features only provides slightly worse results

import re

def model_cleaning(df):
    # simple cleaning
    ret = drop_bad_columns(df)
    # ret = drop_categorical_columns(ret)

    # encoding
    ret = one_hot_encode_categories(ret)

    # imputation
    ret = impute_missing_values(ret)
    
    # normalization
    ret = normalize_dataframe(ret)

    # feature reduction
    # ret = select_best_features(ret) # does not help

    # rename some columns to get rid of bad characters
    regex = re.compile(r"\[|\]|<", re.IGNORECASE)
    ret.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in ret.columns.values]

    return ret

cleaned_data = model_cleaning(X)


## 1.2 Model Creation, Hyperparameter Tuning, and Validation

In [23]:
def run_decision_tree(data, depth):
    clf = DecisionTreeClassifier(max_depth=depth, random_state=0).fit(data, y)
    cv_results = cross_validate(clf, data, y, cv=5, scoring="f1_macro")
    print(f"depth: {depth} -- The mean CV score is: {np.mean(cv_results['test_score'])}")

print("DecisionTreeClassifier")

for n in range(0,10):
    run_decision_tree(cleaned_data, n+1)

DecisionTreeClassifier
depth: 1 -- The mean CV score is: 0.6959383371869742
depth: 2 -- The mean CV score is: 0.6689248144736056
depth: 3 -- The mean CV score is: 0.6439743444016648
depth: 4 -- The mean CV score is: 0.7049483195810098
depth: 5 -- The mean CV score is: 0.7262965153012413
depth: 6 -- The mean CV score is: 0.7194319711678351
depth: 7 -- The mean CV score is: 0.7272765843219517
depth: 8 -- The mean CV score is: 0.716825215256117
depth: 9 -- The mean CV score is: 0.7207166619789527
depth: 10 -- The mean CV score is: 0.717628084783215



### Testing MLPClassifier (neural network)

This is commented out at the moment as it's slow, and the best score achvied with this was around 0.7153903853823627

In [24]:
# from sklearn.neural_network import MLPClassifier

# test_model = MLPClassifier(random_state=1, max_iter=1000).fit(cleaned_data, y)
# grid_search_params = {
#     "hidden_layer_sizes": [(25,), (50,), (100,), (150,)]
# }
# print(grid_search_params)
# grid_search = GridSearchCV(test_model, grid_search_params, cv=5, scoring="f1_macro")
# grid_search.fit(cleaned_data, y)
# display(grid_search.best_score_)
# display(grid_search.best_params_)

# layer_size = len(list(cleaned_data.columns))
# display(f"layer size: {layer_size}")
# clf = MLPClassifier(hidden_layer_sizes=(layer_size,), random_state=1, max_iter=4096).fit(cleaned_data,y)
# cv_results = cross_validate(clf, cleaned_data, y, cv=5, scoring="f1_macro")
# print(f"The mean CV score is: {np.mean(cv_results['test_score'])}")


In [25]:
from sklearn.neighbors import KNeighborsClassifier
layer_size = len(list(cleaned_data.columns))
clf = KNeighborsClassifier(n_neighbors=5).fit(cleaned_data,y)
cv_results = cross_validate(clf, cleaned_data, y, cv=5, scoring="f1_macro")
print(f"The mean CV score is: {np.mean(cv_results['test_score'])}")

# test_model = KNeighborsClassifier().fit(cleaned_data, y)
# grid_search_params = {
#     "n_neighbors": list(range(1, 8)),
#     "p": list(np.linspace(1, 4, num=6))
# }
# print(grid_search_params)
# grid_search = GridSearchCV(test_model, grid_search_params, cv=5, scoring="f1_macro")
# grid_search.fit(cleaned_data, y)
# display(grid_search.best_score_)
# display(grid_search.best_params_)


The mean CV score is: 0.6147344123238534


In [26]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=1).fit(cleaned_data,y)
cv_results = cross_validate(clf, cleaned_data, y, cv=5, scoring="f1_macro")
print(f"The mean CV score is: {np.mean(cv_results['test_score'])}")

# test_model = RandomForestClassifier(random_state=1).fit(cleaned_data, y)
# grid_search_params = {
#     "criterion": ["gini", "entropy", "log_loss"],
#     "max_leaf_nodes": [10, 20, 30, 40, None],

# }
# print(grid_search_params)
# grid_search = GridSearchCV(test_model, grid_search_params, cv=5, scoring="f1_macro")
# grid_search.fit(cleaned_data, y)
# display(grid_search.best_score_)
# display(grid_search.best_params_)
# {'criterion': 'entropy', 'max_leaf_nodes': None}

The mean CV score is: 0.7236249769651738


In [27]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

def run_gradient_boost(data, learning_rate):
    # clf = GradientBoostingClassifier(loss='log_loss', max_depth=3, learning_rate=learning_rate, random_state=1).fit(data, y)
    clf = HistGradientBoostingClassifier(learning_rate=learning_rate, random_state=1).fit(data, y)
    cv_results = cross_validate(clf, data, y, cv=5, scoring="f1_macro")
    print(f"learning_rate: {learning_rate} -- The mean CV score is: {np.mean(cv_results['test_score'])}")

print("GradientBoostingClassifier")
min_rate = 0.1
max_rate = 0.6
# cleaned_data_all_features = cleaned_data # model_cleaning(X)
# print("-- with all features categories:")

#for learning_rate in np.linspace(min_rate, max_rate, num=10):
#    run_gradient_boost(cleaned_data, learning_rate)    
run_gradient_boost(cleaned_data, 0.1)    


GradientBoostingClassifier
learning_rate: 0.1 -- The mean CV score is: 0.7536670551779296


In [28]:
# test_model = HistGradientBoostingClassifier(random_state=1).fit(cleaned_data, y)
# grid_search_params = {
#     "learning_rate": list(np.linspace(min_rate, max_rate, num=15)),
#     "max_leaf_nodes": [10, 20, 30, 40, None],

# }
# print(grid_search_params)
# grid_search = GridSearchCV(test_model, grid_search_params, cv=5, scoring="f1_macro")
# grid_search.fit(cleaned_data, y)
# display(grid_search.best_score_)
# display(grid_search.best_params_)

In [29]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=1, max_iter=512).fit(cleaned_data, y)
cv_results = cross_validate(clf, cleaned_data, y, cv=5, scoring="f1_macro")
print(f"-- The mean CV score is: {np.mean(cv_results['test_score'])}")


-- The mean CV score is: 0.7127663225531284


In [58]:
import xgboost as xgb

def run_xgboost_boost(data, learning_rate, subsample):
   params = {"objective": "binary:logitraw", 
             "eta": learning_rate, 
             "sampling_method": "uniform",
             "subsample": 0.95,
             "booster": "gbtree" }
   params = {"eta": learning_rate, 
             "sampling_method": "uniform",
             "subsample": subsample,
             "booster": "gbtree" }
   dtrain_reg = xgb.DMatrix(data.values, y.values)
   clf = xgb.XGBClassifier(objective="binary:hinge", **params).fit(data.to_numpy(), y.to_numpy())
   # model = xgb.train(
   #    params=params,
   #    dtrain=dtrain_reg,
   #    num_boost_round=n,
   # )
   # clf = model.get_booster()
   cv_results = cross_validate(clf, cleaned_data, y, cv=5, scoring="f1_macro")
   # cv_results = xgb.cv(params, dtrain_reg, 5, nfold=5, metrics={"auc"}, seed=0)
   # print(cv_results)
   # print(f"learning_rate: {learning_rate} -- The mean CV score is: {np.mean(cv_results['test-auc-mean'])}")
   print(f"learning_rate: {learning_rate}, subsample: {subsample} -- The mean CV score is: {np.mean(cv_results['test_score'])}")

print("XGBoost")
# min_rate = 1
# max_rate = 10
# min_subsample = 95
# max_subsample = 100
# for learning_rate in range(min_rate, max_rate):
#    for subsample in range(min_subsample, max_subsample)
#   run_xgboost_boost(cleaned_data, learning_rate*0.05, subsample=1)

test_model = xgb.XGBClassifier(random_state=1).fit(cleaned_data.to_numpy(), y.to_numpy())
grid_search_params = {
    "objective": ["binary:logistic"],
    "learning_rate": [0.11],
    "subsample": [0.4888888888888889], # list(np.linspace(0.48, 0.49, num=50)), # [0.4888888888888889],
    "max_depth": [6],
    "booster": ['gbtree'],
    "n_estimators": [100],
    "eval_metric": ['rmse', 'rmsle', 'mae', 'mape', 'mphe', 'error']
}
print(grid_search_params)
grid_search = GridSearchCV(test_model, grid_search_params, cv=5, scoring="f1_macro")
grid_search.fit(cleaned_data, y)
display(grid_search.best_score_)
display(grid_search.best_params_)

# {'learning_rate': 0.11,
# 'objective': 'binary:logistic',
# 'subsample': 0.4888888888888889
# }



XGBoost
{'objective': ['binary:logistic'], 'learning_rate': [0.11], 'subsample': [0.4888888888888889], 'max_depth': [6], 'booster': ['gbtree'], 'n_estimators': [100]}


0.7602104399043436

{'booster': 'gbtree',
 'learning_rate': 0.11,
 'max_depth': 6,
 'n_estimators': 100,
 'objective': 'binary:logistic',
 'subsample': 0.4888888888888889}

In [31]:
from sklearn.naive_bayes import GaussianNB

def run_naive_bayes(data, smoothing):
    clf = GaussianNB(var_smoothing=smoothing).fit(data, y)
    cv_results = cross_validate(clf, data, y, cv=5, scoring="f1_macro")
    print(f"var_smoothing: {smoothing} -- The mean CV score is: {np.mean(cv_results['test_score'])}")

print("RandomForestClassifier")
for n in range(0, 5):
    run_naive_bayes(cleaned_data, n*0.1)

RandomForestClassifier
var_smoothing: 0.0 -- The mean CV score is: 0.6072375462125044
var_smoothing: 0.1 -- The mean CV score is: 0.6734576349055209
var_smoothing: 0.2 -- The mean CV score is: 0.6902031874263501
var_smoothing: 0.30000000000000004 -- The mean CV score is: 0.6984086514933967
var_smoothing: 0.4 -- The mean CV score is: 0.7008396652419232


### SVM Test

In [32]:
# from sklearn.svm import SVC

# test_model = SVC(random_state=1).fit(cleaned_data.to_numpy(), y.to_numpy())
# grid_search_params = {
#     "C": list(np.linspace(0.5, 2.0, num=10)),
# }
# print(grid_search_params)
# grid_search = GridSearchCV(test_model, grid_search_params, cv=5, scoring="f1_macro")
# grid_search.fit(cleaned_data, y)
# display(grid_search.best_score_)
# display(grid_search.best_params_)


## Feature Selection Tests

**bottom line: does not help! use all the available features**

In [33]:
def run_feature_selection(data, k):
    df = select_best_features(data, kvalue = k)
    # display(list(df.columns))
    clf = xgb.XGBClassifier(random_state=1, objective="binary:logistic", learning_rate=0.11, subsample=0.4888888888888889, max_depth=6).fit(df.to_numpy(), y.to_numpy())
    cv_results = cross_validate(clf, df, y, cv=5, scoring="f1_macro")
    print(f"Feature K: {k} - {np.mean(cv_results['test_score'])}")

for n in range(10, 100, 20):
    run_feature_selection(cleaned_data, n)

for n in range(100, len(cleaned_data.columns)+1):
    run_feature_selection(cleaned_data, n)


Feature K: 10 - 0.735319208061668
Feature K: 30 - 0.7502488646501894
Feature K: 50 - 0.7543781829222638
Feature K: 70 - 0.7568676719239618
Feature K: 90 - 0.7564772301591769
Feature K: 100 - 0.7554416014886319
Feature K: 101 - 0.7563581217023747
Feature K: 102 - 0.753964013088962
Feature K: 103 - 0.754355559591985
Feature K: 104 - 0.7602104399043436
Feature K: 105 - 0.7602104399043436


## day 1 baseline

In [34]:
clf = DecisionTreeClassifier(max_depth=3, random_state=0)
cv_results = cross_validate(clf, cleaned_data, y, cv=5, scoring="f1_macro")
print(np.mean(cv_results['test_score']))

0.6439743444016648


## Final Train Eval

In [35]:
# The best algorithm found so far has been HistGradientBoostingClassifier with 0.3 learning rate
# clf = HistGradientBoostingClassifier(learning_rate=0.1, max_leaf_nodes=30, random_state=1).fit(cleaned_data, y)
# clf = GradientBoostingClassifier(loss='log_loss', max_depth=3, learning_rate=0.11, random_state=1).fit(cleaned_data, y)

clf = xgb.XGBClassifier(random_state=1, objective="binary:logistic", learning_rate=0.11, subsample=0.4888888888888889, max_depth=6).fit(cleaned_data.to_numpy(), y.to_numpy())


In [36]:
# We use cross_validate to perform K-fold cross validation for us.

# baseline
cv_results = cross_validate(clf, cleaned_data, y, cv=5, scoring="f1_macro")
print(np.mean(cv_results['test_score']))

# TODO: can also add hyperparameter tuning to explore different values of the algorithms
# hyperparameters, and see how much those affect results.
# See GridSearchCV or RandomizedSearchCV.


0.7602104399043436


In [37]:
# Now that cross validation has completed, we can see what it estimates the peformance
# of our model to be.
display(cv_results)
print("The mean CV score is:")
print(np.mean(cv_results['test_score']))

{'fit_time': array([0.17403483, 0.17582321, 0.16091108, 0.14946914, 0.20155978]),
 'score_time': array([0.00780106, 0.01023889, 0.00797486, 0.00858307, 0.00877404]),
 'test_score': array([0.74823176, 0.76650626, 0.77260423, 0.75474824, 0.75896171])}

The mean CV score is:
0.7602104399043436


## 1.4: Create Predictions for Competition Data

Once we are happy with the estimated performance of our model, we can move on to the final step.

First, we train our model one last time, using all available training data (unlike CV, which always uses a subset). This final training will give our model the best chance as the highest performance.

Then, we must load in the (unlabeled) competition data from the cloud and use our model to generate predictions for each instance in that data. We will then output those predictions to a CSV file. We will then send that file to Steve, and he can then tell us how well we did (because he knows the right answers!).

In [38]:
# Our model's "final form"
clf = clf.fit(cleaned_data, y)

In [39]:
X_comp = pd.read_csv("https://drive.google.com/uc?export=download&id=1SmFBoNh7segI1Ky92mfeIe6TpscclMwQ")

# Importantly, we need to perform the same cleaning/transformation steps
# on this competition data as you did the training data. Otherwise, we will
# get an error and/or unexpected results.

X_comp = model_cleaning(X_comp)

# Use your model to make predictions
pred_comp = clf.predict(X_comp)

my_submission = pd.DataFrame({'predicted': pred_comp})

# Let's take a peak at the results (as a sanity check)
display(my_submission.head(10))

# You could use any filename.
my_submission.to_csv('my_submission.csv', index=False)

# You can now download the above file from Colab (see menu on the left)

Unnamed: 0,predicted
0,0
1,0
2,0
3,0
4,0
5,1
6,0
7,0
8,0
9,1


# Model 2 (Your idea Here!)

Here, you can do all the above, but try different ideas:

- Different ML algorithms (e.g., RandomForestClassifier, LGBM, NN)
- Different data cleaning steps (Ordinal encoding, One Hot Encoding, etc.)
- Hyperparameter tuning (using, e.g., GridSearchCV or RandomizedSearchCV)
- Ensembles
- .... anything you can think of!


Steve's GitHub page is a great place for ideas:

https://github.com/stepthom/869_course

In [40]:
# TODO: Win the competition here!

# Model 3 (Your next idea here!)

In [41]:
# TODO: Win the competition here, too!