## Instructions:

- Put the parts of your code under the corresponding sections. (0.25/2 points will be taken off for not doing this.)
- Do not include any redundant/irrelevant code, text or comments. (0.5/2 points will be taken off for not doing this.)
- **Your code must run without any errors or runtime issues.** (Failure to meet this condition will result in a 0.)
- **Your code must return your Public Leaderboard score.** (Failure to meet this condition will result in a 0.)

## 1) Libraries

Put all the Python libraries and tools you imported here.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import TunedThresholdClassifierCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import root_mean_squared_error


## 2) Data

- This section is required to include the code that reads, cleans and preprocesses the datasets.
- Note that both the training and test datasets should undergo the same sequence of operations.

In [3]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [4]:
predictors = train.drop(columns = ["host_is_superhost"], axis = 1)
response = train["host_is_superhost"]

In [5]:
# dropping ID and host_about
# host_location is not useful for prediction because it is the host's area of residence, not necessarily the location of the listing
# latitude and longitude provide redunant information about listing location
# the min/max night columns are not useful since they have very arbitrary, often meaningless values
columns_to_drop = ["id", "host_about", "description", "host_location", "longitude", "latitude", "amenities", "host_verifications", "host_neighbourhood", "property_type", "first_review", "last_review", "bathrooms_text", "has_availability", 'maximum_nights',
       'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm']
initial_cleaned = predictors.drop(columns = columns_to_drop, axis = 1)
test = test.drop(columns = columns_to_drop, axis = 1)


In [6]:
review_columns = [
    'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
    'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
    'review_scores_value'
]

# fill NaN values in review_columns with median
initial_cleaned[review_columns] = initial_cleaned[review_columns].fillna(initial_cleaned[review_columns].median())
test[review_columns] = test[review_columns].fillna(test[review_columns].median())

# fill NaN values in listing_location with mode
initial_cleaned["listing_location"] = initial_cleaned["listing_location"].fillna(initial_cleaned["listing_location"].mode()[0])
test["listing_location"] = test["listing_location"].fillna(test["listing_location"].mode()[0])

# impute NaN values in host_response_time with mode
initial_cleaned["host_response_time"] = initial_cleaned["host_response_time"].fillna(initial_cleaned["host_response_time"].mode()[0])
test["host_response_time"] = test["host_response_time"].fillna(test["host_response_time"].mode()[0])

# convert host_response_rate and host_acceptance_rate to numeric
initial_cleaned["host_response_rate"] = initial_cleaned["host_response_rate"].str.replace("%", "").astype(float) / 100
initial_cleaned["host_acceptance_rate"] = initial_cleaned["host_acceptance_rate"].str.replace("%", "").astype(float) / 100
test["host_response_rate"] = test["host_response_rate"].str.replace("%", "").astype(float) / 100
test["host_acceptance_rate"] = test["host_acceptance_rate"].str.replace("%", "").astype(float) / 100

# impute NaN values in host_response_rate and host_acceptance_rate with median
initial_cleaned["host_response_rate"] = initial_cleaned["host_response_rate"].fillna(initial_cleaned["host_response_rate"].median())
initial_cleaned["host_acceptance_rate"] = initial_cleaned["host_acceptance_rate"].fillna(initial_cleaned["host_acceptance_rate"].median())
test["host_response_rate"] = test["host_response_rate"].fillna(test["host_response_rate"].median())
test["host_acceptance_rate"] = test["host_acceptance_rate"].fillna(test["host_acceptance_rate"].median())

# impute NaN values in bedrooms and beds with median
initial_cleaned["bedrooms"] = initial_cleaned["bedrooms"].fillna(initial_cleaned["bedrooms"].median())
initial_cleaned["beds"] = initial_cleaned["beds"].fillna(initial_cleaned["beds"].median())
test["bedrooms"] = test["bedrooms"].fillna(test["bedrooms"].median())
test["beds"] = test["beds"].fillna(test["beds"].median())

# impute NaN values in host_listings_count and host_total_listings_count with median
initial_cleaned["host_listings_count"] = initial_cleaned["host_listings_count"].fillna(initial_cleaned["host_listings_count"].median())
initial_cleaned["host_total_listings_count"] = initial_cleaned["host_total_listings_count"].fillna(initial_cleaned["host_total_listings_count"].median())
test["host_listings_count"] = test["host_listings_count"].fillna(test["host_listings_count"].median())
test["host_total_listings_count"] = test["host_total_listings_count"].fillna(test["host_total_listings_count"].median())

# impute NaN values in host_has_profile_pic and host_identity_verified with mode
initial_cleaned["host_has_profile_pic"] = initial_cleaned["host_has_profile_pic"].fillna(initial_cleaned["host_has_profile_pic"].mode()[0])
initial_cleaned["host_identity_verified"] = initial_cleaned["host_identity_verified"].fillna(initial_cleaned["host_identity_verified"].mode()[0])
test["host_has_profile_pic"] = test["host_has_profile_pic"].fillna(test["host_has_profile_pic"].mode()[0])
test["host_identity_verified"] = test["host_identity_verified"].fillna(test["host_identity_verified"].mode()[0])

# impute the NaN value in host_since column with the mode
initial_cleaned["host_since"] = initial_cleaned["host_since"].fillna(initial_cleaned["host_since"].mode()[0])
test["host_since"] = test["host_since"].fillna(test["host_since"].mode()[0])

# impute NaN values in reviews_per_month with median
initial_cleaned["reviews_per_month"] = initial_cleaned["reviews_per_month"].fillna(initial_cleaned["reviews_per_month"].median())
test["reviews_per_month"] = test["reviews_per_month"].fillna(test["reviews_per_month"].median())


  initial_cleaned["host_has_profile_pic"] = initial_cleaned["host_has_profile_pic"].fillna(initial_cleaned["host_has_profile_pic"].mode()[0])
  initial_cleaned["host_identity_verified"] = initial_cleaned["host_identity_verified"].fillna(initial_cleaned["host_identity_verified"].mode()[0])
  test["host_has_profile_pic"] = test["host_has_profile_pic"].fillna(test["host_has_profile_pic"].mode()[0])
  test["host_identity_verified"] = test["host_identity_verified"].fillna(test["host_identity_verified"].mode()[0])


In [7]:
# extract year from host_since column, convert to datetime, and drop the original column
initial_cleaned["host_since_year"] = pd.to_datetime(initial_cleaned["host_since"]).dt.year
initial_cleaned.drop(columns=["host_since"], inplace=True)
test["host_since_year"] = pd.to_datetime(test["host_since"]).dt.year
test.drop(columns=["host_since"], inplace=True)

In [8]:
from sklearn.preprocessing import OrdinalEncoder

one_hot_encoder = OneHotEncoder(sparse_output=False, drop='first')

# one hot encode host_response_time, listing_location, room_type
one_hot_columns = ["host_response_time", "listing_location", "room_type"]
encoded_one_hot_train = one_hot_encoder.fit_transform(initial_cleaned[one_hot_columns])
encoded_one_hot_df = pd.DataFrame(
    encoded_one_hot_train,
    columns=one_hot_encoder.get_feature_names_out(one_hot_columns),
    index=initial_cleaned.index
)
encoded_one_hot_test = one_hot_encoder.transform(test[one_hot_columns])
encoded_one_hot_test_df = pd.DataFrame(
    encoded_one_hot_test,
    columns=one_hot_encoder.get_feature_names_out(one_hot_columns),
    index=test.index
)

# ordinal encode neighbourhood_cleansed and host_since_year
ordinal_encode_cols = ["neighbourhood_cleansed", "host_since_year"]
ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
ordinal_encoder.fit(initial_cleaned[ordinal_encode_cols])
initial_cleaned[ordinal_encode_cols] = ordinal_encoder.transform(initial_cleaned[ordinal_encode_cols])
test[ordinal_encode_cols] = ordinal_encoder.transform(test[ordinal_encode_cols])


# convert boolean columns to int
bool_columns = [
    "host_has_profile_pic",
    "host_identity_verified",
    "instant_bookable"
]
for col in bool_columns:
    initial_cleaned[col] = initial_cleaned[col].astype(int)
    test[col] = test[col].astype(int)


# concatenate the encoded columns with the original dataframe
cleaned_data_train = pd.concat([initial_cleaned, encoded_one_hot_df], axis=1)
cleaned_data_test = pd.concat([test, encoded_one_hot_test_df], axis=1)

# drop original columns after encoding
cols_to_drop = one_hot_columns
cleaned_data_train.drop(columns=cols_to_drop, inplace=True)
cleaned_data_test.drop(columns=cols_to_drop, inplace=True)

In [9]:
X_train = cleaned_data_train
y_train = response

In [10]:
X_test = cleaned_data_test

In [14]:
from sklearn.ensemble import BaggingClassifier

param_grid = {
    'n_estimators':      [50, 100, 150, 200],
    'max_samples':       [0.5, 0.75, 1.0],
    'max_features':      [0.3, 0.5, 0.75, 1.0],
    'bootstrap':         [True],              
    'bootstrap_features':[True, False],
}

base_model = DecisionTreeClassifier(random_state=42)
bagging_model = BaggingClassifier(
    estimator=base_model,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

gscv = GridSearchCV(
    bagging_model,
    param_grid=param_grid,
    scoring='roc_auc',
    cv=cv,
    n_jobs=-1,
    verbose=2,
)

gscv.fit(X_train, y_train)

print("Best params:", gscv.best_params_)
print("Best CV ROC AUC:", gscv.best_score_)

Fitting 5 folds for each of 96 candidates, totalling 480 fits
[CV] END bootstrap=True, bootstrap_features=True, max_features=0.3, max_samples=0.5, n_estimators=50; total time=   0.7s
[CV] END bootstrap=True, bootstrap_features=True, max_features=0.3, max_samples=0.5, n_estimators=50; total time=   0.8s
[CV] END bootstrap=True, bootstrap_features=True, max_features=0.3, max_samples=0.5, n_estimators=100; total time=   1.4s
[CV] END bootstrap=True, bootstrap_features=True, max_features=0.3, max_samples=0.5, n_estimators=100; total time=   1.3s
[CV] END bootstrap=True, bootstrap_features=True, max_features=0.3, max_samples=0.5, n_estimators=100; total time=   1.4s
[CV] END bootstrap=True, bootstrap_features=True, max_features=0.3, max_samples=0.5, n_estimators=100; total time=   1.1s
[CV] END bootstrap=True, bootstrap_features=True, max_features=0.3, max_samples=0.5, n_estimators=100; total time=   1.3s
[CV] END bootstrap=True, bootstrap_features=True, max_features=0.3, max_samples=0.5, n

## 3) Machine Learning Model

- This section is required to train the **already tuned** model and obtain the test predictions (or prediction probabilities) with it.
- As written in the instructions, your code must not have any runtime issues, so **do NOT include your grid search here!** You will still need to tune your model to pass the thresholds. However, you need to keep that as your personal work and should NOT include the grid search here.

In [15]:
best_model = gscv.best_estimator_
y_test_proba = best_model.predict_proba(X_test)[:, 1]

## 4) Exporting the Predictions

Include the code that (1) puts the predictions in the format that Kaggle understands and (2) exports it as a csv file.

In [16]:
submission = pd.read_csv('test.csv')
submission = submission[["id"]]
submission["predicted"] = y_test_proba

submission.to_csv('submission.csv', index=False)