## Instructions:

- Put the parts of your code under the corresponding sections. (0.25/2 points will be taken off for not doing this.)
- Do not include any redundant/irrelevant code, text or comments. (0.5/2 points will be taken off for not doing this.)
- **Your code must run without any errors or runtime issues.** (Failure to meet this condition will result in a 0.)
- **Your code must return your Public Leaderboard score.** (Failure to meet this condition will result in a 0.)
- **Submit both your ipynb and your html file for grading purposes.**

## 1) Libraries

Put all the Python libraries and tools you imported here.

In [241]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier

## 2) Data

- This section is required to include the code that reads, cleans and preprocesses the datasets.
- Note that both the training and test datasets should undergo the same sequence of operations.

In [242]:
train = pd.read_csv("train_classification.csv")
test = pd.read_csv("test_classification.csv")

### 2-1 Pre-cleaning

In [243]:
# host_response_rate, host_acceptance_rate - remove % and convert to float
for col in ['host_response_rate', 'host_acceptance_rate']:
    train[col] = train[col].str.rstrip('%').astype(float)
    test[col] = test[col].str.rstrip('%').astype(float)

# convert datetime objects to integers
date_cols = ['host_since', 'last_review', 'first_review']
for col in date_cols:
    train[col] = pd.to_datetime(train[col], errors='coerce')
    test[col] = pd.to_datetime(test[col], errors='coerce')

today = train['last_review'].max()
for col in date_cols:
    train[col] = (today - train[col]).dt.days
    test[col] = (today - test[col]).dt.days

train['host_since'] = train['host_since'] / 365
test['host_since'] = test['host_since'] / 365

In [244]:
relevant_columns = train.columns.to_list()
relevant_columns.remove('id') # unique identifier does not help
relevant_columns.remove('host_location') # host location does not matter
relevant_columns.remove('host_neighbourhood') # host neighbourhood does not matter
relevant_columns.remove('host_is_superhost') # target variable
relevant_columns.remove('has_availability') # only one value
relevant_columns.remove('first_review') # superhost depends on most recent activity

cols_to_remove = [
    'availability_30', 'availability_60', 'availability_90', 'availability_365',
    'number_of_reviews_l30d',
    'calculated_host_listings_count_entire_homes',
    'calculated_host_listings_count_private_rooms',
    'calculated_host_listings_count_shared_rooms',
    'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights'
] # redundancy

relevant_columns = [col for col in relevant_columns if col not in cols_to_remove]


In [245]:
# For train
train['complete'] = (
    train[['description', 'host_response_rate', 'host_acceptance_rate', 
           'review_scores_rating', 'beds']]
    .notna()
    .all(axis=1)
    .astype(int)
)

# For test
test['complete'] = (
    test[['description', 'host_response_rate', 'host_acceptance_rate', 
          'review_scores_rating', 'beds']]
    .notna()
    .all(axis=1)
    .astype(int)
)
relevant_columns.append('complete')

### 2-2 Missing Value Imputation

In [246]:
missing_cols = train[relevant_columns].columns[train[relevant_columns].isnull().any()].tolist()
numeric_missing_cols = [col for col in missing_cols if np.issubdtype(train[col].dtype, np.number)]
non_numeric_missing_cols = [col for col in missing_cols if not np.issubdtype(train[col].dtype, np.number)]

train[numeric_missing_cols] = train[numeric_missing_cols].apply(lambda x: x.fillna(x.median()))
test[numeric_missing_cols] = test[numeric_missing_cols].apply(lambda x: x.fillna(x.median()))

for col in non_numeric_missing_cols:
    mode_value_train = train[col].mode()[0]
    train[col] = train[col].fillna(mode_value_train)
    mode_value_test = test[col].mode()[0] 
    test[col] = test[col].fillna(mode_value_test)

  train[col] = train[col].fillna(mode_value_train)
  test[col] = test[col].fillna(mode_value_test)


### 2-3 Object Type Predictor Feature Engineering

In [247]:
# description
train['description'] = train['description'].fillna('').apply(lambda x: len(x.split()))
test['description'] = test['description'].fillna('').apply(lambda x: len(x.split()))

# host_about
train['host_about'] = train['host_about'].fillna('').apply(lambda x: len(x.split()))
test['host_about'] = test['host_about'].fillna('').apply(lambda x: len(x.split()))

#host_location
def cleanse_host_location(location):
    if pd.isna(location):
        return np.nan
    location = location.lower()
    if "il" in location or "illinois" in location:
        return "chicago"
    elif "north carolina" in location or "nc" in location:
        return "asheville"
    elif "hi" in location or "hawaii" in location:
        return "kauai"
    else:
        return "other"

train['host_location_cleansed'] = train['host_location'].apply(cleanse_host_location)
test['host_location_cleansed'] = test['host_location'].apply(cleanse_host_location)

train['host_same_state'] = (train['host_location_cleansed'] == train['listing_location']).astype(int)
test['host_same_state'] = (test['host_location_cleansed'] == test['listing_location']).astype(int)

train['same_state_chicago'] = (
    (train['listing_location'] == 'chicago') & (train['host_same_state'] == 1)
).astype(int)

train['same_state_kauai'] = (
    (train['listing_location'] == 'kauai') & (train['host_same_state'] == 1)
).astype(int)

test['same_state_chicago'] = (
    (test['listing_location'] == 'chicago') & (test['host_same_state'] == 1)
).astype(int)

test['same_state_kauai'] = (
    (test['listing_location'] == 'kauai') & (test['host_same_state'] == 1)
).astype(int)

relevant_columns.append('host_same_state')
relevant_columns.append('same_state_chicago')
relevant_columns.append('same_state_kauai')

# listing_location -- Leave as it is

# host_response_time
ordinal_map = {
    "a few days or more": 0,
    "within a day": 1,
    "within a few hours": 2,
    "within an hour": 3
}
train["host_response_time"] = train["host_response_time"].map(ordinal_map)
test["host_response_time"] = test["host_response_time"].map(ordinal_map)

# host_verifications
import ast
train['host_verifications_list'] = train['host_verifications'].apply(ast.literal_eval)
test['host_verifications_list'] = test['host_verifications'].apply(ast.literal_eval)

train['email'] = train['host_verifications_list'].apply(lambda x: int('email' in x))
train['phone'] = train['host_verifications_list'].apply(lambda x: int('phone' in x))
train['work_email'] = train['host_verifications_list'].apply(lambda x: int('work_email' in x))

test['email'] = test['host_verifications_list'].apply(lambda x: int('email' in x))
test['phone'] = test['host_verifications_list'].apply(lambda x: int('phone' in x))
test['work_email'] = test['host_verifications_list'].apply(lambda x: int('work_email' in x))

relevant_columns.append('email')
relevant_columns.append('phone')
relevant_columns.append('work_email')
relevant_columns.remove('host_verifications')

# neighbourhood_cleansed
neighbourhood_counts = train['neighbourhood_cleansed'].value_counts()
global_mean = train['host_is_superhost'].mean()
neighbourhood_means = train.groupby('neighbourhood_cleansed')['host_is_superhost'].mean()

train['neighbourhood_cleansed_group'] = train['neighbourhood_cleansed'].map(
    lambda x: neighbourhood_means[x] if neighbourhood_counts[x] >= 10 else global_mean)

test['neighbourhood_cleansed_group'] = test['neighbourhood_cleansed'].map(
    lambda x: neighbourhood_means.get(x, global_mean) if x in neighbourhood_means else global_mean)

relevant_columns.append('neighbourhood_cleansed_group')
relevant_columns.remove('neighbourhood_cleansed')

# property_type
property_counts = train['property_type'].value_counts()
property_means = train.groupby('property_type')['host_is_superhost'].mean()

train['property_type_group'] = train['property_type'].map(
    lambda x: property_means[x] if property_counts[x] >= 10 else global_mean)

test['property_type_group'] = test['property_type'].map(
    lambda x: property_means.get(x, global_mean) if x in property_means else global_mean)

relevant_columns.append('property_type_group')
relevant_columns.remove('property_type')

# room_type - leave as it is

#bathrooms_text
train['bathrooms_text'] = train['bathrooms_text'].str.lower().str.replace(r'half[- ]bath', '0.5', regex=True)
test['bathrooms_text'] = test['bathrooms_text'].str.lower().str.replace(r'half[- ]bath', '0.5', regex=True)

train['num_bathrooms'] = train['bathrooms_text'].str.extract(r'(\d+\.?\d*)').astype(float)
test['num_bathrooms'] = test['bathrooms_text'].str.extract(r'(\d+\.?\d*)').astype(float)

def bathroom_type(bathroom):
    bathroom = str(bathroom).lower()
    if "shared" in bathroom:
        return True
    else:
        return False

train['shared_bathroom'] = train['bathrooms_text'].apply(bathroom_type)
test['shared_bathroom'] = test['bathrooms_text'].apply(bathroom_type)

relevant_columns.append('num_bathrooms')
relevant_columns.append('shared_bathroom')
relevant_columns.remove('bathrooms_text')

# amenities - new
def parse_amenities(x):
    if isinstance(x, list):
        return [item.lower() for item in x]
    elif isinstance(x, str):
        try:
            items = ast.literal_eval(x)
            return [item.lower() for item in items]
        except:
            return []
    else:
        return []

def add_amenity_indicators(df):
    df = df.copy()
    df['parsed_amenities'] = df['amenities'].apply(parse_amenities)
    df['has_netflix'] = df['parsed_amenities'].apply(lambda x: int(any('netflix' in amenity for amenity in x)))
    df['has_tv_wifi'] = df['parsed_amenities'].apply(lambda x: int(all(amenity in x for amenity in ['tv', 'wifi'])))
    df['self_check_in'] = df['parsed_amenities'].apply(lambda x: int(any('check-in' in amenity for amenity in x)))
    df['has_coffee'] = df['parsed_amenities'].apply(lambda x: int(any('coffee' in amenity for amenity in x)))
    df['has_kitchen'] = df['parsed_amenities'].apply(lambda x: int(any('kitchen' in amenity for amenity in x)))
    df['has_tub'] = df['parsed_amenities'].apply(lambda x: int(any('tub' in amenity for amenity in x)))
    df.drop(columns='parsed_amenities', inplace=True)
    return df

train = add_amenity_indicators(train)
test = add_amenity_indicators(test)


new_features = ['has_netflix','has_tv_wifi','self_check_in','has_coffee','has_kitchen','has_tub']
relevant_columns += new_features

# amenities - convert to len of list
train['amenities'] = train['amenities'].apply(lambda x: x.split(', '))
test['amenities'] = test['amenities'].apply(lambda x: x.split(', '))
train['amenities'] = train['amenities'].apply(len)
test['amenities'] = test['amenities'].apply(len)

### 2-4 New Variables

In [248]:
# Combine according to Superhost criteria
train['superhost_criteria'] = (
    (train['host_response_rate'] >= 90) & 
    (train['review_scores_rating'] >= 4.8)
).astype(int)

test['superhost_criteria'] = (
    (test['host_response_rate'] >= 90) & 
    (test['review_scores_rating'] >= 4.8)
).astype(int)

relevant_columns.append('superhost_criteria')

# active
train['active'] = train['last_review'].apply(lambda x: 1 if x <= 100 else 0)
test['active'] = test['last_review'].apply(lambda x: 1 if x <= 100 else 0)
relevant_columns.append('active')

### 2-4 Encoding

In [249]:
# Booleans
booleans = cols_to_dummy = train[relevant_columns].select_dtypes(include='bool').columns.tolist()
for col in booleans:
    train[col] = train[col].astype(int)
    test[col] = test[col].astype(int)
                        

# Other
cols_to_dummy = train[relevant_columns].select_dtypes(include='object').columns.tolist()

X_train_onehot = pd.get_dummies(train[cols_to_dummy], columns=cols_to_dummy).astype(int)
X_test_onehot = pd.get_dummies(test[cols_to_dummy], columns=cols_to_dummy).astype(int)

train = pd.concat([train, X_train_onehot], axis=1)
test = pd.concat([test, X_test_onehot], axis=1)

relevant_columns = relevant_columns + X_train_onehot.columns.tolist()
relevant_columns = [col for col in relevant_columns if col not in cols_to_dummy]

## 3) Machine Learning Model

- This section is required to train the **already tuned** model and obtain the test predictions (or prediction probabilities) with it.
- As written in the instructions, your code must not have any runtime issues, so **do NOT include your grid search here!** You will still need to tune your model to pass the thresholds. However, you need to keep that as your personal work and should NOT include the grid search here.

In [250]:
X_train = train[relevant_columns]
X_test = test[relevant_columns]
y_train = train['host_is_superhost']

In [251]:
class_counts = y_train.value_counts()
scale_pos_weight = class_counts[0] / class_counts[1]
print(scale_pos_weight)

1.0111089829534572


In [252]:
model = XGBClassifier(random_state = 12, objective = 'binary:logistic', scale_pos_weight = scale_pos_weight)
model.fit(X_train, y_train)
importances = model.feature_importances_

feature_importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': importances
})

feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)

selected = feature_importance_df[feature_importance_df['importance'] > 0]['feature']
X_train_selected = X_train[selected]

In [253]:
model = XGBClassifier(
    random_state = 12, 
    objective = 'binary:logistic', 
    scale_pos_weight = scale_pos_weight,
    colsample_bytree = 0.5,
    learning_rate = 0.1, 
    max_depth = 8,
    n_estimators = 250,
    reg_lambda = 1,
    subsample =  0.75
    )

model.fit(X_train_selected, y_train)

## 4) Exporting the Predictions

Include the code that (1) puts the predictions in the format that Kaggle understands and (2) exports it as a csv file.

In [254]:
y_prob = model.predict_proba(X_test[selected])[:, 1]
test_ids = test['id']
predictions_df = pd.DataFrame({'id': test_ids, 'predicted': y_prob})
predictions_df.to_csv('classification_predictions.csv', index=False)