## Instructions:

- Put the parts of your code under the corresponding sections. (0.25/2 points will be taken off for not doing this.)
- Do not include any redundant/irrelevant code, text or comments. (0.5/2 points will be taken off for not doing this.)
- **Your code must run without any errors or runtime issues.** (Failure to meet this condition will result in a 0.)
- **Your code must return your Public Leaderboard score.** (Failure to meet this condition will result in a 0.)
- **Submit both your ipynb and your html file for grading purposes.**

## 1) Libraries

Put all the Python libraries and tools you imported here.

In [1]:
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

## 2) Data

- This section is required to include the code that reads, cleans and preprocesses the datasets.
- Note that both the training and test datasets should undergo the same sequence of operations.

In [2]:
train = pd.read_csv("train_regression.csv")
test = pd.read_csv("test_regression.csv")

### 2-1 Pre-cleaning

In [3]:
# host_response_rate, host_acceptance_rate - remove % and convert to float
for col in ['host_response_rate', 'host_acceptance_rate']:
    train[col] = train[col].str.rstrip('%').astype(float)
    test[col] = test[col].str.rstrip('%').astype(float)

# price - remove $ and convert to float
train['price'] = train['price'].replace('[\$,]', '', regex=True).astype(float)

# convert datetime objects to integers
date_cols = ['host_since', 'first_review', 'last_review']
for col in date_cols:
    train[col] = pd.to_datetime(train[col], errors='coerce')
    test[col] = pd.to_datetime(test[col], errors='coerce')

today = train['last_review'].max()
for col in date_cols:
    train[col] = (today - train[col]).dt.days
    test[col] = (today - test[col]).dt.days

train['host_since'] = train['host_since'] / 365
test['host_since'] = test['host_since'] / 365

#bathrooms_text
train['bathrooms_text'] = train['bathrooms_text'].str.lower().str.replace(r'half[- ]bath', '0.5', regex=True)
test['bathrooms_text'] = test['bathrooms_text'].str.lower().str.replace(r'half[- ]bath', '0.5', regex=True)

train['num_bathrooms'] = train['bathrooms_text'].str.extract(r'(\d+\.?\d*)').astype(float)
test['num_bathrooms'] = test['bathrooms_text'].str.extract(r'(\d+\.?\d*)').astype(float)

def bathroom_type(bathroom):
    bathroom = str(bathroom).lower()
    if "shared" in bathroom:
        return True
    else:
        return False

train['shared_bathroom'] = train['bathrooms_text'].apply(bathroom_type)
test['shared_bathroom'] = test['bathrooms_text'].apply(bathroom_type)

In [4]:
relevant_columns = train.columns.to_list()
relevant_columns.remove('id') # unique identifier does not help
relevant_columns.remove('description') # hard to extract info
relevant_columns.remove('host_location') # host location does not matter
relevant_columns.remove('host_about') # hard to extract info
relevant_columns.remove('host_neighbourhood') # host neighbourhood does not matter
relevant_columns.remove('price') # target variable
relevant_columns.remove('has_availability') # only one value
relevant_columns.remove('bathrooms_text') # already extracted information

### 2-2 Missing Value Imputation

In [5]:
missing_cols = [col for col in relevant_columns
    if train[col].isnull().any() or test[col].isnull().any()
]
numeric_missing_cols = [col for col in missing_cols if np.issubdtype(train[col].dtype, np.number)]
non_numeric_missing_cols = [col for col in missing_cols if not np.issubdtype(train[col].dtype, np.number)]

train[numeric_missing_cols] = train[numeric_missing_cols].apply(lambda x: x.fillna(x.median()))
test[numeric_missing_cols] = test[numeric_missing_cols].apply(lambda x: x.fillna(x.median()))

for col in non_numeric_missing_cols:
    mode_value_train = train[col].mode()[0]
    train[col] = train[col].fillna(mode_value_train)
    mode_value_test = test[col].mode()[0] 
    test[col] = test[col].fillna(mode_value_test)

  train[col] = train[col].fillna(mode_value_train)
  test[col] = test[col].fillna(mode_value_test)


In [6]:
# For train
train['complete'] = (
    train[['description', 'host_response_rate', 'host_acceptance_rate', 
           'review_scores_rating', 'beds']]
    .notna()
    .all(axis=1)
    .astype(int)
)

# For test
test['complete'] = (
    test[['description', 'host_response_rate', 'host_acceptance_rate', 
          'review_scores_rating', 'beds']]
    .notna()
    .all(axis=1)
    .astype(int)
)
relevant_columns.append('complete')

### 2-3 Object Type Predictor Feature Engineering

In [7]:
# listing_location -- Leave as it is

# host_response_time -- Leave as it is

# host_verifications
import ast
train['host_verifications_list'] = train['host_verifications'].apply(ast.literal_eval)
test['host_verifications_list'] = test['host_verifications'].apply(ast.literal_eval)

train['email'] = train['host_verifications_list'].apply(lambda x: int('email' in x))
train['phone'] = train['host_verifications_list'].apply(lambda x: int('phone' in x))
train['work_email'] = train['host_verifications_list'].apply(lambda x: int('work_email' in x))

test['email'] = test['host_verifications_list'].apply(lambda x: int('email' in x))
test['phone'] = test['host_verifications_list'].apply(lambda x: int('phone' in x))
test['work_email'] = test['host_verifications_list'].apply(lambda x: int('work_email' in x))

relevant_columns.append('email')
relevant_columns.append('phone')
relevant_columns.append('work_email')
relevant_columns.remove('host_verifications')

# neighbourhood_cleansed
neighbourhood_counts = train['neighbourhood_cleansed'].value_counts()
global_mean = train['price'].mean()
neighbourhood_means = train.groupby('neighbourhood_cleansed')['price'].mean()

train['neighbourhood_cleansed_price'] = train['neighbourhood_cleansed'].map(
    lambda x: neighbourhood_means[x] if neighbourhood_counts[x] >= 10 else global_mean)

test['neighbourhood_cleansed_price'] = test['neighbourhood_cleansed'].map(
    lambda x: neighbourhood_means.get(x, global_mean) if x in neighbourhood_means else global_mean)

relevant_columns.append('neighbourhood_cleansed_price')
relevant_columns.remove('neighbourhood_cleansed')


# property_type
property_counts = train['property_type'].value_counts()
property_means = train.groupby('property_type')['price'].mean()

train['property_type_price'] = train['property_type'].map(
    lambda x: property_means[x] if property_counts[x] >= 10 else global_mean)

test['property_type_price'] = test['property_type'].map(
    lambda x: property_means.get(x, global_mean) if x in property_means else global_mean)

relevant_columns.append('property_type_price')
relevant_columns.remove('property_type')

# room_type - leave as it is

In [8]:
def haversine(lat1, lon1, lat2, lon2):
    R = 6371.0
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1

    a = np.sin(dlat/2)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(dlon/2)**2
    return 2 * R * np.arcsin(np.sqrt(a))

def get_distance_from_city_center(row):
    city = row['listing_location']
    lat = row['latitude']
    lon = row['longitude']

    if 'chicago' in city.lower():
        return haversine(lat, lon, 41.9047, -87.6273)
    elif 'asheville' in city.lower():
        return haversine(lat, lon, 35.5951, -82.5515)
    elif 'kauai' in city.lower():
        return haversine(lat, lon, 22.2233, -159.4850)

train['distance_from_center_km'] = train.apply(get_distance_from_city_center, axis=1)
test['distance_from_center_km'] = test.apply(get_distance_from_city_center, axis=1)
relevant_columns.append('distance_from_center_km')

# amenities - new
def parse_amenities(x):
    if isinstance(x, list):
        return [item.lower() for item in x]
    elif isinstance(x, str):
        try:
            items = ast.literal_eval(x)
            return [item.lower() for item in items]
        except:
            return []
    else:
        return []

def add_amenity_indicators(df):
    df = df.copy()
    df['parsed_amenities'] = df['amenities'].apply(parse_amenities)
    df['has_tv_wifi'] = df['parsed_amenities'].apply(lambda x: int(all(amenity in x for amenity in ['tv', 'wifi'])))
    df['self_check_in'] = df['parsed_amenities'].apply(lambda x: int(any('check-in' in amenity for amenity in x)))
    df['has_pool'] = df['parsed_amenities'].apply(lambda x: int(any('pool' in amenity for amenity in x)))
    df['has_tub'] = df['parsed_amenities'].apply(lambda x: int(any('tub' in amenity for amenity in x)))
    df.drop(columns='parsed_amenities', inplace=True)
    return df

train = add_amenity_indicators(train)
test = add_amenity_indicators(test)


new_features = ['has_tv_wifi', 'self_check_in', 'has_pool', 'has_tub']
relevant_columns += new_features

# amenities - convert to len of list
train['amenities'] = train['amenities'].apply(lambda x: x.split(', '))
test['amenities'] = test['amenities'].apply(lambda x: x.split(', '))
train['amenities'] = train['amenities'].apply(len)
test['amenities'] = test['amenities'].apply(len)

# Dictionary mapping neighborhoods to ZIP codes
neighborhood_to_zip = {
    'North Shore Kauai': '96714',
    'Koloa-Poipu': '96756',
    'Near North Side': '60610',
    'Kapaa-Wailua': '96746',
    '28806': '28806',
    'West Town': '60622',
    'Lihue': '96766',
    'Lake View': '60657',
    'Near West Side': '60607',
    '28801': '28801',
    'Loop': '60601',
    'Logan Square': '60647',
    '28803': '28803',
    '28804': '28804',
    '28805': '28805',
    'Lincoln Park': '60614',
    'Near South Side': '60616',
    'Lower West Side': '60608',
    'Woodlawn': '60637',
    'Uptown': '60640',
    'Bridgeport': '60608',
    'Irving Park': '60618',
    'Avondale': '60618',
    'Edgewater': '60660',
    'Rogers Park': '60626',
    '28704': '28704',
    'North Center': '60618',
    'Hyde Park': '60615',
    'West Ridge': '60645',
    'Grand Boulevard': '60653',
    'Portage Park': '60641',
    'South Shore': '60649',
    'Humboldt Park': '60622',
    'East Garfield Park': '60624',
    '28715': '28715',
    'Lincoln Square': '60625',
    'Armour Square': '60616',
    'Albany Park': '60625',
    'Kenwood': '60615',
    'Austin': '60644',
    'North Lawndale': '60623',
    '28732': '28732',
    'Douglas': '60616',
    'Mckinley Park': '60609',
    'Jefferson Park': '60630',
    'Norwood Park': '60631',
    'South Lawndale': '60623',
    'Belmont Cragin': '60639',
    'New City': '60609',
    'Washington Park': '60637',
    'Dunning': '60634',
    'Hermosa': '60639',
    'Greater Grand Crossing': '60619',
    'Auburn Gresham': '60620',
    'Brighton Park': '60632',
    'Roseland': '60628',
    'Englewood': '60621',
    'Oakland': '60653',
    'West Englewood': '60636',
    'Chatham': '60619',
    'South Chicago': '60617',
    'Waimea-Kekaha': '96752',
    'East Side': '60617',
    'Calumet Heights': '60619',
    'Ohare': '60666',
    'Pullman': '60628',
    'North Park': '60625',
    'Washington Heights': '60628',
    'Beverly': '60643',
    'West Garfield Park': '60624',
    'Clearing': '60638',
    'Montclare': '60639',
    'West Lawn': '60629',
    'Ashburn': '60652',
    'Garfield Ridge': '60638',
    'Morgan Park': '60643',
    'Forest Glen': '60646',
    'Archer Heights': '60632',
    'West Pullman': '60643',
    'South Deering': '60617',
    'Burnside': '60619',
    'Chicago Lawn': '60629',
    'Hegewisch': '60633',
    'Gage Park': '60632',
    'Mount Greenwood': '60655',
    'Fuller Park': '60609',
    'Avalon Park': '60619',
    'Edison Park': '60631',
    'West Elsdon': '60629',
}

# Map ZIP codes to DataFrame
train['zip_code'] = train['neighbourhood_cleansed'].map(neighborhood_to_zip)
test['zip_code'] = test['neighbourhood_cleansed'].map(neighborhood_to_zip)

# Mapping from ZIP codes to median household income
zip_to_income = {
    '60654': 144831,
    '60642': 141179,
    '60614': 139561,
    '60606': 139148,
    '60661': 138831,
    '60603': 128107,
    '60601': 127323,
    '60607': 126307,
    '60622': 124852,
    '60611': 124621,
    '60602': 123047,
    '60605': 118403,
    '60657': 109025,
    '60610': 105649,
    '60647': 102851,
    '60618': 101558,
    '60630': 97134,
    '60613': 91351,
    '60625': 85299,
    '60643': 83015,
    '60641': 81649,
    '60652': 78559,
    '60616': 77841,
    '60640': 71030,
    '60608': 70704,
    '60659': 70033,
    '60660': 66206,
    '60615': 62105,
    '60632': 60576,
    '60612': 60457,
    '60639': 59710,
    '60626': 57452,
    '60609': 54142,
    '60651': 52963,
    '60617': 51237,
    '60628': 49719,
    '60623': 44040,
    '60619': 43403,
    '60637': 42080,
    '60649': 40142,
    '60653': 39565,
    '60644': 37952,
    '60636': 35077,
    '60621': 33235,
    '60624': 32607,
    '28704': 79088,
    '28715': 70950,
    '28732': 79451,
    '28801': 51125,
    '28803': 70638,
    '28804': 79709,
    '28805': 67104,
    '28806': 59925,
    '96714': 132115,
    '96756': 97500,
    '96746': 89681,
    '96766': 96434,
    '96752': 74628,
    '60634': 84997,
    '60646': 111115,
    '60645': 77126,
    '60629': 58813,
    '60633': 56875,
    '60631': 117688,
    '60638': 89191,
    '60620': 48805,
    '60666': 75134,
    '60655': 75134
}

train['median_income'] = train['zip_code'].map(lambda x: zip_to_income.get(str(x), np.nan))
test['median_income'] = test['zip_code'].map(lambda x: zip_to_income.get(str(x), np.nan))

relevant_columns.append('median_income')

### 2-4 Encoding

In [9]:
# Booleans
booleans = cols_to_dummy = train[relevant_columns].select_dtypes(include='bool').columns.tolist()
for col in booleans:
    train[col] = train[col].astype(int)
    test[col] = test[col].astype(int)
                        

# Other
cols_to_dummy = train[relevant_columns].select_dtypes(include='object').columns.tolist()

X_train_onehot = pd.get_dummies(train[cols_to_dummy], columns=cols_to_dummy).astype(int)
X_test_onehot = pd.get_dummies(test[cols_to_dummy], columns=cols_to_dummy).astype(int)

train = pd.concat([train, X_train_onehot], axis=1)
test = pd.concat([test, X_test_onehot], axis=1)

relevant_columns = relevant_columns + X_train_onehot.columns.tolist()
relevant_columns = [col for col in relevant_columns if col not in cols_to_dummy]

## 3) Machine Learning Model

- This section is required to train the **already tuned** model and obtain the test predictions (or prediction probabilities) with it.
- As written in the instructions, your code must not have any runtime issues, so **do NOT include your grid search here!** You will still need to tune your model to pass the thresholds. However, you need to keep that as your personal work and should NOT include the grid search here.

In [10]:
X_train = train[relevant_columns]
X_test = test[relevant_columns]
y_train = train['price']

### 3-2 Optimal Model after Tuning with GridSearchCV

In [11]:
# Feature Selection with XGBRegressor
model = XGBRegressor(random_state = 12, objective='reg:absoluteerror')
model.fit(X_train, y_train)
importances = model.feature_importances_

feature_importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': importances
})

feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)

selected = feature_importance_df[feature_importance_df['importance'] > 0]['feature']
X_train_selected = X_train[selected]

In [12]:
base_model = DecisionTreeRegressor(
    random_state = 12,
    max_depth = 18
)

model = AdaBoostRegressor(
    random_state = 12,
    estimator = base_model, 
    n_estimators = 59,
    learning_rate = 0.1
)

model.fit(X_train_selected, y_train)

## 4) Exporting the Predictions

Include the code that (1) puts the predictions in the format that Kaggle understands and (2) exports it as a csv file.

In [13]:
y_pred = model.predict(X_test[selected])
test_ids = test['id']
predictions_df = pd.DataFrame({'id': test_ids, 'predicted': y_pred})
predictions_df.to_csv('regression_predictions.csv', index=False)