## Instructions {-}

- This is the template for the code and report on the Prediction Problem assignments.

- Your code in steps 1, 3, 4, and 5 will be executed sequentially, and must produce the RMSE / accuracy claimed on Kaggle.

- Your code in step 2 will also be executed, and must produce the optimal hyperparameter values used to train the model.

In [1]:
#importing Libraries 
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, RandomizedSearchCV
import xgboost as xgb
import time

## Read data

In [2]:
train_data = pd.read_csv('train_regression.csv')
test_data = pd.read_csv('test_regression.csv')

## 1) Data pre-processing

Put the data pre-processing code. You don't need to explain it. You may use the same code from last quarter.

In [3]:
st = time.time()

train_data['host_response_rate'] = train_data.host_response_rate.apply(lambda x: str(x)).apply(lambda x: x.replace('%','')).apply(lambda x: float(x))

#host acceptence rate
train_data['host_acceptance_rate'] = train_data.host_acceptance_rate.apply(lambda x: str(x)).apply(lambda x: x.replace('%','')).apply(lambda x: float(x))


#dummy variable host_has_profile_pic
train_data = pd.concat([train_data, pd.get_dummies(train_data['host_has_profile_pic']).rename(columns={'t': 'host_has_profile_pic_t'})], axis = 1)
train_data.drop(labels = ['host_has_profile_pic', 'f'], axis = 1, inplace = True)

#dummy variable host_identity_verified
train_data = pd.concat([train_data, pd.get_dummies(train_data['host_identity_verified']).rename(columns={'t': 'host_identity_verified_t'})], axis = 1)
train_data.drop(labels = ['host_identity_verified', 'f'], axis = 1, inplace = True)

#dummy variable host_is_superhost
train_data = pd.concat([train_data, pd.get_dummies(train_data['host_is_superhost']).rename(columns={'t': 'host_is_superhost_t'})], axis = 1)
train_data.drop(labels = ['host_is_superhost', 'f'], axis = 1, inplace = True)

#number of bathrooms
train_data['Number_of_Bathrooms'] = train_data.bathrooms_text.apply(lambda x: str(x)).apply(lambda x: x.replace('shared baths','').replace('shared bath','').replace('Shared half-bath', '0.5').replace('Private half-bath', '0.5').replace('Half-bath','0.5').replace('private bath','').replace('baths','').replace('bath','')).apply(lambda x: float(x))

#classifying bathrooms as private and public
train_data['shared_bathroom'] = train_data['bathrooms_text'].str.contains('shared', case=False)
train_data['private_bathroom'] = train_data['bathrooms_text'].str.contains('private', case=False)

#dummy variable has_availability
train_data = pd.concat([train_data, pd.get_dummies(train_data['has_availability']).rename(columns={'t': 'has_availability_t'})], axis = 1)
train_data.drop(labels = ['has_availability', 'f'], axis = 1, inplace = True)

#dummy variable instant_bookable
train_data = pd.concat([train_data, pd.get_dummies(train_data['instant_bookable']).rename(columns={'t': 'instant_bookable_t'})], axis = 1)
train_data.drop(labels = ['instant_bookable', 'f'], axis = 1, inplace = True)

#classifying downtown neigborhhods
downtown_neighborhoods = ['River North', 'Loop', 'Gold Coast', 'Streeterville', \
                          'South Loop', 'West Loop', 'River West', 'Near North Side']

#new column is_downtown
train_data['is_downtown']= train_data.host_neighbourhood.isin(downtown_neighborhoods).astype(int)

#defining a fucntion to classify a row as other if it is not in the top 42 frequencies of neigborhood
def reclassify(row):
    if row['neighbourhood_cleansed'] not in (list(train_data.neighbourhood_cleansed.value_counts()[0:42].index)):
        row['neighbourhood_cleansed'] = 'other'
    else:
        row
    return(row)

#applying the function to train
train_data = train_data.apply(reclassify, axis = 1)
neigborhood_dummies = pd.get_dummies(train_data['neighbourhood_cleansed'])
neigborhood_dummies.columns = neigborhood_dummies.columns.str.replace(' ', '_')
train_data = pd.concat([train_data, neigborhood_dummies], axis =1)


#converting the datetime columns to numeric
train_data['host_since'] = (pd.Timestamp.now() - pd.to_datetime(train_data.host_since)).dt.days
train_data['first_review'] = (pd.Timestamp.now() - pd.to_datetime(train_data.first_review)).dt.days
train_data['last_review'] = (pd.Timestamp.now() - pd.to_datetime(train_data.last_review)).dt.days

#hostresponse time dummies
host_response_time_dummies = pd.get_dummies(train_data['host_response_time'])
host_response_time_dummies.columns = host_response_time_dummies.columns.str.replace(' ', '_')
train_data = pd.concat([train_data, host_response_time_dummies], axis =1)

#host_verification dummies
host_verifications_dummies = pd.get_dummies(train_data['host_verifications'])
host_verifications_dummies.columns = host_verifications_dummies.columns.str.replace(' ', '_')
new_columns = ['email_workemail_phone', 'email_phone', 'email', 'phone_workemail', 'phone']
host_verifications_dummies.columns = new_columns
train_data = pd.concat([train_data, host_verifications_dummies], axis =1)

#room type dummies
room_type_dummies = pd.get_dummies(train_data['room_type'])
room_type_dummies.columns = room_type_dummies.columns.str.replace(' ', '_')
train_data = pd.concat([train_data, room_type_dummies], axis =1)

#defining a function to reclassify property_types as other of its not in to top twenty freqnecies
def reclassify_ptype(row):
    if row['property_type'] not in list(train_data.property_type.value_counts()[:20].index):
        row['property_type'] = 'other'
    else:
        row
    return(row)

#applying the function 
train_data = train_data.apply(reclassify_ptype, axis = 1)

#prop type dummies
property_type_dummies = pd.get_dummies(train_data['property_type'])
property_type_dummies.columns = property_type_dummies.columns.str.replace(' ', '_')
train_data = pd.concat([train_data, property_type_dummies], axis =1)

#reassigning an extreme value of minimum_maximum_nights to 1125
indexes = list(train_data.minimum_maximum_nights[train_data.minimum_maximum_nights == train_data.minimum_maximum_nights.max()].index)
train_data['minimum_maximum_nights'][indexes] = 1125

#defining a function to reclassify a host neigborhood if its not in. the top 66 frequencies of neigborhoods
def reclassify_host_n(row):
    if row['host_neighbourhood'] not in list(train_data.host_neighbourhood.value_counts()[:66].index):
        row['host_neighbourhood'] = 'other'
    else:
        row
    return(row)

#applying the function across the data
train_data = train_data.apply(reclassify_host_n, axis = 1)

#dummie variables for host_neighbourhood
h_neighbourhood_dummies = pd.get_dummies(train_data['host_neighbourhood'])
h_neighbourhood_dummies.columns = 'is_' + h_neighbourhood_dummies.columns.str.replace(' ', '_').str.replace('/', '_')
train_data = pd.concat([train_data, h_neighbourhood_dummies], axis =1)

train_data['price'] = train_data.price.apply(lambda x: x.replace('$','').replace(',','')).apply(lambda x: float(x))

#getting rid of outliers
train_data = train_data.loc[train_data.price < 80000]

#take log
indexes = list(train_data.host_listings_count[train_data.host_listings_count > 50].index)
train_data['host_listings_count'][indexes] = 50

#take log
indexes = list(train_data.host_total_listings_count[train_data.host_total_listings_count > 200].index)
train_data['host_total_listings_count'][indexes] = 200


indexes = list(train_data.beds[train_data.beds > 15].index)
train_data['beds'][indexes] = 15

indexes = list(train_data.minimum_minimum_nights[train_data.minimum_minimum_nights > 200].index)
train_data['minimum_minimum_nights'][indexes] = 200

indexes = list(train_data.minimum_nights_avg_ntm[train_data.minimum_nights_avg_ntm > 400].index)
train_data['minimum_nights_avg_ntm'][indexes] = 400

#take log
indexes = list(train_data.number_of_reviews[train_data.number_of_reviews > 300].index)
train_data['number_of_reviews'][indexes] = 300

#take log
indexes = list(train_data.number_of_reviews_ltm[train_data.number_of_reviews_ltm > 70].index)
train_data['number_of_reviews_ltm'][indexes] = 70

indexes = list(train_data.number_of_reviews_l30d[train_data.number_of_reviews_l30d > 70].index)
train_data['number_of_reviews_l30d'][indexes] = 70

indexes = list(train_data.number_of_reviews_l30d[train_data.number_of_reviews_l30d > 70].index)
train_data['number_of_reviews_l30d'][indexes] = 70


indexes = list(train_data.first_review[train_data.first_review < 0].index)
train_data['first_review'][indexes] = 125


indexes = list(train_data.last_review[train_data.last_review < 0].index)
train_data['last_review'][indexes] = 155


indexes = list(train_data.calculated_host_listings_count[train_data.calculated_host_listings_count > 300].index)
train_data['calculated_host_listings_count'][indexes] = 95

indexes = list(train_data.calculated_host_listings_count_entire_homes[train_data.calculated_host_listings_count_entire_homes > 300].index)
train_data['calculated_host_listings_count_entire_homes'][indexes] = 95

#take log 
indexes = list(train_data.reviews_per_month[train_data.reviews_per_month > 10].index)
train_data['reviews_per_month'][indexes] = 10

end = time.time()

end - st

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['minimum_maximum_nights'][indexes] = 1125
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['number_of_reviews_l30d'][indexes] = 70


5.919013261795044

In [4]:
#host_response_rate
test_data['host_response_rate'] = test_data.host_response_rate.apply(lambda x: str(x)).apply(lambda x: x.replace('%','')).apply(lambda x: float(x))

#host acceptence rate
test_data['host_acceptance_rate'] = test_data.host_acceptance_rate.apply(lambda x: str(x)).apply(lambda x: x.replace('%','')).apply(lambda x: float(x))

#dummy variable host_has_profile_pic
test_data = pd.concat([test_data, pd.get_dummies(test_data['host_has_profile_pic']).rename(columns={'t': 'host_has_profile_pic_t'})], axis = 1)
test_data.drop(labels = ['host_has_profile_pic', 'f'], axis = 1, inplace = True)

#dummy variable host_identity_verified
test_data = pd.concat([test_data, pd.get_dummies(test_data['host_identity_verified']).rename(columns={'t': 'host_identity_verified_t'})], axis = 1)
test_data.drop(labels = ['host_identity_verified', 'f'], axis = 1, inplace = True)

test_data = pd.concat([test_data, pd.get_dummies(test_data['host_is_superhost']).rename(columns={'t': 'host_is_superhost_t'})], axis = 1)
test_data.drop(labels = ['host_is_superhost', 'f'], axis = 1, inplace = True)

#number of bathrooms
test_data['Number_of_Bathrooms'] = test_data.bathrooms_text.apply(lambda x: str(x)).apply(lambda x: x.replace('shared baths','').replace('shared bath','').replace('Shared half-bath', '0.5').replace('Private half-bath', '0.5').replace('Half-bath','0.5').replace('private bath','').replace('baths','').replace('bath','')).apply(lambda x: float(x))

#classifying bathrooms as private and public
test_data['shared_bathroom'] = test_data['bathrooms_text'].str.contains('shared', case=False)
test_data['private_bathroom'] = test_data['bathrooms_text'].str.contains('private', case=False)

#dummy variable has_availability
test_data = pd.concat([test_data, pd.get_dummies(test_data['has_availability']).rename(columns={'t': 'has_availability_t'})], axis = 1)
test_data.drop(labels = ['has_availability', 'f'], axis = 1, inplace = True)

#dummy variable instant_bookable
test_data = pd.concat([test_data, pd.get_dummies(test_data['instant_bookable']).rename(columns={'t': 'instant_bookable_t'})], axis = 1)
test_data.drop(labels = ['instant_bookable', 'f'], axis = 1, inplace = True)



#classifying downtown neigborhhods
downtown_neighborhoods = ['River North', 'Loop', 'Gold Coast', 'Streeterville', \
                          'South Loop', 'West Loop', 'River West', 'Near North Side']

#new column is_downtown
test_data['is_downtown']= test_data.host_neighbourhood.isin(downtown_neighborhoods).astype(int)




#applying the function reclassify to test
#getting dummies for neighbourhood_cleansed
test_data = test_data.apply(reclassify, axis = 1)
neigborhood_dummies_test = pd.get_dummies(test_data['neighbourhood_cleansed'])
neigborhood_dummies_test.columns = neigborhood_dummies_test.columns.str.replace(' ', '_')
test_data = pd.concat([test_data, neigborhood_dummies_test], axis =1)

#converting the datetime columns to numeric
test_data['host_since'] = (pd.Timestamp.now() - pd.to_datetime(test_data.host_since)).dt.days
test_data['first_review'] = (pd.Timestamp.now() - pd.to_datetime(test_data.first_review)).dt.days
test_data['last_review'] = (pd.Timestamp.now() - pd.to_datetime(test_data.last_review)).dt.days


#hostresponse time dummies
host_response_time_dummies_test = pd.get_dummies(test_data['host_response_time'])
host_response_time_dummies_test.columns = host_response_time_dummies_test.columns.str.replace(' ', '_')
test_data = pd.concat([test_data, host_response_time_dummies_test], axis =1)

#host_verification dummies
host_verifications_dummies_test = pd.get_dummies(test_data['host_verifications'])
host_verifications_dummies_test.columns = host_verifications_dummies_test.columns.str.replace(' ', '_')
new_columns = ['email_workemail_phone', 'email_phone', 'email', 'phone_workemail', 'phone']
host_verifications_dummies_test.columns = new_columns
test_data = pd.concat([test_data, host_verifications_dummies_test], axis =1)

#room type dummies
room_type_dummies_test = pd.get_dummies(test_data['room_type'])
room_type_dummies_test.columns = room_type_dummies_test.columns.str.replace(' ', '_')
test_data = pd.concat([test_data, room_type_dummies_test], axis =1)

#applying the function reclassify_ptype
test_data = test_data.apply(reclassify_ptype, axis = 1)

#prop type dummies
property_type_dummies_test = pd.get_dummies(test_data['property_type'])
property_type_dummies_test.columns = property_type_dummies_test.columns.str.replace(' ', '_')
test_data = pd.concat([test_data, property_type_dummies_test], axis =1)

#applying the function reclassify_host_n across the data
test_data = test_data.apply(reclassify_host_n, axis = 1)

#dummie variables for host_neighbourhood
h_neighbourhood_dummies_test = pd.get_dummies(test_data['host_neighbourhood'])
h_neighbourhood_dummies_test.columns = 'is_' + h_neighbourhood_dummies_test.columns.str.replace(' ', '_').str.replace('/', '_')
test_data = pd.concat([test_data, h_neighbourhood_dummies_test], axis =1)

In [5]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)

test_data_numeric = test_data.select_dtypes(include = ['int64', 'float64', 'uint8'])
train_data_numeric = train_data.select_dtypes(include = ['int64', 'float64', 'uint8'])

test_data_numeric = pd.DataFrame(imputer.fit_transform(test_data_numeric),columns = test_data_numeric.columns)
train_data_numeric = pd.DataFrame(imputer.fit_transform(train_data_numeric),columns = train_data_numeric.columns)

train_data_numeric.drop(labels = ['Belmont_Cragin', 'Private_room_in_guest_suite', 'is_East_Colorado_Springs'], axis = 1, inplace = True )

In [6]:
X_train = train_data_numeric.drop(columns = ['id', 'host_id', 'price'], axis = 1).drop(columns = ['is_other', 'other'], axis = 1)
y = train_data_numeric.price
X_test = test_data_numeric.drop(columns = ['id', 'host_id'], axis = 1).drop(columns = ['is_other','other'] , axis = 1)

## 2) Hyperparameter tuning

### How many attempts did it take you to tune the model hyperparameters?

8 or 9 attempts

### Which tuning method did you use (grid search / Bayes search / etc.)?

Randomized Search CV

### What challenges did you face while tuning the hyperparameters, and what actions did you take to address those challenges?

It took large amounts of time to search for hyperparameters, so I used RandomizedSearchCV with a small number of iterations to get hyperparameters for me to test, I also fitted an XGBoost model and took the 50 most important features to reduce the time.

### How many hours did you spend on hyperparameter tuning?

3 or 4 hours

**Paste the hyperparameter tuning code below. You must show at least one hyperparameter tuning procedure.**

In [59]:
model = xgb.XGBRegressor()
model.fit(X_train,y)

In [60]:
pred = pd.Series(model.feature_importances_, index = list(X_train.columns)).sort_values(ascending = False)[:100]

In [61]:
print(list(pred.index))

['Number_of_Bathrooms', 'is_downtown', 'is_River_West', 'Entire_home/apt', 'calculated_host_listings_count_private_rooms', 'accommodates', 'West_Town', 'minimum_minimum_nights', 'Near_North_Side', 'calculated_host_listings_count_entire_homes', 'Rogers_Park', 'number_of_reviews_ltm', 'within_a_day', 'Private_room_in_rental_unit', 'email_phone', 'is_West_Loop_Greektown', 'maximum_minimum_nights', 'review_scores_communication', 'minimum_nights', 'review_scores_rating', 'availability_365', 'Room_in_boutique_hotel', 'host_total_listings_count', 'review_scores_checkin', 'minimum_maximum_nights', 'last_review', 'Room_in_hotel', 'review_scores_location', 'is_Irving_Park', 'latitude', 'host_is_superhost_t', 'maximum_nights_avg_ntm', 'Entire_home', 'review_scores_cleanliness', 'Hotel_room', 'host_since', 'longitude', 'minimum_nights_avg_ntm', 'maximum_maximum_nights', 'Edgewater', 'is_Pulaski_Park', 'Private_room_in_home', 'is_Back_of_the_Yards', 'is_Old_Town', 'is_Garfield_Park', 'Irving_Park',

In [11]:
X_train = X_train[['Number_of_Bathrooms', 'is_downtown', 'is_River_West', 'Entire_home/apt', 'calculated_host_listings_count_private_rooms', 'accommodates', 'West_Town', 'minimum_minimum_nights', 'Near_North_Side', 'calculated_host_listings_count_entire_homes', 'Rogers_Park', 'number_of_reviews_ltm', 'within_a_day', 'Private_room_in_rental_unit', 'email_phone', 'is_West_Loop_Greektown', 'maximum_minimum_nights', 'review_scores_communication', 'minimum_nights', 'review_scores_rating', 'availability_365', 'Room_in_boutique_hotel', 'host_total_listings_count', 'review_scores_checkin', 'minimum_maximum_nights', 'last_review', 'Room_in_hotel', 'review_scores_location', 'is_Irving_Park', 'latitude', 'host_is_superhost_t', 'maximum_nights_avg_ntm', 'Entire_home', 'review_scores_cleanliness', 'Hotel_room', 'host_since', 'longitude', 'minimum_nights_avg_ntm', 'maximum_maximum_nights', 'Edgewater', 'is_Pulaski_Park', 'Private_room_in_home', 'is_Back_of_the_Yards', 'is_Old_Town', 'is_Garfield_Park', 'Irving_Park', 'calculated_host_listings_count', 'Bridgeport', 'North_Center', 'South_Chicago', 'first_review', 'review_scores_value', 'host_has_profile_pic_t', 'host_acceptance_rate', 'Entire_serviced_apartment', 'phone', 'availability_30', 'email_workemail_phone', 'host_response_rate', 'number_of_reviews_l30d', 'has_availability_t', 'reviews_per_month', 'number_of_reviews', 'Logan_Square', 'Entire_loft', 'beds', 'is_Portage_Park', 'phone_workemail', 'is_Lakeview', 'is_Near_North_Side', 'is_Chicago_Loop', 'availability_60', 'is_West_Town', 'host_listings_count', 'review_scores_accuracy', 'availability_90', 'is_Near_South_Side', 'is_Rush_&_Division', 'within_a_few_hours', 'within_an_hour', 'is_Pilsen', 'is_Hyde_Park', 'Near_West_Side', 'Austin', 'Near_South_Side', 'is_Logan_Square', 'is_West_Loop', 'is_Rogers_Park', 'Lower_West_Side', 'is_Jefferson_Park', 'Woodlawn', 'is_Sheffield_&_DePaul', 'Portage_Park', 'instant_bookable_t', 'Douglas', 'maximum_nights', 'Humboldt_Park', 'is_Washington_Park', 'host_identity_verified_t', 'South_Shore']]
X_test = X_test[['Number_of_Bathrooms', 'is_downtown', 'is_River_West', 'Entire_home/apt', 'calculated_host_listings_count_private_rooms', 'accommodates', 'West_Town', 'minimum_minimum_nights', 'Near_North_Side', 'calculated_host_listings_count_entire_homes', 'Rogers_Park', 'number_of_reviews_ltm', 'within_a_day', 'Private_room_in_rental_unit', 'email_phone', 'is_West_Loop_Greektown', 'maximum_minimum_nights', 'review_scores_communication', 'minimum_nights', 'review_scores_rating', 'availability_365', 'Room_in_boutique_hotel', 'host_total_listings_count', 'review_scores_checkin', 'minimum_maximum_nights', 'last_review', 'Room_in_hotel', 'review_scores_location', 'is_Irving_Park', 'latitude', 'host_is_superhost_t', 'maximum_nights_avg_ntm', 'Entire_home', 'review_scores_cleanliness', 'Hotel_room', 'host_since', 'longitude', 'minimum_nights_avg_ntm', 'maximum_maximum_nights', 'Edgewater', 'is_Pulaski_Park', 'Private_room_in_home', 'is_Back_of_the_Yards', 'is_Old_Town', 'is_Garfield_Park', 'Irving_Park', 'calculated_host_listings_count', 'Bridgeport', 'North_Center', 'South_Chicago', 'first_review', 'review_scores_value', 'host_has_profile_pic_t', 'host_acceptance_rate', 'Entire_serviced_apartment', 'phone', 'availability_30', 'email_workemail_phone', 'host_response_rate', 'number_of_reviews_l30d', 'has_availability_t', 'reviews_per_month', 'number_of_reviews', 'Logan_Square', 'Entire_loft', 'beds', 'is_Portage_Park', 'phone_workemail', 'is_Lakeview', 'is_Near_North_Side', 'is_Chicago_Loop', 'availability_60', 'is_West_Town', 'host_listings_count', 'review_scores_accuracy', 'availability_90', 'is_Near_South_Side', 'is_Rush_&_Division', 'within_a_few_hours', 'within_an_hour', 'is_Pilsen', 'is_Hyde_Park', 'Near_West_Side', 'Austin', 'Near_South_Side', 'is_Logan_Square', 'is_West_Loop', 'is_Rogers_Park', 'Lower_West_Side', 'is_Jefferson_Park', 'Woodlawn', 'is_Sheffield_&_DePaul', 'Portage_Park', 'instant_bookable_t', 'Douglas', 'maximum_nights', 'Humboldt_Park', 'is_Washington_Park', 'host_identity_verified_t', 'South_Shore']]

In [15]:
X_train

Unnamed: 0,Number_of_Bathrooms,is_downtown,is_River_West,Entire_home/apt,calculated_host_listings_count_private_rooms,accommodates,West_Town,minimum_minimum_nights,Near_North_Side,calculated_host_listings_count_entire_homes,...,Woodlawn,is_Sheffield_&_DePaul,Portage_Park,instant_bookable_t,Douglas,maximum_nights,Humboldt_Park,is_Washington_Park,host_identity_verified_t,South_Shore
0,1.0,0.0,0.0,0.0,8.0,1.0,0.0,32.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1125.0,0.0,0.0,1.0,0.0
1,3.0,0.0,0.0,0.0,58.0,12.0,0.0,32.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,365.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,1.0,0.0,6.0,0.0,2.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,45.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,55.0,...,0.0,0.0,0.0,1.0,0.0,180.0,0.0,0.0,1.0,0.0
4,2.0,0.0,0.0,1.0,0.0,6.0,0.0,2.0,0.0,74.0,...,0.0,0.0,0.0,1.0,0.0,365.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4994,1.0,0.0,0.0,0.0,2.0,2.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,90.0,0.0,0.0,1.0,0.0
4995,1.0,0.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,8.0,...,0.0,0.0,0.0,0.0,0.0,90.0,0.0,0.0,1.0,0.0
4996,1.0,0.0,0.0,1.0,0.0,4.0,0.0,3.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,180.0,0.0,0.0,1.0,0.0
4997,1.0,0.0,0.0,1.0,0.0,4.0,0.0,32.0,0.0,95.0,...,0.0,0.0,0.0,0.0,0.0,1125.0,0.0,0.0,1.0,0.0


In [19]:
model = xgb.XGBRegressor()


start_time = time.time()

param_grid = {'max_depth': np.arange(5,105,5),
              'learning_rate':np.arange(0.0001,1.0,0.0001),
               'reg_lambda':np.arange(0,50),
                'n_estimators':np.arange(100,6000),
                'gamma':np.arange(0,100),
                'subsample': np.arange(0.01,1.01,0.01),
                'colsample_bytree': np.arange(0.01,1.01,0.01)}

cv = KFold(n_splits=5,shuffle=True,random_state=1)
optimal_params = RandomizedSearchCV(estimator=model,                                                       
                             param_distributions = param_grid, n_iter = 60,
                             n_jobs=-1,
                             cv = cv, scoring = 'neg_root_mean_squared_error')

optimal_params.fit(X_train,y)
print("Optimal parameter values =", optimal_params.best_params_)
print("Time taken = ", round((time.time()-start_time)/60), " minutes")
print(optimal_params.best_score_)

Optimal parameter values = {'subsample': 0.39, 'reg_lambda': 2, 'n_estimators': 4228, 'max_depth': 30, 'learning_rate': 0.03200000000000001, 'gamma': 30, 'colsample_bytree': 0.39}
Time taken =  10  minutes
-134.83986975034546


In [None]:
#{'subsample': 0.19, 'reg_lambda': 9, 'n_estimators': 3353, 'max_depth': 100, 'learning_rate': 0.008199999999999999, 'gamma': 63, 'colsample_bytree': 0.33}
#-132.63528229181279 #120

#{'subsample': 0.23, 'reg_lambda': 43, 'n_estimators': 2482, 'max_depth': 35, 'learning_rate': 0.036000000000000004, 'gamma': 77, 'colsample_bytree': 0.47000000000000003}
#-132.75268904264857 #111



Below are the other iterations run and the other optimal hyperparameters found

In [None]:
# Optimal parameter values = {'subsample': 0.26, 'reg_lambda': 4, 'n_estimators': 5633, 'max_depth': 10, 'learning_rate': 0.001, 'gamma': 20, 'colsample_bytree': 0.8300000000000001}
#-134.64458377049564   (combine this with linear model from last quarter for best score)
#114

#Optimal parameter values = {'subsample': 0.5, 'reg_lambda': 4, 'n_estimators': 5063, 'max_depth': 35, 'learning_rate': 0.001, 'gamma': 83, 'colsample_bytree': 0.78}
#-135.41619435476863
#115


#Optimal parameter values = {'subsample': 0.49, 'reg_lambda': 0, 'n_estimators': 5901, 'max_depth': 80, 'learning_rate': 0.001, 'gamma': 15, 'colsample_bytree': 0.63}
#-136.35907734031733


# {'subsample': 0.27, 'reg_lambda': 42, 'n_estimators': 5104, 'max_depth': 40, 'learning_rate': 0.069, 'gamma': 10, 'colsample_bytree': 0.36000000000000004}
#-133.13994080966233


#{'subsample': 0.64, 'reg_lambda': 22, 'n_estimators': 3662, 'max_depth': 90, 'learning_rate': 0.007200000000000001, 'gamma': 94, 'colsample_bytree': 0.63}
#-133.82687284677652


#Optimal parameter values = {'subsample': 0.43, 'reg_lambda': 27, 'n_estimators': 4529, 'max_depth': 55, 'learning_rate': 0.0078000000000000005, 'gamma': 86, 'colsample_bytree': 0.26}
#-133.0030364875477

#Optimal parameter values = {'subsample': 0.49, 'reg_lambda': 48, 'n_estimators': 4465, 'max_depth': 75, 'learning_rate': 0.01, 'gamma': 2, 'colsample_bytree': 0.53}
#-132.8240724037667

**Paste the optimal hyperparameter values below.**

{'subsample': 0.27, 'reg_lambda': 42, 'n_estimators': 5104, 'max_depth': 40, 'learning_rate': 0.069, 'gamma': 10, 'colsample_bytree': 0.36000000000000004}

## 3) Model

Using the optimal model hyperparameters, train the model, and paste the code below.

In [25]:
model1 = xgb.XGBRegressor(subsample = 0.23, reg_lambda = 43, n_estimators = 2482, max_depth = 35, \
                          learning_rate = 0.036, gamma = 77, colsample_bytree = 0.47).fit(X_train,y)

prices1 = model1.predict(X_test)

## 4) Put any ad-hoc steps for further improving model accuracy
For example, scaling up or scaling down the predictions, capping predictions, etc.

Put code below.

No additional ad-hoc steps

## 5) Export the predictions in the format required to submit on Kaggle
Put code below.

In [26]:
# Gave Kaggle RMSE of 109.21

pred = pd.Series(prices1)

#pred = (prices1 + prices2)/2


df = pd.concat([test_data['id'],pred], axis = 1)
df.columns = ['id', 'predicted']
df.to_csv('output_file306.csv', index=False)