## Instructions {-}

- This is the template for the code and report on the Prediction Problem assignments.

- Your code in steps 1, 3, 4, and 5 will be executed sequentially, and must produce the RMSE / accuracy claimed on Kaggle.

- Your code in step 2 will also be executed, and must produce the optimal hyperparameter values used to train the model.

## Read data

In [1]:
#importing Libraries 
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import RandomForestRegressor


In [2]:
train_data = pd.read_csv('train_regression.csv')
test_data = pd.read_csv('test_regression.csv')

## 1) Data pre-processing

Put the data pre-processing code. You don't need to explain it. You may use the same code from last quarter.

In [3]:
train_data['host_response_rate'] = train_data.host_response_rate.apply(lambda x: str(x)).apply(lambda x: x.replace('%','')).apply(lambda x: float(x))

#host acceptence rate
train_data['host_acceptance_rate'] = train_data.host_acceptance_rate.apply(lambda x: str(x)).apply(lambda x: x.replace('%','')).apply(lambda x: float(x))


#dummy variable host_has_profile_pic
train_data = pd.concat([train_data, pd.get_dummies(train_data['host_has_profile_pic']).rename(columns={'t': 'host_has_profile_pic_t'})], axis = 1)
train_data.drop(labels = ['host_has_profile_pic', 'f'], axis = 1, inplace = True)

#dummy variable host_identity_verified
train_data = pd.concat([train_data, pd.get_dummies(train_data['host_identity_verified']).rename(columns={'t': 'host_identity_verified_t'})], axis = 1)
train_data.drop(labels = ['host_identity_verified', 'f'], axis = 1, inplace = True)

#dummy variable host_is_superhost
train_data = pd.concat([train_data, pd.get_dummies(train_data['host_is_superhost']).rename(columns={'t': 'host_is_superhost_t'})], axis = 1)
train_data.drop(labels = ['host_is_superhost', 'f'], axis = 1, inplace = True)

#number of bathrooms
train_data['Number_of_Bathrooms'] = train_data.bathrooms_text.apply(lambda x: str(x)).apply(lambda x: x.replace('shared baths','').replace('shared bath','').replace('Shared half-bath', '0.5').replace('Private half-bath', '0.5').replace('Half-bath','0.5').replace('private bath','').replace('baths','').replace('bath','')).apply(lambda x: float(x))

#classifying bathrooms as private and public
train_data['shared_bathroom'] = train_data['bathrooms_text'].str.contains('shared', case=False)
train_data['private_bathroom'] = train_data['bathrooms_text'].str.contains('private', case=False)

#dummy variable has_availability
train_data = pd.concat([train_data, pd.get_dummies(train_data['has_availability']).rename(columns={'t': 'has_availability_t'})], axis = 1)
train_data.drop(labels = ['has_availability', 'f'], axis = 1, inplace = True)

#dummy variable instant_bookable
train_data = pd.concat([train_data, pd.get_dummies(train_data['instant_bookable']).rename(columns={'t': 'instant_bookable_t'})], axis = 1)
train_data.drop(labels = ['instant_bookable', 'f'], axis = 1, inplace = True)

#classifying downtown neigborhhods
downtown_neighborhoods = ['River North', 'Loop', 'Gold Coast', 'Streeterville', \
                          'South Loop', 'West Loop', 'River West', 'Near North Side']

#new column is_downtown
train_data['is_downtown']= train_data.host_neighbourhood.isin(downtown_neighborhoods).astype(int)

#defining a fucntion to classify a row as other if it is not in the top 42 frequencies of neigborhood
def reclassify(row):
    if row['neighbourhood_cleansed'] not in (list(train_data.neighbourhood_cleansed.value_counts()[0:42].index)):
        row['neighbourhood_cleansed'] = 'other'
    else:
        row
    return(row)

#applying the function to train
train_data = train_data.apply(reclassify, axis = 1)
neigborhood_dummies = pd.get_dummies(train_data['neighbourhood_cleansed'])
neigborhood_dummies.columns = neigborhood_dummies.columns.str.replace(' ', '_')
train_data = pd.concat([train_data, neigborhood_dummies], axis =1)


#converting the datetime columns to numeric
train_data['host_since'] = (pd.Timestamp.now() - pd.to_datetime(train_data.host_since)).dt.days
train_data['first_review'] = (pd.Timestamp.now() - pd.to_datetime(train_data.first_review)).dt.days
train_data['last_review'] = (pd.Timestamp.now() - pd.to_datetime(train_data.last_review)).dt.days

#hostresponse time dummies
host_response_time_dummies = pd.get_dummies(train_data['host_response_time'])
host_response_time_dummies.columns = host_response_time_dummies.columns.str.replace(' ', '_')
train_data = pd.concat([train_data, host_response_time_dummies], axis =1)

#host_verification dummies
host_verifications_dummies = pd.get_dummies(train_data['host_verifications'])
host_verifications_dummies.columns = host_verifications_dummies.columns.str.replace(' ', '_')
new_columns = ['email_workemail_phone', 'email_phone', 'email', 'phone_workemail', 'phone']
host_verifications_dummies.columns = new_columns
train_data = pd.concat([train_data, host_verifications_dummies], axis =1)

#room type dummies
room_type_dummies = pd.get_dummies(train_data['room_type'])
room_type_dummies.columns = room_type_dummies.columns.str.replace(' ', '_')
train_data = pd.concat([train_data, room_type_dummies], axis =1)

#defining a function to reclassify property_types as other of its not in to top twenty freqnecies
def reclassify_ptype(row):
    if row['property_type'] not in list(train_data.property_type.value_counts()[:20].index):
        row['property_type'] = 'other'
    else:
        row
    return(row)

#applying the function 
train_data = train_data.apply(reclassify_ptype, axis = 1)

#prop type dummies
property_type_dummies = pd.get_dummies(train_data['property_type'])
property_type_dummies.columns = property_type_dummies.columns.str.replace(' ', '_')
train_data = pd.concat([train_data, property_type_dummies], axis =1)

#reassigning an extreme value of minimum_maximum_nights to 1125
indexes = list(train_data.minimum_maximum_nights[train_data.minimum_maximum_nights == train_data.minimum_maximum_nights.max()].index)
train_data['minimum_maximum_nights'][indexes] = 1125

#defining a function to reclassify a host neigborhood if its not in. the top 66 frequencies of neigborhoods
def reclassify_host_n(row):
    if row['host_neighbourhood'] not in list(train_data.host_neighbourhood.value_counts()[:66].index):
        row['host_neighbourhood'] = 'other'
    else:
        row
    return(row)

#applying the function across the data
train_data = train_data.apply(reclassify_host_n, axis = 1)

#dummie variables for host_neighbourhood
h_neighbourhood_dummies = pd.get_dummies(train_data['host_neighbourhood'])
h_neighbourhood_dummies.columns = 'is_' + h_neighbourhood_dummies.columns.str.replace(' ', '_').str.replace('/', '_')
train_data = pd.concat([train_data, h_neighbourhood_dummies], axis =1)

train_data['price'] = train_data.price.apply(lambda x: x.replace('$','').replace(',','')).apply(lambda x: float(x))

#getting rid of outliers
train_data = train_data.loc[train_data.price < 80000]

#take log
indexes = list(train_data.host_listings_count[train_data.host_listings_count > 50].index)
train_data['host_listings_count'][indexes] = 50

#take log
indexes = list(train_data.host_total_listings_count[train_data.host_total_listings_count > 200].index)
train_data['host_total_listings_count'][indexes] = 200


indexes = list(train_data.beds[train_data.beds > 15].index)
train_data['beds'][indexes] = 15

indexes = list(train_data.minimum_minimum_nights[train_data.minimum_minimum_nights > 200].index)
train_data['minimum_minimum_nights'][indexes] = 200

indexes = list(train_data.minimum_nights_avg_ntm[train_data.minimum_nights_avg_ntm > 400].index)
train_data['minimum_nights_avg_ntm'][indexes] = 400

#take log
indexes = list(train_data.number_of_reviews[train_data.number_of_reviews > 300].index)
train_data['number_of_reviews'][indexes] = 300

#take log
indexes = list(train_data.number_of_reviews_ltm[train_data.number_of_reviews_ltm > 70].index)
train_data['number_of_reviews_ltm'][indexes] = 70

indexes = list(train_data.number_of_reviews_l30d[train_data.number_of_reviews_l30d > 70].index)
train_data['number_of_reviews_l30d'][indexes] = 70

indexes = list(train_data.number_of_reviews_l30d[train_data.number_of_reviews_l30d > 70].index)
train_data['number_of_reviews_l30d'][indexes] = 70


indexes = list(train_data.first_review[train_data.first_review < 0].index)
train_data['first_review'][indexes] = 125


indexes = list(train_data.last_review[train_data.last_review < 0].index)
train_data['last_review'][indexes] = 155


indexes = list(train_data.calculated_host_listings_count[train_data.calculated_host_listings_count > 300].index)
train_data['calculated_host_listings_count'][indexes] = 95

indexes = list(train_data.calculated_host_listings_count_entire_homes[train_data.calculated_host_listings_count_entire_homes > 300].index)
train_data['calculated_host_listings_count_entire_homes'][indexes] = 95

#take log 
indexes = list(train_data.reviews_per_month[train_data.reviews_per_month > 10].index)
train_data['reviews_per_month'][indexes] = 10

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['minimum_maximum_nights'][indexes] = 1125
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['number_of_reviews_l30d'][indexes] = 70


In [4]:
#host_response_rate
test_data['host_response_rate'] = test_data.host_response_rate.apply(lambda x: str(x)).apply(lambda x: x.replace('%','')).apply(lambda x: float(x))

#host acceptence rate
test_data['host_acceptance_rate'] = test_data.host_acceptance_rate.apply(lambda x: str(x)).apply(lambda x: x.replace('%','')).apply(lambda x: float(x))

#dummy variable host_has_profile_pic
test_data = pd.concat([test_data, pd.get_dummies(test_data['host_has_profile_pic']).rename(columns={'t': 'host_has_profile_pic_t'})], axis = 1)
test_data.drop(labels = ['host_has_profile_pic', 'f'], axis = 1, inplace = True)

#dummy variable host_identity_verified
test_data = pd.concat([test_data, pd.get_dummies(test_data['host_identity_verified']).rename(columns={'t': 'host_identity_verified_t'})], axis = 1)
test_data.drop(labels = ['host_identity_verified', 'f'], axis = 1, inplace = True)

test_data = pd.concat([test_data, pd.get_dummies(test_data['host_is_superhost']).rename(columns={'t': 'host_is_superhost_t'})], axis = 1)
test_data.drop(labels = ['host_is_superhost', 'f'], axis = 1, inplace = True)

#number of bathrooms
test_data['Number_of_Bathrooms'] = test_data.bathrooms_text.apply(lambda x: str(x)).apply(lambda x: x.replace('shared baths','').replace('shared bath','').replace('Shared half-bath', '0.5').replace('Private half-bath', '0.5').replace('Half-bath','0.5').replace('private bath','').replace('baths','').replace('bath','')).apply(lambda x: float(x))

#classifying bathrooms as private and public
test_data['shared_bathroom'] = test_data['bathrooms_text'].str.contains('shared', case=False)
test_data['private_bathroom'] = test_data['bathrooms_text'].str.contains('private', case=False)

#dummy variable has_availability
test_data = pd.concat([test_data, pd.get_dummies(test_data['has_availability']).rename(columns={'t': 'has_availability_t'})], axis = 1)
test_data.drop(labels = ['has_availability', 'f'], axis = 1, inplace = True)

#dummy variable instant_bookable
test_data = pd.concat([test_data, pd.get_dummies(test_data['instant_bookable']).rename(columns={'t': 'instant_bookable_t'})], axis = 1)
test_data.drop(labels = ['instant_bookable', 'f'], axis = 1, inplace = True)



#classifying downtown neigborhhods
downtown_neighborhoods = ['River North', 'Loop', 'Gold Coast', 'Streeterville', \
                          'South Loop', 'West Loop', 'River West', 'Near North Side']

#new column is_downtown
test_data['is_downtown']= test_data.host_neighbourhood.isin(downtown_neighborhoods).astype(int)




#applying the function reclassify to test
#getting dummies for neighbourhood_cleansed
test_data = test_data.apply(reclassify, axis = 1)
neigborhood_dummies_test = pd.get_dummies(test_data['neighbourhood_cleansed'])
neigborhood_dummies_test.columns = neigborhood_dummies_test.columns.str.replace(' ', '_')
test_data = pd.concat([test_data, neigborhood_dummies_test], axis =1)

#converting the datetime columns to numeric
test_data['host_since'] = (pd.Timestamp.now() - pd.to_datetime(test_data.host_since)).dt.days
test_data['first_review'] = (pd.Timestamp.now() - pd.to_datetime(test_data.first_review)).dt.days
test_data['last_review'] = (pd.Timestamp.now() - pd.to_datetime(test_data.last_review)).dt.days


#hostresponse time dummies
host_response_time_dummies_test = pd.get_dummies(test_data['host_response_time'])
host_response_time_dummies_test.columns = host_response_time_dummies_test.columns.str.replace(' ', '_')
test_data = pd.concat([test_data, host_response_time_dummies_test], axis =1)

#host_verification dummies
host_verifications_dummies_test = pd.get_dummies(test_data['host_verifications'])
host_verifications_dummies_test.columns = host_verifications_dummies_test.columns.str.replace(' ', '_')
new_columns = ['email_workemail_phone', 'email_phone', 'email', 'phone_workemail', 'phone']
host_verifications_dummies_test.columns = new_columns
test_data = pd.concat([test_data, host_verifications_dummies_test], axis =1)

#room type dummies
room_type_dummies_test = pd.get_dummies(test_data['room_type'])
room_type_dummies_test.columns = room_type_dummies_test.columns.str.replace(' ', '_')
test_data = pd.concat([test_data, room_type_dummies_test], axis =1)

#applying the function reclassify_ptype
test_data = test_data.apply(reclassify_ptype, axis = 1)

#prop type dummies
property_type_dummies_test = pd.get_dummies(test_data['property_type'])
property_type_dummies_test.columns = property_type_dummies_test.columns.str.replace(' ', '_')
test_data = pd.concat([test_data, property_type_dummies_test], axis =1)

#applying the function reclassify_host_n across the data
test_data = test_data.apply(reclassify_host_n, axis = 1)

#dummie variables for host_neighbourhood
h_neighbourhood_dummies_test = pd.get_dummies(test_data['host_neighbourhood'])
h_neighbourhood_dummies_test.columns = 'is_' + h_neighbourhood_dummies_test.columns.str.replace(' ', '_').str.replace('/', '_')
test_data = pd.concat([test_data, h_neighbourhood_dummies_test], axis =1)

In [5]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)

test_data_numeric = test_data.select_dtypes(include = ['int64', 'float64', 'uint8'])
train_data_numeric = train_data.select_dtypes(include = ['int64', 'float64', 'uint8'])

test_data_numeric = pd.DataFrame(imputer.fit_transform(test_data_numeric),columns = test_data_numeric.columns)
train_data_numeric = pd.DataFrame(imputer.fit_transform(train_data_numeric),columns = train_data_numeric.columns)

In [6]:
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, LogisticRegressionCV, LogisticRegression
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import r2_score, accuracy_score
from sklearn.model_selection import cross_val_score, cross_val_predict

train_data_numeric.drop(labels = ['Belmont_Cragin', 'Private_room_in_guest_suite', 'is_East_Colorado_Springs'], axis = 1, inplace = True )

In [7]:
X_train = train_data_numeric.drop(columns = ['id', 'host_id', 'price'], axis = 1)
y = train_data_numeric.price
X_test = test_data_numeric.drop(columns = ['id', 'host_id'], axis = 1)

## 2) Hyperparameter tuning

### How many attempts did it take you to tune the model hyperparameters?

Around 10 attempts to tune the hyperparameters

### Which tuning method did you use (grid search / Bayes search / etc.)?

Grid Search

### What challenges did you face while tuning the hyperparameters, and what actions did you take to address those challenges?

I did not face many major challenges while tuning hyperparameters. I did have difficulty decided weather or not to select variables from the data set. I ultimatly decided not to use variable selection.

### How many hours did you spend on hyperparameter tuning?

45 minutes to an hour

**Paste the hyperparameter tuning code below. You must show at least one hyperparameter tuning procedure.**

In [None]:
#Hyperparameter tuning code
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.ensemble import RandomForestRegressor

kf = KFold(n_splits = 5, shuffle = True)

model = RandomForestRegressor(oob_score=True, verbose=False, n_jobs=-1)

params = {'n_estimators': np.arange(100,3000),
          'max_features': np.arange(1,66),
         'max_samples': np.arange(0.01,1.01,0.01)}


gcv = RandomizedSearchCV(model, params, n_iter = 60, cv = kf, scoring = 'neg_root_mean_squared_error', n_jobs = -1)
gcv.fit(X_train,y)

In [91]:
print(gcv.best_params_)
print(gcv.best_score_)

{'max_features': 42, 'max_samples': 0.94, 'n_estimators': 100}
-133.63987461942853


**Paste the optimal hyperparameter values below.**

{'max_features': 42, 'max_samples': 0.94}

## 3) Model

Using the optimal model hyperparameters, train the model, and paste the code below.

In [8]:
# When I ran the code earlier I got 0.9 and 43, so I used these hyperparameters

model = RandomForestRegressor(oob_score=True, verbose=False, \
    max_features = 43, max_samples = 0.9, n_estimators = 400, \
                    n_jobs=-1, random_state =45).fit(X_train,y)

## 4) Put any ad-hoc steps for further improving model accuracy
For example, scaling up or scaling down the predictions, capping predictions, etc.

Put code below.

No ad-hoc steps

## 5) Export the predictions in the format required to submit on Kaggle
Put code below.

In [9]:
prices = model.predict(X_test)
df = pd.concat([test_data['id'],pd.Series(prices)], axis = 1)


df.columns = ['id', 'predicted']
df

Unnamed: 0,id,predicted
0,771986218856585018,348.5500
1,855276028675941785,133.8025
2,48537824,309.2675
3,41867473,135.8550
4,28361473,103.0850
...,...,...
3333,22399540,147.3650
3334,964949640914649520,72.9875
3335,18007859,88.7050
3336,736269020394606618,102.3475
