# 4 Pre-Processing and Training Data<a id='4_Pre-Processing_and_Training_Data'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Pre-Processing and Training Data](#4_Pre-Processing_and_Training_Data)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
  * [4.3 Imports](#4.3_Imports)
  * [4.4 Load Data](#4.4_Load_Data)
  * [4.5 Create dummy features for room_type](#4.5_Create_dummy_features_for_room_type)
  * [4.6 Standardize numeric features using a scaler](#4.6_Standardize_numeric_features_using_a_scaler)
  * [4.7 Train/Test Split](#4.7_Train/Test_Split)

## 4.2 Introduction<a id='4.2_Introduction'></a>

In preceding notebooks, performed preliminary assessments of data quality and refined the question to be answered. You found a small number of data values that gave clear choices about whether to replace values or drop a whole row. You determined that predicting the price was your primary aim. You threw away records with missing price data, but not before making the most of the other available data to look for any patterns between the regions. You didn't see any and decided to treat all states equally; the region label didn't seem to be particularly useful.

In this notebook you'll start to build machine learning models. Before even starting with learning a machine learning model, however, start by considering how useful the mean value is as a predictor. This is more than just a pedagogical device. You never want to go to stakeholders with a machine learning model only to have the CEO point out that it performs worse than just guessing the average! Your first model is a baseline performance comparitor for any subsequent model. You then build up the process of efficiently and robustly creating and assessing models against it. The development we lay out may be little slower than in the real world, but this step of the capstone is definitely more than just instructional. It is good practice to build up an understanding that the machine learning pipelines you build work as expected. You can validate steps with your own functions for checking expected equivalence between, say, pandas and sklearn implementations.

## 4.3 Imports<a id='4.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime
from sklearn import preprocessing

#from library.db_utils import save_file

## 4.4 Load Data<a id='4.4_Load_Data'></a>

In [2]:
airbnb_denver_explored = pd.read_csv('airbnb_denver_explored.csv')
denver_summary_explored = pd.read_csv('denver_summary_explored.csv')
airbnb_denver_explored.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,id,description,host_id,area,latitude,longitude,property_type,room_type,...,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,price,home_type,price_log
0,0,0,6.07e+17,Home in Denver · 1 bedroom · 1 bed · 1 shared ...,430149575,North Park Hill,39.76039,-104.92968,Private room in home,Private room,...,2.5,2.5,3.0,3.5,4.0,3.0,0.11,35.0,Home,3.583519
1,1,1,5.46e+17,Rental unit in Denver · 2 bedrooms · 3 beds · ...,169214047,Hale,39.72785,-104.93783,Entire rental unit,Entire home/apt,...,,,,,,,,149.0,Rental,5.010635
2,2,3,52429530.0,Townhouse in Denver · ★4.78 · 3 bedrooms · 4 b...,107279139,Five Points,39.75852,-104.98846,Entire townhouse,Entire home/apt,...,4.88,4.62,4.78,4.78,4.93,4.59,2.52,190.0,Townhouse,5.252273
3,3,4,6.32e+17,Townhouse in Denver · ★New · 2 bedrooms · 2 be...,416194740,West Colfax,39.736019,-105.05072,Entire townhouse,Entire home/apt,...,,,,,,,,87.0,Townhouse,4.477337
4,4,5,6.88e+17,Home in Denver · ★5.0 · 2 bedrooms · 2 beds · ...,133612752,Sunnyside,39.77143,-105.02028,Entire home,Entire home/apt,...,5.0,5.0,5.0,5.0,4.92,4.92,0.99,300.0,Home,5.70711


There are some extra columns that are deleted here.

In [3]:
airbnb_denver_explored = airbnb_denver_explored.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'])
airbnb_denver_explored.head()

Unnamed: 0,id,description,host_id,area,latitude,longitude,property_type,room_type,bathrooms,bedrooms,...,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,price,home_type,price_log
0,6.07e+17,Home in Denver · 1 bedroom · 1 bed · 1 shared ...,430149575,North Park Hill,39.76039,-104.92968,Private room in home,Private room,1.0,1.0,...,2.5,2.5,3.0,3.5,4.0,3.0,0.11,35.0,Home,3.583519
1,5.46e+17,Rental unit in Denver · 2 bedrooms · 3 beds · ...,169214047,Hale,39.72785,-104.93783,Entire rental unit,Entire home/apt,2.0,3.0,...,,,,,,,,149.0,Rental,5.010635
2,52429530.0,Townhouse in Denver · ★4.78 · 3 bedrooms · 4 b...,107279139,Five Points,39.75852,-104.98846,Entire townhouse,Entire home/apt,2.5,4.0,...,4.88,4.62,4.78,4.78,4.93,4.59,2.52,190.0,Townhouse,5.252273
3,6.32e+17,Townhouse in Denver · ★New · 2 bedrooms · 2 be...,416194740,West Colfax,39.736019,-105.05072,Entire townhouse,Entire home/apt,2.5,2.0,...,,,,,,,,87.0,Townhouse,4.477337
4,6.88e+17,Home in Denver · ★5.0 · 2 bedrooms · 2 beds · ...,133612752,Sunnyside,39.77143,-105.02028,Entire home,Entire home/apt,1.0,2.0,...,5.0,5.0,5.0,5.0,4.92,4.92,0.99,300.0,Home,5.70711


In [4]:
airbnb_denver_explored.shape

(4889, 23)

In [5]:
airbnb_denver_explored.columns

Index(['id', 'description', 'host_id', 'area', 'latitude', 'longitude',
       'property_type', 'room_type', 'bathrooms', 'bedrooms',
       'number_of_reviews', 'last_scraped', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'reviews_per_month',
       'price', 'home_type', 'price_log'],
      dtype='object')

## 4.5 Create dummy features for room_type<a id='4.5_Create_dummy_features_for_room_type'></a>

In [6]:
scaled_airbnb_denver_explored = airbnb_denver_explored

In [7]:
scaled_airbnb_denver_explored.drop(columns=['property_type', 'home_type', 'description',
                                            'host_id', 'last_scraped'], inplace=True)

In [8]:
scaled_airbnb_denver_explored['room_type'].unique()

array(['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'],
      dtype=object)

In [9]:
scaled_airbnb_denver_explored = pd.get_dummies(scaled_airbnb_denver_explored, columns=['room_type'], prefix='room')

In [10]:
scaled_airbnb_denver_explored.columns

Index(['id', 'area', 'latitude', 'longitude', 'bathrooms', 'bedrooms',
       'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'reviews_per_month', 'price', 'price_log',
       'room_Entire home/apt', 'room_Hotel room', 'room_Private room',
       'room_Shared room'],
      dtype='object')

In [11]:
scaled_airbnb_denver_explored.shape

(4889, 21)

In [12]:
scaled_airbnb_denver_explored.head()

Unnamed: 0,id,area,latitude,longitude,bathrooms,bedrooms,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,...,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,price,price_log,room_Entire home/apt,room_Hotel room,room_Private room,room_Shared room
0,6.07e+17,North Park Hill,39.76039,-104.92968,1.0,1.0,2,3.0,2.5,2.5,...,3.5,4.0,3.0,0.11,35.0,3.583519,0,0,1,0
1,5.46e+17,Hale,39.72785,-104.93783,2.0,3.0,0,,,,...,,,,,149.0,5.010635,1,0,0,0
2,52429530.0,Five Points,39.75852,-104.98846,2.5,4.0,68,4.78,4.88,4.62,...,4.78,4.93,4.59,2.52,190.0,5.252273,1,0,0,0
3,6.32e+17,West Colfax,39.736019,-105.05072,2.5,2.0,0,,,,...,,,,,87.0,4.477337,1,0,0,0
4,6.88e+17,Sunnyside,39.77143,-105.02028,1.0,2.0,12,5.0,5.0,5.0,...,5.0,4.92,4.92,0.99,300.0,5.70711,1,0,0,0


In [13]:
scaled_airbnb_denver_explored_for_fit = scaled_airbnb_denver_explored.copy()

In [14]:
scaled_airbnb_denver_explored_for_fit

Unnamed: 0,id,area,latitude,longitude,bathrooms,bedrooms,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,...,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,price,price_log,room_Entire home/apt,room_Hotel room,room_Private room,room_Shared room
0,6.070000e+17,North Park Hill,39.760390,-104.92968,1.0,1.0,2,3.00,2.50,2.50,...,3.50,4.00,3.00,0.11,35.0,3.583519,0,0,1,0
1,5.460000e+17,Hale,39.727850,-104.93783,2.0,3.0,0,,,,...,,,,,149.0,5.010635,1,0,0,0
2,5.242953e+07,Five Points,39.758520,-104.98846,2.5,4.0,68,4.78,4.88,4.62,...,4.78,4.93,4.59,2.52,190.0,5.252273,1,0,0,0
3,6.320000e+17,West Colfax,39.736019,-105.05072,2.5,2.0,0,,,,...,,,,,87.0,4.477337,1,0,0,0
4,6.880000e+17,Sunnyside,39.771430,-105.02028,1.0,2.0,12,5.00,5.00,5.00,...,5.00,4.92,4.92,0.99,300.0,5.707110,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4884,6.800000e+17,Sunnyside,39.774110,-105.01761,1.0,2.0,5,5.00,5.00,5.00,...,5.00,5.00,5.00,0.40,78.0,4.369448,1,0,0,0
4885,6.300000e+17,Union Station,39.752110,-104.99469,1.0,2.0,0,,,,...,,,,,125.0,4.836282,1,0,0,0
4886,4.727457e+07,Union Station,39.752320,-105.00347,2.0,2.0,0,,,,...,,,,,300.0,5.707110,1,0,0,0
4887,9.680000e+17,Stapleton,39.803850,-104.87780,1.0,1.0,3,5.00,5.00,5.00,...,5.00,5.00,5.00,1.05,135.0,4.912655,1,0,0,0


In [15]:
scaled_airbnb_denver_explored_for_fit.drop(columns=['id', 'area'], axis=1, inplace=True)

In [16]:
scaled_airbnb_denver_explored_for_fit

Unnamed: 0,latitude,longitude,bathrooms,bedrooms,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,price,price_log,room_Entire home/apt,room_Hotel room,room_Private room,room_Shared room
0,39.760390,-104.92968,1.0,1.0,2,3.00,2.50,2.50,3.00,3.50,4.00,3.00,0.11,35.0,3.583519,0,0,1,0
1,39.727850,-104.93783,2.0,3.0,0,,,,,,,,,149.0,5.010635,1,0,0,0
2,39.758520,-104.98846,2.5,4.0,68,4.78,4.88,4.62,4.78,4.78,4.93,4.59,2.52,190.0,5.252273,1,0,0,0
3,39.736019,-105.05072,2.5,2.0,0,,,,,,,,,87.0,4.477337,1,0,0,0
4,39.771430,-105.02028,1.0,2.0,12,5.00,5.00,5.00,5.00,5.00,4.92,4.92,0.99,300.0,5.707110,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4884,39.774110,-105.01761,1.0,2.0,5,5.00,5.00,5.00,5.00,5.00,5.00,5.00,0.40,78.0,4.369448,1,0,0,0
4885,39.752110,-104.99469,1.0,2.0,0,,,,,,,,,125.0,4.836282,1,0,0,0
4886,39.752320,-105.00347,2.0,2.0,0,,,,,,,,,300.0,5.707110,1,0,0,0
4887,39.803850,-104.87780,1.0,1.0,3,5.00,5.00,5.00,5.00,5.00,5.00,5.00,1.05,135.0,4.912655,1,0,0,0


## 4.6 Standardize numeric features using a scaler<a id='4.6_Standardize_numeric_features_using_a_scaler'></a>

Making a Scaler object

In [17]:
scaler = preprocessing.StandardScaler()

Fitting data to the scaler object

In [18]:
scaled_df = scaler.fit_transform(scaled_airbnb_denver_explored_for_fit)

In [19]:
scaled_df = pd.DataFrame(scaled_df)

In [20]:
scaled_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.585269,0.790847,-0.583829,-0.772870,-0.543745,-5.376181,-7.371623,-6.532121,-6.155251,-4.325604,-2.695267,-4.566858,-0.916039,-0.137443,-1.909972,-2.392676,-0.035054,2.466882,-0.079883
1,-0.451397,0.654544,0.565485,0.494931,-0.562495,,,,,,,,,-0.046506,0.207500,0.417942,-0.035054,-0.405370,-0.079883
2,0.525694,-0.192206,1.140142,1.128832,0.074997,-0.186709,0.049536,-0.588628,-0.380996,-0.367341,0.294430,-0.433851,0.221241,-0.013801,0.566028,0.417942,-0.035054,-0.405370,-0.079883
3,-0.191147,-1.233460,1.140142,-0.138970,-0.562495,,,,,,,,,-0.095963,-0.583777,0.417942,-0.035054,-0.405370,-0.079883
4,0.936983,-0.724373,-0.583829,-0.138970,-0.449996,0.454686,0.423712,0.476715,0.332675,0.312985,0.262283,0.423943,-0.500767,0.073946,1.240888,0.417942,-0.035054,-0.405370,-0.079883
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4884,1.022363,-0.679719,-0.583829,-0.138970,-0.515620,0.454686,0.423712,0.476715,0.332675,0.312985,0.519461,0.631893,-0.779188,-0.103143,-0.743857,0.417942,-0.035054,-0.405370,-0.079883
4885,0.321483,-0.296399,-0.583829,-0.138970,-0.562495,,,,,,,,,-0.065651,-0.051196,0.417942,-0.035054,-0.405370,-0.079883
4886,0.328173,-0.443238,0.565485,-0.138970,-0.562495,,,,,,,,,0.073946,1.240888,0.417942,-0.035054,-0.405370,-0.079883
4887,1.969826,1.658503,-0.583829,-0.772870,-0.534370,0.454686,0.423712,0.476715,0.332675,0.312985,0.519461,0.631893,-0.472453,-0.057674,0.062122,0.417942,-0.035054,-0.405370,-0.079883


## 4.7 Train/Test Split<a id='4.6_Train/Test_Split'></a>

I would have 80/20 train/test split partition size

In [21]:
len(scaled_airbnb_denver_explored_for_fit) * .8, len(scaled_airbnb_denver_explored_for_fit) * .2

(3911.2000000000003, 977.8000000000001)

In [22]:
X_train, X_test, y_train, y_test = train_test_split(scaled_airbnb_denver_explored_for_fit.drop(columns='price'), 
                                                    scaled_airbnb_denver_explored_for_fit.price, test_size=0.2, 
                                                    random_state=47)

In [23]:
X_train.shape, X_test.shape

((3911, 18), (978, 18))

In [24]:
y_train.shape, y_test.shape

((3911,), (978,))

In [25]:
scaled_airbnb_denver_explored_for_fit.columns

Index(['latitude', 'longitude', 'bathrooms', 'bedrooms', 'number_of_reviews',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'reviews_per_month', 'price', 'price_log',
       'room_Entire home/apt', 'room_Hotel room', 'room_Private room',
       'room_Shared room'],
      dtype='object')

Check the `dtypes` attribute of `X_train` to verify all features are numeric

In [26]:
X_train.dtypes.unique()

array([dtype('float64'), dtype('int64'), dtype('uint8')], dtype=object)

Repeat this check for the test split in `X_test`

In [27]:
X_test.dtypes.unique()

array([dtype('float64'), dtype('int64'), dtype('uint8')], dtype=object)

In [28]:
scaled_airbnb_denver_explored_for_fit.head()

Unnamed: 0,latitude,longitude,bathrooms,bedrooms,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,price,price_log,room_Entire home/apt,room_Hotel room,room_Private room,room_Shared room
0,39.76039,-104.92968,1.0,1.0,2,3.0,2.5,2.5,3.0,3.5,4.0,3.0,0.11,35.0,3.583519,0,0,1,0
1,39.72785,-104.93783,2.0,3.0,0,,,,,,,,,149.0,5.010635,1,0,0,0
2,39.75852,-104.98846,2.5,4.0,68,4.78,4.88,4.62,4.78,4.78,4.93,4.59,2.52,190.0,5.252273,1,0,0,0
3,39.736019,-105.05072,2.5,2.0,0,,,,,,,,,87.0,4.477337,1,0,0,0
4,39.77143,-105.02028,1.0,2.0,12,5.0,5.0,5.0,5.0,5.0,4.92,4.92,0.99,300.0,5.70711,1,0,0,0


In [29]:
scaled_airbnb_denver_explored

Unnamed: 0,id,area,latitude,longitude,bathrooms,bedrooms,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,...,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,price,price_log,room_Entire home/apt,room_Hotel room,room_Private room,room_Shared room
0,6.070000e+17,North Park Hill,39.760390,-104.92968,1.0,1.0,2,3.00,2.50,2.50,...,3.50,4.00,3.00,0.11,35.0,3.583519,0,0,1,0
1,5.460000e+17,Hale,39.727850,-104.93783,2.0,3.0,0,,,,...,,,,,149.0,5.010635,1,0,0,0
2,5.242953e+07,Five Points,39.758520,-104.98846,2.5,4.0,68,4.78,4.88,4.62,...,4.78,4.93,4.59,2.52,190.0,5.252273,1,0,0,0
3,6.320000e+17,West Colfax,39.736019,-105.05072,2.5,2.0,0,,,,...,,,,,87.0,4.477337,1,0,0,0
4,6.880000e+17,Sunnyside,39.771430,-105.02028,1.0,2.0,12,5.00,5.00,5.00,...,5.00,4.92,4.92,0.99,300.0,5.707110,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4884,6.800000e+17,Sunnyside,39.774110,-105.01761,1.0,2.0,5,5.00,5.00,5.00,...,5.00,5.00,5.00,0.40,78.0,4.369448,1,0,0,0
4885,6.300000e+17,Union Station,39.752110,-104.99469,1.0,2.0,0,,,,...,,,,,125.0,4.836282,1,0,0,0
4886,4.727457e+07,Union Station,39.752320,-105.00347,2.0,2.0,0,,,,...,,,,,300.0,5.707110,1,0,0,0
4887,9.680000e+17,Stapleton,39.803850,-104.87780,1.0,1.0,3,5.00,5.00,5.00,...,5.00,5.00,5.00,1.05,135.0,4.912655,1,0,0,0


In [30]:
scaled_df.to_csv('scaled_airbnb_denver.csv')

In [31]:
scaled_airbnb_denver_explored.to_csv('airbnb_denver_pre.csv')