## Python Workshop III - Intro to Machine Learning

Machine learning is a rapidly growing application that is being applied accross a multitude of fields. Boiled down to its simplist explanation, machine learning is generating models or a series of models that predict values that do not exist yet. In other words, how do the attributes of something compose its value and how can we model those attributes to predict values of that something that do not exist yet?

A common application to introduce machine learning is hedonic pricing, or how do the attributes of something determine its price?

In this exercise we will be looking at attributes (bedrooms, number of allowable guests, etc.,) of AirBnBs in Boston and how those attributes determine the price of listings. Once we develop a model that accurately predicts existing listings, we will then use those estimated values to predict the values of listings based on attributes we have identified.

Accoding to Data Quest, machine learning models perform the following tasks in regards to an application such as valuating real estate:

1. Examining a large data set of past home sales (observations)


2. Finding patterns and statistical relationships between a house’s characteristics (features) and its price (the target variable), including patterns that might not be evident to a human who’s looking at the data


3. Using these statistical relationships and patterns to predict the price of any new houses we feed it data on.

------------------------------------------------------------------------

The content for this workshop comes from the following sources:

## Lesson Content - Data Quest

https://www.dataquest.io/blog/machine-learning-tutorial/

## Data for AirBnB Listings

http://insideairbnb.com/get-the-data.html



## ----------------------------------------------------------------------------------------------------------------------------

To manipulate our data and import our data, we will be using the Pandas package. This is a very common package predominantly used in data science and social science applications in Python (i.e., economics)

In [None]:
import pandas as pd
boston_listings = pd.read_csv(r'filepath\boston_airbnb.csv')
print(boston_listings.shape)
boston_listings.head()

In [None]:
##Remove variables that are not needed / keep ones we want
boston_listings = boston_listings[['host_response_rate','host_acceptance_rate', 'host_listings_count','accommodates','room_type','bedrooms','bathrooms','beds','price','cleaning_fee','security_deposit','minimum_nights','maximum_nights','number_of_reviews','latitude','longitude','city','zipcode','state']]

In [None]:
## What are the dimensions of a our new data frame? We will need to take note of the number of observations for later
print(boston_listings.shape)
boston_listings.head()

## The K-nearest neighbors algorithm

The k-nearest neighbors algorithm stores all AirBnB listing and classifies each listing based on its similarity to another listing. Similarity is based on variables of interest and their euclidian distance (How far apart are they numerically).

To perfrom our K-nearest neighbors algorithm we:

First, we select the number of similar listings k, that we want to compare with.

Second, we need to calculate how similar each listing is to ours using a similarity metric.

Third, we rank each listing using our similarity metric and select the first k listings.

Finally, we calculate the mean price for the k similar listings, and use that as our list price.


Let’s start building our real model by defining the similarity metric we’re going to use. Then, we’ll implement the k-nearest neighbors algorithm and use it to suggest a price for a new listing. For the purposes of this tutorial, we’re going to use a fixed k value of 5, but once you become familiar with the workflow of the algorithm you can experiment with this value to see if you get better results with lower or higher k values.


In [None]:
## In this cell we are going to calculate the distance between the first living
## space in the data set and our own. The smallest distance we can achieve is 0

import numpy as np
our_acc_value = 3
first_living_space_value = boston_listings.loc[0,'accommodates']
first_distance = np.abs(first_living_space_value - our_acc_value)
print(first_distance)

In [None]:
## Next, we are going to calculate all of the distance for each observations
## in the data set and tabulate it. As you can see, we have 429 observations with
## a value of 0. This that these 429 observations are similar to each other.
## Let's use them for our machine learning

boston_listings['distance'] = np.abs(boston_listings.accommodates - our_acc_value)
boston_listings.distance.value_counts().sort_index()

In [None]:
## If we were to use our data sequentially, that would introdue bias into our
## our calculations. So, we are going to randomize our data

boston_listings = boston_listings.sample(frac=1,random_state=0)
boston_listings = boston_listings.sort_values('distance')
boston_listings.price.head()

In [None]:
## See how our price has a dollar sign in it? We need to remove that to make
## it a value that python can work with.

boston_listings['price'] = boston_listings.price.str.replace("\$|,",'').astype(float)
mean_price = boston_listings.price.iloc[:5].mean()
mean_price

We have now made our first prediction. Our KNN model has told us that when we use the accomodates deature to predict price, we get an average price of $168.80 for a three-person listing.

While this is a cool result, we do not know how accurate this result is. So, next we are going to 'train' our model.

To train our model we are going to split it into 2 partions with one group holding 75% of our data and another group holding 25% of our data.

The rows in the training set (train_df) are used to predict the price value for the rows in the test set

We then compare the predicted values with the actual price values in the test to see how accurate they are

We are also going to drop our 'distance' variable we generated before to make a new one

In [None]:
boston_listings = boston_listings[['host_response_rate','host_acceptance_rate', 'host_listings_count','accommodates','room_type','bedrooms','bathrooms','beds','price','cleaning_fee','security_deposit','minimum_nights','maximum_nights','number_of_reviews','latitude','longitude','city','zipcode','state']]
train_df = boston_listings.copy().iloc[:2630]
test_df = boston_listings.copy().iloc[2630:]

To make things easier for ourselves while we look at metrics, we’ll combine the model we made earlier into a function. We won’t need to worry about randomizing the rows, since they’re still randomized from earlier.

In [None]:
def predict_price(new_listing_value,feature_column):
    temp_df = train_df
    temp_df['distance'] = np.abs(boston_listings[feature_column] - new_listing_value)
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predicted_price = knn_5.mean()
    return(predicted_price)

We can now use this function to predict values for our test dataset using the accommodates column.

In [None]:
test_df['predicted_price'] = test_df.accommodates.apply(predict_price,feature_column='accommodates')

## Using RMSE to Evaluate Our Model

For many prediction tasks, we want to penalize predicted values that are further away from the actual value much more than those that are closer to the actual value.

To do this, we can take the mean of the squared error values, which is called the root mean squared error (RMSE). Here’s the formula for RMSE:

- https://en.wikipedia.org/wiki/Root-mean-square_deviation

where n represents the number of rows in the test set. This formula might look overwhelming at first, but all we’re doing is:

- Taking the difference between each predicted value and the actual value (or error),

- Squaring this difference (square),

- Taking the mean of all the squared differences (mean), and

- Taking the square root of that mean (root).


In [None]:
test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
mse = test_df['squared_error'].mean()
rmse = mse ** (1/2)
print('Our Root Mean Square Error is:', round(rmse))

The smaller the RMSE the better so, a value of 182 is not great.

Of some of our variables, let us calculate the RMSE for each to see which one has the smallest value. We can use that one for future predictions.

In [None]:
for feature in ['accommodates','bedrooms','bathrooms','number_of_reviews']:
    test_df['predicted_price'] = test_df.accommodates.apply(predict_price,feature_column=feature)
    test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
    mse = test_df['squared_error'].mean()
    rmse = mse ** (1/2)
    print("RMSE for the {} column: {}".format(feature,rmse))

We can see that the 'bedrooms' and 'accomodate' variables have the lowest RMSE. They are still not great though.

HOWEVER

We can minimize this error by incorporating additional variables into our prediction, or going from a univariate model to a multivariate model

-------------------------------------------------------------------------------------------------------------------------------

## Normalize Variables

We’re going to read in a cleaned version of this data set so that we can focus on evaluating the models. In our cleaned data set:

- All columns have been converted to numeric values, since we can’t calculate the Euclidean distance of a value with non-numeric characters.
- Non numeric columns have been removed for simplicity.
- Any listings with missing values have been removed.
- We have normalized the columns which will give us more accurate results.

In [None]:
boston_listings = pd.read_csv(r'filepath\boston_airbnb.csv')
boston_listings = boston_listings[['accommodates','bedrooms','bathrooms','beds','price','minimum_nights','maximum_nights','number_of_reviews']]
boston_listings['price'] = boston_listings.price.str.replace("\$|,",'').astype(float)
print(boston_listings.shape)
boston_listings.describe()

In [None]:
from sklearn import preprocessing
a = [boston_listings.accommodates]
a = np.nan_to_num(a)
b = [boston_listings.bedrooms]
b = np.nan_to_num(b)
c = [boston_listings.bathrooms]
c =np.nan_to_num(c)
d = [boston_listings.beds]
d = np.nan_to_num(d)
e = [boston_listings.price]
e = np.nan_to_num(e)
f = [boston_listings.minimum_nights]
f= np.nan_to_num(f)
g = [boston_listings.number_of_reviews]
g= np.nan_to_num(g)

In [None]:
#Normalizing the data
normalized_a = preprocessing.normalize(a)
normalized_b = preprocessing.normalize(b)
normalized_c = preprocessing.normalize(c)
normalized_d = preprocessing.normalize(d)
normalized_e = preprocessing.normalize(e)
normalized_f = preprocessing.normalize(f)
normalized_g = preprocessing.normalize(g)

In [None]:
# Reconstructing the data frame
boston_listings["accommodates"].replace({"normalized_a"}, inplace=True)
boston_listings["bedrooms"].replace({"normalized_b"}, inplace=True)
boston_listings["bathrooms"].replace({"normalized_c"}, inplace=True)
boston_listings["beds"].replace({"normalized_d"}, inplace=True)
boston_listings["price"].replace({"normalized_e"}, inplace=True)
boston_listings["minimum_nights"].replace({"normalized_f"}, inplace=True)
boston_listings["number_of_reviews"].replace({"normalized_g"}, inplace=True)

In [None]:
# Saving the data frame to csv to import later
boston_listings.to_csv(r'filepath\boston_airbnb_normalized.csv')

_______________________________________________________________________________

In [None]:
normalized_listings = pd.read_csv(r'filepath\boston_airbnb_normalized.csv')
# Importing data frame

normalized_listings = normalized_listings.sample(frac=1,random_state=0)
# re-randomize observations

#Check number of empty observations -- Empty observations make an uneven matrix that cant be processed
print(normalized_listings.shape)
normalized_listings.isnull().sum()

In [None]:
## Replacing missing values with the mean -- you can replace it with whatever though (median, mode, zero, etc.,)

bedrooms_mean = normalized_listings['bedrooms'].mean()
normalized_listings['bedrooms'].fillna(bedrooms_mean,inplace=True)

bathrooms_mean = normalized_listings['bathrooms'].mean()
normalized_listings['bathrooms'].fillna(bathrooms_mean,inplace=True)

beds_mean = normalized_listings['beds'].mean()
normalized_listings['beds'].fillna(beds_mean,inplace=True)

normalized_listings.isnull().sum()

In [None]:
# Rerandomizing data and setting training data sets

normalized_listings = normalized_listings.sample(frac=1,random_state=0)
norm_train_df = normalized_listings.copy().iloc[1:2630]
norm_test_df = normalized_listings.copy().iloc[2630:]

In [None]:
## Calculating the distance between the first and fifth listings

from scipy.spatial import distance
first_listing = normalized_listings.iloc[0][['accommodates', 'bathrooms']]
fifth_listing = normalized_listings.iloc[20][['accommodates', 'bathrooms']]
first_fifth_distance = distance.euclidean(first_listing, fifth_listing)
first_fifth_distance

In [None]:
# recalculating our RMSE with our new model with TWO predictor variables

def predict_price_multivariate(new_listing_value,feature_columns):
    temp_df = norm_train_df
    temp_df['distance'] = distance.cdist(temp_df[feature_columns],[new_listing_value[feature_columns]])
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predicted_price = knn_5.mean()
    return(predicted_price)
cols = ['accommodates', 'bathrooms']
    

norm_test_df['predicted_price'] = norm_test_df[cols].apply(predict_price_multivariate,feature_columns=cols,axis=1)
norm_test_df['squared_error'] = (norm_test_df['predicted_price'] - norm_test_df['price'])**(2)
mse = norm_test_df['squared_error'].mean()
rmse = mse ** (1/2)
print(round(rmse))

Here we can see that our RMSE for accommodates went from 218 to 113, a pretty good improvement.

HOWEVER

Can we do better?

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(algorithm='brute')

knn.fit(train_features, normalized_listings.price)

predictions = knn.predict(test_features)

In [None]:
## Using a precanned package that runs much faster

knn.fit(norm_train_df[cols],norm_train_df['price'])
two_features_predictions = knn.predict(norm_test_df[cols])

In [None]:
from sklearn.metrics import mean_squared_error
two_features_mse = mean_squared_error(norm_test_df['price'], two_features_predictions)
two_features_rmse = two_features_mse ** (1/2)
print(two_features_rmse)

In [None]:
knn = KNeighborsRegressor(algorithm='auto')
cols = ['accommodates','bedrooms','bathrooms','beds']
knn.fit(norm_train_df[cols], norm_train_df['price'])
four_features_predictions = knn.predict(norm_test_df[cols])
four_features_mse = mean_squared_error(norm_test_df['price'], four_features_predictions)
four_features_rmse = four_features_mse ** (1/2)
four_features_rmse

In [None]:
## Running our alternative model seems to have generate worse to similar outcomes. Perhaps if we choose different
## variables it would improve?

## You can play around with the models by replacing variables you want to include in the models