# Evaluating Model Performance

We'll be working with a dataset from October 3, 2015 on AirBnB listings from Washington, D.C. Each row in the dataset is a specific listing that's available for renting on AirBnB in the Washington, D.C. area. Here are the column discriptiobs:
- `host_response_rate`: the response rate of the host 
- `host_acceptance_rate`: number of requests to the host that convert to rentals 
- `host_listings_count`: number of other listings the host has 
- `latitude`: latitude dimension of the geographic coordinates 
- `longitude`: longitude part of the coordinates 
- `city`: the city the living space resides 
- `zipcode`: the zip code the living space resides 
- `state`: the state the living space resides 
- `accommodates`: the number of guests the rental can accommodate 
- `room_type`: the type of living space (Private room, Shared room or Entire home/apt 
- `bedrooms`: number of bedrooms included in the rental 
- `bathrooms`: number of bathrooms included in the rental 
- `beds`: number of beds included in the rental 
- `price`: nightly price for the rental 
- `cleaning_fee`: additional fee used for cleaning the living space after the guest leaves 
- `security_deposit`: refundable security deposit, in case of damages 
- `minimum_nights`: minimum number of nights a guest can stay for the rental 
- `maximum_nights`: maximum number of nights a guest can stay for the rental 
- `number_of_reviews`: number of reviews that previous guests have left

In [1]:
import pandas as pd
import numpy as np

In [2]:
dc_listings = pd.read_csv("dc_airbnb.csv")
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

In [3]:
dc_listings.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,160.0,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,350.0,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,50.0,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,95.0,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,50.0,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD


In [4]:
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

In [5]:
def predict_price(new_listing):
    dc_listing_copy = train_df.copy()
    dc_listing_copy['distance'] = dc_listing_copy['accommodates'].apply(lambda x: np.abs(x-new_listing))
    dc_listing_copy = dc_listing_copy.sort_values('distance')
    predicted_price = dc_listing_copy.iloc[0:5]['price'].mean()
    return(predicted_price)

In [6]:
import warnings
warnings.filterwarnings('ignore')
test_df['predicted_price'] = test_df['accommodates'].apply(predict_price)

In [7]:
test_df.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state,predicted_price
2792,20%,75%,1,2,Entire home/apt,0.0,1.0,1.0,120.0,,,1,1125,8,38.922187,-77.032475,Washington,20009,DC,104.0
2793,100%,25%,2,3,Entire home/apt,2.0,2.0,1.0,140.0,$75.00,$150.00,2,1125,7,38.931681,-77.044739,Washington,20010,DC,177.4
2794,,,1,4,Entire home/apt,2.0,1.0,1.0,299.0,,,2,1125,5,38.933765,-77.031488,Washington,20010,DC,145.8
2795,100%,100%,1,3,Entire home/apt,1.0,1.0,1.0,85.0,$30.00,$250.00,1,92,2,38.925692,-77.032616,Washington,20009,DC,177.4
2796,100%,100%,1,6,Entire home/apt,2.0,2.0,3.0,175.0,$65.00,$850.00,1,1125,62,38.927572,-77.033604,Washington,20009,DC,187.2


### Error Metrics

In [8]:
# Using the mean absolute error

mae = np.absolute(test_df['predicted_price'] - test_df['price']).mean()
mae

56.29001074113876

In [9]:
# Using the mean squared error

mse = ((test_df['predicted_price']-test_df['price'])**2).mean()
mse

18646.525370569325

### Training another model 

In [10]:
# Using the bathrooms column instead of the accommodates column to make predictions

train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

def predict_price(new_listing):
    dc_listing_copy = train_df.copy()
    dc_listing_copy['distance'] = dc_listing_copy['bathrooms'].apply(lambda x: np.abs(x-new_listing))
    dc_listing_copy = dc_listing_copy.sort_values('distance')
    predicted_price = dc_listing_copy.iloc[0:5]['price'].mean()
    return(predicted_price)

In [11]:
test_df['predict_price'] = test_df['bathrooms'].apply(predict_price)
test_df['squared_error'] = (test_df['predict_price'] - test_df['price'])**2
mse = test_df['squared_error'].mean()
mse

18405.444081632548

### Root Mean Squared Error (RMSE)

In [12]:
rmse = np.sqrt(mse)
rmse

135.66666532952209

** We should expect for the model to be off by 135.6 dollars on average for the predicted price values **

### MAE vs. RMSE

In [13]:
errors_one = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10])
errors_two = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 1000])

In [14]:
mae_one = errors_one.sum()/len(errors_one)
rmse_one = np.sqrt((errors_one**2).sum()/len(errors_one))
print(mae_one)
print(rmse_one)

7.5
7.90569415042


In [15]:
mae_two = errors_two.sum()/len(errors_two)
rmse_two = np.sqrt((errors_two**2).sum()/len(errors_two))
print(mae_two)
print(rmse_two)

62.5
235.823026865
