### Dive into the Boston and Seattle Airbnb dataset

#### CRISP-DM Process

**Business Understanding**: Here, we try to understand what's the driving factors that determine the Airbnb rental price for Boston and Seattle housing. In this project, we mainly would like to understand the following three quesitons. 

1. Is there significant price difference in Airbnb housting between Boston and Seattle ?
2. What's the major driving factors to predict airbnb housing price for Boston and Seattle respectively ?
3. What are the top factors that people needs to most when they consider Airbnb housing? 

**Data Understanding**：Datasets from both Boston and Seattle are investigated before data processing   
**Prepare Data**: Data cleanning, Processing NaN inputs  
**Data Modeling**: Train the model and find the corresponding parameters to predict housing price  
**Model Validation** Test the model and evaluate the effectiveness of the model


In [140]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import MultiLabelBinarizer
import seaborn as sns
%matplotlib inline 

listings_Boston = pd.read_csv("boston_listings.csv")
listings_Seattle = pd.read_csv("seattle_listings.csv")



#listings_Boston.columns[~listings_Boston.columns.isin(listings_Seattle)]
listings_Boston.drop(['listing_url', 'scrape_id','last_scraped', 'name', 'summary', 'space',
                     'description','neighborhood_overview', 'notes','transit', 'access', 'interaction','house_rules',
                     'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'host_url','host_name','host_location', 'host_about',
                      'host_thumbnail_url', 'host_picture_url','city', 'state', 'smart_location', 'smart_location', 'country_code', 'country', 'first_review', 'last_review', 'id', 'host_verifications', 
               'host_id', 'neighbourhood', 'calendar_last_scraped', 'market','street', 'host_since'], axis = 1, inplace = True)



In [141]:
# Dropping NaN with ratio more than 75%
dropping_index = listings_Boston.isnull().sum() / listings_Boston.shape[0] > 0.75
listings_Boston.drop(listings_Boston.columns[dropping_index], axis = 1, inplace = True)


In [142]:
#takes care some data type issues
listings_Boston['host_response_rate'] = listings_Boston['host_response_rate'].replace('%','', regex = True).astype(float)
listings_Boston['host_acceptance_rate'] = listings_Boston['host_acceptance_rate'].replace('%','', regex = True).astype(float)
listings_Boston['cleaning_fee'] = listings_Boston['cleaning_fee'].replace('\$', '',regex = True).astype(float)
listings_Boston['price'] = listings_Boston['price'].replace({'\$': '', ',':''}, regex = True).astype(float)
listings_Boston['extra_people'] = listings_Boston['extra_people'].replace({'\$': '', ',':''}, regex = True).astype(float)
listings_Boston['security_deposit'] = listings_Boston['security_deposit'].replace({'\$': '', ',':''}, regex = True).astype(float)

In [143]:
if type(listings_Boston.zipcode[0]) == str:
        listings_Boston.zipcode = listings_Boston.zipcode.str[:5].apply(lambda x: float(x))

In [144]:
listings_Boston['amenities'] = listings_Boston['amenities'].map(lambda d: [amenity.replace('"', "").replace("{", "").replace("}", "") for amenity in d.split(",")])

In [145]:
listings_Boston['amenities']

0       [TV, Wireless Internet, Kitchen, Free Parking ...
1       [TV, Internet, Wireless Internet, Air Conditio...
2       [TV, Cable TV, Wireless Internet, Air Conditio...
3       [TV, Internet, Wireless Internet, Air Conditio...
4       [Internet, Wireless Internet, Air Conditioning...
                              ...                        
3580    [Internet, Wireless Internet, Air Conditioning...
3581    [TV, Internet, Wireless Internet, Air Conditio...
3582    [translation missing: en.hosting_amenity_49, t...
3583    [Kitchen, Gym, Family/Kid Friendly, Washer, Dr...
3584    [Wireless Internet, Kitchen, Essentials, trans...
Name: amenities, Length: 3585, dtype: object

In [146]:

possible_amenities = set([item for sublist in amenities for item in sublist])
possible_amenities = list(possible_amenities)

In [162]:

mlb = MultiLabelBinarizer()
amenities_result = pd.DataFrame(mlb.fit_transform(listings_Boston['amenities']), columns = mlb.classes_, index = listings_Boston.index)
pd.concat([listings_Boston.drop(['amenities'], axis = 1), amenities_result], axis = 1, sort = False)

Unnamed: 0,experiences_offered,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,...,Smoke Detector,Smoking Allowed,Suitable for Events,TV,Washer,Washer / Dryer,Wheelchair Accessible,Wireless Internet,translation missing: en.hosting_amenity_49,translation missing: en.hosting_amenity_50
0,none,,,,f,Roslindale,1,1,t,f,...,1,0,0,1,1,0,0,1,0,0
1,none,within an hour,100.0,100.0,f,Roslindale,1,1,t,t,...,1,0,0,1,1,0,0,1,0,0
2,none,within a few hours,100.0,88.0,t,Roslindale,1,1,t,t,...,1,0,0,1,1,0,0,1,1,1
3,none,within a few hours,100.0,50.0,f,,1,1,t,f,...,1,0,0,1,1,0,0,1,0,0
4,none,within an hour,100.0,100.0,t,Roslindale,1,1,t,t,...,1,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3580,none,within an hour,96.0,100.0,f,Somerville,28,28,t,t,...,1,0,0,0,0,0,0,1,0,0
3581,none,a few days or more,10.0,83.0,f,,2,2,t,t,...,1,1,0,1,1,0,0,1,0,0
3582,none,within a day,78.0,50.0,f,,1,1,t,f,...,0,0,0,0,0,0,0,0,1,1
3583,none,within an hour,100.0,96.0,f,Somerville,4,4,t,t,...,0,0,0,0,1,0,0,0,0,1


In [161]:
pd.concat([listings_Boston.drop(['amenities'], axis = 1), amenities_result], axis = 1)

Unnamed: 0,experiences_offered,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,...,Smoke Detector,Smoking Allowed,Suitable for Events,TV,Washer,Washer / Dryer,Wheelchair Accessible,Wireless Internet,translation missing: en.hosting_amenity_49,translation missing: en.hosting_amenity_50
0,none,,,,f,Roslindale,1,1,t,f,...,1,0,0,1,1,0,0,1,0,0
1,none,within an hour,100.0,100.0,f,Roslindale,1,1,t,t,...,1,0,0,1,1,0,0,1,0,0
2,none,within a few hours,100.0,88.0,t,Roslindale,1,1,t,t,...,1,0,0,1,1,0,0,1,1,1
3,none,within a few hours,100.0,50.0,f,,1,1,t,f,...,1,0,0,1,1,0,0,1,0,0
4,none,within an hour,100.0,100.0,t,Roslindale,1,1,t,t,...,1,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3580,none,within an hour,96.0,100.0,f,Somerville,28,28,t,t,...,1,0,0,0,0,0,0,1,0,0
3581,none,a few days or more,10.0,83.0,f,,2,2,t,t,...,1,1,0,1,1,0,0,1,0,0
3582,none,within a day,78.0,50.0,f,,1,1,t,f,...,0,0,0,0,0,0,0,0,1,1
3583,none,within an hour,100.0,96.0,f,Somerville,4,4,t,t,...,0,0,0,0,1,0,0,0,0,1
