# Seattle AirBnB data

## Using data to understand the homeowner's market in Seattle

I approached the data as if I were a homeowner Seattle. If I were a homeowner in Seattle, my main objective would be to offer a great experience for my guests while making a healthy profit. Hence, I structured my business understanding questions around these objectives. My questions for my analysis are thus as follows: 

### Business Understanding:
1. Can we predict what drives higher ratings?
2. When are the most popular times of the year for Seattle home-owners?
3. When are the most profitable times of the year for Seattle home-owners?

### Data Understanding

#### Data Exploration

All data was obtained from Kaggle: https://www.kaggle.com/airbnb/seattle/home

In [48]:
#import libraries and load data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython import display
%matplotlib inline

listings = pd.read_csv('./Project 1/seattle/listings.csv')
calendar = pd.read_csv('./Project 1/seattle/calendar.csv')
reviews = pd.read_csv('./Project 1/seattle/reviews.csv')

In [56]:
#explore columns in datasets
print(listings.columns.values)
print(calendar.columns.values)
print(reviews.columns.values)

['id' 'listing_url' 'scrape_id' 'last_scraped' 'name' 'summary' 'space'
 'description' 'experiences_offered' 'neighborhood_overview' 'notes'
 'transit' 'thumbnail_url' 'medium_url' 'picture_url' 'xl_picture_url'
 'host_id' 'host_url' 'host_name' 'host_since' 'host_location'
 'host_about' 'host_response_time' 'host_response_rate'
 'host_acceptance_rate' 'host_is_superhost' 'host_thumbnail_url'
 'host_picture_url' 'host_neighbourhood' 'host_listings_count'
 'host_total_listings_count' 'host_verifications' 'host_has_profile_pic'
 'host_identity_verified' 'street' 'neighbourhood'
 'neighbourhood_cleansed' 'neighbourhood_group_cleansed' 'city' 'state'
 'zipcode' 'market' 'smart_location' 'country_code' 'country' 'latitude'
 'longitude' 'is_location_exact' 'property_type' 'room_type'
 'accommodates' 'bathrooms' 'bedrooms' 'beds' 'bed_type' 'amenities'
 'square_feet' 'price' 'weekly_price' 'monthly_price' 'security_deposit'
 'cleaning_fee' 'guests_included' 'extra_people' 'minimum_nights'
 'm

It appears that all the datasets can be merged by their listing ID. First, check that all columns are variables and rows are individuals.

In [57]:
#check no. of rows and columns
print(listings.shape)
print(calendar.shape)
print(reviews.shape)

(3818, 91)
(1393570, 4)
(84849, 6)


In [51]:
#check for missing values in the columns for each dataset, get percentages
(listings.isnull().sum()/len(listings)).sort_values(ascending=False)

license                             1.000000
square_feet                         0.974594
monthly_price                       0.602672
security_deposit                    0.511262
weekly_price                        0.473808
notes                               0.420639
neighborhood_overview               0.270299
cleaning_fee                        0.269775
transit                             0.244631
host_about                          0.224987
host_acceptance_rate                0.202462
review_scores_accuracy              0.172342
review_scores_checkin               0.172342
review_scores_value                 0.171818
review_scores_location              0.171556
review_scores_cleanliness           0.171032
review_scores_communication         0.170508
review_scores_rating                0.169460
reviews_per_month                   0.164222
first_review                        0.164222
last_review                         0.164222
space                               0.149031
host_respo

For the listing dataset, it looks as though there are a number of columns containing missing values. The license column is completely null.

In [52]:
(calendar.isnull().sum()/len(calendar)).sort_values(ascending=False)

price         0.32939
available     0.00000
date          0.00000
listing_id    0.00000
dtype: float64

For the calendar dataset, the price column has 32% of rows containing null values.

In [53]:
(reviews.isnull().sum()/len(reviews)).sort_values(ascending=False)

comments         0.000212
reviewer_name    0.000000
reviewer_id      0.000000
date             0.000000
id               0.000000
listing_id       0.000000
dtype: float64

The reviews dataset has almost no missing values.

In order to prepare the data for the 3 business questions, we need to look at the 3 datasets and determine which datasets and columns contained within them that are relevant to the question above.

We have 3 datasets: listings, calendar and reviews. Based on our brief exploration above, we can see that the dataset most relevant to our analysis for this question is the listings dataset. The calendar dataset looks to be more relevant to supplement the listings dataset for our 2nd question on popular times and availability. 

Meanwhile, the reviews dataset is more relevant for qualitative predictors and is mainly unstructured data, hence we will only analyse it if we lack sufficient information to answer our questions.

After determining the datasets that are relevant for answering our questions, we move to preparing the data for our analysis.

### Question 1: Can we predict what drives higher ratings?
#### Part I: Data Preparation

Seeing as there are many missing values in the license column, and it is not relevant to the questions above, we can drop it from our analysis dataset. 

In [54]:
#drop license column
listings=listings.drop(columns=['license'])

Next, we revisit the question, which is on driving higher ratings in homes. The relevant column that can be seen as the target variable (y column) would be in the set of review_scores columns. However, we can see that there are several columns in the review_scores.

In [73]:
#check column names that begin with 'review_scores' 
[col for col in listings if col.startswith('review_scores_')]

['review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value']

Based on AirBnB's ratings methodology (https://www.airbnb.com/help/article/1257/how-do-star-ratings-work), the overall experience is the one that determines the overall experience for guests, and so the review_scores_rating column is the one I would set as my target variable.  

However, we need to revisit the other columns in the listings dataset. There are quite a few redundant columns that are unnecessary.

For example, it is unnecessary to have columns that only contain one unique value as they don't provide any predictive power.

In [119]:
#find columns in dataset that only contain one unique value
one_unique=[col for col in listings.columns.values if listings[col].nunique()==1]
one_unique

['scrape_id',
 'last_scraped',
 'experiences_offered',
 'market',
 'country_code',
 'country',
 'has_availability',
 'calendar_last_scraped',
 'requires_license',
 'jurisdiction_names']

Any columns that contain 'url' in the name are also irrelevant as they contain no predictive power or characteristics that lead to higher ratings for homes.

In [120]:
#find columns containing 'url' in the name
url_col=[col for col in listings.columns.values if 'url' in col]
url_col

['listing_url',
 'thumbnail_url',
 'medium_url',
 'picture_url',
 'xl_picture_url',
 'host_url',
 'host_thumbnail_url',
 'host_picture_url']

In [122]:
listings.columns.values

array(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name',
       'summary', 'space', 'description', 'experiences_offered',
       'neighborhood_overview', 'notes', 'transit', 'thumbnail_url',
       'medium_url', 'picture_url', 'xl_picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url',
       'host_picture_url', 'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode',
       'market', 'smart_location', 'country_code', 'country', 'latitude',
       'longitude', 'is_location_exact', 'property_type', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type',
       

It also could be that columns like 'host_name' or 'host_about' are not particularly predictive of the rating as they are unstructured and not indicative of any characteristics of the home.

There are also redundant details like 'longitude' and 'latitude', and 

In [113]:
drop_cols=['listing_url','scrape_id','last_scraped','thumbnail_url',
       'medium_url', 'picture_url', 'xl_picture_url', 'host_id',
       'host_url', 'host_name','host_about','host_thumbnail_url',
       'host_picture_url','country_code','latitude',
       'longitude','latitude',
       'longitude','requires_license', 'jurisdiction_names']