# Exploring Airbnb NYC Listings Through Visuals with Python and Tableau
[Shawn Rodgers](http://www.linkedin.com/in/srodgersnyc) - October 2018

## Contents
1. Overview
2. Importing and Cleaning the Dataset
3. Price DIstribution
4. Impact of Hosts
5. Wrapping Up

## 1. Overview
Airbnb is a popular platform for finding alternative places to stay for vacation. The platform features homes from all over the world. Homes from luxurious mansions in Southern California to treehouses in Bali. In this project, I will highlight the NYC listings using visualizations.

This project is broken into 2 parts: 
1.  In the first part, section 2, I will import and clean the raw data using Python.
2. In the second part, I will create meaningful visualizations using Tableau. This will span from section 3 through section 4. 
	* In section 3, we will see how price is distributed, where homes are distributed, and if the total space available has an affect on price. 
	* In section 4, the visuals built will tells us how different types of host affect price.

The goal of this project is to:
* Understand and highlight the Airbnb NYC home market.
* Conduct an extensive exploratory data analysis. 
* Create story-telling visualizations to illustrate our findings.

We will ultimately learn:
* What affects a listing's price.
* Hot spot of listings. 
* The difference between aged hosts and newer host on the platform.
* A distribution of the reviews of listings
* And more. 

Although Python can handle both the cleaning and building of visuals through libraries such as Matplotlib or Seaborn, I am demonstrating the seamless connection between Python and Tableau. The two together make great for fast analytics and visualization.

Let’s get started.


## 2. Importing and Cleaning the Dataset
In this section we will setup our data by importing the raw CSV file. We will also import all necessary modules. By end of this section, the data will be cleansed and ready for analysis and visuals and visual building.

### 2.1 Libraries
Import the necessary modules.

In [None]:
import pandas as pd # create DataFrames
import numpy as np # working with arrays
import datetime # convert serial dates to datetime object

### 2.2 Importing and Storing Data
This dataset is pulled from [Inside Airbnb](http://insideairbnb.com/get-the-data.html).
Inside Airbnb complies a monthly listing of Airbnb homes from around the world. I will be working with the NYC dataset 2018-09-08.

In [None]:
raw_data = pd.read_csv('../input/airbnb_nyc_homes_raw_20180908.csv') # read raw csv file

Right away the raw dataset raises a non-fatal warning. It is a datatype warning. Columns 43, 87, 88 have mixed types. Let's investigate these columns before moving on.

In [None]:
[raw_data.columns.values[x] for x in [43,87,88]] # retrieve columns that raised a warning

We can see that columns `zipcode`, `license`, and `jurisdiction_names` have mixed data types. Before we fix these columns, let's take a look at the rest of the data for missing values and non-essential columns.

### 2.3 A Glimpse at the Dataset
I will pull the columns’ metadata, a view of the head of the DataFrame, and all columns that hold missing values.

In [None]:
raw_data.info() #metadata

In [None]:
pd.set_option('display.max_columns', 96) # enlarge display of DataFrame to show more columns
raw_data.head(3) # head of DataFrame

In [None]:
raw_data.isna().sum()[raw_data.isna().sum() != 0] # columns that hold missing values

* 96 columns 
* 50,220 rows of data
* 53 columns that hold missing values

In the next sub-section we will truncate the dataset.

### 2.4 Extracting the Necessary Columns
It is always necessary to remove all meaningless data and handle missing values as a preliminary step. Let's start by copying the necessary columns of this project to a new DataFrame.

We should keep in mind, the goals of the project is to draw insights about Airbnb NYC listings. So when we extract certain columns and data, it should have a significance.

In [None]:
data = raw_data[[
'host_id',
'host_since',
'host_is_superhost',
'neighbourhood_cleansed',
'neighbourhood_group_cleansed',
'zipcode',
'latitude',
'longitude',
'room_type',
'bathrooms',
'bedrooms',
'beds',
'price',
'weekly_price',
'monthly_price',
'security_deposit',
'cleaning_fee',
'number_of_reviews',
'review_scores_rating',
'reviews_per_month',
]].copy().reset_index() # extracted columns to new DataFrame

data = data.rename(columns={'index':'id', 
                            'host_is_superhost':'superhost', 
                            'neighbourhood_cleansed':'neighborhood', 
                            'neighbourhood_group_cleansed': 'city'})
 # update column names

Columns extracted (21/96):
* id
* host_id
* host_since
* superhost *(renamed from host_is_superhost)*
* neighborhood *(renamed from neighbourhood_cleansed)*
* city (renamed from neighbourhood_group_cleansed)*
* zipcode
* latitude
* longitude
* room_type
* bathrooms
* bedrooms
* beds
* price
* weekly_price
* monthly_price
* security_deposit
* cleaning_fee
* number_of_reviews
* review_scores_rating
* reviews_per_month

Working with only the extracted will save us time and unnecessary work. 

### 2.5 Cleaning and Converting Values
With the extracted columns in place lets begin to handle missing values  and inaccurate datatypes columns.

In [None]:
data.dtypes # retrieve datatypes

The datatypes are correct except for `host_since` and `zipcode`, which should be of type `datetime` and  `int`, respectively.
To convert `host_since` to a `datetime`, we must remove the missing values in the column. 

To convert `zipcode` into an `int`, we have to first convert it from an `object` type to a numeric type. Second we drop any missing values. We finish the conversion using the `astype()` function.

In [None]:
data.zipcode = pd.to_numeric(data.zipcode, errors='coerce',
                             downcast='integer') 
# changing zipcode to numeric. invalid parsing will be set as NaN.

Let’s view all missing values.

In [None]:
data.isna().sum()[data.isna().sum() != 0] 
# retrieve all columns that have missing values

The weekly, monthly, security deposit, and cleaning fee columns are price options specified by each host. Some host may specify certain prices for those columns and some may not. We will leave these columns as is and not change it. If we were to change these prices to 0, then our analysis will be incorrect because 0 is a value that will impact the aggregate. We will leave the review columns unchanged too because some users may not get reviews.

I will drop the missing values only in columns `host_since`, `superhost` and `zipcode` since these values are currently unobtainable and will not our analysis. 

In [None]:
data = data.dropna(subset=['host_since','superhost','zipcode']) 
# create clean dataset

With all missing values removed from `host_since` and `zipcode`, we can successfully convert the datatypes.
The `host_since` column is stored as a serial date which picks up as a float dataype. This serial format is normally used in MS Excel. To fix we will apply a formula to convert serial format to datetime object. 

In [None]:
def serial_to_datetime(sdate):
# conversion formula
    temp = datetime.datetime(1900, 1, 1)
    delta = datetime.timedelta(days=sdate)
    return temp+delta

data['host_since'] = data['host_since'].apply(serial_to_datetime) 
# apply to all rows in host since column

The `zipcode` column is stored as a `float` but zip-codes do not have decimals.

In [None]:
data.zipcode = data.zipcode.astype(int) # change zip to an integer

### 2.6 Updating Values
We have successfully converted all columns to their proper datatype and dropped all missing values that are not needed. There are a few more values that needs to be fixed manually. 

Change `Manhattan` values  to `New York` in the `city` column.

In [None]:
data['city'] = data['city'].apply(
    lambda x: "New York" if x == "Manhattan" else x) 
# change "Manhattan" to "New York"

Updating `zipcode`  and `neighborhood` values.  (Data entry errors from the source)

In [None]:
data.zipcode.loc[41368] = 11208 # reassign this zipcode because of a data entry order. Cypress Hill neighbourhood zip is 11208

data.zipcode.loc[30750] = 11367 # Kew Gardens Hills zipcode should be 11367, not 91766

data.zipcode.loc[35739] = 10036 # Hell's Kitchen zipcode should 10036, not 7093

data.zipcode.loc[[48316,48654]] = 10303 # Port Ivory zipcode should be 10303 not 7206

data.zipcode.loc[25165] = 10013 # SoHo zipcode should 10013, not 5340

data.zipcode.loc[39882] = 11207 # East New York zipcode should 11207, not 11954

data.neighborhood.loc[36027] = 'Howard Beach' # update to correct city

### 2.7 Saving our Results
The raw dataset that was originally imported has been reviewed and all the necessary information has been extracted and cleansed. The new object  `data` will be the workable DataFrame used to summarize the data and conduct an analysis. 
<br>In the following section we will: 
* Statistically describe the data 
* Create visuals to summarize the data
* Highlight any trends and patterns

As of now, we can save our proper dataset as a CSV file and move it into other BI programs, such as Tableau or Excel, or more into a RDBMS and manage over a database connection. 


In [None]:
data.to_csv('airbnb_nyc_cleaned_20180908.csv', encoding='utf-8', sep=',', index=False) 
# save cleaned dataset csv

We have successfully imported a raw dataset, cleaned the dataset, and moved the static file to Tableau. Now, let’s start describing the dataset.

## 3. Price Distribution
In this section, we will use visualizations to breakdown the pricing distribution of Airbnb NYC homes. Let’s dive right into it.

### 3.1 How is Price Distributed?
The histogram below shows the distribution of price and composition each city makes within that price. The info we can draw from this visual is that most hosts set their prices at \$60 to \$79. This info is drawn from the size of the bins, the tallest being \$60 to \$79. Most Airbnb homes in this bin are from Brooklyn.

We also see the histogram is skewed to the left – there is a higher frequency of homes below \$100. However, homes exist much above that price. At prices above \$100 we see Manhattan Airbnb homes surpass the count of homes from Brooklyn, which held the highest composition of homes at higher frequency (below \$100). 

<img src="https://image.ibb.co/jLQHvf/Price-Distribution.png" alt="Price-Distribution" border="0">

**Key:**
* Red - Manhattan
* Orange - Brooklyn
* Teal - Queens
* Blue -  Bronx
* Green - Staten Island

Briefly, in other words, Airbnb homes can easily be found at a price of \$60 to \$79 in Brooklyn. As prices increase, homes in Manhattan are much more ubiquitous. 

The Bronx and Staten Island Airbnb homes barely contribute to the count of homes in NYC and are seldom above $160.

### 3.2 Where Exactly are Homes Distributed?
Let us take deeper look at where the Airbnb homes are distributed.
The geographic heat map below shows exactly what part of the city holds the most Airbnb homes. As we have learned from the histogram, Brooklyn and Manhattan composes of the most homes in NYC. In the map below, this is shown by the saturated blue color. 

It is important to point out that the part of Brooklyn that connects to Manhattan is where homes are more frequent in Brooklyn. The further away from Manhattan the Brooklyn Airbnb is, the more comparable that Brooklyn Airbnb home’s frequency is to a Bronx and Staten Island home. 

<img src="https://image.ibb.co/hnxxvf/Home-Distribution.png" alt="Home-Distribution" border="0" width=700 height=1000>

What we can draw about price and home distribution from simply the histogram and geographic heat map is abundant. Manhattan is the hot spot for Airbnb homes; it is the center of NYC. However, Manhattan homes are most expensive and to stay competitive, Airbnb homes in neighboring Brooklyn have lower prices.

### 3.3 Neighborhood Price Distribution 
Within these boroughs are neighborhoods. Let’s see how their prices compare. 

<img src="https://image.ibb.co/eg4AFf/Top-Neighborhoods-by-Price.png" alt="Top-Neighborhoods-by-Price" border="0" width=700 height=1000>

The treemap above shows us the top 15 neighborhoods by median price. 

Top 5:
1. Westerleigh, Staten Island – \$801.50
2. Fort Wadsworth, Staten Island – \$750
3. Woodrow, Staten Island – \$462.50
4. Todt, Staten Island – \$429
5. Tribeca, Manhattan – \$270

It is very interesting seeing the 4 top neighborhoods of the 5 are from Staten Island. This is most likely because of the low count of the Airbnb homes in Staten Island. Let’s quickly see if this true using Python.

In [None]:
data[data['neighborhood'].isin(
    ['Westerleigh', 'Fort Wadsworth', 'Woodrow', 'Todt Hill', 'Tribeca'])].groupby(
    ['neighborhood']).agg({'price': "median", 'id':'count'}).rename(
    columns={'id': 'count of homes', 'price':'median price'}) 
# pull median price and count of top 5 neighborhoods.

Proof proved. Now let’s see what the top median prices are when there are at least 10 homes in the neighborhood.

In [None]:
grouped = data.groupby(['neighborhood']).agg({'price': "median", 'id':'count'})
grouped = grouped[grouped.id >= 10].groupby('neighborhood')['price'].median().sort_values(ascending=False)[:5] 
# pull top 5 neighborhoods by median price with at least 10 homes.
grouped

A true sense of the top 5 neighborhoods by median price are:
1. Tribeca, New York – \$270
2. NoHo, New York – \$245
3. Battery Park City, New York – \$225
4. Flatiron District, New York – \$220
5. Midtown, New York – \$210

### 3.4 More Space Higher Prices?
In the scatter chart below, we plot median price against the total space of an Airbnb home (which is a Tableau calculated field: sum of bed, bedroom and bathroom).

<img src="https://image.ibb.co/gY4cvf/Total-Space-vs-Price.png" alt="Total-Space-vs-Price" border="0" width=700 height=1000>

We can see from the linear trend line (line of best fit) there is a slight increase in price as the total space increases. Before we conclude that price increases as total space increases, let’s view the generated linear equation by Tableau. 

Our linear equation (y = mx + b) is:
<br>Median Price = 36.01(Total Space) + 12.1185  
<br>Where,
* Median Price is our dependent variable (the value we expect) 
* \$36.01 is our coefficient. For each unit increase in Total Space, there will be an increase of \$36.01 to median price.
* Total Space is our independent variable
* \$12.12 is the median price if Total Space was 0. 

R² tells us how much total variation in price is explained by Total Space. It is a value 0 to 1. The closer R² is to 1, the more we can attribute a change in price is attributed to Total Space. In this case of our linear model, R² is very low at .02. We can conclude that a change in Total Space does not have a strong impact on price.

On the other, this model is significant at a p-value less than .05. The low p-value is telling us that there is a linear relationship between median price and Total Space.

In as little as 4 visuals we were able to see and understand much about Airbnb pricing in NYC. In the next section we will see how host and their home reviews have an impact on an Airbnb listing.

## 4. Impact of Hosts
In this section we will explore through visuals what impact, if any, the effects of the type of host and reviews have on a listing.

We will start by seeing if the median price changes based on how long a host been active on Airbnb.

### 4.1 New Host vs Veteran Host Prices

<img src="https://image.ibb.co/k4Uo1L/Age-of-Host-vs-Price.png" alt="Age-of-Host-vs-Price" border="0" width=700 height=1000>

In the line chart above, we see how the price changes from 2008 to 2018.

The downward shifting line is our regression line. It is showing us that newer host have lower prices than host that been active longer.

Here is the line equation generated by Tableau:
<br>Median Price = -0.0105959(Months of Host) + 553.183
<br>R² equals .31, which we learned earlier to be on the lower end of explanatory power. However, the p-value proves our linear model to be significant.

### 4.2 Super Host vs. Regular Host
According to Airbnb “Superhosts are experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests.“ - [Superhost Explained Airbnb](https://www.airbnb.com/help/article/828/what-is-a-superhost)

With such a grand title like “Superhost”, does one raise their price much higher?

In the pie chart below, we see an evenly spilt price between being a super host and not being a super host with the price charged being slightly higher for Superhost.

<img src="https://image.ibb.co/f3nxvf/Superhost-Pie-Chart.png" alt="Superhost-Pie-Chart" border="0" width=700 height=1000>

Let’s see the count of homes of both groups.

In [None]:
data.groupby(['superhost'])['superhost'].count() 
# pull the count of homes from superhost. 
# 't' for True being super host and 'f' for False not being superhost.

We see that there is 5x more non-superhost listings on Airbnb than there are of superhost listings.

### 4.3 Distribution on Reviews
Lastly, let’s see how reviews of the listings are distributed. In the histogram below, we map out the count of reviews; the bins are sized 1. We see that reviews rating of 100 have the highest count by far, followed by reviews that received a rating of 98 then reviews that received a rating of 97. 

I think its worth pointing out the spike at a rating of 80. While this is not a confirmed theory, I believe the spike at 80 is because users default to ratings that end in the  number 0. We can also see a similar spike at 90, and also, perhaps why a rating of 100 is very high.

<img src="https://image.ibb.co/gVyRo0/Distribution-of-Reviews.png" alt="Distribution-of-Reviews" border="0" width=700 height=1000>

This distribution of reviews is very informative. At a glance we can draw that host and their home listings are well received by the Airbnb community. 

We were also able to draw that there is very little difference in prices between super host and regular host although there are 5x more Airbnb listing by regular hosts. And for newer host, their prices are be lower compared to veteran hosts.

## 5. Wrapping Up
Thank you for following along. I hoped you found this dataset both interesting and informative about Airbnb listings in NYC. Feel free to leave a comment. 

“Always continue to learn!” - 
[Shawn Rodgers](http://www.linkedin.com/in/srodgersnyc)