# Airbnb in New York City - Impact of Neighborhoods

## Table of Contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## 1. Introduction <a name="introduction"></a>


### 1.1 Background

#### About Airbnb

> **"Millions of Airbnb Hosts connect curious people to an endlessly interesting world.**
> Guests can discover the perfect place to stay for every getaway and explore new experiences while traveling, or online. Hosts can list their extra space, receive hosting tips and support, and earn money while creating memorable moments for guests.**"**

*- This is how Airbnb describe themselves on the Google Playstore*

**Airbnb** is a platform provider for hosts and guests, where hosts can list their properties for the purpose of providing logding and homestay facilities, and guests can avail these said facilities. Founded in the year of 2008, in San Francisco, California - Airbnb has come a long way such that now they have a global presence for providing their one of a kind service.


### 1.2 Problem and Interest

The business model of Airbnb is that it facilitates the rental process of accomodations, lodgings and homestays by provinding an online marketplace. The company doesnot own any of the properties in the listings, they just charge a commission for each of the bookings.

Thus one of the most important aspect would be to get an understanding of the locality of the properties and to see if and how it has any impact on its pricing or popularity.

This can be used for taking business decisions by getting an understanding of customers' and providers' behavior and performance on the platform as a result helping to guide marketing initiatives and maybe implementation of innovative additional services, etc.



## 2. Data <a name="data"></a>


So now we move on to the data we will be requiring and using for this analysis.

* We will be using the "New York City Airbnb Open Data" available on Kaggle. The link to the database is: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
* This dataset has around 49,000 entries with 16 columns. We will not be requiring all the columns and hence we will perform data cleaning and wrangling methods to simplify the data as per our requirement
* Let us now understand the data. The columns for the original dataset and their description are as follows:

| Columns                          | Description                                     |
|:---------------------------------|:------------------------------------------------| 
| `id`                             | id of the listing                               | 
| `name`                           | title of the listing                            |  
| `host_id`                        | id of the host who has listed                   | 
| `host_name`                      | name of the host who has listed                 |
| `neighbourhood_group`            | name of the borough                             |
| `neighbourhood`                  | name of the neighborhood                        |  
| `latitude`                       | location latitude of the listing                | 
| `longitude`                      | location longitude of the listing               |
| `room_type`                      | type of room / accomodation                     |
| `price`                          | price of the listing                            |
| `minimum_nights`                 | minimum number of nights to be booked for       |  
| `number_of_reviews`              | total number of reviews for the listing         | 
| `last_review`                    | date of the last review                         |
| `reviews_per_month`              | average reviews per month                       |
| `calculated_host_listings_count` | total no of listings by the host                |  
| `availability_365 `              | property available for number of days per year  | 

* We already have latitude and longitude data of the properties in the dataset which can be used for finding the nearby venues for these properties using the **Foursquare API** 


Let us load the database and do some basic data wrangling and cleaning

In [1]:
#importing the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

print('Libraries successfully imported.')

Libraries successfully imported.


In [2]:
#loading the dataset into a pandas dataframe

airbnb_df = pd.read_csv("https://raw.githubusercontent.com/sarkar-kumardipta/Coursera_Capstone/main/AB_NYC_2019.csv")
airbnb_df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [3]:
# let us see the size of the dataset

airbnb_df.shape

(48895, 16)


So there are 48,895 rows and 16 columns. Some of the columns contain contain numerical data while the others contain categorical data.

In [4]:
# let us see the datatypes of the dataframe

airbnb_df.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object


For our analysis we can remove the data relating to the hosts as it will not be required. Hence we can drop columns `host_id`, `host_name` and `calculated_host_listings_count`.

Also we can remove `last_review` column. 

In [5]:
# let us drop host_id, Host_name and calculated_host_listings_count columns

airbnb_df.drop(['host_id','host_name','calculated_host_listings_count','last_review'], axis=1, inplace=True)

airbnb_df.head()

Unnamed: 0,id,name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,availability_365
0,2539,Clean & quiet apt home by the park,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,365
1,2595,Skylit Midtown Castle,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,365
3,3831,Cozy Entire Floor of Brownstone,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,0


Now let us understand the rmining data using the `describe()` method.

In [6]:
# let us now use the describe() method to get a better understanding of the data

airbnb_df.describe(include = 'all')

Unnamed: 0,id,name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,availability_365
count,48895.0,48879,48895,48895,48895.0,48895.0,48895,48895.0,48895.0,48895.0,38843.0,48895.0
unique,,47905,5,221,,,3,,,,,
top,,Hillside Hotel,Manhattan,Williamsburg,,,Entire home/apt,,,,,
freq,,18,21661,3920,,,25409,,,,,
mean,19017140.0,,,,40.728949,-73.95217,,152.720687,7.029962,23.274466,1.373221,112.781327
std,10983110.0,,,,0.05453,0.046157,,240.15417,20.51055,44.550582,1.680442,131.622289
min,2539.0,,,,40.49979,-74.24442,,0.0,1.0,0.0,0.01,0.0
25%,9471945.0,,,,40.6901,-73.98307,,69.0,1.0,1.0,0.19,0.0
50%,19677280.0,,,,40.72307,-73.95568,,106.0,3.0,5.0,0.72,45.0
75%,29152180.0,,,,40.763115,-73.936275,,175.0,5.0,24.0,2.02,227.0



From the `count` parameter we can see some of the values are less than 48,895. This means that some of the data must be null. Let us find out which ones are those.

In [7]:
# let us find the total number of null values per column

airbnb_df.isnull().sum()

id                         0
name                      16
neighbourhood_group        0
neighbourhood              0
latitude                   0
longitude                  0
room_type                  0
price                      0
minimum_nights             0
number_of_reviews          0
reviews_per_month      10052
availability_365           0
dtype: int64

So we can see that there are 16 null entries in the `name` column and 10,052 null entries in the `reviews_per_month` column.

How should we deal with these?

We can drop the rows where the `name` is null.

And for `reviews_per_month`, we can replace the empty values with 0 as logically empty `reviews_per_month` means no reviews have been given and hence 0 should suffice.

In [8]:
# let us drop the entries with empty name value

airbnb_df.dropna(subset = ["name"], inplace=True)

# Let us replace the empty 'reviews_per_month' with 0

airbnb_df.fillna({'reviews_per_month':0}, inplace=True)

airbnb_df.reset_index()

airbnb_df.isnull().sum()

id                     0
name                   0
neighbourhood_group    0
neighbourhood          0
latitude               0
longitude              0
room_type              0
price                  0
minimum_nights         0
number_of_reviews      0
reviews_per_month      0
availability_365       0
dtype: int64


Also we can observe from the previous outputs, in the `availability_365` some of the values are 0.

So the properties which are never available throughout the year will create noise for our model, hence it is better to get rid of them.

So we will remove the entries with `availability_365` having value of 0

In [9]:
# Let us see how many 'availability_365' values are 0.

len(airbnb_df.loc[airbnb_df["availability_365"] == 0])

17521

In [10]:
# Let us drop these rows

airbnb_df = airbnb_df[airbnb_df.availability_365 > 0]

print("The dataframe has {} rows and {} columns.".format(airbnb_df.shape[0],airbnb_df.shape[1]))

airbnb_df.head()

The dataframe has 31358 rows and 12 columns.


Unnamed: 0,id,name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,availability_365
0,2539,Clean & quiet apt home by the park,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,365
1,2595,Skylit Midtown Castle,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,0.0,365
3,3831,Cozy Entire Floor of Brownstone,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,194
5,5099,Large Cozy 1 BR Apartment In Midtown East,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,74,0.59,129



Now the finishing step for our data preparation process. 

Here `price` is our dependant variables and rest other parameters are independant variables. Hence we will move the `price` column to the last column for easier visualization and understanding.

Also we will drop the `id` column as it will also not be required for the analysis.

In [11]:
# let us drop the id column

airbnb_df.drop(['id'], axis=1, inplace=True)

# Moving price column to the last

airbnb_df = airbnb_df[['name','neighbourhood_group','neighbourhood','latitude','longitude','room_type','minimum_nights','number_of_reviews','reviews_per_month',
                     'availability_365','price']]

# The prepped dataset
airbnb_df.head()

Unnamed: 0,name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,minimum_nights,number_of_reviews,reviews_per_month,availability_365,price
0,Clean & quiet apt home by the park,Brooklyn,Kensington,40.64749,-73.97237,Private room,1,9,0.21,365,149
1,Skylit Midtown Castle,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,1,45,0.38,355,225
2,THE VILLAGE OF HARLEM....NEW YORK !,Manhattan,Harlem,40.80902,-73.9419,Private room,3,0,0.0,365,150
3,Cozy Entire Floor of Brownstone,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,1,270,4.64,194,89
5,Large Cozy 1 BR Apartment In Midtown East,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,3,74,0.59,129,200
