# Airbnb Open Data

This data was obtained through Kaggle public platform, wich is available at:
https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata

According to the owner, the data refers to Airbnb rentals in New York.
We will clean and analyse this dataset to obtain further insights.

The original content is available at:
http://insideairbnb.com/explore/

In [1]:
# Importing essential libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Uploading and checking file structure

airbnb = pd.read_csv("Airbnb_Open_Data.csv")
airbnb.info()

  airbnb = pd.read_csv("Airbnb_Open_Data.csv")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102599 entries, 0 to 102598
Data columns (total 26 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              102599 non-null  int64  
 1   NAME                            102349 non-null  object 
 2   host id                         102599 non-null  int64  
 3   host_identity_verified          102310 non-null  object 
 4   host name                       102193 non-null  object 
 5   neighbourhood group             102570 non-null  object 
 6   neighbourhood                   102583 non-null  object 
 7   lat                             102591 non-null  float64
 8   long                            102591 non-null  float64
 9   country                         102067 non-null  object 
 10  country code                    102468 non-null  object 
 11  instant_bookable                102494 non-null  object 
 12  cancellation_pol

In [3]:
airbnb.sample(5)

Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,...,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
30808,18016594,HUGE 1 bedroom in Manhattan,71307950635,unconfirmed,J,Manhattan,Stuyvesant Town,40.73101,-73.97347,United States,...,$211,1.0,17.0,6/16/2019,1.12,4.0,2.0,0.0,,
64564,36660033,Tranquil Bedroom in Williamsburg Loft,26601753156,verified,Stephanie,Brooklyn,Williamsburg,40.71068,-73.94852,United States,...,$19,30.0,16.0,8/10/2021,0.58,2.0,2.0,63.0,,
32555,18981463,Comfy Bronx retreat,12747996823,verified,Christopher R,Bronx,Morris Park,40.85548,-73.85585,United States,...,$105,1.0,76.0,6/25/2019,6.02,2.0,1.0,72.0,Please no smoking inside. During the warmer m...,
92463,52068651,"UWS, Luxury Modern 1 BR 1 Bath, Doorman.",28326795888,verified,Lauren,Manhattan,Upper West Side,40.7817,-73.982,United States,...,$89,3.0,0.0,,,4.0,1.0,365.0,No parties. No overnight guests. Please use yo...,
5593,4090350,Stuvesant East,96450868103,verified,Anthony/Joanne,Brooklyn,Bedford-Stuyvesant,40.68439,-73.92811,United States,...,$222,3.0,132.0,6/21/2019,2.53,1.0,2.0,202.0,No pets. No smoking. Please :),


In [4]:
airbnb.isnull().sum().sort_values(ascending=False)

license                           102597
house_rules                        52131
last review                        15893
reviews per month                  15879
country                              532
availability 365                     448
minimum nights                       409
host name                            406
review rate number                   326
calculated host listings count       319
host_identity_verified               289
service fee                          273
NAME                                 250
price                                247
Construction year                    214
number of reviews                    183
country code                         131
instant_bookable                     105
cancellation_policy                   76
neighbourhood group                   29
neighbourhood                         16
long                                   8
lat                                    8
id                                     0
host id         

### Data Cleaning

In [5]:
# Converting column names to snake_case:

airbnb.columns=[col.lower().replace(" ","_") for col in airbnb.columns]
airbnb.columns

Index(['id', 'name', 'host_id', 'host_identity_verified', 'host_name',
       'neighbourhood_group', 'neighbourhood', 'lat', 'long', 'country',
       'country_code', 'instant_bookable', 'cancellation_policy', 'room_type',
       'construction_year', 'price', 'service_fee', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'review_rate_number', 'calculated_host_listings_count',
       'availability_365', 'house_rules', 'license'],
      dtype='object')

In [6]:
# The "id" and "host_id" columns do not provide any useful insights to the analysis, therefore they will be discarded

airbnb.drop(["id","host_id"], axis = 1, inplace=True)


# We are analysing data from NY city, therefore the "country" and "country_code" columns will also be discarded
airbnb.drop(["country","country_code"], axis = 1, inplace=True)
airbnb.sample(5)

Unnamed: 0,name,host_identity_verified,host_name,neighbourhood_group,neighbourhood,lat,long,instant_bookable,cancellation_policy,room_type,...,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365,house_rules,license
46644,Garden Room_1,verified,Dameon,Brooklyn,East New York,40.67201,-73.86944,True,moderate,Private room,...,$48,3.0,0.0,,,3.0,2.0,363.0,No Street Shoes allowed in House. No cooking K...,
46350,"Modern 3 BR, 2 BATH Triplex (Washer/Dryer)",unconfirmed,Caroline,Brooklyn,Bedford-Stuyvesant,40.68882,-73.94784,True,moderate,Entire home/apt,...,$221,3.0,1.0,6/30/2019,1.0,2.0,1.0,282.0,Check in 3pm Check out 12pm (Flexible upon req...,
100896,Nice one bedroom apartment,unconfirmed,Manuel,Manhattan,Morningside Heights,40.81387,-73.96205,False,strict,Entire home/apt,...,$100,10.0,0.0,,,5.0,1.0,0.0,No smoking and Please respect each others sp...,
72547,Spacious Sunny Rm in Stylish Duplex,unconfirmed,Keith,Brooklyn,Bedford-Stuyvesant,40.68167,-73.92525,True,strict,Private room,...,$29,3.0,13.0,8/14/2017,0.28,4.0,2.0,333.0,,
27728,AMAZINGLY LOCATED DECORATED ONE BEDROOM CONDO,unconfirmed,Edward,Manhattan,Hell's Kitchen,40.76521,-73.98507,True,flexible,Entire home/apt,...,$175,4.0,0.0,,,5.0,1.0,0.0,,


In [7]:
# Lets check the license column, since there is only two entrys there

airbnb[~airbnb['license'].isnull()]

Unnamed: 0,name,host_identity_verified,host_name,neighbourhood_group,neighbourhood,lat,long,instant_bookable,cancellation_policy,room_type,...,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365,house_rules,license
11114,"Cozy 1 BR on Bedford Avenue, Wburg",verified,Christina,Brooklyn,Williamsburg,40.71764,-73.95689,True,strict,Private room,...,$140,1.0,1.0,1/3/2016,0.02,1.0,1.0,191.0,"Dear Guest, Thank you for appreciating that I ...",41662/AL
72947,"Cozy 1 BR on Bedford Avenue, Wburg",unconfirmed,Christina,Brooklyn,Williamsburg,40.71764,-73.95689,True,flexible,Private room,...,$140,1.0,1.0,1/3/2016,0.02,1.0,1.0,0.0,,41662/AL


In [8]:
# There seems to have duplicated entrys in the dataset,
# Although, there's a possibility that Christina has 2 rooms available for rent...
# But nevermind that, the licence column will be dropped

airbnb.drop(["license"], axis = 1, inplace=True)

In [21]:
# Filling the null values in the "house_rules" with "not specifyed"

airbnb["house_rules"].fillna("not specifyed", inplace=True)

# Filling the null values in the "minimum nights" column as "0"

airbnb["minimum_nights"].fillna(0, inplace=True)

In [14]:
# Pandas recognized the "price" and "service fee" as objects, whereas they should be considered floats
# We have to remove str simbols ("$" and ",") and convert the file type

airbnb[["price", "service_fee"]] = airbnb[["price", "service_fee"]].replace('[\$,]', '', regex=True).astype(float)

In [24]:
# Converting the "last_review" column to the datetime format

airbnb['last_review'] = pd.to_datetime(airbnb['last_review'])

# Let's check the min and max timestamps

airbnb['last_review'].min(), airbnb['last_review'].max()

(Timestamp('2012-07-11 00:00:00'), Timestamp('2058-06-16 00:00:00'))

In [25]:
# Yeah, we will fill the null values as the minimum date,
# Meanwhile, the time travelers reviews (the ones from the future) will all be set to january, 2022 (how??? I don't know yet).

airbnb["minimum_nights"].fillna("012-07-11 00:00:00", inplace=True)

In [67]:
# Now, lets create a subset to check for duplicates,
# Remember, the same host can have multiple apartments in the same region
# Therefore we will be considering the following subset:

airbnb[airbnb.duplicated(subset=["name", "lat", "long", "room_type", "price", "host_name"])]

Unnamed: 0,name,host_identity_verified,host_name,neighbourhood_group,neighbourhood,lat,long,instant_bookable,cancellation_policy,room_type,...,price,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365,house_rules
68501,Cheap Small Bedroom w/Desk 10min to JFK & Mall,unconfirmed,Guelma,Queens,Rosedale,40.65251,-73.73650,False,strict,Private room,...,308.0,62.0,7.0,21.0,2018-06-30,0.94,2.0,5.0,297.0,not specifyed
68821,Queens Home short walk to Subway- 2 bedroom 2 ...,verified,Sara,Queens,Richmond Hill,40.70248,-73.81937,False,strict,Entire home/apt,...,136.0,27.0,2.0,31.0,2019-06-25,1.65,4.0,2.0,62.0,not specifyed
69194,Midtown East Sutton Area Entire Apt,verified,Cary,Manhattan,Midtown,40.75816,-73.96457,False,moderate,Entire home/apt,...,435.0,87.0,2.0,16.0,2019-05-09,0.24,1.0,1.0,167.0,not specifyed
69195,Finest Gateway to historic Financial District,verified,Dan,Manhattan,Financial District,40.70537,-74.00992,False,flexible,Entire home/apt,...,70.0,14.0,1.0,36.0,2018-09-26,0.57,5.0,1.0,365.0,not specifyed
69196,Industrial Brooklyn Loft with Tree-Lined Windows,verified,Shell,Brooklyn,Clinton Hill,40.68722,-73.96289,True,strict,Entire home/apt,...,1195.0,239.0,1.0,54.0,2019-03-24,0.65,5.0,4.0,365.0,not specifyed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102594,Spare room in Williamsburg,verified,Krik,Brooklyn,Williamsburg,40.70862,-73.94651,False,flexible,Private room,...,844.0,169.0,1.0,0.0,NaT,,3.0,1.0,227.0,No Smoking No Parties or Events of any kind Pl...
102595,Best Location near Columbia U,unconfirmed,Mifan,Manhattan,Morningside Heights,40.80460,-73.96545,True,moderate,Private room,...,837.0,167.0,1.0,1.0,2015-07-06,0.02,2.0,2.0,395.0,House rules: Guests agree to the following ter...
102596,"Comfy, bright room in Brooklyn",unconfirmed,Megan,Brooklyn,Park Slope,40.67505,-73.98045,True,moderate,Private room,...,988.0,198.0,3.0,0.0,NaT,,5.0,1.0,342.0,not specifyed
102597,Big Studio-One Stop from Midtown,unconfirmed,Christopher,Queens,Long Island City,40.74989,-73.93777,True,strict,Entire home/apt,...,546.0,109.0,2.0,5.0,2015-10-11,0.10,3.0,1.0,386.0,not specifyed


In [68]:
# Yeah, apparently 1/3 of the table is duplicated (wow)
# Let's drop these values and continue our analysis

airbnb.drop_duplicates(subset=["name", "lat", "long", "room_type", "price", "host_name"], inplace=True)

In [72]:
airbnb.sample(5)

Unnamed: 0,name,host_identity_verified,host_name,neighbourhood_group,neighbourhood,lat,long,instant_bookable,cancellation_policy,room_type,...,price,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365,house_rules
22531,Cozy Quite Room in NYC Manhattan Upwest,unconfirmed,Wendy,Manhattan,Harlem,40.83053,-73.95048,False,flexible,Private room,...,767.0,153.0,3.0,2.0,2017-05-31,0.08,4.0,1.0,0.0,We're happy to be flexible with check-in and c...
59373,"PRIVATE ENTRANCE, PRIVATE SPACE! BEST LOCATION",verified,Doug,Brooklyn,Williamsburg,40.71678,-73.96321,True,strict,Private room,...,1088.0,218.0,5.0,50.0,2022-01-01,1.22,5.0,3.0,137.0,not specifyed
17112,Private Room near Brooklyn's best park,verified,David,Brooklyn,Prospect Heights,40.67937,-73.96838,False,moderate,Private room,...,899.0,180.0,1.0,0.0,NaT,,5.0,1.0,0.0,not specifyed
19975,Spacious room w/ own bathroom in the East Village,verified,Fayth,Manhattan,East Village,40.7308,-73.98619,True,flexible,Private room,...,482.0,96.0,30.0,4.0,2018-08-31,0.13,3.0,2.0,0.0,- Rooftop deck - 1 Block from South Station
11767,a,unconfirmed,Rob,Manhattan,Kips Bay,40.73975,-73.98073,False,flexible,Entire home/apt,...,849.0,170.0,1.0,1.0,2015-11-09,0.02,3.0,1.0,123.0,#NAME?
