# **New York Airbnb EDA**

## Step-1: Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

%matplotlib inline

## Step-2: Loading the Data

**About this file**

Variables/Features: id, name, host_id, host_name, neighbourhood_group, neighbourhood, latitude, longitude, room_type, price, minimum_nights, number_of_reviews, last_review, reviews_per_month, calculated_host_listings_count, availability_365, number_of_reviews_ltm, license, rating, bedroom, beds, baths

In [3]:
df = pd.read_csv('datasets.csv')

display(df.head())

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,...,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license,rating,bedrooms,beds,baths
0,1312228.0,Rental unit in Brooklyn · ★5.0 · 1 bedroom,7130382,Walter,Brooklyn,Clinton Hill,40.68371,-73.96461,Private room,55.0,...,20/12/15,0.03,1.0,0.0,0.0,No License,5.0,1,1,Not specified
1,45277540.0,Rental unit in New York · ★4.67 · 2 bedrooms ·...,51501835,Jeniffer,Manhattan,Hell's Kitchen,40.76661,-73.9881,Entire home/apt,144.0,...,01/05/23,0.24,139.0,364.0,2.0,No License,4.67,2,1,1
2,9.71e+17,Rental unit in New York · ★4.17 · 1 bedroom · ...,528871354,Joshua,Manhattan,Chelsea,40.750764,-73.994605,Entire home/apt,187.0,...,18/12/23,1.67,1.0,343.0,6.0,Exempt,4.17,1,2,1
3,3857863.0,Rental unit in New York · ★4.64 · 1 bedroom · ...,19902271,John And Catherine,Manhattan,Washington Heights,40.8356,-73.9425,Private room,120.0,...,17/09/23,1.38,2.0,363.0,12.0,No License,4.64,1,1,1
4,40896610.0,Condo in New York · ★4.91 · Studio · 1 bed · 1...,61391963,Stay With Vibe,Manhattan,Murray Hill,40.75112,-73.9786,Entire home/apt,85.0,...,03/12/23,0.24,133.0,335.0,3.0,No License,4.91,Studio,1,1


## Step-3: Understanding the Data Structure

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20770 entries, 0 to 20769
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20770 non-null  float64
 1   name                            20770 non-null  object 
 2   host_id                         20770 non-null  int64  
 3   host_name                       20770 non-null  object 
 4   neighbourhood_group             20770 non-null  object 
 5   neighbourhood                   20763 non-null  object 
 6   latitude                        20763 non-null  float64
 7   longitude                       20763 non-null  float64
 8   room_type                       20763 non-null  object 
 9   price                           20736 non-null  float64
 10  minimum_nights                  20763 non-null  float64
 11  number_of_reviews               20763 non-null  float64
 12  last_review                     

In [6]:
# Checking the no of missing values
df.isna().sum()

id                                 0
name                               0
host_id                            0
host_name                          0
neighbourhood_group                0
neighbourhood                      7
latitude                           7
longitude                          7
room_type                          7
price                             34
minimum_nights                     7
number_of_reviews                  7
last_review                        7
reviews_per_month                  7
calculated_host_listings_count     7
availability_365                   7
number_of_reviews_ltm              7
license                            0
rating                             0
bedrooms                           0
beds                               0
baths                              0
dtype: int64

In [7]:
# Statistical summary
df.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,beds
count,20770.0,20770.0,20763.0,20763.0,20736.0,20763.0,20763.0,20763.0,20763.0,20763.0,20763.0,20770.0
mean,3.033858e+17,174904900.0,40.726821,-73.939179,187.71494,28.558493,42.610605,1.257589,18.866686,206.067957,10.848962,1.723592
std,3.901221e+17,172565700.0,0.060293,0.061403,1023.245124,33.532697,73.523401,1.904472,70.921443,135.077259,21.354876,1.211993
min,2595.0,1678.0,40.500314,-74.24984,10.0,1.0,1.0,0.01,1.0,0.0,0.0,1.0
25%,27072600.0,20411840.0,40.684159,-73.980755,80.0,30.0,4.0,0.21,1.0,87.0,1.0,1.0
50%,49928520.0,108699000.0,40.72289,-73.949597,125.0,30.0,14.0,0.65,2.0,215.0,3.0,1.0
75%,7.22e+17,314399700.0,40.763106,-73.917475,199.0,30.0,49.0,1.8,5.0,353.0,15.0,2.0
max,1.05e+18,550403500.0,40.911147,-73.71365,100000.0,1250.0,1865.0,75.49,713.0,365.0,1075.0,42.0


In [8]:
print(f'The dataset has {df.shape[0]} rows and {df.shape[1]} columns')

The dataset has 20770 rows and 22 columns


## Step 4: Handling Missing Values

In [11]:
# Checking the no of missing values
missing = df.isna().sum()
display(missing[missing > 0])

neighbourhood                      7
latitude                           7
longitude                          7
room_type                          7
price                             34
minimum_nights                     7
number_of_reviews                  7
last_review                        7
reviews_per_month                  7
calculated_host_listings_count     7
availability_365                   7
number_of_reviews_ltm              7
dtype: int64

### Imputing the mean for numerical values and mode for catagorical values

In [43]:
missing_cols = pd.DataFrame(missing[missing > 0]).T

In [49]:
missing_cols.columns

Index(['neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'number_of_reviews_ltm'],
      dtype='object')

In [None]:
for col in df.columns:
    if col in missing_cols:
        if df[col].dtype == ['int64', 'float64']:
            df[col].fillna(df[col].mean(), inplace=True)
        else:
            df[col].fillna(df[col].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


In [71]:
print(df.isna().sum())

id                                0
name                              0
host_id                           0
host_name                         0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
last_review                       0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
number_of_reviews_ltm             0
license                           0
rating                            0
bedrooms                          0
beds                              0
baths                             0
dtype: int64


In [74]:
# Checking the duplicated entries 
df[df.duplicated()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,...,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license,rating,bedrooms,beds,baths
6,45277540.0,Rental unit in New York · ★4.67 · 2 bedrooms ·...,51501835,Jeniffer,Manhattan,Hell's Kitchen,40.76661,-73.9881,Entire home/apt,144.0,...,01/05/23,0.24,139.0,364.0,2.0,No License,4.67,2,1,1
7,9.71e+17,Rental unit in New York · ★4.17 · 1 bedroom · ...,528871354,Joshua,Manhattan,Chelsea,40.750764,-73.994605,Entire home/apt,187.0,...,18/12/23,1.67,1.0,343.0,6.0,Exempt,4.17,1,2,1
8,3857863.0,Rental unit in New York · ★4.64 · 1 bedroom · ...,19902271,John And Catherine,Manhattan,Washington Heights,40.8356,-73.9425,Private room,120.0,...,17/09/23,1.38,2.0,363.0,12.0,No License,4.64,1,1,1
9,40896610.0,Condo in New York · ★4.91 · Studio · 1 bed · 1...,61391963,Stay With Vibe,Manhattan,Murray Hill,40.75112,-73.9786,Entire home/apt,85.0,...,03/12/23,0.24,133.0,335.0,3.0,No License,4.91,Studio,1,1
10,49584980.0,Rental unit in New York · ★5.0 · 1 bedroom · 1...,51501835,Jeniffer,Manhattan,Hell's Kitchen,40.75995,-73.99296,Entire home/apt,115.0,...,29/07/23,0.16,139.0,276.0,2.0,No License,5,1,1,1
20736,7.99e+17,Rental unit in New York · 2 bedrooms · 2 beds ...,224733902,CozySuites Copake,Manhattan,Upper East Side,40.76897,-73.957592,Entire home/apt,153.0,...,15/09/23,0.41,8.0,308.0,2.0,No License,No rating,2,2,2
20737,5.93e+17,Rental unit in New York · ★4.79 · 2 bedrooms ·...,23219783,Rob,Manhattan,West Village,40.73022,-74.00291,Entire home/apt,175.0,...,22/11/23,2.03,4.0,129.0,25.0,No License,4.79,2,2,1
20738,9.23e+17,Loft in New York · ★4.33 · 1 bedroom · 2 beds ...,520265731,Rodrigo,Manhattan,Greenwich Village,40.72839,-73.99954,Entire home/apt,156.0,...,02/01/24,2.6,1.0,356.0,9.0,Exempt,4.33,1,2,1
20739,13361610.0,Rental unit in New York · ★4.89 · 2 bedrooms ·...,8961407,Jamie,Manhattan,Harlem,40.8057,-73.94625,Entire home/apt,397.0,...,08/09/23,1.08,3.0,274.0,3.0,No License,4.89,2,2,1
20740,51195660.0,Rental unit in New York · Studio · 1 bed · 1 bath,51501835,Jeniffer,Manhattan,Chinatown,40.71836,-73.99585,Entire home/apt,100.0,...,25/05/23,0.08,139.0,306.0,1.0,No License,No rating,Studio,1,1


In [75]:
# Dropping the duplicated entries
df.drop_duplicates(inplace= True)

In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20758 entries, 0 to 20769
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20758 non-null  float64
 1   name                            20758 non-null  object 
 2   host_id                         20758 non-null  int64  
 3   host_name                       20758 non-null  object 
 4   neighbourhood_group             20758 non-null  object 
 5   neighbourhood                   20758 non-null  object 
 6   latitude                        20758 non-null  float64
 7   longitude                       20758 non-null  float64
 8   room_type                       20758 non-null  object 
 9   price                           20758 non-null  float64
 10  minimum_nights                  20758 non-null  float64
 11  number_of_reviews               20758 non-null  float64
 12  last_review                     20758

In [77]:
df['id']

0        1.312228e+06
1        4.527754e+07
2        9.710000e+17
3        3.857863e+06
4        4.089661e+07
             ...     
20765    2.473690e+07
20766    2.835711e+06
20767    5.182527e+07
20768    7.830000e+17
20769    5.660000e+17
Name: id, Length: 20758, dtype: float64

In [4]:
# Changing the data types
df['id'] = df['id'].astype('object')
df['host_id'] = df['host_id'].astype('object')
df['last_review'] = pd.to_datetime(df['last_review'])


  df['last_review'] = pd.to_datetime(df['last_review'])


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              20758 non-null  object        
 1   name                            20758 non-null  object        
 2   host_id                         20758 non-null  object        
 3   host_name                       20758 non-null  object        
 4   neighbourhood_group             20758 non-null  object        
 5   neighbourhood                   20758 non-null  object        
 6   latitude                        20758 non-null  float64       
 7   longitude                       20758 non-null  float64       
 8   room_type                       20758 non-null  object        
 9   price                           20758 non-null  float64       
 10  minimum_nights                  20758 non-null  float64       
 11  nu

## Step-5: Calculating some statistics

In [91]:
# Average price by neighborhood group
print("Average price by neighborhood group: \n\n", df.groupby(by= df['neighbourhood_group'])['price'].mean().sort_values(ascending=True))

Average price by neighborhood group: 

 neighbourhood_group
Bronx            118.407798
Staten Island    118.780069
Queens           126.507312
Brooklyn         186.914238
Manhattan        227.474869
Name: price, dtype: float64


In [90]:
# Most expensive neighborhoods
print("Most expensive neighborhoods: \n\n", df.groupby(by=df['neighbourhood'])['price'].mean().sort_values(ascending=False).head(10))

Most expensive neighborhoods: 

 neighbourhood
Tribeca               455.408451
Longwood              424.225806
Civic Center          393.750000
SoHo                  363.507353
NoHo                  351.333333
Theater District      347.257143
West Village          342.938095
Flatiron District     332.615385
Financial District    314.867647
DUMBO                 312.882353
Name: price, dtype: float64


In [93]:
df['number_of_reviews_ltm']

0         0.0
1         2.0
2         6.0
3        12.0
4         3.0
         ... 
20765    12.0
20766     1.0
20767    27.0
20768     7.0
20769    62.0
Name: number_of_reviews_ltm, Length: 20758, dtype: float64

In [105]:
# Most reviewed listings
most_reveiwed = df.sort_values('number_of_reviews', ascending=False).head(10)
display(most_reveiwed[['name', 'neighbourhood_group','neighbourhood', 'number_of_reviews', 'price']])

Unnamed: 0,name,neighbourhood_group,neighbourhood,number_of_reviews,price
2389,Boutique hotel in New York · ★4.54 · 1 bedroom...,Manhattan,East Village,1865.0,144.0
8347,Hotel in New York · ★4.66 · 1 bedroom · 1 bed ...,Manhattan,Theater District,1618.0,163.0
6414,Hotel in New York · ★4.42 · 1 bedroom · 1 bed ...,Manhattan,Financial District,1574.0,148.0
8345,Hotel in New York · ★4.65 · 1 bedroom · 2 beds...,Manhattan,Theater District,1201.0,177.0
12795,Hotel in New York · ★4.36 · 1 bedroom · 1 bed ...,Manhattan,Theater District,1188.0,161.0
19450,Rental unit in New York · ★4.88 · 1 bedroom · ...,Manhattan,Lower East Side,1139.0,106.0
2530,Boutique hotel in New York · ★4.41 · 1 bedroom...,Manhattan,Lower East Side,1128.0,134.0
6575,Loft in New York · ★4.57 · 1 bedroom · 1 bed ·...,Manhattan,Chinatown,1048.0,88.0
6322,Hotel in New York · ★4.70 · 1 bedroom · 1 bed ...,Manhattan,Theater District,991.0,185.0
17597,Hostel in New York · ★4.45 · 1 bedroom · 1 bed...,Manhattan,Chelsea,787.0,71.0


In [6]:
# Saving this cleaned file
df.to_csv("Newyork_Air_bnb_Clean_dataset.csv", index=False)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              20758 non-null  object        
 1   name                            20758 non-null  object        
 2   host_id                         20758 non-null  object        
 3   host_name                       20758 non-null  object        
 4   neighbourhood_group             20758 non-null  object        
 5   neighbourhood                   20758 non-null  object        
 6   latitude                        20758 non-null  float64       
 7   longitude                       20758 non-null  float64       
 8   room_type                       20758 non-null  object        
 9   price                           20758 non-null  float64       
 10  minimum_nights                  20758 non-null  float64       
 11  nu

## Step-6: Exploring Categorical Variables

In [11]:
# Explore neighborhood groups
fig = px.histogram(data_frame= df,
                   x = df['neighbourhood_group'],
                   text_auto= True,
                   title = "Number of listings by the town"
                   )
fig.show()

In [12]:
# Explore room types
fig = px.histogram(data_frame= df,
                   x = df['room_type'],
                   text_auto= True,
                   title= "Distribution of room types"
                   )
fig.show()

In [24]:
# Combine both categorical variables
fig = px.histogram(data_frame= df,
                   x = df['neighbourhood_group'],
                   color= df['room_type'],
                   title = "Room types in each town",
                   text_auto= True,
                   barmode= 'group',
                   height= 500
                   )
fig.show()

## Step-7: Exploring Numerical Variables

In [31]:
# Plot price distribution
fig = px.histogram(data_frame= df,
                   x = df['price'],
                   nbins= 10,
                   title= "Distribution of price",
                   text_auto= True)
fig.show()

In [54]:
# Plot price by neighborhood group
fig = px.box(data_frame= df,
             x = df['neighbourhood_group'],
             y =df['price'],
             title= "Price Distribution by town",
             range_y= (0, 500))
fig.show()

In [66]:
# Plot availability
fig = px.histogram(data_frame= df,
                   x = df['availability_365'],
                   title= "Plot availability throughout the year",
                   nbins= 10,
                   text_auto= True
                   )
fig.show()

## Step-8: Exploring Relationships Between Variables

In [None]:
# Price vs. Room Type
fig = px.box(data_frame= df,
             x = df['room_type'],
             y = df['price'],
             title= "Price by Room Type",
             color='room_type',
             range_y= (0, 500)
             )
fig.show()

In [80]:
# Number of reviews vs. Price
fig = px.scatter(data_frame= df,
                 x= df['price'],
                 y= df['number_of_reviews'],
                 range_x= (0, 10000),
                 title= "Number of Reviews vs. Price")
fig.show()

In [84]:
# Minimum nights vs. Price
fig = px.scatter(data_frame= df,
                 y= df['minimum_nights'],
                 x= df['price'],
                 range_x= (0, 10000),
                 title= "Minimum Nights vs. Price")
fig.show()

In [12]:
# Heatmap - Corrrelation of one variable with others for numerical column
num_cols = df.select_dtypes(include=['int64', 'float64'])
corr_df = num_cols.corr()
corr_df

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,beds
id,1.0,0.42235,0.009435,0.068513,0.002497,-0.073697,-0.282485,0.138342,-0.010699,0.10052,0.037681,0.040977
host_id,0.42235,1.0,0.01179,0.122479,-0.006075,-0.072428,-0.139648,0.170461,-0.049981,0.082528,0.10094,0.056595
latitude,0.009435,0.01179,1.0,0.046123,-0.001216,0.004182,-0.048357,-0.042049,0.069997,-0.004955,-0.041949,-0.070172
longitude,0.068513,0.122479,0.046123,1.0,-0.033387,0.024098,0.006003,0.042783,-0.072118,0.061427,0.033829,0.035519
price,0.002497,-0.006075,-0.001216,-0.033387,1.0,-0.006423,-0.012619,-0.009955,-0.007295,0.020031,-0.011369,0.066174
minimum_nights,-0.073697,-0.072428,0.004182,0.024098,-0.006423,1.0,-0.059242,-0.122541,0.014986,0.034533,-0.092614,-0.0263
number_of_reviews,-0.282485,-0.139648,-0.048357,0.006003,-0.012619,-0.059242,1.0,0.631579,-0.114541,-0.050409,0.605962,0.035554
reviews_per_month,0.138342,0.170461,-0.042049,0.042783,-0.009955,-0.122541,0.631579,1.0,-0.108516,-0.040899,0.849859,0.047526
calculated_host_listings_count,-0.010699,-0.049981,0.069997,-0.072118,-0.007295,0.014986,-0.114541,-0.108516,1.0,0.046258,-0.091491,-0.071028
availability_365,0.10052,0.082528,-0.004955,0.061427,0.020031,0.034533,-0.050409,-0.040899,0.046258,1.0,-0.049307,0.064581


In [20]:
fig = px.imshow(corr_df, x = corr_df.columns, y = corr_df.columns, text_auto= True, aspect= "auto", title="Correlation Matrix Heatmap")
fig.show()

Reviews are strongly connected — Popular listings keep getting more reviews regularly.

In [86]:
df.describe()

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,beds
count,20758.0,20758.0,20758.0,20758.0,20758.0,20758,20758.0,20758.0,20758.0,20758.0,20758.0
mean,40.726815,-73.939174,187.588496,28.560844,42.603045,2023-01-16 07:37:46.692359680,1.257414,18.841459,206.062289,10.848299,1.723721
min,40.500314,-74.24984,10.0,1.0,1.0,2011-10-12 00:00:00,0.01,1.0,0.0,0.0,1.0
25%,40.68415,-73.980761,80.0,30.0,4.0,2023-01-12 00:00:00,0.21,1.0,87.0,1.0,1.0
50%,40.722856,-73.949597,125.0,30.0,14.0,2023-07-22 00:00:00,0.65,2.0,215.0,3.0,1.0
75%,40.76312,-73.917463,199.0,30.0,49.0,2023-11-07 00:00:00,1.8,5.0,354.0,15.0,2.0
max,40.911147,-73.71365,100000.0,1250.0,1865.0,2024-05-01 00:00:00,75.49,713.0,365.0,1075.0,42.0
std,0.060294,0.061408,1022.706923,33.535717,73.52771,,1.9047,70.910579,135.093954,21.357409,1.212272


## Conclusion

After exploring the New York Airbnb 2024 dataset, I found some interesting patterns:

-- Room Type: Most people rent out entire homes or apartments, followed by private rooms. Very few listings are for hotel rooms or shared spaces.

-- Popular Areas: Manhattan and Brooklyn have the highest number of Airbnb listings. Staten Island has the least.

-- Availability: Some listings are available for the entire year, while others are listed for just a few days. This means some hosts rent full-time, and some only part-time.

-- Price: Hotel rooms and entire homes are usually more expensive. Private rooms and shared rooms are cheaper options for travelers.

-- Room Types by Area: Manhattan has the most entire homes, while Brooklyn has many private rooms.

**Stats Summary:**

-- Average price is around $188, but some listings are super expensive.

-- Most hosts ask for at least 30 nights stay.

-- Some places have lots of reviews, showing they are popular and trusted.

In short, Airbnb in New York has a variety of choices — from cheap shared rooms to costly hotel-like stays — and most listings are in Manhattan and Brooklyn.