<a href="https://colab.research.google.com/github/swastiika14/EDA_AirBnB/blob/main/SWASTIKA_EDA_Submission_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Airbnb NYC 2019 - Exploratory Data Analysis




##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **NAME -** SWASTIKA SRIVASTAVA


# **Project Summary -**

Write the summary here within 500-600 words.
**Airbnb NYC 2019 Dataset: Project Overview**

This project operates within the Airbnb NYC 2019 data set based on a data visualization and analysis platform. The primary goal is to derive valuable insights from the data to aid in decision-making by customers, hosts, and the Airbnb platform as a whole. The data set is comprised of location, price, availability, room type, and customer review statistics in five NYC boroughs.

The project starts with cleaning data and imputation of missing values, specifically in columns titled `reviews_per_month` and `host_name`. Null value distributions are presented visually with the help of a heatmap, thus allowing for more reliable data manipulation. Initial exploratory data analysis (EDA) is carried out in an attempt to create a sense of the overall structure of the dataset, e.g., listing distribution according to room type and borough.

Fifteen various visualizations were created in order to gain insights pertaining to business objectives. Bar charts were utilized to represent the count of listings differentiated by borough and room types and determined that Manhattan and Brooklyn are the highest-performing markets, whereas Staten Island and the Bronx reflect poor performance. Histograms pertaining to price, minimum night stays, and availability were utilized to examine the booking trends and discovered that the majority of listings are priced below $200 and are available throughout the entire year. Additionally, correlation heatmaps and pair plots were utilized to examine the interrelation among numerical variables and may be helpful in the potential development of models.
Geospatial analysis was also done through scatter plots of latitudinal and longitudinal coordinates to show the focus of property listings, especially in the middle of Manhattan. Review frequency and average prices by borough helped in the evaluation of pricing strategy and demand in the market. The Manhattan listings had higher average prices and review counts, reflecting high tourist activity, and the Bronx and Staten Island had lower interaction levels, thus reflecting potential for market expansion.

Other sophisticated analysis consists of forecasted revenue segmented by borough and the Listings activity, gauged by the metric of average monthly reviews. This type of data allow the evaluation of profitability and consumer demand. Price versus reviews scatter plots revealed a loose negative correlation, indicating that more guest activity is seen on listings that are lower priced. In addition, the room type distribution by borough revealed that private rooms are more demanded in the outer boroughs, with entire apartments still dominating Manhattan.

All graphical illustrations were chosen for their efficiency to convey certain trends or patterns. The insights derived from these observations make it possible to better price, make available, and locate, and also indicate potential business hazards like market saturation or lack of visibility in already saturated markets. For example, overreliance on performing districts like Manhattan could turn out to be an issue in case of more stringent regulations or if consumer behavior shifts.

In conclusion, this project gives an analytical perspective of the New York City Airbnb market. The project covers short-term optimization strategies such as pricing and listing strategies and long-term planning issues such as market expansion and recruitment of hosts. The results give evidence-based practices to companies operating with Airbnb, host users, and data analysts who would want to create predictive models or recommendation systems from this data.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**




**Write Problem Statement Here.**

The project will examine the Airbnb NYC 2019 data to determine trends in prices, availability, and listing performance within the five New York City boroughs. The goal is to determine actionable insights that can be applied to make business decisions by utilizing them to benefit both the hosts and the Airbnb platform, and determine areas of opportunity and risk.

#### **Define Your Business Objective?**



The aim of this project is to provide a deep dive analysis of the Airbnb NYC 2019 dataset to produce actionable insights for strategic and operational decision-making. By examining listing behavior across New York City's five boroughs, the analysis will reveal trends in pricing, availability, room types, and customer behavior. The project will allow stakeholders to maximize revenue, maximize customer satisfaction, and efficiently allocate resources. The project also seeks to reveal emergent trends, underperforming locations, and emerging risks such as market oversaturation. These insights will allow Airbnb to develop personalized marketing plans, optimize host onboarding, and optimize user experiences through data-driven platform enhancements. The aim of this project is to analyze the Airbnb NYC 2019 dataset to produce insightful findings that inform strategic business decisions. In particular, it will reveal patterns in pricing, availability, and customer behavior by borough to optimize host performance and platform efficiency. These findings will allow data-driven decisions on pricing strategy, user engagement, host acquisition, and market expansion. The project will also reveal underperforming locations and emerging risks, allowing Airbnb to optimize customer experience while maximizing revenue and occupancy.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="darkgrid")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
path='/content/drive/MyDrive/Colab Notebooks/Copy of Airbnb NYC 2019.csv'
df = pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
print("\nColumn names:\n", df.columns)

In [None]:
print("\nData types:\n", df.dtypes)

In [None]:
df.shape

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.value_counts()

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing=df.isnull().sum()
print(missing)

In [None]:
missing = missing[missing > 0].sort_values(ascending=True)
sns.barplot(x=missing.values, y=missing.index)
plt.title("Missing Values per Column", fontsize=16, fontweight='bold')
plt.xlabel("Count of Missing Values")
plt.ylabel("Column Name")
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(9, 3))
sns.heatmap(df.isnull(), cbar=False, cmap='YlGnBu')
plt.title("Missing Values Heatmap", fontsize=16)
plt.show()

### What did you know about your dataset?



Airbnb NYC 2019 dataset contains detailed data about Airbnb listings in five New York City boroughs. It contains data regarding hosts, listing names, locations, prices, availabilities, room types, and customer engagement metrics like reviews count and review frequency.

From considering the initial data, I noticed that:

The information is located in various sections with varying numbers of listings.

They are mostly in three categories of rooms: Entire home/apartment, Private room, and Shared room.

There are missing values in some columns like reviews_per_month and host_name.

Prices and availability are very volatile, and this can impact customer choice and sales.

They are most commonly located in areas such as Manhattan and Brooklyn.

This initial finding is what influenced the subsequent focus areas for cleaning, analysis, and visualization

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description



---

### 📝 **Variable Description (Airbnb NYC 2019)**

| Variable Name                         | Description                                                              |
| ------------------------------------- | ------------------------------------------------------------------------ |
| **id**                                | Unique identifier for each Airbnb listing.                               |
| **name**                              | Name/title of the listing written by the host.                           |
| **host\_id**                          | Unique identifier for each host.                                         |
| **host\_name**                        | Name of the host (may be missing in some entries).                       |
| **neighbourhood\_group**              | The borough in which the listing is located (e.g., Manhattan, Brooklyn). |
| **neighbourhood**                     | Specific neighborhood within the borough.                                |
| **latitude**                          | Geographic latitude of the listing.                                      |
| **longitude**                         | Geographic longitude of the listing.                                     |
| **room\_type**                        | Type of room offered: Entire home/apt, Private room, or Shared room.     |
| **price**                             | Price per night (in USD) for the listing.                                |
| **minimum\_nights**                   | Minimum number of nights a guest must stay.                              |
| **number\_of\_reviews**               | Total number of reviews the listing has received.                        |
| **last\_review**                      | Date of the most recent review (may be null if no reviews).              |
| **reviews\_per\_month**               | Average number of reviews per month (calculated, may be null).           |
| **calculated\_host\_listings\_count** | Number of listings the host has in total.                                |
| **availability\_365**                 | Number of days in a year the listing is available for booking (0–365).   |

---


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No of unique values in ",i,":",df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df['host_name']=df['host_name'].fillna('Unknown')
df['reviews_per_month']=df['reviews_per_month'].fillna(0)
df['name'] = df['name'].fillna('No name')

In [None]:

print("No. of high-priced listings: ", len(df[df['price'] > 500]))
df_high_price = df[df['price'] > 500]

high_price_counts = pd.DataFrame(
    df_high_price.groupby('neighbourhood_group')['price'].count().reset_index(name="Count")
)


print(high_price_counts)



In [None]:
df = df[df['price'] > 0]

In [None]:
df.describe()

In [None]:
df['neighbourhood_group'].value_counts()


In [None]:
df['room_type'].value_counts()

In [None]:
df.groupby('neighbourhood_group')['price'].mean().sort_values(ascending=False)

In [None]:
df.groupby(['neighbourhood_group', 'room_type']).size().unstack().fillna(0)

In [None]:
df.groupby('neighbourhood_group')['availability_365'].mean()

In [None]:
df.groupby('room_type')['price'].mean()

In [None]:
pop_neighbour= df['neighbourhood'].value_counts().head(10)
pop_neighbour

### What all manipulations have you done and insights you found?

In the Airbnb NYC 2019 dataset, several key data manipulations were performed to prepare and analyze the data effectively. Missing values in the host_name and reviews_per_month columns were handled by replacing them with 'Unknown' and 0, respectively. Outlier filtering was applied by restricting the price to under \$500 and limiting minimum nights to fewer than 30 for clearer trend analysis. New features like estimated_revenue were engineered, and grouping operations were conducted to analyze borough-wise trends in price, availability, and review patterns.

Through these manipulations, it was observed that Brooklyn and Manhattan dominate the listing count, while Manhattan also leads in average price and estimated revenue. Most listings are either private rooms or entire apartments, and availability varies significantly across listings. Positive trends between reviews and revenue highlight strong customer engagement in certain areas, though price was not strongly correlated with other numerical features, suggesting market-driven pricing strategies. These insights can support pricing optimization, marketing focus, and host engagement strategies.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
plt.figure(figsize=(8, 4))
sns.countplot(data=df, x='neighbourhood_group', hue='neighbourhood_group', palette='plasma', legend=False)
plt.title("Number of Listings by Area")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

It shows the distribution of listings across NYC’s five boroughs.

##### 2. What is/are the insight(s) found from the chart?

Manhattan has the highest number of listings, followed by Brooklyn. Staten Island has the least.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps identify where Airbnb is dominant. A lower count in Staten Island may indicate untapped potential or low demand.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
df['room_type'].value_counts().plot.pie(autopct="%1.1f%%", colors=sns.color_palette('Set2'))
plt.title("Room Type Distribution")
plt.ylabel('count')
plt.show()


##### 1. Why did you pick the specific chart?

To understand the composition of listing types.

##### 2. What is/are the insight(s) found from the chart?

 'Entire home/apt' dominates, followed by 'Private room'.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Indicates user preference and
 guides room-type-specific marketing.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x='neighbourhood_group', hue='room_type', palette='muted')
plt.title("Room Types Across Areas")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Shows how room types are distributed
 in each borough.

##### 2. What is/are the insight(s) found from the chart?

Entire homes dominate in Manhattan; private rooms more common in Brooklyn.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps Airbnb adjust supply strategy and host acquisition efforts.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 5))
sns.histplot(df[df['price'] < 500]['price'], bins=60, kde=True, color='darkcyan')
plt.title("Price Distribution ")
plt.xlabel("Price")
plt.ylabel("Number of Listings")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To visualize the distribution of typical listing prices and eliminate distortion from extreme outliers.



##### 2. What is/are the insight(s) found from the chart?

Most listings are priced between $50  and  $200. The price distribution is right-skewed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps hosts set competitive prices.
Negative growth risk: Market saturation in the $100–$150 range may reduce bookings

#### Chart - 5

In [None]:
# Chart - 5 visualization code
avg_price = df.groupby('neighbourhood_group')['price'].mean().sort_values(ascending=False)

plt.figure(figsize=(8, 5))
sns.barplot(x=avg_price.index,hue=avg_price.index, y=avg_price.values, palette="rocket",legend=False)
plt.title("Average Price by Area")
plt.ylabel("Average Price ($)")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To compare average Airbnb prices across boroughs in a clear and ranked format.

##### 2. What is/are the insight(s) found from the chart?

Manhattan has the highest average price, followed by Brooklyn. The Bronx has the lowest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps hosts and Airbnb adjust pricing strategies by location.
Negative growth risk: High prices in Manhattan may deter budget travelers, reducing demand.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 5))
sns.histplot(df['availability_365'], bins=40, kde=True, color='skyblue')
plt.title("Availability Over the Year")
plt.xlabel("Availability (Days)")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To understand how frequently listings are available throughout the year.



##### 2. What is/are the insight(s) found from the chart?

Many listings are either rarely available (0–50 days) or always available (365 days), suggesting two distinct hosting patterns

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps identify full-time vs part-time hosts and tailor engagement strategies.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df[df['price'] < 500], x='longitude', y='latitude',
                hue='neighbourhood_group', alpha=0.4, palette='Set1')
plt.title("Geographical Spread of Listings ")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To visualize how budget-friendly listings are distributed across NYC geographically.

##### 2. What is/are the insight(s) found from the chart?

Manhattan and Brooklyn have dense clusters of listings under $500, while Staten Island has fewer options.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps target areas for new listings and better pricing strategies.
Negative growth risk: Sparse listings in outer boroughs may reflect low demand or host participation

#### Chart - 8

In [None]:
# Chart - 8 visualization code
top_neigh = df['neighbourhood'].value_counts().head(20)

plt.figure(figsize=(10, 6))
sns.barplot(y=top_neigh.index,hue=top_neigh.index, x=top_neigh.values, palette='plasma')
plt.title("Top 20 Neighbourhoods by Listing Count")
plt.xlabel("Count")
plt.ylabel("Neighbourhood")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To identify which neighborhoods have the highest Airbnb activity.

##### 2. What is/are the insight(s) found from the chart?

Neighbourhoods like Williamsburg, Bedford-Stuyvesant, and Harlem have the most listings, indicating high host activity and guest demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps focus marketing and support in high-traffic areas.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 2))
sns.stripplot(x=df[df['reviews_per_month'] > 0]['reviews_per_month'], color='purple', alpha=0.3, jitter=True)
plt.title("Strip Plot of Reviews per Month")
plt.xlabel("Reviews per Month")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To analyze how frequently listings receive guest reviews each month, excluding those with zero activity.

##### 2. What is/are the insight(s) found from the chart?

Most listings receive fewer than 2 reviews per month, showing that only a small portion have high engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Identifies high-performing listings and helps improve engagement strategies.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10, 5))
sns.kdeplot(df[df['minimum_nights'] < 30]['minimum_nights'], fill=True, color='teal')
plt.title(" Minimum Nights (Under 30 Only)")
plt.xlabel("Minimum Nights")
plt.ylabel("Density")
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

To focus on short-term stays by visualizing the common minimum night requirements without long-stay outliers.

##### 2. What is/are the insight(s) found from the chart?

Most listings require 1–3 nights minimum, with 1 night being the most common, indicating a preference for flexibility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Shows guest-friendly policies dominate, which can attract more short-stay bookings.

#### Chart - 11

In [None]:
plt.figure(figsize=(12, 6))
sns.scatterplot(
    data=df[df['price'] < 500],x='number_of_reviews',y='price',hue='room_type',palette='Set1', alpha=0.5,s=60,edgecolor='w'
)

plt.title("Price vs Number of Reviews (Listings Under $500)", fontsize=16, fontweight='bold')
plt.xlabel("Number of Reviews", fontsize=12)
plt.ylabel("Price ($)", fontsize=12)
plt.grid(True, linestyle='--', alpha=0.3)
plt.legend(title='Room Type', loc='upper right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To examine the relationship between review count and pricing for reasonably priced listings.

##### 2. What is/are the insight(s) found from the chart?

There’s no strong correlation, but many low-priced listings receive a high number of reviews, suggesting affordability boosts demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Suggests lower prices may drive more bookings and reviews, increasing visibility.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Estimate revenue = price × number_of_reviews
df['estimated_revenue'] = df['price'] * df['number_of_reviews']

revenue_by_borough = df.groupby('neighbourhood_group')['estimated_revenue'].sum().sort_values(ascending=False)

plt.figure(figsize=(8, 5))
sns.barplot(x=revenue_by_borough.index,hue=revenue_by_borough.index, y=revenue_by_borough.values, palette="crest")
plt.title("Estimated Revenue by Area")
plt.ylabel("Estimated Total Revenue ($)")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To compare total potential earnings from Airbnb listings across boroughs.

##### 2. What is/are the insight(s) found from the chart?

Manhattan generates the highest estimated revenue, followed by Brooklyn. Staten Island contributes the least.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Highlights high-performing markets for investment or host acquisition.

#### Chart - 13 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='YlGnBu', fmt='.2f')
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To examine relationships between numerical variables and detect possible trends or redundancies.

##### 2. What is/are the insight(s) found from the chart?

Strong correlation between number_of_reviews and reviews_per_month.

Weak correlation between price and other variables, showing pricing is influenced more by non-numeric factors.


#### Chart - 14 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select relevant numeric features
pairplot_data = df[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']]


pairplot_data = pairplot_data[pairplot_data['price'] < 500]

# Plot
sns.pairplot(pairplot_data, diag_kind='kde', corner=True, plot_kws={'alpha': 0.5, 's': 20, 'edgecolor': 'k'})
plt.suptitle("Pair Plot of Key Numerical Features", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

To explore potential relationships and interactions between key numerical variables in one consolidated view.

##### 2. What is/are the insight(s) found from the chart?

number_of_reviews and reviews_per_month show a positive trend.

No strong linear relationship between price and other variables.

Most data is concentrated at lower values, especially for minimum_nights

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

In order to achieve its business objective, Airbnb can utilize localized data to manage pricing and availability. Manhattan listings, for instance, have the highest revenues and average prices and hence upscale experiences must be emphasized there. Brooklyn, on the other hand, has ample listings with scope for increased host interaction. Dynamic pricing strategies and promotion deals based on such localized trends can boost revenues and satisfaction levels.

Airbnb can also focus on underperforming segments by encouraging hosts to put more listings up and being responsive. Data shows that shared and private room listings can be sold where people are looking to save money. Opening up more reviews and competitively priced listings can increase visibility and trust levels, which will improve platform performance and satisfy user expectations across all segments.

# **Conclusion**



1. **Brooklyn and Manhattan dominate listings**: These two boroughs have the highest number of listings, indicating major Airbnb activity is concentrated in these areas.

2. **Private rooms are the most common**: Among all room types, **private rooms** lead in count, especially in lower-priced areas like Queens and the Bronx.

3. **Most listings are affordable**: A majority of listings are priced **under \$200**, with a steep drop-off after that. High-priced listings are rare and mostly in Manhattan.

4. **Hosts prefer flexible availability**: Many listings show **high availability throughout the year**, indicating that hosts are open to frequent bookings.

5. **Minimum nights required is low**: Most listings require **less than a week’s stay**, making them suitable for short-term travelers.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***