<a href="https://colab.research.google.com/github/shreeya09/data-analysis/blob/main/AirBnb_Bookings_Analysis_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AirBNB EDA



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

Since 2008, guests and hosts have used Airbnb to expand on travelling possibilities and present a more unique, personalised way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analysed and used for security, business decisions, understanding of customers' and providers' (hosts) behaviour and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.

# **GitHub Link -**



# **Problem Statement**


Explore and analyse the Airbnb data to discover key understandings.

#### **Define Your Business Objective?**

To uncover insights that can inform business decisions, improve customer and host experiences, and identify trends or anomalies.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
file_path = "https://raw.githubusercontent.com/shreeya09/data-analysis/main/Airbnb%20NYC%202019.csv"  # Adjust path if running locally
df = pd.read_csv(file_path)


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

The dataset given is from the hospitality and tourism industry, specifically from the Airbnb platform, focusing on listings in New York City during the year 2019. We aim to analyze the platform's key operational aspects such as host behavior, pricing strategies, customer engagement, and geographic distribution of listings to derive valuable business insights.

Airbnb data analysis can help identify factors affecting price, availability, and popularity of listings, as well as trends across neighborhoods. These insights are useful for business optimization, marketing strategies, policy development, and urban planning.

The dataset consists of 48,895 rows and 16 columns, capturing details of Airbnb listings in New York City for the year 2019. This dataset provides a mix of categorical and numerical features relevant for analyzing host behavior, listing distribution, pricing trends, and guest activity.

There are no duplicate values in the dataset, which ensures data integrity and uniqueness of entries.

However, the dataset does contain missing values in the following columns:

name: 16 missing

host_name: 21 missing

last_review: 10,052 missing

reviews_per_month: 10,052 missing

The majority of the missing values in last_review and reviews_per_month are likely due to listings that have never received a review, which is a common and expected scenario in this domain.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

id : Unique ID for each listing

name : Title of the listing provided by the host

host_id : Unique ID for each host

host_name : Name of the host

neighbourhood_group : Categorical – Main NYC borough (e.g., Manhattan, Brooklyn)

neighbourhood : Categorical – Specific neighborhood within a borough

latitude : Latitude coordinate of the listing location

longitude : Longitude coordinate of the listing location

room_type : Categorical – Type of room (Entire home/apt, Private room, Shared room)

price : Price per night in USD

minimum_nights : Minimum number of nights required for booking

number_of_reviews : Total number of reviews received

last_review : Date of the most recent review (NaN if no reviews)

reviews_per_month : Average number of reviews per month (NaN if no reviews)

calculated_host_listings_count : Number of listings managed by the same host

availability_365 : Number of days the listing is available per year

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# 1. Remove duplicate rows (if any)
df = df.drop_duplicates()

# 2. Convert 'last_review' to datetime format
df['last_review'] = pd.to_datetime(df['last_review'])

# 3. Fill missing values correctly
df['name'] = df['name'].fillna("No name")
df['host_name'] = df['host_name'].fillna("No host")
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)
df['last_review'] = df['last_review'].fillna(pd.NaT)

# 4. (Optional) Remove listings with 0 price or unrealistic minimum nights
df = df[df['price'] > 0]
df = df[df['minimum_nights'] <= 365]

# 5. Reset index after cleaning
df = df.reset_index(drop=True)

# Preview cleaned data
df.info()



### What all manipulations have you done and insights you found?


**Data Manipulations Performed**

1. Loaded the Airbnb NYC 2019 dataset (~49,000 rows, 16 columns)  
2. Removed duplicate rows → 0 found  
3. Checked and filled missing values:  
   - `name` → filled with `"No name"`  
   - `host_name` → filled with `"No host"`  
   - `reviews_per_month` → filled with `0`  
   - `last_review` → converted to datetime, missing kept as `NaT`  
4. Removed listings with:  
   - `price = 0` (invalid listing)  
   - `minimum_nights > 365` (outlier)  
5. Reset DataFrame index after cleaning  
6. Verified column data types and null counts  

---

**Initial Insights**

- ~10,000 listings have never received a review (`NaT` in `last_review`)  
- No duplicate entries in the dataset  
- All columns except `last_review` now have complete data  
- Removed zero-priced and extreme minimum night listings for more reliable analysis  
- Dataset is now clean and ready for visualizations and deeper EDA

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1- Distribution of room types (Countplot)

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set styles
sns.set(style="whitegrid", palette="pastel")
plt.rcParams["figure.figsize"] = (10, 6)

# 1. Distribution of room types
sns.countplot(data=df, x='room_type', order=df['room_type'].value_counts().index)
plt.title("Room Type Distribution")
plt.xlabel("Room Type")
plt.ylabel("Number of Listings")
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Why Countplot used?

It visually counts occurrences of each category in a column.

It works best with categorical data like room_type.

It gives a quick sense of which room types are more common and how they compare.

##### 2. What is/are the insight(s) found from the chart?

Shows the dominant room types on Airbnb NYC. Room Types: Entire homes/apartments and private rooms dominate the market.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Knowing that Entire homes and Private rooms dominate can help Airbnb prioritize features, promotions, or pricing models around them.

#### Chart - 2- Listings by neighbourhood group (Pie Chart)

In [None]:
# Chart - 2 visualization code

# Chart - 2: Listings by neighbourhood group

# Data preparation
neighbourhood_counts = df['neighbourhood_group'].value_counts()
labels = neighbourhood_counts.index
sizes = neighbourhood_counts.values
colors = sns.color_palette("pastel")[0:5]

# Plot
plt.figure(figsize=(8, 8))
wedges, texts, autotexts = plt.pie(
    sizes,
    colors=colors,
    startangle=140,
    wedgeprops=dict(edgecolor='black'),
    autopct='%1.1f%%',
    pctdistance=1.15  # Move percentage labels outward
)

# Add legend for neighborhood labels
plt.legend(wedges, labels, title="Neighbourhood Group", loc="center left", bbox_to_anchor=(1, 0.5, 0.5, 1))

# Style adjustments
for text in autotexts:
    text.set_color('black')
    text.set_fontsize(10)

plt.title("Listings Distribution by Neighbourhood Group", fontsize=14)
plt.axis('equal')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is ideal for showing proportions of a whole.

It helps visually compare how each neighbourhood group contributes to total Airbnb listings.

Especially useful when the number of categories is limited and clearly distinguishable.

Makes it easy to spot dominant groups (like Manhattan and Brooklyn).



##### 2. What is/are the insight(s) found from the chart?

Manhattan and Brooklyn dominate the Airbnb market in NYC, together accounting for over 85% of total listings.

Other boroughs like Queens, Bronx, and Staten Island have significantly fewer listings, indicating lower host or tourist activity in those areas.

This distribution suggests that most Airbnb activity is concentrated in central and tourist-heavy areas, likely due to higher demand and better infrastructure.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding that Manhattan and Brooklyn dominate listings helps Airbnb and hosts focus marketing, promotions, and premium services in high-demand areas.
It also opens opportunities to expand inventory in underrepresented boroughs like Queens or Bronx, where competition is lower and growth potential exists.Over-concentration in Manhattan and Brooklyn could lead to market saturation, pricing pressure, and regulatory scrutiny.Neglecting areas like Staten Island or Bronx may result in missed growth opportunities, especially as tourists seek cheaper or more local experiences.

#### Chart - 3- Price distribution by room type (Box Plot)

In [None]:
# Chart - 3 visualization code
# 3. Price distribution by room type
sns.boxplot(data=df, x='room_type', y='price')
plt.ylim(0, 500)  # Limit y-axis for better view, exclude extreme outliers
plt.title("Price Distribution by Room Type")
plt.ylabel("Price (USD)")
plt.show()

##### 1. Why did you pick the specific chart?

A boxplot shows the spread, median, and outliers of price for each room type.

Ideal for comparing distribution across categories (like room_type).

Highlights central tendency (median) and price variability.

Clearly displays extreme values (outliers), helping spot unusually priced listings.

##### 2. What is/are the insight(s) found from the chart?

Entire home/apt listings have the highest median price and a wide price range, as expected due to full property rentals.

Private rooms are significantly more affordable and tightly clustered around a lower price range.

Shared rooms are the cheapest but also the least common.

The plot reveals many high-price outliers, especially in Entire home/apt category, indicating luxury or premium listings.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.
Understanding how prices vary by room type helps:
-Hosts price competitively within their category.
-Airbnb recommend optimal pricing for new listings based on market trends.
-Identify opportunities to promote Private Rooms as budget-friendly options for price-sensitive users.
-Guide Airbnb’s dynamic pricing algorithms to consider category-based pricing patterns.

The presence of many high-price outliers, especially in Entire home/apt, may create a perception of being overpriced, driving away budget travelers.If Shared rooms remain underutilized, they could represent wasted inventory or low host engagement.

#### Chart - 4- Number of reviews vs price (Scatter Plot)

In [None]:
# Chart - 4 visualization code
# 4. Number of reviews vs price (scatter plot)
sns.scatterplot(data=df[df['price'] < 500], x='price', y='number_of_reviews', alpha=0.4)
plt.title("Price vs. Number of Reviews")
plt.xlabel("Price (USD)")
plt.ylabel("Number of Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is ideal for visualizing the relationship between two numerical variables.

Helps identify patterns, trends, or correlations — e.g., whether cheaper listings get more reviews.

Reveals clusters and outliers, like low-priced listings with lots of activity.

Adding alpha=0.4 makes dense areas more visible without clutter.

##### 2. What is/are the insight(s) found from the chart?

Listings with lower prices (under $150) tend to receive more reviews, suggesting they are booked more frequently.

Higher-priced listings generally receive fewer reviews, possibly due to lower affordability or being targeted toward short-term/luxury stays.

A few outliers exist — expensive listings with lots of reviews, likely premium or highly-rated properties.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.The insight that lower-priced listings attract more reviews can guide hosts to optimize their pricing for better visibility and more bookings.Airbnb can use this to enhance dynamic pricing models, suggesting lower rates for new hosts to quickly gain traction.Frequent reviews also build trust — boosting search ranking and conversion rates, leading to increased bookings.

Overemphasis on low pricing may trigger a “race to the bottom”, reducing overall host revenue and platform value. High-quality, higher-priced listings may get overlooked despite offering better service, leading to underutilization and dissatisfied premium hosts.



#### Chart - 5- Availability by neighbourhood group (Box Plot)

In [None]:
# Chart - 5 visualization code
# 5. Availability by neighbourhood group
sns.boxplot(data=df, x='neighbourhood_group', y='availability_365')
plt.title("Availability by Neighbourhood Group")
plt.ylabel("Available Days per Year")
plt.show()

##### 1. Why did you pick the specific chart?

A boxplot is ideal for showing the distribution and spread of a numerical variable (availability_365) across categories (neighbourhood_group).

It reveals:

Median availability (typical listing activity)

Variation within each borough

Outliers, such as listings available all year or rarely available

Helps compare how active listings are in different boroughs at a glance.

##### 2. What is/are the insight(s) found from the chart?

All boroughs show a wide variation in listing availability, ranging from 0 to 365 days.

Many listings have full-year availability (365 days), especially in Manhattan and Brooklyn, suggesting strong host engagement in those areas.

Some boroughs (e.g., Staten Island and Bronx) show lower median availability, which may reflect less consistent host activity or part-time rentals.

Presence of 0-availability listings could indicate inactive or paused listings still on the platform.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding that Manhattan and Brooklyn have many fully available listings suggests these are key markets with active and committed hosts — ideal for promotions, partnerships, or Airbnb Plus expansion.Insights into borough-level availability help Airbnb adjust search algorithms and availability-based recommendations, improving user satisfaction and booking efficiency.

A significant number of listings with 0-day availability may indicate inactive or neglected listings, which can clutter search results and frustrate users.Lower availability in boroughs like Staten Island or Bronx could signal missed market potential or low host retention, which, if ignored, could limit Airbnb’s long-term growth in these areas.

#### Chart - 6- Listing Locations in NYC (Map Plot)

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 8))
sns.scatterplot(
    data=df[df['price'] < 500],  # Limit to reasonable prices
    x='longitude', y='latitude',
    hue='neighbourhood_group',
    palette='Set2',
    alpha=0.4,
    s=20
)
plt.title("Airbnb Listings in NYC by Neighbourhood Group")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.legend(title="Neighbourhood Group")
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot using longitude and latitude visually simulates a geographical map.

Helps identify listing density clusters and how listings are distributed across NYC.

Ideal when real map libraries (e.g., Folium) are not used but location data exists.

##### 2. What is/are the insight(s) found from the chart?

Most listings are concentrated in Manhattan and Brooklyn, forming dense geographic clusters.

Queens and Bronx have fewer, more scattered listings.

Certain areas show very high listing overlap, indicating hotspots of demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

Oversaturation in dense zones (e.g., Manhattan) may lead to high competition and declining profits for hosts.

Underrepresented areas could indicate missed market opportunities or lack of support, leading to uneven platform growth.

#### Chart - 7 Minimum Nights Affects Reviews (ScatterPlot)

In [None]:
# Chart - 7 visualization code
# Scatter plot to examine how minimum nights affects reviews

plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=df[df['minimum_nights'] < 30],  # Filter extreme outliers
    x='minimum_nights',
    y='number_of_reviews',
    alpha=0.5
)
plt.title("Minimum Nights vs Number of Reviews")
plt.xlabel("Minimum Nights Required")
plt.ylabel("Number of Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot shows the relationship between two numeric variables.

Ideal for detecting how minimum night requirements affect guest activity (measured via reviews).

##### 2. What is/are the insight(s) found from the chart?

Listings with lower minimum night requirements (1–5 nights) generally receive more reviews.

As minimum_nights increases, the number of reviews tends to drop sharply.

Suggests that flexible stay options attract more bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

Airbnb can recommend lower minimum nights for new or low-performing hosts to increase engagement.

Hosts can adjust stay requirements to balance booking frequency and operational workload.

Improves guest satisfaction by offering more flexible stay durations.

Listings with very high minimum nights may receive very few bookings, leading to poor visibility and lost revenue.

Rigid policies may push guests to competitors offering shorter stays or more flexibility.

#### Chart - 8- Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Between Numerical Features")
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is ideal for exploring relationships between multiple numerical variables at once.

It helps identify:

Strong or weak linear correlations

Potential predictor variables for modeling

Redundant features that may not add value

Perfect for summarizing complex inter-variable relationships in a single visual.

##### 2. What is/are the insight(s) found from the chart?

Most numerical features show weak or no strong correlation with price, indicating that pricing is influenced by a complex mix of factors (e.g., location, room type, amenities).

Strongest positive correlation:

reviews_per_month ↔ number_of_reviews — logical, since more reviews monthly means more total reviews.

calculated_host_listings_count and availability_365 have very low correlation with other variables, suggesting they behave independently.

#### Chart - 9 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select important numeric columns for pair plot
pairplot_cols = [
    'price',
    'minimum_nights',
    'number_of_reviews',
    'reviews_per_month',
    'calculated_host_listings_count',
    'availability_365'
]

# Filter to remove extreme price values and reduce overplotting
df_pair = df[df['price'] < 500].sample(2000, random_state=1)

# Pair plot
sns.pairplot(df_pair[pairplot_cols], diag_kind='kde', corner=True, plot_kws={'alpha': 0.4, 's': 20})
plt.suptitle("Pair Plot of Key Numerical Features (Sampled)", y=1.02, fontsize=14)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot is great for visualizing all pairwise relationships between selected numerical features in a single chart.

Helps detect:

Linear or non-linear trends

Outliers or clusters

Possible correlations between variables (e.g., reviews vs. price)

Also shows distributions of each variable along the diagonal, giving insights into skewness or spread.

Useful for feature selection and spotting patterns before modeling.


##### 2. What is/are the insight(s) found from the chart?

Price has no strong linear correlation with other numeric features — confirming it's influenced by a mix of factors (e.g., location, room type).

Number of reviews and reviews per month show a strong positive relationship, as expected.

Minimum nights is highly skewed — most listings have short stays, but a few require long-term bookings (outliers).

Availability shows a dense cluster near 0 and 365 — suggesting many listings are either fully available year-round or rarely active.

No strong multicollinearity is visible, which is good for further modeling.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To enhance customer experience, host performance, and market growth, Airbnb can take the following actions based on EDA insights:

1. **Encourage Flexible Stays**  
   - Listings with lower minimum night requirements tend to get more reviews.  
   - Suggest new or low-performing hosts start with **1–3 night minimums** to increase visibility and bookings.

2. **Optimize Pricing Guidance**  
   - Price does not strongly correlate with other numerical features, indicating multiple influencing factors.  
   - Use **room type and location** to provide smart pricing suggestions to help hosts remain competitive.

3. **Activate Inactive Listings**  
   - Thousands of listings have 0 availability or no reviews.  
   - Prompt hosts to **update calendars**, or **deactivate inactive listings** to improve platform quality and search relevance.

4. **Target Untapped Locations**  
   - Most listings are concentrated in Manhattan and Brooklyn.  
   - Increase outreach and incentives for hosting in **Queens, Bronx, and Staten Island** to balance supply and unlock new demand.

5. **Support High-Performing Hosts**  
   - Hosts with consistent availability and frequent reviews indicate strong engagement.  
   - Offer these hosts **badges, boosts, or rewards** to retain quality supply and motivate others.

# **Conclusion**

This Exploratory Data Analysis (EDA) on the Airbnb NYC 2019 dataset provided valuable insights into listing behavior, pricing trends, guest engagement, and host activity across different boroughs of New York City.

Key takeaways include:
- **Manhattan and Brooklyn** dominate in terms of listing volume and host activity.
- **Entire home/apartment** listings are the most expensive and common, while **private rooms** offer budget-friendly options with high engagement.
- **Lower minimum night requirements** lead to more reviews, suggesting better booking frequency and guest satisfaction.
- A large number of listings have either **no reviews or 0 availability**, pointing to inactive or underutilized properties.
- **Price** shows little correlation with other numerical features, indicating it is influenced by multiple qualitative factors like location, amenities, and room type.

Overall, the insights from this analysis can help Airbnb:
- Improve platform efficiency
- Optimize search and pricing algorithms
- Expand into underrepresented markets
- Guide new hosts to success

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***