# **Project Name**    - Airbnb Bookings Analysis




##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

Write the summary here :   

This project analyzes Airbnb listings in New York City for the year 2019. The dataset contains detailed information about over 48,000 listings across five boroughs: Manhattan, Brooklyn, Queens, Bronx, and Staten Island. Key attributes include host details, neighborhood, room type, price, availability, and number of reviews. The primary goal of this analysis is to uncover patterns in pricing, availability, and spatial distribution of listings. Descriptive statistics and visualizations are used to explore the popularity of different boroughs and room types. The project also investigates factors influencing listing prices and identifies outliers. We examine which areas have the highest listing density and how prices vary across neighborhoods. Insights are drawn on affordability, host activity, and seasonal availability. Findings can support better decision-making for travelers, hosts, and policy-makers by understanding trends in the short-term rental market in NYC.

# **GitHub Link -**

Provide your GitHub Link here.  
https://github.com/tanvisolanke9-lgtm/Airbnb---Booking---Analysis.git

# **Problem Statement**


**Write Problem Statement Here.**  

This project aims to analyze Airbnb listings in NYC (2019) to identify trends in pricing, availability, and room types across neighborhoods. The goal is to uncover insights that help travelers, hosts, and policymakers make informed decisions in the short-term rental market.




#### **Define Your Business Objective?**

Answer Here.  

The primary business objective of this project is to perform an in-depth analysis of Airbnb listings in New York City for the year 2019, with the goal of generating meaningful insights that can benefit key stakeholders — including hosts, travelers, and policymakers.  
For Airbnb hosts, the objective is to identify trends in pricing, availability, and customer preferences across different boroughs and room types. This can help them optimize their listings to increase visibility, occupancy rates, and overall revenue.  
For travelers, the analysis aims to highlight the most affordable and popular areas, enabling better decision-making when choosing where to stay based on budget, convenience, and listing quality.  
For city authorities and policymakers, the project seeks to uncover patterns in listing density, pricing outliers, and seasonal availability that could inform policy decisions around housing regulations, zoning, and short-term rental restrictions.  
Ultimately, this analysis is designed to make the short-term rental ecosystem more transparent, efficient, and beneficial for everyone involved by using data-driven insights to address key business questions.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
#load the datset
df = pd.read_csv('/content/Airbnb NYC 2019.csv')

### Dataset First View

In [None]:
# Dataset First Look
print(df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(df.shape)

### Dataset Information

In [None]:
# Dataset Info
print(df.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print('Number of duplicated rows:', duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]
print("Missing values in each column:")
print(missing_values)

In [None]:
# Visualizing the missing values
# Set figure size
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

Answer Here  
The Airbnb NYC 2019 dataset contains detailed information about 48,895 Airbnb listings across New York City’s five boroughs: Manhattan, Brooklyn, Queens, Bronx, and Staten Island. The dataset includes 16 columns, featuring attributes such as listing ID, name, host details, location, room type, price, number of reviews, and availability. The most common room types are entire homes/apartments, private rooms, and shared rooms. An initial check shows there are no duplicate rows in the dataset. However, some missing values are present, particularly in the name, host_name, last_review, and reviews_per_month columns — with the last two missing in over 10,000 rows, likely due to listings that have not received any reviews. The data types include a mix of strings, numerical values, and dates. This dataset is ideal for exploring pricing trends, room availability, neighborhood popularity, and host activity, and can support decisions for hosts, travelers, and city policymakers alike.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
print(df.describe())

### Variables Description

Answer Here  
The Airbnb NYC 2019 dataset includes 16 columns, each representing key information about individual Airbnb listings across New York City:  
1) id: Unique identifier for the listing.  
2)name: Title or name of the listing (e.g., “Modern Studio in Midtown”).  
3)host_id:Unique identifier for the host.  
4)host_name: name of the host  
5)neighbourhood_group: Borough where the listing is located (e.g., Manhattan, Brooklyn).  
6)neighbourhood: 	Specific neighborhood within the borough.  
7)latitude: Geographical latitude coordinate of the listing.  
8)longitude: Geographical longitude coordinate of the listing.  
9)room_type: Type of accommodation offered: Entire home/apt, Private room, Shared room.  
10)price: Price per night in US dollars.  
11)minimum_nights: Minimum number of nights required per booking.  
12)number_of_reviews: Total number of reviews received for the listing.  
13)last_review: Date when the most recent review was left.  
14)reviews_per_month: Average number of reviews received per month.  
15)calculated_host_listing_counts: Number of listings the host has on Airbnb.  
16)availability_365: Number of days the listing is available in a calendar year (0–365).







### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
print('unique values in each column:', unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#1  Preview dataset
print("Original shape:", df.shape)
print("Initial missing values:\n", df.isnull().sum())

#2 Drop duplicate rows
df = df.drop_duplicates()

#3 Drop rows where essential columns are missing
df = df.dropna(subset=['name', 'host_name'])

#4 Fill missing values
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')

#5  Remove extreme outliers
df = df[df['price'] <= 1000]                # Remove listings priced above $1000
df = df[df['minimum_nights'] <= 365]        # Remove listings with unrealistic minimum stay

#6 Fix data types
df['host_id'] = df['host_id'].astype(str)
df['id'] = df['id'].astype(str)

#7 Feature Engineering: Create price category
df['price_category'] = pd.cut(df['price'],
                              bins=[0, 100, 200, 500, 1000],
                              labels=['Low', 'Medium', 'High', 'Luxury'])

#8 Reset index
df = df.reset_index(drop=True)

#9 Final summary
print("\nCleaned shape:", df.shape)
print("\nRemaining missing values:\n", df.isnull().sum())
print("\nData types:\n", df.dtypes)




### What all manipulations have you done and insights you found?

Answer Here.  
Data manipulations are performed:  
1) Loaded and previewed the dataset: Dataset has 48,895 rows and 16 columns.  
2)Removed duplicate rows: Ensures no exact records are repeated.Result: No duplicate rows found.  
3)Handled missing values: Dropped rows with missing name or host_name (essential info), Filled reviews_per_month with 0 where reviews are missing, Converted last_review to datetime; missing dates left as NaT.  
4)Removed outliers: Dropped listings with:price > $1000, minimum_nights > 365  
5)Fixed data types: Converted id and host_id to string types for clarity, Parsed last_review to datetime for time-based analysis.  
6)Created new feature: price_category: Binned prices into: Low, Medium, High, Luxury.  

Insights performed so far:  
1)Manhattan and Brooklyn dominate Airbnb listings: Most listings are concentrated in these two boroughs.  
2)Room Type Distribution: The majority of listings are for entire homes/apartments and private rooms.  
3)Price Distribution is Skewed: Most listings fall below $200.  
4)Missing Reviews: Over 10,000 listings have no reviews (last_review is missing).  
5)Host Behavior: Some hosts own multiple listings, which may indicate commercial operations.  
6)Data Quality is Reasonably Good: After cleaning, very few missing values remain, and data types are consistent.  












## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart -1 (univariate analysis) numerical  
1)What is the distribution of price of listings?   

In [None]:
# Chart - 1 visualization code
# Set a reasonable upper limit to remove outliers for better visualization
upper_limit = df['price'].quantile(0.95)
filtered_prices = df[df['price'] <= upper_limit]['price']

#plotting the distribution of price
plt.figure(figsize=(10,6))
plt.hist(filtered_prices, bins=50, edgecolor ='Black')
plt.title('Distribution of airbnb listing prices (95th percentile)')
plt.xlabel('Price')
plt.ylabel('Number of Listings')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here: I choose histogram because it is a standard and most effective tool for visualizing the frequency distribution of continous variable like price. it clearly reveals the patterns, central tendency, spread, and skewness.  

##### 2. What is/are the insight(s) found from the chart?

Answer Here: Most listings are priced between $50–$200.The distribution is right-skewed, with a few high-priced outliers.Median price is more reliable than the mean due to skewness.Budget to mid-range listings dominate the NYC Airbnb market.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here: Positive Impact:  
Helps set competitive prices ($50–$200) to attract more bookings Reveals opportunity in mid-high range pricing Supports better marketing and pricing strategies.  
Negative impact:  
Overpricing leads to fewer bookings Misjudging the market due to outliers may hurt new or luxury listings.

#### Chart - 2 (univariate analysis) numerical
2) What is the average number of reviews_per_month?

In [None]:
# Chart - 2 visualization code
# Drop missing values for visualization
filtered_reviews = df['reviews_per_month'].dropna()

# Calculate the mean of reviews_per_month
mean_reviews = filtered_reviews.mean()

#plot histogram
plt.figure(figsize=(10,6))
plt.hist(filtered_reviews, bins=50, edgecolor= 'Black')
plt.axvline(mean_reviews, color='red', linestyle='dashed', linewidth=2, label=f'Mean = {mean_reviews:.2f}')
plt.title('Distribution of reviews per month')
plt.xlabel('Average Reviews per month')
plt.ylabel('Nummber of listings')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.: A histogram with a mean line clearly shows the distribution of reviews per month.  

It highlights the average value in context, making it easy to interpret.
This chart reveals skewness, spread, and frequency of review counts.
It's ideal for understanding how typical or unusual the average really is.

##### 2. What is/are the insight(s) found from the chart?

Answer Here: Most listings receive fewer than 2 reviews per month, indicating low review frequency The distribution is right-skewed, with a few listings getting very high reviews The average is pulled slightly right due to these high-review outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here: Positive Business Impact:  
Helps hosts understand that frequent reviews are rare, setting realistic expectations Encourages new hosts to optimize listings (photos, descriptions, pricing) to increase bookings and reviews.  
 Negative Growth:  
 Hosts may overestimate demand, expecting frequent reviews and high occupancy Listings with few reviews may seem inactive, leading to lower trust and fewer bookings if not managed well.



#### Chart - 3 (univariate analysis) categorical
3)What are the most common room_types in NYC?

In [None]:
# Chart - 3 visualization code
# Count of each room_type
room_type_counts = df['room_type'].value_counts()

#Bar plot
plt.figure(figsize=(8,5))
room_type_counts.plot(kind='bar', color='green', edgecolor='black')
plt.title('Most common room types in nyc airbnb listings')
plt.xlabel('Room types')
plt.ylabel('Number of listings')
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here: A bar chart is ideal for visualizing categorical data like room_type.
It clearly shows the count of listings for each type, making comparisons easy.
This helps identify the most and least common room types in NYC.  
It’s more readable than a pie chart, especially when proportions are close.
The length of the bars directly reflects popularity, enhancing quick insights.
Overall, it's the most effective way to present this type of frequency data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here: Entire home/apartment is the most common room type in NYC Airbnb listings Private rooms are also popular, indicating demand for budget/shared options Shared rooms and hotel rooms are the least common .


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here:  Positive Business Impact   
Hosts can align their offering with demand by listing entire homes, which are most preferred The insights help platforms promote the most booked room types, boosting revenue Airbnb can design personalized recommendations based on room-type popularity.  
Negative Growth Risks:  
Oversupply of entire homes may lead to market saturation, reducing occupancy and profits Neglecting less common types (e.g., shared rooms) may miss out on budget-conscious travelers.



#### Chart 4 -  (bivariate analysis) numerical - numerical
4) How does the price vary across different room_types?

In [None]:
# Chart - 4 visualization
# Filter out extreme outliers for better visualization (e.g., top 5% prices)
filtered_df = df[df['price'] <= df['price'].quantile(0.95)]

# create a box plot
plt.figure(figsize=(10,6))
sns.boxplot(x='room_type', y='price', data=filtered_df, palette='pastel')
plt.title('Price distribution by room type')
plt.xlabel('Room type')
plt.ylabel('price')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here: box plot is the best choice because it shows the spread, median, and outliers of prices for each category.

##### 2. What is/are the insight(s) found from the chart?

Answer Here: Entire home/apartment listings have the highest median price, indicating they're premium offerings Private rooms are priced significantly lower, making them ideal for budget travelers Shared rooms are the cheapest but have very limited availability Hotel rooms show a wide price range with some high-end outliers.The data confirms that room type has a strong influence on price.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here: Positive Business Impact:  
Helps hosts price competitively based on room type and market expectations.  
Encourages hosts to offer private rooms in costly areas to attract budget travelers.  
Platforms can design targeted promotions for each room type and price band.  

Negative Growth:  
Overpricing private/shared rooms may lead to low occupancy and poor reviews.  
Ignoring room-type-based pricing can result in misaligned value and lost bookings.

#### Chart - 5 (bivariate analysis) numerical - numerical  
5) How does price relate to minimum_nights?

In [None]:
# Chart - 5 visualization code
# Filter out extreme outliers for better visualization
filtered_df = df[
    (df['price'] <= df['price'].quantile(0.95)) &
    (df['minimum_nights'] <= df['minimum_nights'].quantile(0.95))
]

#plotting scatter plot
plt.figure(figsize=(10,6))
plt.scatter(filtered_df['minimum_nights'], filtered_df['price'], alpha=0.5)
plt.title('Price vs Minimum Nights')
plt.xlabel('Minimum Nights')
plt.ylabel('Price')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here: a scatter plot is a good choice. However, because both variables can have extreme outliers, we’ll limit them for clarity.Both price and minimum_nights are numerical (continuous) variables.A scatter plot is ideal for showing the relationship or pattern between two such variables.

It helps visualize whether higher minimum stays lead to higher or lower prices.Unlike bar or box plots, scatter plots show individual data points and spread.



##### 2. What is/are the insight(s) found from the chart?

Answer Here: Most listings have a low minimum night requirement (1–5 nights).
There’s no strong correlation between price and minimum nights overall.A few listings with very high minimum nights tend to have lower or irregular prices.Some outliers exist where listings require many nights but have very low or very high prices.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here: Positive Business Impact:  
Hosts can confidently set flexible minimum nights without worrying about price loss.  
Encourages shorter stays, which are more attractive to tourists and can increase booking frequency.  
Helps platforms educate new hosts that pricing isn't heavily dependent on minimum stay.  

Negative Growth:  
Setting high minimum nights may reduce bookings, especially for short-term tourists.

#### Chart - 6 (bivariate analysis) categorical - categorical
6) What is the count of listings by neighbourhood_group and room_type?

In [None]:
# Chart - 6 visualization code
# Create a grouped count table
grouped_counts = df.groupby(['neighbourhood_group', 'room_type']).size().reset_index(name='count')

#plot
plt.figure(figsize=(10,6))
sns.barplot( data=grouped_counts, x='neighbourhood_group', y='count',hue='room_type',palette='Set2')
plt.title('Count of listings bt neighbourhood group and room type')
plt.xlabel('Neighbourhood group')
plt.ylabel('Number of listings')
plt.legend(title='Room Type')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here: The data involves two categorical variables: neighbourhood_group and room_type A grouped bar chart clearly shows comparisons within and across categories It allows us to see how each room type is distributed in every neighborhood group Easier to read and interpret than a stacked bar chart when comparing individual values.

##### 2. What is/are the insight(s) found from the chart?

Answer Here: Manhattan has the highest number of listings, especially for entire homes/apartments.  
Brooklyn also has a large share, but private rooms are more common there.  
Queens and Bronx have fewer listings overall, mostly private rooms.  
Staten Island has the least number of listings, across all room types.  
Entire homes dominate in Manhattan, while private rooms are more popular in other boroughs.  

  


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here: Positive Business Impact  
Helps hosts tailor their room offerings based on borough trends (e.g., entire homes in Manhattan).  
Airbnb can focus marketing and investment efforts in high-performing areas like Manhattan and Brooklyn.  
Insights support localized pricing and promotion strategies by borough and room type.  

 Negative Growth:  
 Over-saturating areas like Manhattan may lead to high competition and lower occupancy.



#### Chart - 7 (Univariate analysis)
7) How many listings exist in each neighbourhood group?

In [None]:
# Chart - 7 visualization code
# Count of listings by neighbourhood group
neighbourhood_counts = df['neighbourhood_group'].value_counts()

# plotting the counts
plt.figure(figsize=(8,5))
sns.barplot(x=neighbourhood_counts.index, y=neighbourhood_counts.values,color= 'Black')
plt.title('Count of listings by neighbourhood group')
plt.xlabel('Neighbourhood group')
plt.ylabel('Number pf listings')
plt.show()





##### 1. Why did you pick the specific chart?

Answer Here: Bar chart is most suitable for categorical comparision .Bar charts make it easy to compare counts across distinct categories like neighbourhood groups. Easy to enhance with labels, sorting, or colors to communicate insights effectively in reports or presentations.

##### 2. What is/are the insight(s) found from the chart?

Answer Here: 1) Brooklyn and Manhattan dominate the market  
These two neighbourhood groups usually have the highest number of listings, indicating they are the most popular areas for Airbnb rentals.  
2)Queens has a moderate number of listings  
While not as dominant as Brooklyn or Manhattan, Queens still contributes a fair number of listings—possibly due to proximity to airports and affordability.  
3)Bronx and Staten Island have the least listings  
These areas may be less popular among tourists or have stricter local regulations or fewer hosts participating on Airbnb.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here: Positive Business Impact  
Target High-Demand Areas (Brooklyn & Manhattan)  
Optimized Resource Allocation  
Service Expansion Opportunities  

Negative Growth  
Market Saturation in Brooklyn and Manhattan  
Neglected Potential in Bronx and Staten Island

These areas have very few listings.

#### Chart - 8 (Bivariate analysis)
8) Which neighbourhood groups have the most availability?

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x="neighbourhood_group", y="availability_365", color='green')
plt.title("Availability (per year) by Neighbourhood Group")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Availability in Days")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here: Box plot  Show Distribution and Comparisons Clearly, Box plots clearly show the median availability (the line inside the box) for each neighbourhood group. The size of the box (interquartile range) shows how spread out the availability values are.Availability can range from 0 to 365 days, and a box plot is well-suited for skewed or uneven data like this.

##### 2. What is/are the insight(s) found from the chart?

Answer Here: Higher Median Availability:
A group with a higher median line (middle of the box) typically has more listings available for most of the year.  
Spread and Outliers:  
A wider box or more outliers means more variability—some listings available only a few days, others all year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here: Positive Business Impact  
Optimize Listing Strategy Based on Availability Trends  
Target Low-Availability Areas with High Potential  
Predict Seasonality and Planning  

Negative Growth  
Low Availability in Certain Neighbourhood Groups  
High Availability But Low Occupancy Risk  

#### Chart - 9 (univariate analysis)
9) What is the distribution of number of reviews per listing?

In [None]:
# Chart - 9 visualization code
# Set a style
sns.set(style="whitegrid")

# Create a figure with two plots
plt.figure(figsize=(14, 6))

# plot histogram
plt.subplot(1,2,1)
sns.histplot(df['number_of_reviews'], bins=50, kde=True, color='Black')
plt.title('Distribution of number of reviews')
plt.xlabel('Number of reviews')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here: A histogram is chosen because it clearly shows the distribution and frequency of review counts, helps identify skewness, and detects outliers easily. It's ideal for understanding how reviews are spread across listings.

##### 2. What is/are the insight(s) found from the chart?

Answer Here: 1) Most listings have few reviews:  
 The majority of listings receive less than 50 reviews, indicating limited guest engagement or newer hosts.  
 2)Right-skewed distribution:  
 A small number of listings have very high review counts, showing they are high-performing or long-standing properties.  
 3)Presence of outliers:  
  Some listings have extremely high number of reviews, which could be top-rated or highly booked properties worth analyzing further.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here:  Positive Business Impact:  
1)Identify High-Performing Listings: Listings with many reviews are likely popular, well-managed, and highly rat  
2)Target Low-Engagement Listings: Listings with few or no reviews may need better pricing, photos, or guest experience improvements.  

Negative growth:  
1)Listings with consistently low or no reviews may indicate poor guest experience, lack of visibility, or host inactiveness.  
2)If many such listings exist and are ignored, it can lead to: Poor overall platform reputation

#### Chart - 10 (Bivariate analysis)
10)Is there a relation between availability and minimum nights?

In [None]:
# Chart - 10 visualization
#Filter extreme outliers for clarity
df_filtered = df[(df['minimum_nights'] <= 30) & (df['availability_365'] <= 365)]

#plotting a scatter plot
plt.figure(figsize =(10,6))
sns.scatterplot(data=df_filtered, x='minimum_nights', y='availability_365', alpha=0.5  )
plt.title('Availability vs minimum nights')
plt.xlabel('minimum nights')
plt.ylabel('availability 365')
plt.grid(True)
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.: 1)Both variables are numerical: minimum_nights and availability_365 are continuous variables, making scatter plots ideal for identifying patterns or trends.  
2)Reveals relationships and clusters: It helps detect whether listings with higher minimum stays tend to have lower availability or vice versa.  
3)Identifies outliers:  Easily spot listings with extremely high minimum nights or unusual availability.  


##### 2. What is/are the insight(s) found from the chart?

Answer Here: 1)Most listings have a minimum stay between 1 to 7 nights.  
2)These listings tend to show high availability throughout the year.  
3)There's a clear trend: shorter minimum stays → higher availability.  
4)Listings with long minimum nights (e.g., 30+) usually have low availability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here: positive business impact:  
1)Optimize listing flexibility:
Hosts can reduce minimum night requirements to increase availability and attract more short-term bookings.  
2)Improve occupancy rates:
Listings with shorter stays and higher availability are more likely to be booked frequently, boosting overall revenue.  

Negative growth:  
1)Listings with long minimum stays and low availability may experience low occupancy and revenue loss.  
2)This reduces platform efficiency and can lead to poor user experience if guests find fewer flexible options.  



#### Chart - 11 (multi variate analysis)
11) Which combinations of host ID, neighbourhood, and room type result in the most reviews?

In [None]:
# Chart - 11 visualization code

# Group and sum reviews
grouped = df.groupby(['host_id', 'neighbourhood', 'room_type'])['number_of_reviews'].sum().reset_index()

# Sort to find top combinations
top_combinations = grouped.sort_values(by='number_of_reviews', ascending=False).head(10)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(data=top_combinations,  x='number_of_reviews',  y=top_combinations.apply(lambda row: f"{row['host_id']} | {row['neighbourhood']} | {row['room_type']}",
            axis=1), palette='coolwarm')
plt.title('Top 10 Host-Neighbourhood-Room Type Combinations by Number of Reviews')
plt.xlabel('Total Number of Reviews')
plt.ylabel('Host ID | Neighbourhood | Room Type')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here: 1) Clear comparison of combinations:
A horizontal bar chart clearly shows which host–neighbourhood–room type combinations have the highest total reviews.  
2)Easy to read long labels:
Since the y-axis includes combined identifiers (Host ID | Neighbourhood | Room Type), a horizontal layout makes the chart more readable.  
3)Ideal for categorical vs numerical comparison:
A bar chart is best when comparing categories (combinations) against a numerical metric (number of reviews).  

##### 2. What is/are the insight(s) found from the chart?

Answer Here: The chart shows that a few host–neighbourhood–room type combinations get the most reviews, mainly in Manhattan and Brooklyn. Top-reviewed listings are usually entire homes or private rooms, showing guests prefer comfort and affordability. High-performing hosts likely offer better service and visibility. These insights highlight what attracts guests and drives engagement.









##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here:  positive business impact:  
1)Identify top-performing host strategies – Airbnb can analyze what high-review hosts are doing right (pricing, location, amenities) and promote those practices to others.  
2)Focus on high-demand areas – Emphasizing listings in neighbourhoods like Manhattan and Brooklyn can boost platform visibility and bookings.  

Negative growth:  
1)Over-reliance on a few hosts or areas can lead to market saturation and reduced diversity of listings elsewhere.  
Hosts with low visibility or poor review counts may be left behind, leading to lower retention and earnings.

#### Chart - 12 (multivariate analysus)
12) Which neighbourhood and room type combinations have the highest average price?

In [None]:
# Chart - 12 visualization code
# Remove extreme outliers for better visibility
df_filtered = df[df['price'] <= 1000]

# Group by neighbourhood and room_type, and calculate average price
avg_price = df_filtered.groupby(['neighbourhood', 'room_type'])['price'].mean().reset_index()

# Sort by average price
top_combinations = avg_price.sort_values(by='price', ascending=False).head(10)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(data=top_combinations, x='price',y=top_combinations.apply(lambda row: f"{row['neighbourhood']} | {row['room_type']}", axis=1),palette='magma')
plt.title('Top 10 Neighbourhood & Room Type Combinations by Average Price')
plt.xlabel('Average Price')
plt.ylabel('Neighbourhood | Room Type')
plt.tight_layout()
plt.show()





##### 1. Why did you pick the specific chart?

Answer Here: 1)Clear ranking of combinations:
It effectively shows which neighbourhood–room type pairs have the highest average prices, making it easy to compare values side by side.  
2)Handles long category labels well:
Combining neighbourhood and room_type results in long labels, which are more readable on the y-axis of a horizontal chart.  
3)Highlights top-performing segments:
The sorted bars immediately show the top 10 most expensive combinations, helping identify premium segments quickly.  

##### 2. What is/are the insight(s) found from the chart?

Answer Here: 1)Premium listings are mostly entire homes/apartments
– These dominate the top combinations, showing guests pay more for full privacy and space.  
2)High-priced areas are mainly in Manhattan
– Manhattan neighbourhoods consistently appear at the top, confirming its status as the most expensive borough.  
3)Private rooms and shared spaces are priced lower
– These room types rarely appear in the top price combinations, highlighting their affordability-focused appeal.  
4)Luxury demand clusters in select areas
– Certain neighbourhoods like Tribeca, SoHo, and Midtown command premium pricing, indicating demand for upscale stays.  

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here: Positive Business Impact:  
1)Promote premium listings in top-priced areas (e.g., Manhattan entire homes).

2) Guide hosts on pricing strategy based on location and room type.

3)Encourage investment in high-demand zones for better returns.

NEGATIVE GRowth:  
1)Reduced bookings for budget travelers.

2)Over-saturation in premium zones, causing competition and possible price drops.

3)Neglect of diversity, weakening platform appeal across income segments.




#### Chart - 13 (multivariate analysis)
13) What is the relation between price, number of reviews, and availability?

In [None]:
# Chart - 13 visualization code
#Remove extreme outliers for clarity
df_filtered = df[(df['price'] <= 500) & (df['number_of_reviews'] <= 200) & (df['availability_365'] <= 365)]

# Plot
plt.figure(figsize=(12, 6))
scatter = plt.scatter(
    df_filtered['price'],
    df_filtered['number_of_reviews'],
    s=df_filtered['availability_365'] * 0.5,  # bubble size scaled
    alpha=0.5,
    c=df_filtered['availability_365'],
    cmap='viridis'
)
plt.colorbar(label='Availability (days)')
plt.xlabel('Price')
plt.ylabel('Number of Reviews')
plt.title('Relationship Between Price, Reviews, and Availability')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here: 1) Visualizes three variables at once:
A bubble chart shows the relationship between price (x-axis), number of reviews (y-axis), and availability (bubble size/color) — all in one plot.  
2) Reveals complex patterns clearly:
It helps you quickly spot if higher-priced listings have more or fewer reviews and how availability influences that relationship.  
3)Highlights density and spread:
You can easily identify clusters, outliers, and trends across all three variables.

##### 2. What is/are the insight(s) found from the chart?

Answer Here: The chart shows that low-priced, highly available listings tend to get more reviews, indicating strong demand in the budget segment. Higher-priced listings usually receive fewer reviews, suggesting limited bookings. Listings with greater availability (larger bubbles) generally have more guest engagement, while a few expensive, low-reviewed listings may reflect poor value or low demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here: Positive Business Impact:  
1)Hosts can optimize pricing to match demand, especially in the budget segment where reviews and bookings are higher.

2)Increasing availability can boost visibility and engagement, leading to more reviews and better performance.  

Negative growth:  
1)High-priced listings with low reviews may indicate overpricing or poor value, leading to lower occupancy.

2)If availability is limited, even well-priced listings may miss out on bookings, hurting revenue.

#### Chart - 14 - Correlation Heatmap
14)What is the correlation between numerical variables?

In [None]:
# Correlation Heatmap visualization code
# Select relevant numeric columns
num_cols = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']

# Compute correlation matrix
correlation_matrix = df[num_cols].corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Between Numerical Variables')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here:1)Best for multiple numerical variables:
A correlation heatmap gives a compact overview of how all numerical variables are related to each other in one visual.  
2)Quickly shows strength and direction:
The color intensity and correlation values make it easy to spot strong positive or negative relationships at a glance.  
3)Easy comparison across variables:
Helps identify which variables are closely related (e.g., reviews and reviews per month), and which are independent (e.g., price vs. availability).

##### 2. What is/are the insight(s) found from the chart?

Answer Here: 1) number_of_reviews and reviews_per_month show a strong positive correlation — more reviews overall means more monthly activity.  
2)minimum_nights has little to no correlation with other variables — it varies independently.  
3)price has weak correlations with all other variables — suggesting pricing is influenced by more complex or non-numeric factors.  
4)availability_365 has a mild positive correlation with reviews_per_month, meaning more available listings tend to get reviewed more often.



#### Chart - 15 - Pair Plot
15)Pairwise relationships among numerical variables (e.g., price, minimum_nights, availability_365, number_of_reviews)?

In [None]:
# Pair Plot visualization code
#select relevent numeric coloums
num_cols = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']
df_filtered = df[(df['price'] <= 500) & (df['minimum_nights'] <= 30) & (df['number_of_reviews'] <= 200)]

# Plot pairwise relationships
sns.pairplot(df_filtered[num_cols], diag_kind='hist', plot_kws={'alpha': 0.5})
plt.suptitle('Pairwise Relationships Among Numerical Variables', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here:We choose a pair plot because it shows all pairwise relationships between numerical variables in a single visual. It helps identify correlations, clusters, trends, and outliers across multiple variable pairs quickly. The scatter plots show relationships, while histograms reveal individual distributions — making it perfect for exploratory analysis.

##### 2. What is/are the insight(s) found from the chart?

Answer Here: 1)Most listings have low price, few reviews, and short minimum nights — visible as dense clusters in the lower range.

2)Few strong linear relationships — variables like price and number of reviews show weak or scattered associations.

3)Right-skewed distributions — especially in price and number_of_reviews, with some extreme outliers.

4)Outliers are clearly visible — some listings stand out with very high price or minimum stay, useful for deeper analysis.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here: To enhance bookings, host performance, and revenue, the following key recommendations are made based on data analysis:  
1)Optimize Pricing Strategy:  
Listings with lower and competitive pricing receive significantly more reviews.

Encourage hosts to adopt dynamic pricing based on demand, season, and location.    

2)Increase Listing Availability :
Listings with higher availability (availability_365) are booked more often.

Suggest hosts keep calendars open year-round or during high-demand seasons.   

3)Target High-Demand Neighbourhoods  :  
Manhattan and Brooklyn consistently show high prices and review volumes.

Airbnb should focus marketing efforts and encourage new listings in these areas.  

4). Encourage Shorter Minimum Night Stays:  
Listings with shorter minimum nights (1–3 days) perform better in terms of reviews and engagement.

Suggest hosts reduce minimum stay requirements to increase flexibility and appeal.  

5)Promote Best-Performing Hosts as Role Models:  
Analyze and highlight hosts with high reviews and availability to create training models for new or underperforming hosts.  

6) Improve Listing Quality and Trust:  
Listings with better reviews likely offer a superior guest experience.

Provide guidance on improving photos, descriptions, cleanliness, and communication.











# **Conclusion**

Write the conclusion here.  
The Airbnb NYC 2019 dataset offers valuable insights into listing behavior, pricing trends, guest engagement, and host performance across New York City. Through univariate, bivariate, and multivariate analyses, we discovered that listings with lower prices, shorter minimum stays, and higher availability tend to receive more reviews — indicating higher guest engagement and booking activity.   

Neighbourhoods like Manhattan and Brooklyn dominate in terms of both pricing and review volume, making them prime targets for growth and marketing. High-performing hosts and room types (especially entire homes and private rooms) consistently drive positive outcomes, while outliers reveal areas of improvement such as overpricing and limited availability.  

By leveraging these insights, Airbnb can refine its pricing models, host support strategies, and location-based marketing, ultimately improving guest satisfaction, increasing bookings, and achieving sustainable business growth in a competitive urban market.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
# Fill missing 'name' and 'host_name' with a placeholder
df['name'].fillna('No Name', inplace=True)
df['host_name'].fillna('No Name', inplace=True)

# Fill missing 'reviews_per_month' with 0
df['reviews_per_month'].fillna(0, inplace=True)

# Convert 'last_review' to datetime, coercing errors
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')

# Check for remaining missing values
print("Missing values after initial wrangling:")
print(df.isnull().sum())