# Optimizing Airbnb NYC Listings: Data-Driven Insights for Hosts & Investors

## Business Question:


### What location, pricing, and availability strategies can hosts adopt to maximize bookings and revenue

## Data Cleaning 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib as plt

In [None]:
df=pd.read_csv(r"/Users/turfdiddy/Desktop/Bootcamp_ds:ml/Week_4/Personal_Project/AB_NYC_2019.csv")

In [None]:
df

In [None]:
df.head()

In [None]:
df.duplicated().sum()

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

In [None]:
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

In [None]:
# The strategy here is to leave out misisng names and host names as they dont affect numerical analysis 
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)
df['last_review'] = df['last_review'].fillna('No review')

In [None]:
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')
df['price'] = pd.to_numeric(df['price'], errors='coerce')

In [None]:
df['review_year'] = df['last_review'].dt.year
df['review_month'] = df['last_review'].dt.month
df['review_day'] = df['last_review'].dt.day

In [None]:
import plotly.express as px

# Before filtering
fig_before = px.histogram(
    df,
    x="price",
    nbins=100,
    title="Price Distribution (Before Outlier Removal)",
    labels={"price": "Price (USD)"},
    color_discrete_sequence=["#EF553B"]
)
fig_before.show()

# Filter out prices above $500
df_filtered = df[df["price"] <= 500].copy()

# After filtering
fig_after = px.histogram(
    df_filtered,
    x="price",
    nbins=100,
    title="Price Distribution (After Outlier Removal - Focus on ≤ $500)",
    labels={"price": "Price (USD)"},
    color_discrete_sequence=["#00CC96"]
)
fig_after.show()

In [None]:
df.info()
df.describe()

In [None]:
df.to_csv("airbnb_nyc_cleaned.csv", index=False)

print(" Cleaned Airbnb dataset saved as airbnb_nyc_cleaned.csv")

## EDA Analysis

### Univariate Analysis

In [None]:
#The goal with this EDA analysis is to analyse and investigate the various variables within the dataframe
#Starting first with the univariate analysis of the individual variables/columns that influence price. 
#To start with we will first generate an interactive map of the Airbnb listings in NYC, colored by neighborhood group (borough). usoing Plotly Express for visualization.



# Scatter map with neighborhood group color
fig = px.scatter_mapbox(
    df,
    lat="latitude",
    lon="longitude",
    color="neighbourhood_group",  # color by borough
    hover_name="name",            # listing name
    hover_data=["price", "room_type", "number_of_reviews"],
    zoom=10,
    height=600,
    title="NYC Airbnb Listings by Location and Borough",
)

# Set map style
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":30,"l":0,"b":0})

fig.show()

In [None]:
import plotly.express as px

group_counts = df['neighbourhood_group'].value_counts().reset_index()
group_counts.columns = ['neighbourhood_group', 'listing_count']

fig = px.bar(
    group_counts,
    x='neighbourhood_group',
    y='listing_count',
    text='listing_count',
    color='neighbourhood_group',
    title='Number of Listings per Neighbourhood Group',
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig.update_traces(textposition='outside')
fig.show()

In [None]:
#The first variable we will analyze is the 'price' variable. The purpose of this analysis is to understand the distribution of prices across different listings in NYC. and identify
#common price ranges, outliers, and the overall pricing structure of Airbnb listings in NYC.
# We start with a boxplot of the price distribution


In [None]:
import plotly.express as px

fig = px.box(df, y="price", title="Boxplot of Airbnb Prices (Including Outliers)")
fig.show()

Insight:
- There seems to be outliers that are affecting the dataset

In [None]:
#Boxplot after removing the outliers and focusing the dataset on the price range 0-500 USD

fig = px.box(
    df_filtered,
    y="price",
    title="Boxplot of Airbnb Prices (After Removing Outliers)",
    labels={"price": "Price (USD)"},
    color_discrete_sequence=["#636EFA"]
)
fig.show()

In [None]:
fig = px.histogram(
    df,
    x="price",
    nbins=100,
    title="Distribution of Airbnb Prices",
    labels={"price": "Price (USD)"},
    color_discrete_sequence=["#636EFA"]
)
fig.update_layout(xaxis_range=[0, 500])  # Focus on listings under $500 due to the outliers
fig.show()

In [None]:


# Before filtering
fig_before = px.histogram(
    df,
    x="price",
    nbins=100,
    title="Price Distribution (Before Outlier Removal)",
    labels={"price": "Price (USD)"},
    color_discrete_sequence=["#EF553B"]
)
fig_before.show()

# Filter out prices above $500
df_filtered = df[df["price"] <= 500].copy()

# After filtering
fig_after = px.histogram(
    df_filtered,
    x="price",
    nbins=100,
    title="Price Distribution (After Outlier Removal - Focus on ≤ $500)",
    labels={"price": "Price (USD)"},
    color_discrete_sequence=["#00CC96"]
)
fig_after.show()

**Insight:**  
The majority of Airbnb listings in New York are priced under **$500 per night**, with a large concentration between **$50 and $200**. However, there are extreme outliers — some listings priced at several thousand dollars — that could distort statistical measures like the mean. Removing these outliers provides a more accurate view of typical market prices and allows for fairer comparisons across neighborhoods and property types.

**Implications for Hosts & Investors:**  
- For competitive pricing, most hosts would benefit from aligning their nightly rates within the **$50–$200** range where demand is likely higher.  
- Ultra-high pricing may be viable for rare luxury properties but represents a **small niche** of the market.

In [None]:
# In order to ensure working with the clean dataset including the price filter , I filtered the dataframe to
# focus and filter out prices lesser and equal to 500 USD

df = df[df["price"] <= 500].copy()

print(f"Filtered dataset shape: {df.shape}")
print(f"Max price after filtering: {df['price'].max()}")

In [None]:
#The new cleaned dataframe is now saved and stored as a csv to be used for the SQL analysis
df.to_csv("airbnb_nyc_cleaned.csv", index=False)

print(" Cleaned Airbnb dataset saved as airbnb_nyc_cleaned.csv")

In [None]:
#The next variable here to conduct an analysis on is the room_type. This is essential because hosts and 
# investors will want to know which property dominates the market.
# We start by visualizing the room type column

In [None]:
room_type_counts = df_filtered['room_type'].value_counts().reset_index()
room_type_counts.columns = ['room_type', 'count']  # rename columns

# Bar plot
fig = px.bar(
    room_type_counts,
    x='room_type',
    y='count',
    title='Distribution of Airbnb Room Types in NYC',
    labels={'room_type': 'Room Type', 'count': 'Number of Listings'},
    color='room_type',
    color_discrete_sequence=px.colors.qualitative.Set2
)

fig.update_layout(showlegend=False)
fig.show()

**Insight:**
- The majority of Airbnb listings in New York are **Entire home/apartment**, followed by **Private room** listings.  
- **Shared rooms** and **Hotel rooms** make up only a small fraction of the market.  
- This distribution indicates a strong preference from both hosts and guests for private spaces, with shared accommodations being a niche offering.

In [None]:
# The next variable we move on to from here is the number_of_reviews.
# This would enable us determine the popularity of the various listings

In [None]:
import plotly.express as px

# --- BEFORE OUTLIER REMOVAL ---
fig_before = px.histogram(
    df_filtered,
    x="number_of_reviews",
    nbins=100,
    title="Number of Reviews (Before Outlier Removal)",
    labels={"number_of_reviews": "Number of Reviews"},
    color_discrete_sequence=["#EF553B"]
)
fig_before.show()

# --- DETERMINE OUTLIER CUTOFF ---
q99 = df_filtered['number_of_reviews'].quantile(0.99)  # 99th percentile
print(f"99th percentile cutoff for number_of_reviews: {q99}")

# --- FILTER OUT OUTLIERS ---
df_reviews_filtered = df_filtered[df_filtered['number_of_reviews'] <= q99]

# --- AFTER OUTLIER REMOVAL ---
fig_after = px.histogram(
    df_reviews_filtered,
    x="number_of_reviews",
    nbins=50,
    title="Number of Reviews (After Outlier Removal)",
    labels={"number_of_reviews": "Number of Reviews"},
    color_discrete_sequence=["#00CC96"]
)
fig_after.show()

Insight:
- The vast majority of Airbnb listings in New York have fewer than **100 reviews**, with a steep drop-off after the first few dozen reviews.  
- A small number of listings have extremely high review counts (**several hundred to over 1000**), which are clear outliers and likely represent long-standing, high-demand properties.  
- Removing these extreme values provides a clearer picture of typical listing performance, avoiding distortion in summary statistics and visualizations.

In [None]:
# The next variable we move on to next is the reviews per month. We start by first removing the NaN values from the reviews per month column.
reviews_df = df.dropna(subset=['reviews_per_month'])

In [None]:

fig = px.histogram(
    reviews_df,
    x='reviews_per_month',
    nbins=50,
    title='Distribution of Reviews per Month',
    labels={'reviews_per_month': 'Reviews per Month'}
)
fig.show()

In [None]:
# Remove NaN and extreme outliers (> 10 reviews/month)
reviews_filtered = df.dropna(subset=['reviews_per_month'])
reviews_filtered = reviews_filtered[reviews_filtered['reviews_per_month'] <= 10]

In [None]:

fig = px.histogram(
    reviews_filtered,
    x='reviews_per_month',
    nbins=50,
    title='Distribution of Reviews per Month (Filtered)',
    labels={'reviews_per_month': 'Reviews per Month'}
)
fig.show()

Insights:
-   Majority of listings have fewer than 3 reviews per month.

-   Many have close to 0, suggesting low activity or seasonality.

-   Listings with 3–10 reviews/month are likely high-demand, short-stay rentals.


In [None]:
#The next variable we investigate is the minimum nights variable.

In [None]:
# 1 Before cleaning: show the full distribution (expect heavy skew from outliers)
fig_before = px.histogram(
    df,
    x="minimum_nights",
    nbins=100,
    title="Distribution of Minimum Nights (Before Cleaning)",
    labels={"minimum_nights": "Minimum Nights"},
    color_discrete_sequence=["#636EFA"]
)
fig_before.show()

# 2 Identify a reasonable threshold
# Check high values to decide cut-off
df_filtered["minimum_nights"].describe()

# (Often, Airbnb projects use <= 30 or <= 60 nights as cutoff)
# Let's use 60 nights as threshold for "reasonable stays"
df_filtered = df_filtered[df_filtered["minimum_nights"] <= 60]

#  After cleaning: replot the histogram
fig_after = px.histogram(
    df_filtered,
    x="minimum_nights",
    nbins=60,
    title="(After Removing Outliers > 60 Nights)",
    labels={"minimum_nights": "Minimum Nights"},
    color_discrete_sequence=["#00CC96"]
)
fig_after.show()

Insight: 

- The majority of Airbnb listings in New York require short stays, typically under 10 nights.  
- A small but significant number of listings have extremely high minimum night requirements — in some cases hundreds or even thousands — which are unrealistic for most guests and likely intended for monthly or long-term rentals.  
- Removing stays requiring more than 60 nights provides a more accurate view of the market for short- to medium-term accommodations, which make up the bulk of demand.  

**Implications for Hosts & Investors**  
- Shorter minimum night requirements (e.g., 1–7 nights) can appeal to a wider pool of guests and increase booking frequency.  
- Extremely high minimum stays may limit audience reach and could indicate a niche, long-term rental strategy rather than typical Airbnb usage.

In [None]:
# Next we investigate the availability column


seasonal_df = df[(df['availability_365'] >= 1) & (df['availability_365'] <= 364)]
fig = px.histogram(
    seasonal_df,
    x="availability_365",
    nbins=12,
    title="Seasonal Availability Distribution of Airbnb Listings (1–364 days)",
    labels={"availability_365": "Available Days per Year"},
    color_discrete_sequence=["#00CC96"]
)
fig.show()

Insights:
-   General Availability Patterns (First Histogram):

-   Listings are not evenly distributed across all possible availability days.
-   Clear peaks at the start of the year (1–50 days) and around full-year availability (300+ days).
  
-   This suggests two types of hosts:
-   Seasonal hosts: make their properties available for short specific periods.
-   Year-round hosts: list properties for nearly the whole year.

In [None]:
fig = px.histogram(
    seasonal_df,
    x="availability_365",
    nbins=12,  # 12 bins ~ monthly groupings
    color="neighbourhood_group",
    barmode="overlay",
    title="Seasonal Availability by Neighbourhood Group (1–364 days)",
    labels={
        "availability_365": "Available Days per Year",
        "neighbourhood_group": "Neighbourhood Group"
    }
)
fig.show()

Insights:
-   Brooklyn dominates across almost all availability ranges, especially in the 1–50 day and near-365 day categories.

-   Queens and Staten Island show more even but smaller distribution across availability periods.

-   Manhattan has a strong presence in the high-availability category (300+ days).

-   Bronx consistently has fewer listings across all ranges.

-   Manhattan & Brooklyn = largest and most flexible markets, with both extreme short-term and long-term hosting common.

-   Queens = strong year-round and mid-season presence.

-   Bronx & Staten Island = smaller markets, but when listings are up, they often stay available for long periods.

-   Seasonal availability patterns suggest different boroughs cater to different tourist/tenant cycles, valuable for pricing and marketing strategies

### Bivariate Analysis

To address our business question — *"How do prices vary across different neighbourhood groups in our Airbnb dataset, and can these differences guide decision-making for hosts, investors, or policy makers?"* — we begin with a statistical hypothesis test.  

The goal here is to **move beyond visual impressions** from our earlier univariate analysis and interactive maps, and provide **statistical evidence** for or against price differences between neighbourhood groups.  

**Hypothesis**  
- **H₀ (Null Hypothesis):** The mean Airbnb prices are equal across all neighbourhood groups.  
- **H₁ (Alternative Hypothesis):** At least one neighbourhood group has a different mean price.  

By testing this, we can determine whether any observed price differences are simply due to random variation, or whether they are statistically significant — and thus potentially actionable for business strategy.  

In [None]:
from scipy import stats

# Same filtered data
price_filtered = df[df['price'].between(10, 500)].copy()

# Group price values by neighbourhood group
groups_prices = [
    group["price"].values
    for name, group in price_filtered.groupby("neighbourhood_group")
]

In [None]:
f_stat, p_value = stats.f_oneway(*groups_prices)

print(f"F-statistic: {f_stat:.2f}")
print(f"p-value: {p_value:.5f}")

if p_value < 0.05:
    print("Reject H₀: Significant difference in prices between neighbourhood groups.")
else:
    print("Fail to reject H₀: No significant difference in prices between groups.")

In [None]:
# We can now visualise the relationship between price and neighbourhood groups
 

price_filtered = df[df['price'].between(10, 500)]  # Keep listings between $10 and $500

# Create boxplot
fig = px.box(
    price_filtered,
    x="neighbourhood_group",
    y="price",
    title="Price Distribution by Neighbourhood Group",
    labels={
        "neighbourhood_group": "Neighbourhood Group",
        "price": "Price (USD)"
    },
    color="neighbourhood_group"
)
fig.show()

Insight:
-	Manhattan has the highest median price among all neighbourhood groups and a very wide spread, indicating a diverse range of property pricing.

-   Brooklyn shows a lower median price than Manhattan but still has a significant price range, making it a competitive mid-high priced area.

-   Queens and Staten Island have noticeably lower median prices and tighter interquartile ranges, suggesting more affordable markets.

-   Bronx has the lowest overall prices with a small spread, indicating less variation in listing costs.

-   Extreme outliers exist in every group (especially Manhattan and Brooklyn), but most listings fall well below these extreme values.

-   If the platform targets premium travelers in New York, Manhattan and Brooklyn are prime focus areas. For budget conscious travelers , Bronx 
-   and Staten Island offer lower price entry points   


In [None]:
fig = px.bar(
    price_filtered.groupby("neighbourhood_group")["price"].median().reset_index(),
    x="neighbourhood_group",
    y="price",
    title="Median Price by Neighbourhood Group",
    labels={"price": "Median Price (USD)"}
)
fig.show()

Insights:

-   The median price ranking from highest to lowest:
-   Manhattan > Brooklyn > Staten Island > Queens > Bronx.
	
-   Manhattan’s median price is more than double that of Queens and Bronx, signaling a clear luxury positioning.
	
-   Brooklyn sits in the middle — appealing to both mid-range and premium segments.

-   Staten Island, while not the cheapest, may offer niche opportunities for travelers seeking less crowded areas with moderate prices.

-   Bronx’s very low median price shows budget dominance, but may also indicate fewer high-end property options.

-	If the goal is maximizing revenue per booking, focus marketing on Manhattan and Brooklyn. If the goal is increasing booking volume, highlight affordable options in Bronx, Queens, and Staten Island.

In [None]:
fig = px.scatter_mapbox(
    price_filtered,  # Your DataFrame
    lat="latitude",
    lon="longitude",
    color="neighbourhood_group",  # Group color
    size="price",  # Bubble size based on price
    hover_name="name",  # Show listing name
    hover_data=["price", "room_type"],
    mapbox_style="carto-positron",
    zoom=10,
    title="Airbnb Prices by Location and Neighbourhood Group"
)
fig.show()

In [None]:
# The interactive map below makes it possible to view the various boroughs and the price influences

import plotly.express as px
import plotly.graph_objects as go


price_filtered = df[df['price'].between(10, 500)].copy()


groups = sorted(price_filtered['neighbourhood_group'].dropna().unique())
category_orders = {"neighbourhood_group": groups}

# Building the map (Plotly Express creates one trace per group when using `color=`)
fig = px.scatter_mapbox(
    price_filtered,
    lat="latitude",
    lon="longitude",
    color="neighbourhood_group",
    size="price",
    hover_name="name",
    hover_data=["price", "room_type", "neighbourhood_group"],
    mapbox_style="open-street-map",   # no token needed
    zoom=10,
    height=650,
    category_orders=category_orders,
    title="Airbnb Prices by Neighbourhood Group (Interactive)"
)

#  Making dropdown buttons (one per group + 'All')
n_traces = len(fig.data)  # should equal number of groups
buttons = []

# 'All' button -> show all traces
buttons.append(dict(
    label="All",
    method="update",
    args=[{"visible": [True]*n_traces},
          {"title": "Airbnb Prices – All Neighbourhoods"}]
))

# One button per neighbourhood group
for i, grp in enumerate(groups):
    visible = [False]*n_traces
    visible[i] = True  # only this group's trace
    buttons.append(dict(
        label=grp,
        method="update",
        args=[{"visible": visible},
              {"title": f"Airbnb Prices – {grp}"}]
    ))

fig.update_layout(
    updatemenus=[dict(
        type="dropdown",
        x=0.01, y=0.98, xanchor="left", yanchor="top",
        showactive=True,
        buttons=buttons
    )],
    margin=dict(l=10, r=10, t=60, b=10)
)

fig.show()

Business Implication:

-   Airbnb hosts or property investors can prioritize high-price areas (statistically higher) if profitability is the main goal.

-   Policy makers could investigate high-price boroughs for affordability measures

**Business Interpretation:**  
These results suggest that neighbourhood group is an important factor when setting listing prices.  
For hosts, understanding these differences could help in **optimizing pricing strategies**.  
For investors, this could inform **location-based investment decisions** to balance occupancy rates and revenue potential.  
For policy makers, this may help guide **short-term rental regulations** tailored to each borough's market dynamics.  

In [None]:
# Next we want to investigate the relationship between Price and Room Type

We now turn our attention to another key factor in Airbnb pricing — the **type of room offered**.  
Room type is one of the first things a potential guest considers, and it may have a strong influence on price.

**Business Question:**  
Do different types of rooms (*Entire home/apt*, *Private room*, *Shared room*, *Hotel room*) command significantly different average prices in the Airbnb market?

**Hypotheses:**

- **H₀ (Null Hypothesis):** The mean prices are equal across all room types.  
- **H₁ (Alternative Hypothesis):** At least one room type has a significantly different mean price.


In [None]:
price_rt = df[df['price'].between(10, 500)].copy()

price_rt = price_rt.dropna(subset=['price', 'room_type'])

price_rt['room_type'] = price_rt['room_type'].astype('category')


groups = [
    grp['price'].values
    for _, grp in price_rt.groupby('room_type', observed=True)
]

f_stat, p_val = stats.f_oneway(*groups)
print(f"ANOVA F-statistic: {f_stat:.4f}")
print(f"ANOVA p-value:     {p_val:.6f}")

if p_val < 0.05:
    print("Reject H0: At least one room type has a different mean price.")
else:
    print("Fail to reject H0: No evidence of mean price differences across room types.")


In [None]:
import plotly.express as px

# 1) Filter price outliers for better visualization
roomtype_df = df[df['price'].between(10, 500)].copy()

# 2) Boxplot – distribution of prices by room type
fig_box = px.box(
    roomtype_df,
    x="room_type",
    y="price",
    color="room_type",
    title="Price Distribution by Room Type",
    labels={
        "room_type": "Room Type",
        "price": "Price (USD)"
    }
)
fig_box.show()

# 3) Barplot – mean price per room type
mean_prices = roomtype_df.groupby("room_type", as_index=False)['price'].mean()

fig_bar = px.bar(
    mean_prices,
    x="room_type",
    y="price",
    color="room_type",
    title="Average Price by Room Type",
    labels={
        "room_type": "Room Type",
        "price": "Average Price (USD)"
    }
)
fig_bar.show()

In [None]:
#From here we want to investigate the relationship between Price and availability_365

**Business Question**:

-   Does the number of days a listing is available per year influence its average price?  

-   This can provide insights for hosts on whether being available all year leads to higher earnings, or if seasonal availability creates a scarcity effect that allows for premium pricing.

**Hypotheses**: 
- **H₀ (Null Hypothesis):** Mean prices are equal across different availability categories.  
- **H₁ (Alternative Hypothesis):** At least one availability category has a different mean price.


In [None]:
# Step 1: Create Availability Categories


df['availability_category'] = pd.cut(
    df['availability_365'],
    bins=[-1, 0, 180, 364, 365],
    labels=['Unavailable', 'Low (1–180)', 'Medium (181–364)', 'Year-round']
)

In [None]:
#Step 2: Filter extreme price outliers for clearer visualization
filtered_df = df[df['price'].between(10, 500)]


In [None]:
# Step 3: Boxplot
fig = px.box(
    filtered_df,
    x='availability_category',
    y='price',
    color='availability_category',
    title="Price Distribution by Availability Category",
    labels={'availability_category': 'Availability Category', 'price': 'Price (USD)'}
)
fig.show()


Insights:

-	Listings with very low availability (close to 0 days) tend to have a wider spread in prices, including a few extremely high-priced outliers.

-	Moderate to high availability (100–300 days) generally corresponds to lower median prices, suggesting that hosts keeping listings open most of the year might adopt more competitive pricing to attract bookings.

-	Fully available listings (365 days) cluster in the mid-to-lower price range, indicating they are likely targeted at consistent occupancy rather than premium pricing.

-	The interquartile range (IQR) narrows as availability increases, showing more price stability for highly available listings.

-	Overall, availability appears to influence pricing strategy, with low-availability listings more likely positioned as premium/short-term rentals, and high-availability listings geared toward steady, budget-conscious demand.

In [None]:
# Step 4: Hypothesis Testing (ANOVA)
groups = [
    filtered_df.loc[filtered_df['availability_category'] == cat, 'price']
    for cat in filtered_df['availability_category'].unique()
]

anova_stat, p_value = stats.f_oneway(*groups)

print(f"ANOVA Statistic: {anova_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Step 5: Conclusion
if p_value < 0.05:
    print("Reject H₀ → There is a significant difference in mean prices across availability categories.")
else:
    print("Fail to Reject H₀ → No significant difference in mean prices across availability categories.")

-   The statistical evidence aligns with the boxplot pattern: availability strongly influences pricing strategy.

In [None]:
# Our next analysis is with a multivariate Analysis on Price ,Room type and Availability. The Business goal link
# here is to see if listings with more reviews tend 

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

filtered_df = df[df['price'].between(10, 500)].copy()
filtered_df['availability_category'] = np.where(
    filtered_df['availability_365'] > 180, 'High Availability', 'Low Availability'
)

# Model
model = ols('price ~ C(room_type) * C(availability_category)', data=filtered_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


sns.boxplot(
    x='room_type',
    y='price',
    hue='availability_category',
    data=filtered_df
)
plt.title('Price by Room Type & Availability')
plt.show()

In [None]:
import plotly.express as px
import pandas as pd

# Create availability category for cleaner grouping
bins = [-1, 0, 100, 200, 365]
labels = ["No availability", "Low (1–100 days)", "Medium (101–200 days)", "High (201–365 days)"]
df['availability_category'] = pd.cut(df['availability_365'], bins=bins, labels=labels)

# Filter extreme prices for better visualization
df_filtered = df[df['price'].between(10, 500)]

# Grouped boxplot: Price vs Availability, split by Room Type
fig = px.box(
    df_filtered,
    x="availability_category",
    y="price",
    color="room_type",
    title="Price Distribution by Availability and Room Type",
    labels={
        "availability_category": "Availability Category",
        "price": "Price (USD)",
        "room_type": "Room Type"
    },
    category_orders={
        "availability_category": labels,
        "room_type": sorted(df['room_type'].unique())
    }
)
fig.show()

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Filter and prepare data for ANOVA
anova_df = df_filtered.copy()

# Fit Two-Way ANOVA model with interaction term
model = ols('price ~ C(availability_category) * C(room_type)', data=anova_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)  # Type II ANOVA
print(anova_table)

**Insight Summary – Price, Availability, and Room Type (Statistically Backed)**

1. **Statistical Results**  
   - **Availability Category Effect**:  
     F(3, 47828) = 214.99, **p < 0.0001** → Significant differences in prices exist between different availability categories.

   - **Room Type Effect**:  
     F(2, 47828) = 11,199.55, **p < 0.0001** → Room type is a very strong predictor of price variation.

   - **Interaction Effect (Availability × Room Type)**:  
     F(6, 47828) = 44.91, **p < 0.0001** → The impact of availability on price depends on the room type.

2. **Key Insights**  
   - *Entire home/apartment* listings consistently have the highest median prices across all availability levels.  

   - Listings with **medium-to-high availability** (101–365 days) tend to have greater price variation, especially in the *Entire home/apartment* category.  

   - *Shared rooms* remain the most affordable option regardless of availability.

3. **Notable Observations**  
   - Even in “No availability” listings, prices vary widely, suggesting some inactive or temporarily blocked listings still set premium prices.  

   - Significant interaction means availability’s influence on price is **not uniform** — e.g., high availability boosts prices more for *Entire home/apartment* than for *Private rooms*.

4. **Business Implications**  
   - **For hosts**: Maximizing availability can be leveraged differently depending on room type to optimize revenue.  
   
   - **For the platform**: Premium-tier marketing could target *Entire home/apartment* owners with high availability, as they show the largest spread and highest potential in price.

 ## Project Summary – Airbnb NYC Price Analysis

###  Business Objective
The goal of this analysis was to explore how various factors — such as **neighbourhood location, room type, and availability** — influence Airbnb pricing in New York City. This investigation is aimed at identifying patterns that can help **hosts optimize their pricing strategy** and **assist the platform** in targeting premium listings for strategic promotion.

---

###  Approach
The analysis followed a structured **Exploratory Data Analysis (EDA)** pipeline:
1. **Data Cleaning & Preparation**  
   - Removed unrealistic price values (e.g., prices above $500 for better interpretability).  
   - Filtered out extreme outliers in variables such as minimum nights and number of reviews.  
   - Handled missing values and standardized key variables.

2. **Univariate Analysis**  
   - Explored individual variables (price, minimum nights, number of reviews, availability) to understand their distributions.  
   - Identified skewness in price and availability distributions, prompting filtering and categorization.

3. **Bivariate Analysis with Hypothesis Testing**  
   - **Price vs Neighbourhood Group**  
     - *H₀*: Mean prices are equal across neighbourhood groups.  
     - ANOVA showed significant differences (p < 0.0001). Manhattan emerged as the most expensive area.
   - **Price vs Room Type**  
     - *H₀*: Mean prices are equal across room types.  
     - Strong statistical evidence (p < 0.0001) that *Entire home/apartment* commands the highest prices.
   - **Price vs Availability**  
     - *H₀*: Mean prices are equal across availability categories.  
     - Significant differences found (p < 0.0001), with higher availability often associated with wider price ranges.
   - **Multivariate Analysis – Price by Room Type & Availability**  
     - Tested for interaction between room type and availability on price.  
     - Significant interaction (p < 0.0001) — availability’s impact on price depends on the room type.

4. **Geospatial Visualization**  
   - Interactive maps revealed clear geographic clustering of high-priced listings in Manhattan and parts of Brooklyn.  
   - Dropdown filters allowed borough-specific price exploration.

---

###  Key Findings
- **Location matters**: Manhattan listings are consistently priced higher than other boroughs.  
- **Room type is a major driver of price**: Entire homes/apartments command significantly higher prices compared to private or shared rooms.  
- **Availability has a complex effect**: High availability correlates with greater price variance, especially for Entire homes/apartments.  
- **Interaction effects are important**: The relationship between availability and price is not uniform across room types.

---

###  Business Implications
- **For Hosts**:  
  - Position properties in high-demand boroughs (e.g., Manhattan) and maintain high availability for premium pricing potential.  
  - Entire home/apartment owners can benefit most from extended availability.
- **For the Platform**:  
  - Use predictive modeling on combined location, room type, and availability data to identify and promote high-value listings.  
  - Create tailored recommendations for hosts to adjust availability based on room type to maximize earnings.

---

**Conclusion**:  
Through rigorous EDA, hypothesis testing, and visualization, this project demonstrated that Airbnb pricing in NYC is significantly shaped by **location**, **room type**, and **availability**, with strong interaction effects between these factors. These insights can directly inform strategic decisions for both hosts and the platform.