Data Loading and Initial Cleaning:

The AB_NYC_2019.csv dataset (48,895 listings) was successfully loaded.
Missing Values: Key missing values were identified and addressed:
reviews_per_month and last_review NaNs (10,052 each) were found to directly correspond to listings with zero number_of_reviews. reviews_per_month was imputed with 0. last_review NaNs were noted as expected for listings with no review history.
Minor missing values for name (16) and host_name (21) were imputed with "Unknown."
Key Feature Exploration (Informing "Similarity" and "Neighborhood Profiles"):

Price:
Highly right-skewed (median $106, mean $153, max $10,000). A log transformation (log_price) was useful for clearer visualization.
Strongly driven by room_type ('Entire home/apt' median $160 vs. 'Private room' $70) and neighbourhood_group (Manhattan median $150 vs. Bronx $65).
Significant price variation exists within neighbourhood_groups at the neighbourhood level (e.g., Midtown's median $210 vs. Harlem's $89 within Manhattan).
Showed weak linear correlation with number_of_reviews and availability_365.
Room Type:
The market is dominated by 'Entire home/apt' (~52%) and 'Private room' (~46%), with 'Shared room' being a small niche (~2.4%).
The mix of room_types varies significantly by neighbourhood_group (e.g., Manhattan favors 'Entire home/apt'; Queens/Bronx favor 'Private room') and also by specific neighbourhoods within them.
Minimum Nights:
Highly right-skewed (median 3 nights, mean ~7, max 1250). The majority of listings (75%) require 5 nights or less, indicating a focus on short stays.
Median minimum_nights are short (2-3 days) across all neighbourhood_groups, but Manhattan and Brooklyn have higher means due to more listings with very long minimum stay requirements.
Host Profile (calculated_host_listings_count):
Dominated by single-listing hosts (~66% of listings) and small-scale hosts (~80% from hosts with 1-2 listings).
A "long tail" of professional hosts/multi-property managers exists (max 327 listings by one host).
Host types vary by room_type (e.g., 'Entire home/apt' have both many single hosts and the most listings from very large operators; 'Shared rooms' are more common with mid-size operators). Price ranges appeared more standardized for larger hosts.
Understanding "Busyness" and Demand Indicators:

Availability (availability_365): A large proportion of listings show very low availability (over 35% have 0 days, median 45 days). 'Brooklyn' and 'Manhattan' have the lowest median availability at the neighbourhood_group level.
Review Activity (reviews_per_month): Most listings have low review rates (median 0.37 after imputing zeros). Surprisingly, outer neighbourhood_groups like 'Staten Island', 'Queens', and 'Bronx' showed higher average reviews per month per listing than 'Manhattan' and 'Brooklyn'. Specific neighbourhoods, often in these outer groups, had listings with very high average activity.
Listing Density: 'Manhattan' and 'Brooklyn' neighbourhood_groups, and specific neighbourhoods within them (like Williamsburg, Bedford-Stuyvesant, Harlem), overwhelmingly have the highest concentration of listings.
Complexity of "Busyness": Neighborhood-level aggregated metrics for availability (inverse), review activity, and density showed weak and sometimes counterintuitive (negative) linear correlations. This indicates "busyness" is multi-faceted and these indicators capture different, largely independent aspects.
Geographical Variation Summary (neighbourhood_group and neighbourhood levels):

Distinct neighbourhood_group Profiles: Each neighbourhood_group exhibits a unique combination of typical price points, availability patterns, dominant room types, and review activity metrics.
Significant Intra-neighbourhood_group Heterogeneity: Crucially, substantial variation exists for all key metrics (price, room type mix, availability) among neighbourhoods within the same neighbourhood_group. No neighbourhood_group is internally homogenous.
Overall Conclusion of EDA for Project Goals:

The EDA has provided a deep and nuanced understanding of the dataset's structure, distributions, and interrelationships relevant to the project.
It confirms that "busyness" is not a monolithic concept and will require a clear, potentially multi-faceted definition based on chosen indicators (e.g., focusing on low availability, high density, or high activity, or a combination).
The analysis strongly supports the need to build rich, granular profiles for individual neighbourhoods to assess their "characteristic similarity." These profiles should incorporate aggregated data on price distributions, room type compositions, typical minimum stay requirements, and potentially host profiles.
Relying solely on neighbourhood_group averages would be insufficient for finding genuinely "similar" alternative neighborhoods.

1. Missing Values: Identify and handle missing values, especially in key columns like reviews_per_month (which might be legitimately null or 0 if number_of_reviews is 0) and last_review.
Han, J., Pei, J. and Tong, H. (2022, Chapter 3: Data Preprocessing) provide comprehensive techniques for handling missing data.


1.1 Load Data

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv("..\\data\\AB_NYC_2019.csv")
print("Dataset loaded. Shape:", df.shape)
df.head()

1.2 Overall Missing Value Summary

In [None]:
print("--- Overall Missing Value Counts ---")
missing_values_summary = df.isnull().sum()
print(missing_values_summary[missing_values_summary > 0])

1.3 Analyze Missing reviews_per_month

In [None]:
print("--- Analyzing Missing 'reviews_per_month' ---")
if 'reviews_per_month' in df.columns and 'number_of_reviews' in df.columns:
    total_nan_rpm = df['reviews_per_month'].isnull().sum()
    print(f"Total NaN values in 'reviews_per_month': {total_nan_rpm}")

    listings_with_zero_reviews = df[df['number_of_reviews'] == 0]
    print(f"Number of listings with 0 reviews: {len(listings_with_zero_reviews)}")

    nan_rpm_when_zero_reviews = 0
    if not listings_with_zero_reviews.empty:
        nan_rpm_when_zero_reviews = listings_with_zero_reviews['reviews_per_month'].isnull().sum()
    print(f"NaN in 'reviews_per_month' when 'number_of_reviews' is 0: {nan_rpm_when_zero_reviews}")

    if total_nan_rpm > 0 and total_nan_rpm == nan_rpm_when_zero_reviews and len(listings_with_zero_reviews) == total_nan_rpm:
        print("\nObservation: All NaN values in 'reviews_per_month' occur where 'number_of_reviews' is 0.")
        print("Interpretation: This is expected. If there are no reviews, 'reviews_per_month' cannot be calculated and is thus NaN.")
        print("Proposed Strategy for next step: Impute these NaNs with 0.")
    else:
        nan_rpm_with_reviews = df[(df['reviews_per_month'].isnull()) & (df['number_of_reviews'] > 0)]
        count_nan_rpm_with_reviews = len(nan_rpm_with_reviews)
        print(f"\nNumber of listings with 'reviews_per_month' = NaN BUT 'number_of_reviews' > 0: {count_nan_rpm_with_reviews}")
        if count_nan_rpm_with_reviews > 0:
            print("Observation: There are unexpected NaNs in 'reviews_per_month' for listings that DO have reviews. These need further investigation or a specific imputation strategy for them.")
        elif total_nan_rpm > 0 :
             print("\nObservation: 'reviews_per_month' has NaN values. The primary cause appears to be listings with 0 reviews.")
             print("Proposed Strategy for next step: For NaNs corresponding to 0 reviews, impute with 0. Investigate any other NaNs if present.")
        elif total_nan_rpm == 0:
            print("\nNo NaN values found in 'reviews_per_month'.")
        else:
            print("\nMixed observations about NaNs in 'reviews_per_month'. Review distribution carefully.")
else:
    print("Error: 'reviews_per_month' or 'number_of_reviews' column not found in DataFrame.")

1.4 Analyze Missing last_review

In [None]:
print("\n--- Analyzing Missing 'last_review' ---")
if 'last_review' in df.columns and 'number_of_reviews' in df.columns:
    total_nan_last_review = df['last_review'].isnull().sum()
    print(f"Total NaN values in 'last_review': {total_nan_last_review}")

    # Assuming 'listings_with_zero_reviews' is still relevant from the previous cell's scope
    # or re-calculate if running cells out of strict order:
    # listings_with_zero_reviews = df[df['number_of_reviews'] == 0]
    # print(f"Number of listings with 0 reviews: {len(listings_with_zero_reviews)}")


    nan_last_review_when_zero_reviews = 0
    # Re-filter to ensure 'listings_with_zero_reviews' is defined in this cell's scope if run independently
    listings_with_zero_reviews_for_last_review_check = df[df['number_of_reviews'] == 0]
    if not listings_with_zero_reviews_for_last_review_check.empty:
        nan_last_review_when_zero_reviews = listings_with_zero_reviews_for_last_review_check['last_review'].isnull().sum()
    print(f"NaN in 'last_review' when 'number_of_reviews' is 0: {nan_last_review_when_zero_reviews}")

    if total_nan_last_review > 0 and total_nan_last_review == nan_last_review_when_zero_reviews and len(listings_with_zero_reviews_for_last_review_check) == total_nan_last_review:
        print("\nObservation: All NaN values in 'last_review' occur where 'number_of_reviews' is 0.")
        print("Interpretation: This is expected. If there are no reviews, there is no 'last_review' date.")
        print("Proposed Strategy: These NaNs are informative. No direct imputation of the date is usually needed for these expected NaNs. If creating a 'days_since_last_review' feature, these would become NaN or a special large value.")
    else:
        nan_last_review_with_reviews = df[(df['last_review'].isnull()) & (df['number_of_reviews'] > 0)]
        count_nan_last_review_with_reviews = len(nan_last_review_with_reviews)
        print(f"\nNumber of listings with 'last_review' = NaN BUT 'number_of_reviews' > 0: {count_nan_last_review_with_reviews}")
        if count_nan_last_review_with_reviews > 0:
            print("Observation: There are unexpected NaNs in 'last_review' for listings that DO have reviews. These require investigation.")
        elif total_nan_last_review > 0:
            print("\nObservation: 'last_review' has NaN values. The primary cause appears to be listings with 0 reviews.")
        elif total_nan_last_review == 0:
            print("\nNo NaN values found in 'last_review'.")
        else:
            print("\nMixed observations about NaNs in 'last_review'. Review distribution carefully.")
else:
    print("Error: 'last_review' or 'number_of_reviews' column not found in DataFrame.")

1.5 Analyze Other Minor Missing Values (name, host_name)

In [None]:
print("\n--- Analyzing Other Minor Missing Values ('name', 'host_name') ---")
minor_missing_cols_info = []

if 'name' in df.columns:
    missing_name_count = df['name'].isnull().sum()
    if missing_name_count > 0:
        print(f"Missing values in 'name': {missing_name_count} ({(missing_name_count/len(df)*100):.2f}%)")
        minor_missing_cols_info.append("'name'")
    else:
        print("No missing values in 'name'.")
else:
    print("Column 'name' not found.")

if 'host_name' in df.columns:
    missing_host_name_count = df['host_name'].isnull().sum()
    if missing_host_name_count > 0:
        print(f"Missing values in 'host_name': {missing_host_name_count} ({(missing_host_name_count/len(df)*100):.2f}%)")
        minor_missing_cols_info.append("'host_name'")
    else:
        print("No missing values in 'host_name'.")
else:
    print("Column 'host_name' not found.")

if minor_missing_cols_info:
    print(f"\nObservation: Columns {', '.join(minor_missing_cols_info)} have a very small number of missing values.")
    print("Proposed Strategy for next steps: Impute with a placeholder like 'Unknown' or 'Not Specified'.")

 1.6 Impute Missing reviews_per_month

In [None]:
print("\n--- Imputing 'reviews_per_month' ---")
if 'reviews_per_month' in df.columns:
    print(f"Missing 'reviews_per_month' before imputation: {df['reviews_per_month'].isnull().sum()}")
    df['reviews_per_month'] = df['reviews_per_month'].fillna(0) # Recommended change
    print(f"Missing 'reviews_per_month' after imputation with 0: {df['reviews_per_month'].isnull().sum()}")
    print("\nSample of 'number_of_reviews' and 'reviews_per_month' after imputation:")
    # Ensure listings_with_zero_reviews is defined or filter again if needed for this check
    listings_with_zero_reviews = df[df['number_of_reviews'] == 0]
    print(listings_with_zero_reviews[['number_of_reviews', 'reviews_per_month']].head())
else:
    print("Error: 'reviews_per_month' column not found for imputation.")

1.7 Impute Missing name

In [None]:
print("\n--- Imputing 'name' ---")
if 'name' in df.columns:
    print(f"Missing 'name' before imputation: {df['name'].isnull().sum()}")
    df['name'] = df['name'].fillna('Unknown') # Recommended change
    print(f"Missing 'name' after imputation with 'Unknown': {df['name'].isnull().sum()}")
else:
    print("Error: 'name' column not found for imputation.")

1.8 Impute Missing host_name

In [None]:
print("\n--- Imputing 'host_name' ---")
if 'host_name' in df.columns:
    print(f"Missing 'host_name' before imputation: {df['host_name'].isnull().sum()}")
    df['host_name'] = df['host_name'].fillna('Unknown') # Recommended change
    print(f"Missing 'host_name' after imputation with 'Unknown': {df['host_name'].isnull().sum()}")
else:
    print("Error: 'host_name' column not found for imputation.")

1.9 Final Check of Missing Values

In [None]:
print("\n--- Final Check of Missing Values After Imputations ---")
if not df.empty:
    final_missing_summary = df.isnull().sum()
    missing_after_imputation = final_missing_summary[final_missing_summary > 0]
    if missing_after_imputation.empty:
        print("No more missing values in 'name', 'host_name', 'reviews_per_month'.")
    else:
        print("Remaining missing values:")
        print(missing_after_imputation)
        if 'last_review' in missing_after_imputation.index:
            print("Note: 'last_review' will still show NaNs if listings had 0 reviews, which is expected and typically not imputed with a date.")
else:
    print("DataFrame 'df' is not loaded. Cannot perform final check.")

2. Understanding "Busyness" and Demand Indicators (at neighbourhood and neighbourhood_group levels)

2.1 Setup for Visualizations

In [10]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
sns.set_palette("viridis")

2.2 Availability Analysis - Overall Distribution of availability_365

In [None]:
print("--- Availability Analysis: Overall Distribution of availability_365 ---")

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(df['availability_365'], bins=30, kde=False)
plt.title('Histogram of Overall availability_365')
plt.xlabel('Availability (days out of 365)')
plt.ylabel('Number of Listings')

plt.subplot(1, 2, 2)
sns.boxplot(y=df['availability_365'])
plt.title('Box Plot of Overall availability_365')
plt.ylabel('Availability (days out of 365)')

plt.tight_layout()
plt.show()

print("\nDescription of availability_365:")
print(df['availability_365'].describe())
# Listings with 0 availability might be interesting to investigate separately
print(f"\nNumber of listings with 0 availability: {len(df[df['availability_365'] == 0])}")

2.3 Availability Analysis - Per neighbourhood_group and neighbourhood

In [None]:
print("\n--- Availability Analysis: Per neighbourhood_group and neighbourhood ---")

# Average availability per neighbourhood_group
avg_availability_group = df.groupby('neighbourhood_group')['availability_365'].mean().sort_values(ascending=True)
print("\nAverage availability_365 per neighbourhood_group (Sorted by least available):")
print(avg_availability_group)

plt.figure(figsize=(10, 6))
avg_availability_group.plot(kind='bar')
plt.title('Average availability_365 by Neighbourhood Group')
plt.ylabel('Average Availability (days)')
plt.xlabel('Neighbourhood Group')
plt.xticks(rotation=45)
plt.show()

# Box plot of availability_365 by neighbourhood_group
plt.figure(figsize=(10, 6))
sns.boxplot(x='neighbourhood_group', y='availability_365', data=df)
plt.title('Distribution of availability_365 by Neighbourhood Group')
plt.ylabel('Availability (days)')
plt.xlabel('Neighbourhood Group')
plt.xticks(rotation=45)
plt.show()

# Top 10 neighbourhoods with lowest average availability (potentially "busiest" by this metric)
avg_availability_neighbourhood = df.groupby('neighbourhood')['availability_365'].mean().sort_values(ascending=True)
print("\nTop 10 Neighbourhoods with Lowest Average availability_365:")
print(avg_availability_neighbourhood.head(10))

# Top 10 neighbourhoods with highest average availability
print("\nTop 10 Neighbourhoods with Highest Average availability_365:")
print(avg_availability_neighbourhood.tail(10))

2.4 Overall Distribution (number_of_reviews, reviews_per_month)

In [None]:
print("\n--- Review Metrics: Overall Distribution ---")

plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
sns.histplot(df['number_of_reviews'], bins=50, kde=False)
plt.title('Histogram of number_of_reviews')
plt.xlabel('Total Number of Reviews')
plt.ylabel('Number of Listings')
# May need to set xlim due to skewness for better visualization
# plt.xlim(0, df['number_of_reviews'].quantile(0.95)) # Example: view up to 95th percentile

plt.subplot(1, 2, 2)
sns.histplot(df[df['reviews_per_month'] > 0]['reviews_per_month'], bins=50, kde=False) # Exclude 0 for a better view of active listings
plt.title('Histogram of reviews_per_month (for listings with >0 reviews/month)')
plt.xlabel('Reviews per Month')
plt.ylabel('Number of Listings')
# May need to set xlim due to skewness
# plt.xlim(0, df['reviews_per_month'].quantile(0.95)) # Example

plt.tight_layout()
plt.show()

print("\nDescription of number_of_reviews:")
print(df['number_of_reviews'].describe())
print("\nDescription of reviews_per_month (ensure NaNs were handled, e.g., filled with 0):")
print(df['reviews_per_month'].describe())

2.5 Review Metrics - Per neighbourhood_group and neighbourhood

In [None]:
print("\n--- Review Metrics: Per neighbourhood_group and neighbourhood ---")

# Average reviews_per_month per neighbourhood_group
# Ensure reviews_per_month does not have NaNs before this step (e.g., filled with 0)
avg_reviews_group = df.groupby('neighbourhood_group')['reviews_per_month'].mean().sort_values(ascending=False)
print("\nAverage reviews_per_month per neighbourhood_group (Sorted by most active):")
print(avg_reviews_group)

plt.figure(figsize=(10, 6))
avg_reviews_group.plot(kind='bar')
plt.title('Average reviews_per_month by Neighbourhood Group')
plt.ylabel('Average Reviews per Month')
plt.xlabel('Neighbourhood Group')
plt.xticks(rotation=45)
plt.show()

# Box plot of reviews_per_month by neighbourhood_group (for listings with reviews_per_month > 0)
plt.figure(figsize=(10, 6))
sns.boxplot(x='neighbourhood_group', y='reviews_per_month', data=df[df['reviews_per_month'] > 0])
plt.title('Distribution of reviews_per_month (>0) by Neighbourhood Group')
plt.ylabel('Reviews per Month')
plt.xlabel('Neighbourhood Group')
plt.xticks(rotation=45)
# Consider using plt.ylim to zoom in if outliers are too extreme
# plt.ylim(0, df[df['reviews_per_month'] > 0]['reviews_per_month'].quantile(0.95))
plt.show()

# Top 10 neighbourhoods with highest average reviews_per_month (potentially "busiest" by this metric)
avg_reviews_neighbourhood = df.groupby('neighbourhood')['reviews_per_month'].mean().sort_values(ascending=False)
print("\nTop 10 Neighbourhoods with Highest Average reviews_per_month:")
print(avg_reviews_neighbourhood.head(10))

2.6 Listing Density Analysis

In [None]:
print("\n--- Listing Density Analysis ---")

# Listing density per neighbourhood_group
density_group = df.groupby('neighbourhood_group')['id'].count().sort_values(ascending=False)
print("\nListing Density (Count) per neighbourhood_group:")
print(density_group)

plt.figure(figsize=(10, 6))
density_group.plot(kind='bar')
plt.title('Listing Density by Neighbourhood Group')
plt.ylabel('Number of Listings')
plt.xlabel('Neighbourhood Group')
plt.xticks(rotation=45)
plt.show()

# Top 10 neighbourhoods with highest listing density
density_neighbourhood = df.groupby('neighbourhood')['id'].count().sort_values(ascending=False)
print("\nTop 10 Neighbourhoods with Highest Listing Density:")
print(density_neighbourhood.head(10))

2.7 Correlations for Busyness (at Neighbourhood Level)

In [None]:
print("\n--- Correlations for Busyness (at Neighbourhood Level) ---")

# 1. Create a neighbourhood-level DataFrame
neighbourhood_stats = df.groupby('neighbourhood').agg(
    avg_availability_365=('availability_365', 'mean'),
    avg_reviews_per_month=('reviews_per_month', 'mean'), # Assumes reviews_per_month NaNs handled
    listing_density=('id', 'count')
).reset_index()

print("\nNeighbourhood Level Stats (first 5 rows):")
print(neighbourhood_stats.head())

# 2. Calculate the correlation matrix
# For busyness, we expect:
# - avg_availability_365 to be negatively correlated with busyness (lower availability = busier)
# - avg_reviews_per_month to be positively correlated with busyness
# - listing_density to be positively correlated with busyness
# So, let's make 'inverse_avg_availability' for easier interpretation in correlation
neighbourhood_stats['inverse_avg_availability'] = 1 / (neighbourhood_stats['avg_availability_365'] + 0.001) # Add small constant to avoid division by zero

busyness_features_corr = neighbourhood_stats[['inverse_avg_availability', 'avg_reviews_per_month', 'listing_density']].corr()

print("\nCorrelation matrix for potential busyness indicators:")
print(busyness_features_corr)

# 3. Visualize the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(busyness_features_corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Neighbourhood Busyness Indicators')
plt.show()

print("\nInterpretation of Correlation Matrix:")
print("- Positive values close to 1 indicate strong positive correlation.")
print("- Negative values close to -1 indicate strong negative correlation.")
print("- Values close to 0 indicate weak or no linear correlation.")
print("This helps understand if these metrics move together and could form parts of a composite 'busyness' score.")

Summary of Step 2 - 

Availability (availability_365): A significant portion of listings (over 35%) have zero availability, and half have 45 days or less. Brooklyn and Manhattan show the lowest average availability at the borough level, while specific neighborhoods across different boroughs show extreme highs and lows.

Review Activity (number_of_reviews, reviews_per_month): Most listings have few total reviews and low monthly review rates, indicating skewed distributions where a minority of listings are highly active. Surprisingly, outer boroughs (Staten Island, Queens, Bronx) showed higher average reviews per month per listing than the denser Manhattan and Brooklyn. Specific neighborhoods, often in these outer boroughs, had listings with very high average review velocity.

Listing Density: Manhattan and Brooklyn neighborhoods overwhelmingly have the highest concentration of listings.
Correlations: The neighborhood-level aggregated metrics for busyness (based on inverse availability, average reviews per month, and listing density) showed weak, and sometimes counterintuitive (negative), linear correlations. This suggests these three aspects of "busyness" don't necessarily increase or decrease together.

3. Defining "Similarity" of Listings and Neighborhoods

3.1 Price Analysis

3.1.1 Overall Price Distribution & Outlier Check 

In [None]:
print("--- Overall Price Distribution ---")
print("Descriptive statistics for 'price':")
print(df['price'].describe())

plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
sns.histplot(df['price'], bins=100, kde=False)
plt.title('Histogram of Price (Original)')
plt.xlabel('Price')
plt.ylabel('Number of Listings')

plt.subplot(1, 2, 2)
sns.boxplot(x=df['price'])
plt.title('Box Plot of Price (Original)')
plt.xlabel('Price')

plt.tight_layout()
plt.show()


3.1.1 (b) Same as above but with log transformations for visualizations

In [None]:
print("\n--- Price Distribution (Log Transformed for Visualization) ---")
df_log_price = df[df['price'] > 0]['price']
log_price = np.log1p(df_log_price)

plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
sns.histplot(log_price, bins=50, kde=True)
plt.title('Histogram of log(1 + Price)')
plt.xlabel('log(1 + Price)')
plt.ylabel('Number of Listings')

plt.subplot(1, 2, 2)
sns.boxplot(x=log_price)
plt.title('Box Plot of log(1 + Price)')
plt.xlabel('log(1 + Price)')

plt.tight_layout()
plt.show()

3.1.2 Price Distribution per room_type

In [None]:
print("\n--- Price Distribution per room_type ---")

price_to_plot = np.log1p(df[df['price'] > 0]['price']) 
y_label_plot = 'log(1 + Price)' # Or 'Price'

plt.figure(figsize=(12, 7))
sns.boxplot(x='room_type', y=price_to_plot, data=df[df['price'] > 0]) # Ensure data matches price_to_plot
plt.title(f'Distribution of {y_label_plot} by Room Type')
plt.xlabel('Room Type')
plt.ylabel(y_label_plot)
plt.show()

plt.figure(figsize=(12, 7))
sns.violinplot(x='room_type', y=price_to_plot, data=df[df['price'] > 0])
plt.title(f'Distribution of {y_label_plot} by Room Type (Violin Plot)')
plt.xlabel('Room Type')
plt.ylabel(y_label_plot)
plt.show()

print("\nAverage and Median price per room_type:")
print(df.groupby('room_type')['price'].agg(['mean', 'median', 'count']).sort_values(by='median', ascending=False))

3.1.3 Price Distribution per neighbourhood_group

In [None]:
print("\n--- Price Distribution per neighbourhood_group ---")

plt.figure(figsize=(12, 7))
sns.boxplot(x='neighbourhood_group', y=price_to_plot, data=df[df['price'] > 0])
plt.title(f'Distribution of {y_label_plot} by Neighbourhood Group')
plt.xlabel('Neighbourhood Group')
plt.ylabel(y_label_plot)
plt.xticks(rotation=45)
plt.show()

print("\nAverage and Median price per neighbourhood_group:")
avg_price_borough = df.groupby('neighbourhood_group')['price'].agg(['mean', 'median', 'count']).sort_values(by='median', ascending=False)
print(avg_price_borough)

plt.figure(figsize=(10, 6))
avg_price_borough['median'].plot(kind='bar')
plt.title('Median Price by Neighbourhood Group')
plt.ylabel('Median Price')
plt.xlabel('Neighbourhood Group')
plt.xticks(rotation=45)
plt.show()

3.1.4 Price Distribution per neighbourhood (Focused Analysis)

In [None]:
print("\n--- Price Distribution per neighbourhood (Example: Top 10 most common) ---")

# Find the top 10 most common neighborhoods
top_10_neighbourhoods = df['neighbourhood'].value_counts().nlargest(10).index
print(f"Analyzing price for Top 10 most common neighbourhoods: {list(top_10_neighbourhoods)}")

# Filter the DataFrame for these top 10 neighborhoods
df_top_neighbourhoods = df[df['neighbourhood'].isin(top_10_neighbourhoods)]

# Using log_price for visualization
price_to_plot_neigh = np.log1p(df_top_neighbourhoods[df_top_neighbourhoods['price'] > 0]['price'])
y_label_plot_neigh = 'log(1 + Price)'


plt.figure(figsize=(18, 8))
sns.boxplot(x='neighbourhood', y='price', data=df_top_neighbourhoods, order=top_10_neighbourhoods)
plt.title(f'Distribution of Price in Top 10 Common Neighbourhoods')
plt.xlabel('Neighbourhood')
plt.ylabel('Price')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("\nAverage and Median price for Top 10 most common neighbourhoods:")
avg_price_top_neigh = df_top_neighbourhoods.groupby('neighbourhood')['price'].agg(['mean', 'median', 'count']).reindex(top_10_neighbourhoods) # Keep original order
print(avg_price_top_neigh)

3.2 Room Type Analysis

3.2.1 Overall room_type Distribution

In [None]:
print("--- Overall Room Type Distribution ---")

# Frequency counts
room_type_counts = df['room_type'].value_counts()
print("Frequency of each room_type:")
print(room_type_counts)

# Proportions
room_type_proportions = df['room_type'].value_counts(normalize=True) * 100
print("\nProportion of each room_type (%):")
print(room_type_proportions)

# Visualization
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='room_type', order=room_type_counts.index) # Order by frequency
plt.title('Overall Distribution of Room Types')
plt.xlabel('Room Type')
plt.ylabel('Number of Listings')
plt.show()

3.2.2 room_type Distribution per neighbourhood_group

In [None]:
print("\n--- Room Type Distribution per Neighbourhood Group ---")

# Grouped counts
group_room_counts = df.groupby('neighbourhood_group')['room_type'].value_counts(normalize=False).unstack(fill_value=0)
print("Counts of room_type within each neighbourhood_group:")
print(group_room_counts)

# Proportions for stacked bar chart
group_room_proportions = df.groupby('neighbourhood_group')['room_type'].value_counts(normalize=True).mul(100).unstack(fill_value=0)
print("\nProportions (%) of room_type within each neighbourhood_group:")
print(group_room_proportions)

# Visualization - Stacked Bar Chart for Proportions
group_room_proportions.plot(kind='bar', stacked=True, figsize=(12, 7))
plt.title('Proportion of Room Types within each Neighbourhood Group')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Percentage of Listings (%)')
plt.xticks(rotation=45)
plt.legend(title='Room Type')
plt.tight_layout()
plt.show()

# Alternative Visualization - Grouped Bar Chart for Counts (using catplot)
plt.figure(figsize=(12, 7)) # May need to adjust figure size if using catplot separately
g = sns.catplot(data=df, x='neighbourhood_group', hue='room_type', kind='count', height=6, aspect=1.5, legend_out=False)
g.set_axis_labels("Neighbourhood Group", "Number of Listings")
g.fig.suptitle('Distribution of Room Types by Neighbourhood Group (Counts)', y=1.02) # Adjust title position
plt.xticks(rotation=45)
plt.legend(title='Room Type')
plt.tight_layout() # Apply to the figure of the catplot
plt.show()

3.2.2 (b) room_type Distribution per neighbourhood (Focused)

In [None]:
print("\n--- Room Type Distribution per Neighbourhood (Example: Top 5 Most Common) ---")

# Find the top 5 most common neighborhoods
top_5_neighbourhoods = df['neighbourhood'].value_counts().nlargest(5).index
print(f"Analyzing room type distribution for Top 5 most common neighbourhoods: {list(top_5_neighbourhoods)}")

# Filter the DataFrame for these top 5 neighborhoods
df_top_neighbourhoods = df[df['neighbourhood'].isin(top_5_neighbourhoods)]

# Proportions for stacked bar chart for these specific neighborhoods
top_hoods_room_proportions = df_top_neighbourhoods.groupby('neighbourhood')['room_type'].value_counts(normalize=True).mul(100).unstack(fill_value=0)
print("\nProportions (%) of room_type within selected top neighbourhoods:")
print(top_hoods_room_proportions.reindex(top_5_neighbourhoods)) # Keep original order

# Visualization - Stacked Bar Chart for Proportions
if not top_hoods_room_proportions.empty:
    top_hoods_room_proportions.reindex(top_5_neighbourhoods).plot(kind='bar', stacked=True, figsize=(14, 7))
    plt.title('Proportion of Room Types within Top 5 Common Neighbourhoods')
    plt.xlabel('Neighbourhood')
    plt.ylabel('Percentage of Listings (%)')
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Room Type')
    plt.tight_layout()
    plt.show()
else:
    print("No data to plot for top neighbourhoods.")

3.3 Minimum Nights

3.3.1 Overall minimum_nights Distribution & Outlier Check

In [None]:
print("--- Overall minimum_nights Distribution ---")
print("Descriptive statistics for 'minimum_nights':")
print(df['minimum_nights'].describe())

plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
sns.histplot(df['minimum_nights'], bins=100, kde=False) # Using more bins initially
plt.title('Histogram of minimum_nights (Original Scale)')
plt.xlabel('Minimum Nights')
plt.ylabel('Number of Listings')
# The x-axis might be dominated by outliers. We'll address this.

plt.subplot(1, 2, 2)
sns.boxplot(x=df['minimum_nights'])
plt.title('Box Plot of minimum_nights (Original Scale)')
plt.xlabel('Minimum Nights')

plt.tight_layout()
plt.show()

3.3.1 (b) Focused Visualization of minimum_nights

In [None]:
print("\n--- Focused Visualization of minimum_nights ---")


upper_limit = df['minimum_nights'].quantile(0.99) 

plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
sns.histplot(df[df['minimum_nights'] <= upper_limit]['minimum_nights'], bins=30, kde=False)
plt.title(f'Histogram of minimum_nights (up to {upper_limit:.0f} nights)')
plt.xlabel('Minimum Nights')
plt.ylabel('Number of Listings')

plt.subplot(1, 2, 2)
sns.boxplot(x=df[df['minimum_nights'] <= upper_limit]['minimum_nights'])
plt.title(f'Box Plot of minimum_nights (up to {upper_limit:.0f} nights)')
plt.xlabel('Minimum Nights')

plt.tight_layout()
plt.show()


plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)

temp_min_nights_for_log_viz = df[df['minimum_nights'] > 0]['minimum_nights']
sns.histplot(temp_min_nights_for_log_viz, bins=50, kde=False, log_scale=True)
plt.title('Histogram of minimum_nights (Log Scale on X-axis)')
plt.xlabel('Minimum Nights (Log Scale)')
plt.ylabel('Number of Listings')

plt.subplot(1, 2, 2)

if df['minimum_nights'].min() >= 1:
    sns.boxplot(x=np.log(df['minimum_nights']))
    plt.title('Box Plot of log(minimum_nights)')
    plt.xlabel('log(Minimum Nights)')
elif df['minimum_nights'].min() >= 0: # If it can be 0, use log1p
    sns.boxplot(x=np.log1p(df['minimum_nights']))
    plt.title('Box Plot of log1p(minimum_nights)')
    plt.xlabel('log1p(Minimum Nights)')
    
plt.tight_layout()
plt.show()

3.3.2 minimum_nights Distribution per room_type

In [None]:
print("\n--- minimum_nights Distribution per room_type ---")

upper_limit_plot = df['minimum_nights'].quantile(0.95)

plt.figure(figsize=(12, 7))
sns.boxplot(x='room_type', y='minimum_nights', data=df[df['minimum_nights'] <= upper_limit_plot])
plt.title(f'Distribution of minimum_nights (up to {upper_limit_plot:.0f} nights) by Room Type')
plt.xlabel('Room Type')
plt.ylabel(f'Minimum Nights (capped at {upper_limit_plot:.0f})')
plt.show()


plt.figure(figsize=(12, 7))
sns.boxplot(x='room_type', y='minimum_nights', data=df)
plt.yscale('log') 
plt.title('Distribution of minimum_nights by Room Type (Log Scale on Y-axis)')
plt.xlabel('Room Type')
plt.ylabel('Minimum Nights (Log Scale)')
plt.show()

print("\nAverage and Median minimum_nights per room_type:")
print(df.groupby('room_type')['minimum_nights'].agg(['mean', 'median', 'count']).sort_values(by='median', ascending=False))

3.3.3 minimum_nights Distribution per neighbourhood_group

In [None]:
print("\n--- minimum_nights Distribution per neighbourhood_group ---")


plt.figure(figsize=(12, 7))
sns.boxplot(x='neighbourhood_group', y='minimum_nights', data=df[df['minimum_nights'] <= upper_limit_plot])
plt.title(f'Distribution of minimum_nights (up to {upper_limit_plot:.0f} nights) by Neighbourhood Group')
plt.xlabel('Neighbourhood Group')
plt.ylabel(f'Minimum Nights (capped at {upper_limit_plot:.0f})')
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(12, 7))
sns.boxplot(x='neighbourhood_group', y='minimum_nights', data=df)
plt.yscale('log') # Apply log scale to y-axis
plt.title('Distribution of minimum_nights by Neighbourhood Group (Log Scale on Y-axis)')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Minimum Nights (Log Scale)')
plt.xticks(rotation=45)
plt.show()

print("\nAverage and Median minimum_nights per neighbourhood_group:")
avg_min_nights_borough = df.groupby('neighbourhood_group')['minimum_nights'].agg(['mean', 'median', 'count']).sort_values(by='median', ascending=False)
print(avg_min_nights_borough)

plt.figure(figsize=(10, 6))
avg_min_nights_borough['median'].plot(kind='bar')
plt.title('Median minimum_nights by Neighbourhood Group')
plt.ylabel('Median Minimum Nights')
plt.xlabel('Neighbourhood Group')
plt.xticks(rotation=45)
plt.show()

3.4 Host Characteristics

3.4.1 Overall calculated_host_listings_count Distribution

In [None]:
print("--- Overall calculated_host_listings_count Distribution ---")
print("Descriptive statistics for 'calculated_host_listings_count':")
print(df['calculated_host_listings_count'].describe())

plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
sns.histplot(df['calculated_host_listings_count'], bins=100, kde=False)
plt.title('Histogram of Host Listings Count (Original Scale)')
plt.xlabel('Number of Listings by Host')
plt.ylabel('Number of Listings (from those hosts)')
# The x-axis will likely be dominated by hosts with few listings.

plt.subplot(1, 2, 2)
sns.boxplot(x=df['calculated_host_listings_count'])
plt.title('Box Plot of Host Listings Count (Original Scale)')
plt.xlabel('Number of Listings by Host')

plt.tight_layout()
plt.show()

print("\nValue counts for host listings count (Top 10):")
print(df['calculated_host_listings_count'].value_counts().nlargest(10))

print(f"\nPercentage of listings by hosts with only 1 listing: {(df['calculated_host_listings_count'] == 1).mean()*100:.2f}%")
print(f"Percentage of listings by hosts with 1 to 2 listings: {(df['calculated_host_listings_count'] <= 2).mean()*100:.2f}%")
print(f"Percentage of listings by hosts with more than 10 listings: {(df['calculated_host_listings_count'] > 10).mean()*100:.2f}%")


3.4.1 (b) Focused Visualization & Categorization of Host Types

In [None]:
print("\n--- Focused Visualization of calculated_host_listings_count ---")

upper_limit_host_viz = df['calculated_host_listings_count'].quantile(0.95)

plt.figure(figsize=(12, 5))
sns.histplot(df[df['calculated_host_listings_count'] <= upper_limit_host_viz]['calculated_host_listings_count'],
             bins=int(upper_limit_host_viz), kde=False)
plt.title(f'Histogram of Host Listings Count (up to {upper_limit_host_viz:.0f} listings)')
plt.xlabel('Number of Listings by Host')
plt.ylabel('Number of Listings (from those hosts)')
plt.show()

plt.figure(figsize=(12, 5))
sns.histplot(df['calculated_host_listings_count'], bins=50, kde=False, log_scale=(False, True))
plt.title('Histogram of Host Listings Count (Log Scale on Y-axis)')
plt.xlabel('Number of Listings by Host')
plt.ylabel('Number of Listings (Log Scale)')
plt.show()


bins = [0, 1, 2, 5, 10, 50, df['calculated_host_listings_count'].max() + 1]
labels = ['1', '2', '3-5', '6-10', '11-50', '51+']
df['host_type_category'] = pd.cut(df['calculated_host_listings_count'], bins=bins, labels=labels, right=True)

print("\nDistribution of listings by derived host_type_category:")
host_type_counts = df['host_type_category'].value_counts().reindex(labels) # Ensure correct order
print(host_type_counts)

plt.figure(figsize=(10, 6))
host_type_counts.plot(kind='bar')
plt.title('Number of Listings by Host Type Category')
plt.xlabel('Host Type (Number of Listings Managed)')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()

3.4.2 Host Type Distribution per room_type

In [None]:
print("\n--- Host Type Distribution per room_type ---")

if 'host_type_category' in df.columns:
    # Proportions for stacked bar chart
    room_host_type_proportions = df.groupby('room_type')['host_type_category'].value_counts(normalize=True).mul(100).unstack(fill_value=0)
    print("\nProportions (%) of listings by host_type_category within each room_type:")
    print(room_host_type_proportions)

    room_host_type_proportions.plot(kind='bar', stacked=True, figsize=(12, 7))
    plt.title('Proportion of Listings by Host Type within each Room Type')
    plt.xlabel('Room Type')
    plt.ylabel('Percentage of Listings (%)')
    plt.xticks(rotation=0)
    plt.legend(title='Host Type (Listings Managed)', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()
else:
    print("Column 'host_type_category' not created. Skipping this plot.")

# Boxplots of the original 'calculated_host_listings_count' by room_type
plt.figure(figsize=(10, 6))
sns.boxplot(x='room_type', y='calculated_host_listings_count', data=df)
plt.yscale('log') # Apply log scale to y-axis
plt.title('Distribution of Host Listings Count by Room Type (Log Scale on Y-axis)')
plt.xlabel('Room Type')
plt.ylabel('Host Listings Count (Log Scale)')
plt.show()

3.5 Relationships between Features

3.5.1 Correlation Matrix for Numerical Features related to Price

In [None]:
print("--- Correlation Matrix for Price and Related Numerical Features ---")

# Select numerical features that might correlate with price
numerical_features_for_corr = ['price', 'log_price', 'minimum_nights', 'number_of_reviews',
                               'reviews_per_month', 'calculated_host_listings_count', 'availability_365']


if 'log_price' not in df.columns and 'price' in df.columns:
    df['log_price'] = np.log1p(df['price']) 
elif 'log_price' not in df.columns and 'price' not in df.columns:
    numerical_features_for_corr = [col for col in numerical_features_for_corr if col in df.columns and col != 'log_price']


correlation_matrix = df[numerical_features_for_corr].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Price and Other Numerical Features')
plt.show()

3.5.2 Scatter Plot - log_price vs. number_of_reviews

In [None]:
print("\n--- Scatter Plot: log_price vs. number_of_reviews ---")

if 'log_price' in df.columns and 'number_of_reviews' in df.columns and 'room_type' in df.columns:
    plt.figure(figsize=(12, 7))
    sns.scatterplot(x='number_of_reviews', y='log_price', data=df, hue='room_type', alpha=0.3, s=20) # s for marker size
    plt.title('log_price vs. Number of Reviews (colored by Room Type)')
    plt.xlabel('Number of Reviews')
    plt.ylabel('log(1 + Price)')
    plt.show()
else:
    print("Skipping scatter plot: 'log_price', 'number_of_reviews', or 'room_type' not found.")


3.5.3 Scatter Plot - log_price vs. reviews_per_month

In [None]:
print("\n--- Scatter Plot: log_price vs. reviews_per_month ---")

if 'log_price' in df.columns and 'reviews_per_month' in df.columns and 'room_type' in df.columns:
    plt.figure(figsize=(12, 7))
    sns.scatterplot(x='reviews_per_month', y='log_price', data=df, hue='room_type', alpha=0.3, s=20)
    plt.title('log_price vs. Reviews per Month (colored by Room Type)')
    plt.xlabel('Reviews per Month')
    plt.ylabel('log(1 + Price)')
    plt.show()
else:
    print("Skipping scatter plot: 'log_price', 'reviews_per_month', or 'room_type' not found.")


3.5.4 Scatter Plot - log_price vs. availability_365

In [None]:
print("\n--- Scatter Plot: log_price vs. availability_365 ---")

if 'log_price' in df.columns and 'availability_365' in df.columns and 'room_type' in df.columns:
    plt.figure(figsize=(12, 7))
    sns.scatterplot(x='availability_365', y='log_price', data=df, hue='room_type', alpha=0.3, s=20)
    plt.title('log_price vs. Availability_365 (colored by Room Type)')
    plt.xlabel('Availability (days out of 365)')
    plt.ylabel('log(1 + Price)')
    plt.show()
else:
    print("Skipping scatter plot: 'log_price', 'availability_365', or 'room_type' not found.")

3.5.5 Price vs. calculated_host_listings_count

In [None]:
print("\n--- Price vs. calculated_host_listings_count ---")

if 'log_price' in df.columns and 'calculated_host_listings_count' in df.columns:
    plt.figure(figsize=(12, 7))
    temp_host_counts_for_log_viz = df[df['calculated_host_listings_count'] > 0]['calculated_host_listings_count']
    
    sns.scatterplot(x='calculated_host_listings_count', y='log_price', data=df, alpha=0.3, s=20)
    plt.xscale('log') # Log scale for x-axis due to skewness of host listings count
    plt.title('log_price vs. Host Listings Count (Log Scale on X-axis)')
    plt.xlabel('Number of Listings by Host (Log Scale)')
    plt.ylabel('log(1 + Price)')
    plt.show()

    if 'host_type_category' in df.columns:
        plt.figure(figsize=(12, 7))
        sns.boxplot(x='host_type_category', y='log_price', data=df, order=['1', '2', '3-5', '6-10', '11-50', '51+'])
        plt.title('log_price by Host Type Category')
        plt.xlabel('Host Type (Number of Listings Managed)')
        plt.ylabel('log(1 + Price)')
        plt.xticks(rotation=45)
        plt.show()
else:
    print("Skipping plot: 'log_price' or 'calculated_host_listings_count' not found.")

Summary of Step 3 - Defining "Similarity" of Listings and Neighborhoods (for Neighborhood Characterization)

Price Profile:

The overall price distribution is heavily right-skewed, with a median price of $106 but a mean of $153 and a maximum of $10,000. This skewness persists within different room_types and neighbourhood_groups.
room_type is a primary price driver: 'Entire home/apt' (median $160) is significantly more expensive than 'Private room' (median $70), which is pricier than 'Shared room' (median $45).
neighbourhood_group also dictates price: Manhattan (median $150) is the most expensive borough, followed by Brooklyn (median $90), then Queens and Staten Island (both median $75), and the Bronx (median $65).
Significant price variation exists even within the most common neighborhoods of the same borough (e.g., in Manhattan, Midtown's median price is $210, while Harlem's is $89).
price showed weak linear correlation with number_of_reviews and availability_365 but is strongly influenced by room_type and broad location (neighbourhood_group).
Room Type Composition:

The market is dominated by 'Entire home/apt' (approx. 52%) and 'Private room' (approx. 45.7%), with 'Shared room' being a small niche (approx. 2.4%).
The mix of room_types varies substantially by neighbourhood_group: Manhattan has a higher proportion of 'Entire home/apt' (~61%), while Queens and the Bronx have more 'Private room' listings (~60%). Brooklyn and Staten Island show a more balanced mix.
This variation extends to specific popular neighborhoods (e.g., within Manhattan, Harlem has mostly 'Private rooms' while the Upper West Side has mostly 'Entire home/apt').
Minimum Nights Profile:

The overall distribution of minimum_nights is highly right-skewed (median 3 nights, mean ~7 nights, max 1250 nights). About 75% of listings require a minimum stay of 5 nights or less.
Median minimum_nights are short across all boroughs (2-3 nights). However, Manhattan and Brooklyn show higher mean minimum_nights due to a greater prevalence of listings with very long minimum stay requirements.
Host Profile (calculated_host_listings_count):

The vast majority of listings (~66%) come from hosts managing only a single property; about 80% are from hosts with 1 or 2 listings.
A "long tail" of professional hosts managing many properties exists, though they account for a smaller percentage of total listings.
The distribution of host types varies by room_type: 'Entire home/apt' listings come from both many single hosts and the largest professional operators. 'Private rooms' are more common with small to mid-size hosts, and 'Shared rooms' often involve hosts managing a few to many listings (but not the very largest operators).
An interesting observation was that the price range for listings might be narrower (more standardized) for listings managed by very large hosts.
Implications for Defining Neighborhood Similarity:
To define "characteristically similar neighborhoods," a profile for each neighborhood should be constructed based on aggregated features reflecting its typical offerings. Key components of this profile will include: its typical price level(s) and distribution, its mix of available room types, its common minimum stay requirements, and potentially the predominant host profile (e.g., prevalence of single-listing vs. professional hosts). The EDA suggests these characteristics vary significantly and are crucial for distinguishing neighborhood types.

4. Understanding neighbourhood_group and neighbourhood Characteristics

4.1 Price Profile per neighbourhood_group

In [None]:
print("--- Price Profile per Neighbourhood Group ---")

# Define custom quantiles for price range
def q1(x): return x.quantile(0.25)
def q3(x): return x.quantile(0.75)

price_profile_group = df.groupby('neighbourhood_group')['price'].agg(
    ['min', q1, 'median', 'mean', q3, 'max', 'std', 'count']
).sort_values(by='median', ascending=False)

price_profile_group.rename(columns={'q1': '25th_percentile', 'q3': '75th_percentile'}, inplace=True)

print("Price profile (min, 25th, median, mean, 75th, max, std, count) per neighbourhood_group:")
print(price_profile_group)

# Visualization (Box plot - you might have done this in price analysis, but good to see here too)
# Using log_price for better visualization if available and price is skewed
price_col_to_plot = 'log_price' if 'log_price' in df.columns else 'price'
y_label = 'log(1+Price)' if price_col_to_plot == 'log_price' else 'Price'

plt.figure(figsize=(10, 6))
sns.boxplot(x='neighbourhood_group', y=price_col_to_plot, data=df, order=price_profile_group.index)
plt.title(f'Price Distribution by Neighbourhood Group ({y_label})')
plt.xlabel('Neighbourhood Group')
plt.ylabel(y_label)
plt.xticks(rotation=45)
if price_col_to_plot == 'price':
    pass
plt.tight_layout()
plt.show()

4.2 Availability Profile per neighbourhood_group

In [None]:
print("\n--- Availability Profile per Neighbourhood Group ---")

availability_profile_group = df.groupby('neighbourhood_group')['availability_365'].agg(
    ['min', q1, 'median', 'mean', q3, 'max', 'std', 'count']
).sort_values(by='median', ascending=True) # Sorted by typically least available

availability_profile_group.rename(columns={'q1': '25th_percentile', 'q3': '75th_percentile'}, inplace=True)

print("Availability_365 profile (min, 25th, median, mean, 75th, max, std, count) per neighbourhood_group:")
print(availability_profile_group)

# Visualization
plt.figure(figsize=(10, 6))
sns.boxplot(x='neighbourhood_group', y='availability_365', data=df, order=availability_profile_group.index)
plt.title('Availability_365 Distribution by Neighbourhood Group')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Availability (days out of 365)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

4.3 Dominant Room Type(s) per neighbourhood_group

In [None]:
print("\n--- Dominant Room Type(s) per Neighbourhood Group ---")

# Calculate proportions of each room_type within each neighbourhood_group
room_type_proportions_group = pd.crosstab(df['neighbourhood_group'], df['room_type'], normalize='index') * 100
room_type_proportions_group = room_type_proportions_group.round(2) # Round for cleaner display

print("Proportions (%) of room_types within each neighbourhood_group:")
print(room_type_proportions_group)

# To find the top 1 or 2 dominant types programmatically:
print("\nDominant room types (top 1 or 2 based on count):")
for group, data in df.groupby('neighbourhood_group'):
    print(f"\n{group}:")
    top_room_types = data['room_type'].value_counts().nlargest(2) # Get top 2
    for room_type, count in top_room_types.items():
        percentage = (count / len(data)) * 100
        print(f"  - {room_type}: {count} listings ({percentage:.2f}%)")


# Visualization (Stacked Bar Chart - you might have done this, but good for profile summary)
if not room_type_proportions_group.empty:
    room_type_proportions_group.plot(kind='bar', stacked=True, figsize=(12, 7))
    plt.title('Proportion of Room Types within each Neighbourhood Group')
    plt.xlabel('Neighbourhood Group')
    plt.ylabel('Percentage of Listings (%)')
    plt.xticks(rotation=45)
    plt.legend(title='Room Type', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()

4.4 Review Volume Profile per neighbourhood_group

In [None]:
print("\n--- Review Volume Profile per Neighbourhood Group ---")

print("\n--- Profile for 'number_of_reviews' (Total Reviews Received) ---")
review_count_profile_group = df.groupby('neighbourhood_group')['number_of_reviews'].agg(
    ['sum', q1, 'median', 'mean', q3, 'max', 'count'] # Sum might be interesting too
).sort_values(by='median', ascending=False)
review_count_profile_group.rename(columns={'q1': '25th_percentile', 'q3': '75th_percentile'}, inplace=True)
print(review_count_profile_group)

plt.figure(figsize=(10,6))
sns.boxplot(x='neighbourhood_group', y='number_of_reviews', data=df, order=review_count_profile_group.index)
plt.yscale('log') # Use log scale due to skewness
plt.title('Distribution of number_of_reviews by Neighbourhood Group (Log Scale Y)')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Number of Reviews (Log Scale)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


print("\n--- Profile for 'reviews_per_month' (Review Activity Rate) ---")
# Ensure 'reviews_per_month' has NaNs filled with 0 as per earlier EDA steps
rpm_profile_group = df.groupby('neighbourhood_group')['reviews_per_month'].agg(
    ['sum', q1, 'median', 'mean', q3, 'max', 'count']
).sort_values(by='median', ascending=False)
rpm_profile_group.rename(columns={'q1': '25th_percentile', 'q3': '75th_percentile'}, inplace=True)
print(rpm_profile_group)

plt.figure(figsize=(10,6))
sns.boxplot(x='neighbourhood_group', y='reviews_per_month', data=df, order=rpm_profile_group.index)
upper_rpm_viz_limit = df[df['reviews_per_month'] > 0]['reviews_per_month'].quantile(0.95) # 95th percentile of non-zero values
if pd.notna(upper_rpm_viz_limit) and upper_rpm_viz_limit > 0 : # Check if limit is valid
    plt.ylim(0, upper_rpm_viz_limit)
plt.title(f'Distribution of reviews_per_month by Neighbourhood Group (capped at {upper_rpm_viz_limit:.2f})')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Reviews per Month')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

4.5 Intra-Neighbourhood Variation

In [None]:
print("--- Intra-Borough Variation: Price (log_price) ---")

if 'log_price' not in df.columns:
    print("Warning: 'log_price' column not found. Please create it first for better price visualization.")
    # As a fallback, try to use original 'price' but it might be heavily skewed
    price_col_to_plot = 'price'
    price_y_label = 'Price (Original Scale - may be skewed)'
else:
    price_col_to_plot = 'log_price'
    price_y_label = 'log(1 + Price)'


for group_name, group_df in df.groupby('neighbourhood_group'):
    plt.figure(figsize=(15, 8)) # Adjust size as needed
    
    ordered_neighbourhoods = group_df.groupby('neighbourhood')[price_col_to_plot].median().sort_values().index
    
    sns.boxplot(x='neighbourhood', y=price_col_to_plot, data=group_df, order=ordered_neighbourhoods)
    
    plt.title(f'Price Distribution by Neighbourhood in {group_name}\n({price_y_label})')
    plt.xlabel('Neighbourhood')
    plt.ylabel(price_y_label)
    plt.xticks(rotation=90, ha='right', fontsize=8) # Rotate labels for readability
    plt.tight_layout() # Adjust layout to make room for labels
    plt.show()
    
    print(f"Descriptive statistics for {price_col_to_plot} in {group_name} (Top 5 neighbourhoods by median {price_col_to_plot}):")
    top_5_hoods_price_stats = group_df.groupby('neighbourhood')[price_col_to_plot].agg(['median', 'mean', 'count', 'std']).loc[ordered_neighbourhoods[-5:]].sort_values('median', ascending=False)
    print(top_5_hoods_price_stats)
    print("-" * 50)

4.6 Intra-Neighbourhood Variation - Listing Types (Room Type Proportions)

In [None]:
print("\n--- Intra-Borough Variation: Listing Types (Room Type Proportions) ---")

# Calculate room type proportions for each neighbourhood
neighbourhood_room_counts = df.groupby(['neighbourhood_group', 'neighbourhood', 'room_type']).size().unstack(fill_value=0)
neighbourhood_room_proportions = neighbourhood_room_counts.apply(lambda x: x / x.sum() if x.sum() > 0 else x, axis=1).reset_index()

# Get the list of unique room types to plot
room_types_to_plot = df['room_type'].unique()

for group_name in df['neighbourhood_group'].unique():
    group_df_proportions = neighbourhood_room_proportions[neighbourhood_room_proportions['neighbourhood_group'] == group_name]
    
    if group_df_proportions.empty:
        print(f"No data for {group_name} to plot room type proportions.")
        continue
        
    fig, axes = plt.subplots(len(room_types_to_plot), 1, figsize=(15, 5 * len(room_types_to_plot)), sharex=False) # sharex=False might be better if neighborhood lists differ
    if len(room_types_to_plot) == 1: # handles case of single room type if dataset was filtered
        axes = [axes] 

    fig.suptitle(f'Distribution of Room Type Proportions across Neighbourhoods in {group_name}', fontsize=16, y=1.02)
    
    for i, room_type_col in enumerate(room_types_to_plot):
        if room_type_col in group_df_proportions.columns:

            ordered_hoods_rt = group_df_proportions.sort_values(by=room_type_col)['neighbourhood']

            sns.boxplot(ax=axes[i], x='neighbourhood', y=room_type_col, data=group_df_proportions, order=ordered_hoods_rt)
            axes[i].set_title(f'Proportion of "{room_type_col}" listings')
            axes[i].set_xlabel('') # Avoid repetitive x-labels
            axes[i].set_ylabel('Proportion')
            axes[i].tick_params(axis='x', rotation=90, labelsize=8)
        else:
            axes[i].set_title(f'No "{room_type_col}" listings found in proportions for this group.')
            axes[i].tick_params(axis='x', rotation=90, labelsize=8)


    plt.tight_layout(rect=[0, 0, 1, 0.98]) # Adjust layout to make room for suptitle
    plt.show()
    
    print(f"Summary of 'Entire home/apt' proportions in {group_name} (example):")
    if 'Entire home/apt' in group_df_proportions.columns:
        print(group_df_proportions[['neighbourhood', 'Entire home/apt']].describe())
    print("-" * 50)

Summary of Step 4 - Understanding neighbourhood_group and neighbourhood Characteristics

Distinct neighbourhood_group Level Profiles:

Price & Room Type Dynamics:
A clear price hierarchy exists: 'Manhattan' is the most expensive neighbourhood_group (median price ~$150), followed by 'Brooklyn' (~$90), then 'Queens' and 'Staten Island' (both ~$75), with the 'Bronx' being the most affordable (~$65).
Dominant room_types vary: 'Manhattan' is characterized by a majority of 'Entire home/apt' listings (~61%). 'Brooklyn' and 'Staten Island' show a more balanced mix between 'Entire home/apt' and 'Private room'. 'Queens' and the 'Bronx' lean more towards 'Private room' listings (~60%), with the 'Bronx' also having the highest relative proportion of 'Shared room's.
Availability Patterns:
Listings in 'Brooklyn' (median 28 days) and 'Manhattan' (median 36 days) exhibit the lowest median availability, suggesting tighter overall supply or higher occupancy.
'Queens' and the 'Bronx' offer more available listings on average, while 'Staten Island' listings are typically the most available throughout the year (median 219 days).
Review Volume & Activity:
The typical listing (median) in 'Staten Island' shows the highest number of accumulated reviews and the highest monthly review velocity, despite it having the fewest total listings. The 'Bronx' and 'Queens' follow with robust review activity per listing.
'Manhattan' and 'Brooklyn', despite their high density and prices, showed lower median review counts and monthly review rates per listing.
Significant Variation at the neighbourhood Level (Within neighbourhood_groups):

Price Variation: All neighbourhood_groups demonstrate considerable price differences between their constituent neighbourhoods. For example, within Manhattan, median prices in its most common/popular neighborhoods can range significantly (e.g., Harlem ~$89 vs. Midtown ~$210).
Room Type Mix Variation: The proportion of different room_types is not uniform within any neighbourhood_group. For instance, some Manhattan neighbourhoods are heavily skewed towards 'Entire home/apt', while others like Harlem have a much higher share of 'Private room' listings, deviating from the overall neighbourhood_group average.
Availability Variation: Visualizations indicated that availability patterns also differ widely among neighbourhoods within the same neighbourhood_group; some neighbourhoods are far more supply-constrained than others.
Overall Implication for Your Project:

The key insight is that while neighbourhood_group provides essential high-level context (e.g., general price tier, broad availability pressures), significant heterogeneity exists among neighbourhoods within each neighbourhood_group.
Therefore, to effectively identify "characteristically similar neighborhoods" for your recommendation system, relying solely on neighbourhood_group-level characteristics would be insufficient. It's crucial to develop and use granular, neighbourhood-specific profiles based on price, room type composition, typical availability, and other explored features to capture their unique characters.