## 2. Data Loading, Re-Pre-Processing and Feature Engineering

Our first step in the EDA process is to load the `AB_NYC_2019.csv` dataset. Missing values handled, log_price created, host_type_category potentially created based on findings in the EDA process.

### 2.1. Load Data

In [40]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler 
from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors 
import matplotlib.pyplot as plt
import seaborn as sns


try:
    df = pd.read_csv("..\\data\\AB_NYC_2019.csv")
    df['reviews_per_month'] = df['reviews_per_month'].fillna(0)
    df['name'] = df['name'].fillna('Unknown')
    df['host_name'] = df['host_name'].fillna('Unknown')
    if 'price' in df.columns and df['price'].min() >= 0:
        df['log_price'] = np.log1p(df['price'])
    bins_host = [0, 1, 2, 5, 10, 50, df['calculated_host_listings_count'].max() + 1]
    labels_host = ['1', '2', '3-5', '6-10', '11-50', '51+']
    df['host_type_category'] = pd.cut(df['calculated_host_listings_count'], bins=bins_host, labels=labels_host, right=True)
    print("DataFrame loaded and initial cleaning/preparation assumed complete.")
except FileNotFoundError:
    print("Ensure df is loaded and preprocessed in notebook.")
    df = pd.DataFrame({
        'price': np.random.exponential(150, 5000),
        'log_price': np.log1p(np.random.exponential(150, 5000)),
        'room_type': np.random.choice(['Entire home/apt', 'Private room', 'Shared room'], 5000, p=[0.5, 0.45, 0.05]),
        'minimum_nights': np.random.choice([1,2,3,7,30,90], 5000, p=[0.5,0.2,0.1,0.1,0.05,0.05]),
        'calculated_host_listings_count': np.random.choice([1,2,3,10,50,100], 5000, p=[0.6,0.15,0.1,0.05,0.05,0.05]),
        'number_of_reviews': np.random.randint(0,100,5000),
        'availability_365': np.random.randint(0, 366, 5000),
        'reviews_per_month': np.random.rand(5000) * 5,
        'neighbourhood_group': np.random.choice(['Manhattan', 'Brooklyn', 'Queens', 'Bronx', 'Staten Island'], 5000),
        'neighbourhood': [f"Hood_{i%20}_{np.random.choice(['Manhattan', 'Brooklyn', 'Queens', 'Bronx', 'Staten Island'])}" for i in range(5000)],
        'id': range(5000)
    })
    bins_host = [0, 1, 2, 5, 10, 50, df['calculated_host_listings_count'].max() + 1]
    labels_host = ['1', '2', '3-5', '6-10', '11-50', '51+']
    df['host_type_category'] = pd.cut(df['calculated_host_listings_count'], bins=bins_host, labels=labels_host, right=True)
    print("Using dummy data for script execution.")


plt.style.use('ggplot')
sns.set_palette("muted")

DataFrame loaded and initial cleaning/preparation assumed complete.


### 2.1 Define Price Tiers (for Neighborhood Profiling)

In [41]:

print("--- Defining Price Tiers ---")

if 'price' in df.columns:
    # Calculate the required percentiles
    price_q1 = df['price'].quantile(0.25)    # 25th percentile
    price_q3 = df['price'].quantile(0.75)    # 75th percentile
    price_q95 = df['price'].quantile(0.95)   # 95th percentile
    max_price = df['price'].max()

    print(f"Percentiles used for tiers: Q1=${price_q1:.2f}, Q3=${price_q3:.2f}, Q95=${price_q95:.2f}, Max=${max_price:.2f}")

    # Define bins and labels for 4 tiers
    price_bins = [-0.01, price_q1, price_q3, price_q95, max_price + 1]
    price_labels = ['Budget', 'Mid-Range', 'Premium', 'Upper Premium']
    
    # Check for non-monotonic bins which can happen if quantiles are equal
    # (e.g., if Q3 and Q95 are the same due to data distribution)
    is_monotonic = all(price_bins[i] < price_bins[i+1] for i in range(len(price_bins)-2)) # Check up to Q95
    if price_bins[len(price_bins)-2] > price_bins[len(price_bins)-1]: # Check last bin edge with max_price + 1
        is_monotonic = False # max_price + 1 should always be greater unless max_price itself is problematic

    if not is_monotonic or len(set(price_bins)) < len(price_bins):
        print("\nWarning: Bin edges are not strictly monotonic or contain duplicates.")
        print(f"Original calculated bins: {[-0.01, price_q1, price_q3, price_q95, max_price + 1]}")
        # Attempt to create unique sorted bins. This might reduce the number of bins if quantiles are equal.
        price_bins = sorted(list(set([-0.01, price_q1, price_q3, price_q95, max_price + 1])))
        print(f"Adjusted unique sorted bins: {price_bins}")
        
        # If after adjustment we don't have enough bins for 4 labels, we might need to simplify
        if len(price_bins) < 5: # Need 5 edges for 4 labels
            print("Could not form 4 distinct tiers with current quantiles. Consider reviewing thresholds or using fewer tiers.")
        else:
             # If we still have enough bins for 4 labels after sorting unique
             print("Proceeding with adjusted unique bins for 4 tiers.")


    if len(price_bins) == len(price_labels) + 1: # Ensure we have the right number of bins for labels
        df['price_tier'] = pd.cut(df['price'],
                                  bins=price_bins,
                                  labels=price_labels,
                                  right=True,        # (lower_bound, upper_bound]
                                  include_lowest=True) # Ensures min value is included

        print("\nValue counts for 'price_tier' (%):")
        print(df['price_tier'].value_counts(normalize=True).sort_index() * 100)
        
        print("\nPrice Tier Definitions Used:")
        for i in range(len(price_labels)):
            lower_b = price_bins[i]
            if np.isclose(lower_b, -0.01): lower_b = 0.0 # For cleaner display
            upper_b = price_bins[i+1]
            # Adjust display for the last bin to show it includes the max
            inclusive_char = "<=" if i < len(price_labels) -1 else "<=" 
            print(f"  {price_labels[i]}: ${lower_b:.2f} to ${upper_b if i < len(price_labels)-1 else max_price:.2f}")

    else:
        print("\nError: Could not create price tiers. Number of unique bin edges is insufficient for the desired number of labels.")
        print(f"Final bins considered: {price_bins}")
        print(f"Labels: {price_labels}")
else:
    print("Error: 'price' column not found. Cannot create price tiers.")

--- Defining Price Tiers ---
Percentiles used for tiers: Q1=$69.00, Q3=$175.00, Q95=$355.00, Max=$10000.00

Value counts for 'price_tier' (%):
price_tier
Budget           25.301156
Mid-Range        49.794458
Premium          19.912056
Upper Premium     4.992331
Name: proportion, dtype: float64

Price Tier Definitions Used:
  Budget: $0.00 to $69.00
  Mid-Range: $69.00 to $175.00
  Premium: $175.00 to $355.00
  Upper Premium: $355.00 to $10000.00


### 2.3 Aggregate Features to Create Neighborhood Profiles (for Similarity)

In [42]:
print("--- Creating Neighbourhood Profiles for Similarity ---")

if 'host_type_category' not in df.columns and 'calculated_host_listings_count' in df.columns:
    bins_host = [0, 1, 2, 5, 10, 50, df['calculated_host_listings_count'].max() + 1]
    labels_host = ['Host_1', 'Host_2', 'Host_3-5', 'Host_6-10', 'Host_11-50', 'Host_51+'] # Renamed for easier column names
    df['host_type_category'] = pd.cut(df['calculated_host_listings_count'], bins=bins_host, labels=labels_host, right=True)

# Features to aggregate for neighbourhood character
agg_functions = {
    'log_price': ['median', 'std'], # Median log_price, spread of log_price
    'price': ['median'], # Median raw price for easier interpretation
    'minimum_nights': ['median'],
    'number_of_reviews': ['median', 'mean'], # Median total reviews as quality/establishment proxy
    'id': 'count' # Listing density
}

neighbourhood_profiles = df.groupby(['neighbourhood_group', 'neighbourhood']).agg(agg_functions)

# Flatten multi-index columns if any were created by agg
neighbourhood_profiles.columns = ['_'.join(col).strip() if isinstance(col, tuple) else col for col in neighbourhood_profiles.columns.values]
neighbourhood_profiles.rename(columns={'id_count': 'listing_count'}, inplace=True)


# Add room_type proportions
room_type_props = df.groupby(['neighbourhood_group', 'neighbourhood'])['room_type'].value_counts(normalize=True).unstack(fill_value=0)
room_type_props.columns = [f'room_type_prop_{col.replace(" ", "_").replace("/", "_")}' for col in room_type_props.columns] # Clean column names
neighbourhood_profiles = neighbourhood_profiles.join(room_type_props)

# Add price_tier proportions
if 'price_tier' in df.columns:
    price_tier_props = df.groupby(['neighbourhood_group', 'neighbourhood'])['price_tier'].value_counts(normalize=True).unstack(fill_value=0)
    price_tier_props.columns = [f'price_tier_prop_{col}' for col in price_tier_props.columns]
    neighbourhood_profiles = neighbourhood_profiles.join(price_tier_props)

# Add host_type_category proportions (optional, but based on your EDA)
if 'host_type_category' in df.columns:
    host_type_props = df.groupby(['neighbourhood_group', 'neighbourhood'])['host_type_category'].value_counts(normalize=True).unstack(fill_value=0)
    host_type_props.columns = [f'host_type_prop_{col}' for col in host_type_props.columns]
    neighbourhood_profiles = neighbourhood_profiles.join(host_type_props)


neighbourhood_profiles.fillna(0, inplace=True) # Fill any NaNs that might result from unstacking if a category isn't present
neighbourhood_profiles = neighbourhood_profiles.reset_index() # Make neighbourhood_group and neighbourhood actual columns

print("\nNeighbourhood Profiles for Similarity (first 5 rows):")
print(neighbourhood_profiles.head())
print(f"\nShape of neighbourhood_profiles: {neighbourhood_profiles.shape}")

--- Creating Neighbourhood Profiles for Similarity ---

Neighbourhood Profiles for Similarity (first 5 rows):
  neighbourhood_group neighbourhood  log_price_median  log_price_std  \
0               Bronx      Allerton          4.210781       0.585361   
1               Bronx    Baychester          4.330733       0.227715   
2               Bronx       Belmont          3.978589       0.702946   
3               Bronx     Bronxdale          3.931826       0.359313   
4               Bronx   Castle Hill          3.688879       0.477558   

   price_median  minimum_nights_median  number_of_reviews_median  \
0          66.5                    2.0                      27.0   
1          75.0                    3.0                      11.0   
2          52.5                    2.0                       4.5   
3          50.0                    2.0                      14.0   
4          39.0                    2.0                       0.0   

   number_of_reviews_mean  listing_count  room_t

### 2.4 Define "Busyness" for Neighborhoods

In [43]:
print("\n--- Defining Neighbourhood Continuous Busyness Score & New Tiers ---")

# 1. Calculate raw busyness metrics (This part is the same as your existing cell)
neighbourhood_busyness_metrics = df.groupby(['neighbourhood_group', 'neighbourhood']).agg(
    avg_availability_365=('availability_365', 'mean'),
    avg_reviews_per_month=('reviews_per_month', 'mean'), # Assumes NaNs handled
    listing_density=('id', 'count')
).reset_index()

print("Raw busyness metrics for neighbourhoods (first 5):")
print(neighbourhood_busyness_metrics.head())

# 2. Prepare features for the composite score
if 'avg_availability_365' in neighbourhood_busyness_metrics.columns:
    neighbourhood_busyness_metrics['inverse_avg_availability'] = 1 / (neighbourhood_busyness_metrics['avg_availability_365'] + 0.01)
    # These are the features that will make up your composite score
    busyness_components = ['inverse_avg_availability', 'avg_reviews_per_month', 'listing_density']
else:
    busyness_components = ['avg_reviews_per_month', 'listing_density'] 
    print("Warning: 'avg_availability_365' not found, composite score will use fewer features.")

# Ensure selected features are present
final_busyness_components = [f for f in busyness_components if f in neighbourhood_busyness_metrics.columns]
if not final_busyness_components:
    print("Error: No valid features found for composite busyness score. Please check column names.")
else:
    data_to_scale_for_score = neighbourhood_busyness_metrics[final_busyness_components].copy()

    # Safeguard: Impute NaNs if any arose (e.g., from inverse if original availability was -0.01 somehow)
    for col in final_busyness_components:
        if data_to_scale_for_score[col].isnull().any():
            data_to_scale_for_score[col].fillna(data_to_scale_for_score[col].mean(), inplace=True)
            print(f"Warning: NaNs imputed with mean in {col} during busyness score calculation.")

    # 3. Scale these features to a 0-1 range using MinMaxScaler
    # This ensures each component can contribute more equally to a simple average score.
    min_max_scaler = MinMaxScaler()
    scaled_features_for_score = min_max_scaler.fit_transform(data_to_scale_for_score)
    
    # Convert back to DataFrame to easily calculate row-wise mean (or weighted sum)
    scaled_features_df = pd.DataFrame(scaled_features_for_score, columns=final_busyness_components, index=data_to_scale_for_score.index)

    # 4. Calculate Composite Busyness Score (Simple Average Method)
    # Higher score means "busier" as all components are defined such that higher = busier.
    neighbourhood_busyness_metrics['composite_busyness_score'] = scaled_features_df.mean(axis=1)
    
    print("\nComposite Busyness Score calculated (first 5 rows with score):")
    print(neighbourhood_busyness_metrics[['neighbourhood_group', 'neighbourhood', 'composite_busyness_score']].head())

    print("\nDescriptive statistics for 'composite_busyness_score':")
    print(neighbourhood_busyness_metrics['composite_busyness_score'].describe())

    # 5. Optional: Create new categorical busyness tiers based on this continuous score
    # For example, 5 tiers (quintiles) for more granularity than your previous 3 K-Means clusters
    busyness_score_labels_5tiers = ['Very Low Busyness', 'Low Busyness', 'Medium Busyness', 'High Busyness', 'Very High Busyness']
    try:
        neighbourhood_busyness_metrics['busyness_label_from_score'] = pd.qcut(
            neighbourhood_busyness_metrics['composite_busyness_score'],
            q=5, # For 5 quintiles/tiers
            labels=busyness_score_labels_5tiers,
            duplicates='drop' # Important if many neighborhoods have similar scores leading to non-unique bin edges
        )
        print("\nNew 5-tier busyness labels from composite score (value counts):")
        print(neighbourhood_busyness_metrics['busyness_label_from_score'].value_counts().sort_index())
    except Exception as e:
        print(f"\nCould not create 5-tier labels with pd.qcut due to score distribution (e.g., too few unique scores for 5 quantiles): {e}")
        print("Consider using fewer quantiles or manual binning if this occurs, or just use the continuous score.")


--- Defining Neighbourhood Continuous Busyness Score & New Tiers ---
Raw busyness metrics for neighbourhoods (first 5):
  neighbourhood_group neighbourhood  avg_availability_365  \
0               Bronx      Allerton            163.666667   
1               Bronx    Baychester            157.857143   
2               Bronx       Belmont            187.666667   
3               Bronx     Bronxdale            145.421053   
4               Bronx   Castle Hill            159.333333   

   avg_reviews_per_month  listing_density  
0               1.615714               42  
1               1.891429                7  
2               1.573333               24  
3               1.614211               19  
4               0.616667                9  

Composite Busyness Score calculated (first 5 rows with score):
  neighbourhood_group neighbourhood  composite_busyness_score
0               Bronx      Allerton                  0.124309
1               Bronx    Baychester                  0.14194

### 2.5 Combine Similarity Profiles with Busyness Information

In [44]:
print("--- Combining Neighbourhood Similarity Profiles with NEW Busyness Information ---")

if 'neighbourhood_profiles' in locals() or 'neighbourhood_profiles' in globals():
    if 'neighbourhood_busyness_metrics' in locals() or 'neighbourhood_busyness_metrics' in globals():
        
        # Define columns to merge from neighbourhood_busyness_metrics
        # These should be your new busyness metrics
        columns_to_merge = ['neighbourhood_group', 'neighbourhood', 'composite_busyness_score']
        
        # Check if you created the new N-tier label and add it for merging
        if 'busyness_label_from_score' in neighbourhood_busyness_metrics.columns:
            columns_to_merge.append('busyness_label_from_score')
            print("Found 'busyness_label_from_score' to merge.")
        elif 'busyness_label_5tiers' in neighbourhood_busyness_metrics.columns: # Alternative name used in example
            columns_to_merge.append('busyness_label_5tiers')
            print("Found 'busyness_label_5tiers' to merge.")
        else:
            print("Note: New N-tier busyness label (e.g., 'busyness_label_from_score') not found in neighbourhood_busyness_metrics. Merging continuous score only.")

        # Ensure the key columns for merging exist
        actual_cols_to_merge = [col for col in columns_to_merge if col in neighbourhood_busyness_metrics.columns]
        if 'neighbourhood' not in actual_cols_to_merge or 'neighbourhood_group' not in actual_cols_to_merge:
            print("Error: 'neighbourhood' or 'neighbourhood_group' missing in neighbourhood_busyness_metrics for merging.")
        else:
            neighbourhood_busyness_to_merge = neighbourhood_busyness_metrics[actual_cols_to_merge].drop_duplicates(subset=['neighbourhood_group', 'neighbourhood'])

            # Create the final DataFrame.
            # If 'final_neighbourhood_profiles_df' might exist from a previous run with old busyness columns,
            # it's cleaner to merge with the base 'neighbourhood_profiles' which only has character features.
            final_neighbourhood_profiles_df = pd.merge(
                neighbourhood_profiles, # This should be your profiles *without* any old busyness columns
                neighbourhood_busyness_to_merge,
                on=['neighbourhood_group', 'neighbourhood'],
                how='left' 
            )

            print("\nFinal Combined Neighbourhood Profiles (with new busyness score/labels - first 5 rows):")
            print(final_neighbourhood_profiles_df.head())
            print(f"\nShape of final_neighbourhood_profiles_df: {final_neighbourhood_profiles_df.shape}")
            
            print("\nCheck for any missing new busyness information after merge:")
            if 'composite_busyness_score' in final_neighbourhood_profiles_df.columns:
                print(f"Missing 'composite_busyness_score': {final_neighbourhood_profiles_df['composite_busyness_score'].isnull().sum()}")
            if 'busyness_label_from_score' in final_neighbourhood_profiles_df.columns: # Or 'busyness_label_5tiers'
                print(f"Missing 'busyness_label_from_score': {final_neighbourhood_profiles_df['busyness_label_from_score'].isnull().sum()}")
            elif 'busyness_label_5tiers' in final_neighbourhood_profiles_df.columns:
                 print(f"Missing 'busyness_label_5tiers': {final_neighbourhood_profiles_df['busyness_label_5tiers'].isnull().sum()}")


    else:
        print("Error: 'neighbourhood_busyness_metrics' DataFrame (with new scores/labels) not found.")
else:
    print("Error: 'neighbourhood_profiles' DataFrame (character profiles) not found.")

--- Combining Neighbourhood Similarity Profiles with NEW Busyness Information ---
Found 'busyness_label_from_score' to merge.

Final Combined Neighbourhood Profiles (with new busyness score/labels - first 5 rows):
  neighbourhood_group neighbourhood  log_price_median  log_price_std  \
0               Bronx      Allerton          4.210781       0.585361   
1               Bronx    Baychester          4.330733       0.227715   
2               Bronx       Belmont          3.978589       0.702946   
3               Bronx     Bronxdale          3.931826       0.359313   
4               Bronx   Castle Hill          3.688879       0.477558   

   price_median  minimum_nights_median  number_of_reviews_median  \
0          66.5                    2.0                      27.0   
1          75.0                    3.0                      11.0   
2          52.5                    2.0                       4.5   
3          50.0                    2.0                      14.0   
4          39

## 3. Modelling

### 3.1 Feature Selection and Scaling for KNN

In [45]:
print("--- Selecting and Scaling Features for KNN Similarity ---")

# Define columns from final_neighbourhood_profiles_df to be used for calculating similarity.
# These should be features describing the neighborhood's character.
# Exclude identifiers (neighbourhood, neighbourhood_group) and the busyness label/cluster itself.

# Recommended similarity features based on EDA and profile creation:
similarity_feature_columns = [
    'log_price_median',         # Typical price level (log-transformed)
    'log_price_std',            # Price variability within the neighborhood
    'minimum_nights_median',    # Typical minimum stay
    'number_of_reviews_median', # Proxy for typical listing establishment/perceived quality
    'number_of_reviews_mean',
    'listing_count',            # Can also be part of neighborhood character (small vs. large inventory)
    
    # Room type proportions
    'room_type_prop_Entire_home_apt',
    'room_type_prop_Private_room',
    'room_type_prop_Shared_room',
    
    # Price tier proportions
    'price_tier_prop_Budget',
    'price_tier_prop_Mid-Range',
    'price_tier_prop_Premium',
    'price_tier_prop_Upper_Premium',
    
    # Host type proportions
    'host_type_prop_1',
    'host_type_prop_2',
    'host_type_prop_3-5',
    'host_type_prop_6-10',
    'host_type_prop_11-50',
    'host_type_prop_51+'
]
similarity_feature_columns.extend([col for col in final_neighbourhood_profiles_df.columns if 'host_type_prop_' in col])
similarity_feature_columns = sorted(list(set(similarity_feature_columns))) # Ensure uniqueness and order

# Verify that these columns exist in your DataFrame
existing_similarity_features = [col for col in similarity_feature_columns if col in final_neighbourhood_profiles_df.columns]

if not existing_similarity_features:
    print("Error: No similarity feature columns found in 'final_neighbourhood_profiles_df'. Please check column names.")
    features_scaled = None # Ensure this variable is defined for later checks
else:
    print(f"Using these {len(existing_similarity_features)} features for KNN similarity: {existing_similarity_features}")
    
    features_for_knn = final_neighbourhood_profiles_df[existing_similarity_features].copy()
    
    # Fill any potential NaNs in these selected features (e.g., if a proportion was all NaN and then 0)
    # A good practice before scaling, though your aggregation should handle most.
    features_for_knn.fillna(features_for_knn.mean(), inplace=True) # Impute with mean if any NaNs persist
    
    # Scale the features
    scaler = StandardScaler()
    features_scaled = scaler.fit_transform(features_for_knn)
    
    print("\nShape of scaled features for KNN:", features_scaled.shape)
    print("Scaled features (first row example):", features_scaled[0])

--- Selecting and Scaling Features for KNN Similarity ---
Using these 18 features for KNN similarity: ['host_type_prop_1', 'host_type_prop_11-50', 'host_type_prop_2', 'host_type_prop_3-5', 'host_type_prop_51+', 'host_type_prop_6-10', 'listing_count', 'log_price_median', 'log_price_std', 'minimum_nights_median', 'number_of_reviews_mean', 'number_of_reviews_median', 'price_tier_prop_Budget', 'price_tier_prop_Mid-Range', 'price_tier_prop_Premium', 'room_type_prop_Entire_home_apt', 'room_type_prop_Private_room', 'room_type_prop_Shared_room']

Shape of scaled features for KNN: (221, 18)
Scaled features (first row example): [-0.82373718 -0.25850706 -0.72302177  2.12973271 -0.23393207 -0.40643778
 -0.33506963 -0.6290947   0.1013179  -0.30285419  1.28085333  1.29846227
  0.59921022 -0.04477275 -0.67913045 -0.51235558  0.69540609 -0.50440547]


## 3.2 Fit KNN Model

In [46]:
print("\n--- Fitting KNN Model ---")

knn_model = None # Initialize
if 'features_scaled' in locals() and features_scaled is not None and features_scaled.shape[0] > 0:
    # Number of neighbors: target itself + up to 10 others. Max N-1 if N < 11.
    n_neighbors_knn = min(11, features_scaled.shape[0]) 

    knn_model = NearestNeighbors(n_neighbors=n_neighbors_knn, metric='euclidean') # 'cosine' is also a good option
    knn_model.fit(features_scaled)
    
    print(f"KNN model fitted with k={n_neighbors_knn} and '{knn_model.metric}' metric.")
else:
    print("Error: Scaled features for KNN ('features_scaled') not available or empty. KNN model not fitted.")


--- Fitting KNN Model ---
KNN model fitted with k=11 and 'euclidean' metric.


### 3.3 Develop Recommendation Logic Function

In [47]:
print("--- Defining Recommendation Logic Function (v2 - for Continuous Busyness Score) ---")

def get_neighbourhood_recommendations_v2(target_neighbourhood_name, 
                                        target_group_name,
                                        df_profiles, # This is your final_neighbourhood_profiles_df
                                        model_knn, 
                                        all_scaled_features, # Scaled features used for KNN
                                        num_recommendations=3,
                                        busyness_score_column='composite_busyness_score', # Use your new continuous score column
                                        busyness_threshold_factor=1.0): 

    recommendations_df = pd.DataFrame()
    
    target_profile_matches = df_profiles[
        (df_profiles['neighbourhood'] == target_neighbourhood_name) &
        (df_profiles['neighbourhood_group'] == target_group_name)
    ]

    if target_profile_matches.empty:
        print(f"Error: Target neighbourhood '{target_neighbourhood_name}' in group '{target_group_name}' not found.")
        return recommendations_df

    target_index = target_profile_matches.index[0]
    
    if busyness_score_column not in target_profile_matches.columns:
        print(f"Error: Busyness score column '{busyness_score_column}' not found in profiles.")
        return recommendations_df
        
    target_busyness_score = target_profile_matches[busyness_score_column].iloc[0]
    
    print(f"\nTarget: {target_neighbourhood_name} ({target_group_name}), Original Busyness Score: {target_busyness_score:.3f}")

    # Define the threshold for an alternative to be "less busy"
    # If a higher score means "busier", then we want alternatives with a score *lower* than the target's.
    max_acceptable_busyness_score = target_busyness_score * busyness_threshold_factor
    
    # If busyness_threshold_factor is 1.0, it means we want strictly less.
    # If target_busyness_score is very low, max_acceptable_busyness_score could become 0 or negative if not handled.
    # Ensure max_acceptable_busyness_score doesn't go below a logical minimum (e.g. 0 if scores are non-negative)
    if busyness_threshold_factor == 1.0: # Strictly less than target
        max_acceptable_busyness_score = target_busyness_score - 1e-6 # A very small epsilon to ensure it's strictly less
    
    # If target score is already very low, finding something "less busy" might be hard or impossible.
    # You might add a check here: if target_busyness_score < some_absolute_low_threshold: print "Target is already very low busyness"
    
    print(f"Searching for alternatives with busyness score < {max_acceptable_busyness_score:.3f} (target score was {target_busyness_score:.3f})")
        
    target_features_scaled_for_knn = all_scaled_features[target_index].reshape(1, -1)
    distances, indices = model_knn.kneighbors(target_features_scaled_for_knn)
    
    neighbor_profiles_df = df_profiles.iloc[indices[0]].copy()
    neighbor_profiles_df['similarity_distance'] = distances[0]
    
    recommendations_df = neighbor_profiles_df[
        (neighbor_profiles_df['neighbourhood'] != target_neighbourhood_name) & 
        (neighbor_profiles_df['neighbourhood_group'] == target_group_name) &
        (neighbor_profiles_df[busyness_score_column] < max_acceptable_busyness_score) 
    ].sort_values(by='similarity_distance', ascending=True).head(num_recommendations)
    
    if recommendations_df.empty:
        print(f"No suitable less busy (score < {max_acceptable_busyness_score:.3f}), similar neighbourhoods found in '{target_group_name}' for '{target_neighbourhood_name}'.")

    return recommendations_df

# Define the function in your notebook by running this cell
get_neighbourhood_recommendations = get_neighbourhood_recommendations_v2 # Overwrite or use new name

print("\nRecommendation function 'get_neighbourhood_recommendations' (v2 - for continuous score) is now defined.")

--- Defining Recommendation Logic Function (v2 - for Continuous Busyness Score) ---

Recommendation function 'get_neighbourhood_recommendations' (v2 - for continuous score) is now defined.


## Testing

In [48]:
print("\n--- Testing Recommendation Logic with Continuous Busyness Score ---")

if not all(var in locals() or var in globals() for var in ['final_neighbourhood_profiles_df', 'knn_model', 'features_scaled', 'get_neighbourhood_recommendations']):
    print("ERROR: One or more required variables not found. Run previous cells defining these elements.")
elif 'composite_busyness_score' not in final_neighbourhood_profiles_df.columns:
    print("ERROR: 'composite_busyness_score' column not found in 'final_neighbourhood_profiles_df'. Please run the cell that calculates this score.")
else:
    # --- Test Case 1: Target 'Midtown' in 'Manhattan' ---
    target_n1 = 'Midtown'
    target_g1 = 'Manhattan'
    print(f"\n--- Test Case 1: Target: {target_n1}, {target_g1} ---")

    # Define how much "less busy" you want the alternatives to be.
    # factor = 1.0 means strictly less busy than the target.
    # factor = 0.9 means at most 90% of the target's busyness score (i.e., at least 10% less busy).
    # factor = 0.8 means at most 80% of the target's busyness score (i.e., at least 20% less busy).
    threshold_factor_for_midtown = 0.8 

    recommendations1 = get_neighbourhood_recommendations( # Assuming you renamed/updated your function
        target_neighbourhood_name=target_n1,
        target_group_name=target_g1,
        df_profiles=final_neighbourhood_profiles_df,
        model_knn=knn_model,
        all_scaled_features=features_scaled,
        num_recommendations=3,
        busyness_score_column='composite_busyness_score', # Use the continuous score column
        busyness_threshold_factor=threshold_factor_for_midtown 
    )

    if not recommendations1.empty:
        print(f"\nRecommended less busy (score < target_score * {threshold_factor_for_midtown}), similar neighbourhoods in {target_g1} for {target_n1}:")
        # Display relevant columns, including the new score and your N-tier label if you created it
        cols_to_display = ['neighbourhood', 'composite_busyness_score', 'similarity_distance', 'price_median']
        if 'busyness_label_from_score' in recommendations1.columns: # If you created the 5-tier label
            cols_to_display.insert(1, 'busyness_label_from_score')
        elif 'busyness_label_5tiers' in recommendations1.columns: # Alternative name
             cols_to_display.insert(1, 'busyness_label_5tiers')
        display(recommendations1[cols_to_display])
    else:
        # The function get_neighbourhood_recommendations should have printed why no recommendations were made.
        # (e.g., "No suitable less busy ... found")
        # You could add a generic message here if needed, but it might be redundant.
        print(f"Further check: No recommendations returned by the function for {target_n1}.")


    # --- Test Case 2: Target 'Allerton' in 'Bronx' ---
    # Allerton was previously 'Supply Constrained/Busy' with the K-Means labels.
    # Let's see how it behaves with the continuous score.
    target_n2 = 'Allerton'
    target_g2 = 'Bronx'
    print(f"\n--- Test Case 2: Target: {target_n2}, {target_g2} ---")
    
    threshold_factor_for_allerton = 1.0 # Strictly less busy

    recommendations2 = get_neighbourhood_recommendations(
        target_neighbourhood_name=target_n2,
        target_group_name=target_g2,
        df_profiles=final_neighbourhood_profiles_df,
        model_knn=knn_model,
        all_scaled_features=features_scaled,
        num_recommendations=3,
        busyness_score_column='composite_busyness_score',
        busyness_threshold_factor=threshold_factor_for_allerton
    )

    if not recommendations2.empty:
        print(f"\nRecommended less busy (score < target_score * {threshold_factor_for_allerton}), similar neighbourhoods in {target_g2} for {target_n2}:")
        cols_to_display = ['neighbourhood', 'composite_busyness_score', 'similarity_distance', 'price_median']
        if 'busyness_label_from_score' in recommendations2.columns:
            cols_to_display.insert(1, 'busyness_label_from_score')
        elif 'busyness_label_5tiers' in recommendations2.columns:
             cols_to_display.insert(1, 'busyness_label_5tiers')
        display(recommendations2[cols_to_display])
    else:
        print(f"Further check: No recommendations returned by the function for {target_n2}.")


--- Testing Recommendation Logic with Continuous Busyness Score ---

--- Test Case 1: Target: Midtown, Manhattan ---

Target: Midtown (Manhattan), Original Busyness Score: 0.193
Searching for alternatives with busyness score < 0.154 (target score was 0.193)

Recommended less busy (score < target_score * 0.8), similar neighbourhoods in Manhattan for Midtown:


Unnamed: 0,neighbourhood,busyness_label_from_score,composite_busyness_score,similarity_distance,price_median
108,Kips Bay,Low Busyness,0.096633,3.031839,152.0
126,West Village,High Busyness,0.121361,3.57969,200.0
103,Gramercy,Low Busyness,0.093901,3.621641,165.0



--- Test Case 2: Target: Allerton, Bronx ---

Target: Allerton (Bronx), Original Busyness Score: 0.124
Searching for alternatives with busyness score < 0.124 (target score was 0.124)
No suitable less busy (score < 0.124), similar neighbourhoods found in 'Bronx' for 'Allerton'.
Further check: No recommendations returned by the function for Allerton.


## 4. Solution


### 4.1 Imports for Interactive Demo (and PCA)

In [49]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output
from sklearn.decomposition import PCA # For the PCA plot
from sklearn.preprocessing import StandardScaler

if 'final_neighbourhood_profiles_df' not in locals() or 'knn_model' not in locals() or 'features_scaled' not in locals() or 'get_neighbourhood_recommendations' not in locals():
    print("WARNING: Key components (DataFrame, KNN model, scaled features, or recommendation function) not found.")
    print("This script might not run correctly without them. Ensure they are loaded/defined from your previous work.")
    # Create minimal dummies for script validation if run by the tool without prior state
    final_neighbourhood_profiles_df = pd.DataFrame({
        'neighbourhood_group': ['Manhattan', 'Manhattan', 'Manhattan', 'Brooklyn', 'Brooklyn'],
        'neighbourhood': ['Midtown', 'Harlem', 'Chelsea', 'Williamsburg', 'Bushwick'],
        'log_price_median': np.random.rand(5) * 2 + 4,
        'busyness_label': ['High Density/Saturated', 'Active & Available/Less Busy', 'Supply Constrained/Busy', 'Supply Constrained/Busy', 'Active & Available/Less Busy']
    })
    # Dummy features_scaled for PCA plotting (rows must match df)
    if 'features_scaled' not in locals() or features_scaled is None or features_scaled.shape[0] != len(final_neighbourhood_profiles_df) :
      features_scaled = StandardScaler().fit_transform(np.random.rand(len(final_neighbourhood_profiles_df), 5)) # Dummy with 5 features
    # Dummy knn_model
    if 'knn_model' not in locals() or knn_model is None:
      from sklearn.neighbors import NearestNeighbors
      knn_model = NearestNeighbors(n_neighbors=3).fit(features_scaled)
    # Dummy recommendation function
    if 'get_neighbourhood_recommendations' not in locals():
        def get_neighbourhood_recommendations(target_neighbourhood_name, target_group_name, df_profiles, model_knn, all_scaled_features, **kwargs):
            print(f"Dummy call for {target_neighbourhood_name}, {target_group_name}. In real use, this would return recommendations.")
            return pd.DataFrame({'neighbourhood': ['Dummy Rec 1', 'Dummy Rec 2'], 'busyness_label': ['Active & Available/Less Busy']*2, 'similarity_distance': [0.1,0.2]})


plt.style.use('ggplot')
sns.set_palette("muted")

### 4.2 Prepare Data for PCA Plot

In [50]:
print("--- Preparing 2D coordinates for PCA plot ---")
pca_plot_coords = None # Initialized to None
if 'features_scaled' in locals() and features_scaled is not None and features_scaled.shape[0] > 0 :
    pca = PCA(n_components=2, random_state=42)
    pca_plot_coords = pca.fit_transform(features_scaled) # This is where it's assigned
    print("PCA transformation complete for plotting.")
    final_neighbourhood_profiles_df['pca1'] = pca_plot_coords[:, 0]
    final_neighbourhood_profiles_df['pca2'] = pca_plot_coords[:, 1]
else:
    print("Error: 'features_scaled' not available or empty. Cannot prepare PCA plot coordinates.")

--- Preparing 2D coordinates for PCA plot ---
PCA transformation complete for plotting.


### 4.3 Setup ipywidgets for Interaction

In [51]:
print("--- Setting up Interactive Widgets with Dependent Dropdown ---")

# 1. Create the widgets
group_options = ['-- Select a N\'hood Group --'] + sorted(final_neighbourhood_profiles_df["neighbourhood_group"].unique())
group_dropdown = widgets.Dropdown(
    options=group_options,
    value=group_options[0], # Default to the prompt
    description="N'hood Group:",
    layout=widgets.Layout(width="auto"),
    style={'description_width': 'initial'}
)

neighbourhood_dropdown = widgets.Dropdown(
    options=['-- Select a Neighbourhood --'], # Initial prompt
    value='-- Select a Neighbourhood --',
    description="Neighbourhood:",
    disabled=True, # Start disabled
    layout=widgets.Layout(width="auto"),
    style={'description_width': 'initial'}
)

run_button = widgets.Button(
    description="🔍 Find Alternatives",
    button_style='primary',
    tooltip='Click to get recommendations',
    icon='search',
    layout=widgets.Layout(width="auto", margin='10px 0 0 0')
)

output_area = widgets.Output()

# 2. Define the single function to update the neighbourhood_dropdown
def on_group_select_update_neighbourhood_options(change):
    """
    This function is called when the value of group_dropdown changes.
    It updates the options in neighbourhood_dropdown based on the selected group.
    """
    selected_group = change.new # The new value of the group_dropdown

    # Always reset neighbourhood_dropdown to its initial prompt state first
    current_neighbourhood_options = ['-- Select a Neighbourhood --']
    neighbourhood_dropdown.options = current_neighbourhood_options
    neighbourhood_dropdown.value = current_neighbourhood_options[0] # Set to prompt
    
    if selected_group and selected_group != '-- Select a N\'hood Group --': # If a valid group is selected
        # Filter neighbourhoods for the selected group
        neighbourhoods_in_group = sorted(
            final_neighbourhood_profiles_df[final_neighbourhood_profiles_df["neighbourhood_group"] == selected_group]["neighbourhood"].unique()
        )
        
        if neighbourhoods_in_group:
            # Update options and enable the dropdown
            neighbourhood_dropdown.options = current_neighbourhood_options + neighbourhoods_in_group
            neighbourhood_dropdown.disabled = False
        else:
            # No neighbourhoods found for this group (should ideally not happen with real data)
            neighbourhood_dropdown.disabled = True
    else:
        # If the prompt "-- Select a N'hood Group --" is re-selected, disable neighbourhood_dropdown
        neighbourhood_dropdown.disabled = True
    
    # Clear any previous results shown in the output_area when the group changes
    output_area.clear_output()

# 3. Link the group_dropdown's 'value' change event to the update function
group_dropdown.observe(on_group_select_update_neighbourhood_options, names='value')

print("Widgets created and dependent dropdown logic is set up.")
print("Please select a Neighbourhood Group; the Neighbourhood dropdown will then populate.")

--- Setting up Interactive Widgets with Dependent Dropdown ---
Widgets created and dependent dropdown logic is set up.
Please select a Neighbourhood Group; the Neighbourhood dropdown will then populate.


### 4.4 Define the Main Action Function (triggered by button)

In [52]:
print("--- Defining Main Action Function for Button Click (Using Continuous Score) ---")

def on_find_alternatives_clicked(b): # b is the button event
    with output_area:
        output_area.clear_output(wait=True) 
        
        try: 
            selected_group = group_dropdown.value
            selected_neighbourhood = neighbourhood_dropdown.value

            if not selected_group or selected_group == '-- Select a N\'hood Group --' or \
               not selected_neighbourhood or selected_neighbourhood == '-- Select a Neighbourhood --':
                display(Markdown("<font color='red'>**Error:** Please select a valid Neighbourhood Group and Neighbourhood.</font>"))
                return

            display(Markdown(f"### Processing: Recommending for **{selected_neighbourhood}** in **{selected_group}**..."))
            
            target_profile = final_neighbourhood_profiles_df[
                (final_neighbourhood_profiles_df['neighbourhood_group'] == selected_group) &
                (final_neighbourhood_profiles_df['neighbourhood'] == selected_neighbourhood)
            ]
            
            if target_profile.empty:
                display(Markdown(f"<font color='red'>Error: Profile for {selected_neighbourhood} in {selected_group} not found.</font>"))
                return
            
            # Use the continuous busyness score
            target_busyness_score = target_profile['composite_busyness_score'].iloc[0]
            
            # Define your busyness_threshold_factor for the call
            # Example: 1.0 means strictly less busy. 0.9 means at least 10% less busy.
            # You can make this dynamic based on target_busyness_score or selected_group if desired
            threshold_factor = 0.85 

            recommendations = get_neighbourhood_recommendations( # Assuming this is now the v2 function
                target_neighbourhood_name=selected_neighbourhood,
                target_group_name=selected_group,
                df_profiles=final_neighbourhood_profiles_df,
                model_knn=knn_model,
                all_scaled_features=features_scaled,
                num_recommendations=3,
                busyness_score_column='composite_busyness_score', # Pass the correct column name
                busyness_threshold_factor=threshold_factor 
            )

            display(Markdown(f"#### Selected: **{selected_neighbourhood}** ({selected_group}) - Busyness Score: *{target_busyness_score:.3f}*"))

            if not recommendations.empty:
                display(Markdown("#### ✅ Recommended Less Busy, Similar Alternatives:"))
                # Display relevant columns, including the new score
                cols_to_show = ['neighbourhood', 'composite_busyness_score', 'similarity_distance', 'price_median']
                if 'busyness_label_from_score' in recommendations.columns: # If you created N-tier labels from score
                    cols_to_show.insert(1, 'busyness_label_from_score')
                display(recommendations[cols_to_show])
            else:
                # Check if the target itself was already very "less busy"
                # (The get_neighbourhood_recommendations function handles its own print for this,
                # but we add a clear message here too if no recs were returned for other reasons)
                very_low_busyness_cutoff = final_neighbourhood_profiles_df['composite_busyness_score'].quantile(0.25) # Example: bottom 25%
                if target_busyness_score < very_low_busyness_cutoff and (target_busyness_score * threshold_factor) < target_busyness_score : # and we were actually looking for something less busy
                    display(Markdown(f"<font color='green'>'{selected_neighbourhood}' already has a low busyness score ({target_busyness_score:.3f}). No significantly less busy, similar alternatives were found.</font>"))
                else:
                    display(Markdown(f"<font color='orange'>⚠️ No suitable less busy, similar neighbourhoods found in {selected_group} for {selected_neighbourhood} with the defined criteria (busyness score < {target_busyness_score * threshold_factor:.3f}).</font>"))

            # --- Display PCA Plot ---
            if 'pca1' in final_neighbourhood_profiles_df.columns and 'pca2' in final_neighbourhood_profiles_df.columns:
                plt.figure(figsize=(12, 8))
                
                # For coloring/styling the PCA plot, you can use your new N-tier busyness label
                # if you created one (e.g., 'busyness_label_from_score') or just style by neighbourhood_group
                hue_column_pca = 'busyness_label_from_score' if 'busyness_label_from_score' in final_neighbourhood_profiles_df.columns else 'neighbourhood_group'
                
                sns.scatterplot(
                    x=final_neighbourhood_profiles_df['pca1'], 
                    y=final_neighbourhood_profiles_df['pca2'],
                    hue=final_neighbourhood_profiles_df[hue_column_pca], 
                    style=final_neighbourhood_profiles_df['neighbourhood_group'],
                    # palette=palette_pca, # Add if using N-tier labels with a custom palette
                    alpha=0.6, s=50
                )
                
                selected_coords_df = final_neighbourhood_profiles_df[
                    (final_neighbourhood_profiles_df['neighbourhood'] == selected_neighbourhood) &
                    (final_neighbourhood_profiles_df['neighbourhood_group'] == selected_group)
                ]
                if not selected_coords_df.empty:
                    # Use .loc with index for pca_plot_coords to ensure alignment
                    selected_idx = selected_coords_df.index[0]
                    if 'pca_plot_coords' in globals() and globals().get('pca_plot_coords') is not None:
                         plt.scatter(pca_plot_coords[selected_idx, 0], pca_plot_coords[selected_idx, 1], 
                                    s=250, edgecolor='black', facecolor='yellow', label=f"Selected: {selected_neighbourhood}", zorder=5)

                if not recommendations.empty:
                    rec_indices = recommendations.index
                    if 'pca_plot_coords' in globals() and globals().get('pca_plot_coords') is not None:
                        plt.scatter(pca_plot_coords[rec_indices, 0], 
                                    pca_plot_coords[rec_indices, 1], 
                                    s=180, edgecolor='black', facecolor='blue', marker='*', label="Recommended", zorder=5)
                        for rec_idx in rec_indices: 
                            row = final_neighbourhood_profiles_df.loc[rec_idx]
                            plt.text(row['pca1'] + 0.05, row['pca2'] + 0.05, row['neighbourhood'], fontsize=9, zorder=5)

                plt.title(f"Neighbourhood Similarity Map (PCA Projection)\nTarget: {selected_neighbourhood}")
                plt.xlabel("PCA Component 1")
                plt.ylabel("PCA Component 2")
                plt.legend(title="Legend", bbox_to_anchor=(1.05, 1), loc='upper left')
                plt.grid(True)
                plt.tight_layout(rect=[0, 0, 0.85, 1])
                plt.show()
            else:
                display(Markdown("<font color='brown'>PCA columns ('pca1', 'pca2') not found in `final_neighbourhood_profiles_df`. Plot cannot be generated.</font>"))
        
        except Exception as e:
            import traceback
            display(Markdown(f"<font color='red'>**An error occurred in on_find_alternatives_clicked:**\n```\n{traceback.format_exc()}\n```</font>"))

# Link to button (ensure this is in the same cell or run after run_button is defined)
run_button.on_click(on_find_alternatives_clicked)
print("Action function 'on_find_alternatives_clicked' (Updated for Continuous Score) defined and linked to button.")

--- Defining Main Action Function for Button Click (Using Continuous Score) ---
Action function 'on_find_alternatives_clicked' (Updated for Continuous Score) defined and linked to button.


### 4.5 Display the Interface

In [53]:
print("--- Displaying Interactive Recommendation Interface ---")

# Initialize neighbourhood options for the first selected group
if group_dropdown.options and group_dropdown.options[0] != '': # Check if options exist and first is not blank
    initial_group = group_dropdown.options[0] if group_dropdown.value == '' else group_dropdown.value
    if initial_group: # if a valid group is selected or defaulted
        initial_neighbourhoods = sorted(
            final_neighbourhood_profiles_df[final_neighbourhood_profiles_df["neighbourhood_group"] == initial_group]["neighbourhood"].unique()
        )
        if initial_neighbourhoods:
             neighbourhood_dropdown.options = [''] + initial_neighbourhoods
             # neighbourhood_dropdown.value = '' # Start with blank selection
             neighbourhood_dropdown.disabled = False


# Display the widgets
display(Markdown("## 🗽 NYC Airbnb Neighbourhood Recommender"))
display(Markdown("Select a `Neighbourhood Group` and then a `Neighbourhood` you are interested in:"))
display(widgets.VBox([
    widgets.HBox([group_dropdown, neighbourhood_dropdown]),
    run_button
]))
display(output_area)

print("\nInterface is now active. Interact with the dropdowns and button above.")

--- Displaying Interactive Recommendation Interface ---


## 🗽 NYC Airbnb Neighbourhood Recommender

Select a `Neighbourhood Group` and then a `Neighbourhood` you are interested in:

VBox(children=(HBox(children=(Dropdown(description="N'hood Group:", layout=Layout(width='auto'), options=("-- …

Output()


Interface is now active. Interact with the dropdowns and button above.
