<a href="https://colab.research.google.com/github/yagel2/TDS_NYC_Airbnb/blob/main/Final%20Project/final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Load and Preprocess Data
Load the dataset, handle missing values, and split the dataset into features (X) and target (y). Adjust the code below to match your dataset structure.

In [None]:
import pandas as pd

!wget  --no-clobber https://raw.githubusercontent.com/yagel2/TDS_NYC_Airbnb/main/AB_NYC_2019.csv

# Load the dataset (adjust the path or use a dataset URL)
df = pd.read_csv('AB_NYC_2019.csv')

# Preprocess: fill missing values, encode categorical columns
df = df.fillna(0)  # Example of filling missing values
df = pd.get_dummies(df)  # Convert categorical variables into dummy variables

# Split into features and target (change 'target_column' to your actual target column name)
X = df.drop(columns='target_column')
y = df['target_column']

# Show first few rows of the data
df.head()

Feature Selection Methods
A. SHAP-based Feature Selection
SHAP values are useful for understanding the contribution of each feature in a model's prediction. You can use SHAP with models like XGBoost to identify feature importance.

In [None]:
import shap
import xgboost as xgb

# Train an XGBoost model
model = xgb.XGBRegressor()
model.fit(X, y)

# Create a SHAP explainer and compute SHAP values
explainer = shap.Explainer(model)
shap_values = explainer(X)

# Visualize the SHAP summary plot
shap.summary_plot(shap_values, X)

# Get feature importance based on SHAP values
shap_importance = shap_values.abs.mean(axis=0)
important_features = X.columns[shap_importance.argsort()[-10:]]  # Top 10 important features
X_selected = X[important_features]

# Display the selected features
X_selected.head()


B. Recursive Feature Elimination (RFE)
RFE recursively removes features and evaluates the model performance to identify the most important features.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Initialize RandomForestRegressor model
model = RandomForestRegressor()

# RFE: Select top 10 features
selector = RFE(model, n_features_to_select=10)
X_selected_rfe = selector.fit_transform(X, y)

# Get the selected features
selected_features_rfe = X.columns[selector.support_]
X_selected_rfe = X[selected_features_rfe]

# Display selected features
X_selected_rfe.head()


C. Permutation Importance
Permutation importance shuffles feature values and checks the impact on model performance.

In [None]:
from sklearn.inspection import permutation_importance

# Train a RandomForest model
model = RandomForestRegressor()
model.fit(X, y)

# Compute permutation importance
result = permutation_importance(model, X, y, n_repeats=10, random_state=42)

# Get sorted feature importances
sorted_idx = result.importances_mean.argsort()

# Select top 10 features based on permutation importance
top_features_perm = X.columns[sorted_idx[:10]]
X_selected_perm = X[top_features_perm]

# Display selected features
X_selected_perm.head()

Compare Model Performance
Now, compare the performance of models before and after feature selection using metrics such as R² and RMSE. For comparison, we’ll train models on the full dataset as well as on the selected feature set.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Train model on full features
model_full = RandomForestRegressor()
model_full.fit(X, y)
y_pred_full = model_full.predict(X)
rmse_full = np.sqrt(mean_squared_error(y, y_pred_full))
r2_full = r2_score(y, y_pred_full)

# Train model on selected features
model_selected = RandomForestRegressor()
model_selected.fit(X_selected, y)
y_pred_selected = model_selected.predict(X_selected)
rmse_selected = np.sqrt(mean_squared_error(y, y_pred_selected))
r2_selected = r2_score(y, y_pred_selected)

# Display the comparison of performance metrics
print(f"RMSE (Full Features): {rmse_full}")
print(f"R² (Full Features): {r2_full}")
print(f"RMSE (Selected Features): {rmse_selected}")
print(f"R² (Selected Features): {r2_selected}")

 Visualize Feature Importance
It’s helpful to visualize the importance of selected features to understand their impact on the model’s prediction.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot feature importance based on SHAP
shap.summary_plot(shap_values, X)

# Alternatively, plot the top features based on permutation importance
plt.figure(figsize=(10, 6))
sns.barplot(x=shap_importance, y=X.columns)
plt.title("Top Features Based on SHAP Values")
plt.show()

# Or plot the permutation importance
plt.figure(figsize=(10, 6))
sns.barplot(x=result.importances_mean[sorted_idx[:10]], y=top_features_perm)
plt.title("Top Features Based on Permutation Importance")
plt.show()
