### Week 7 Final Project

### Chicago Airbnb Dataset

In [1]:
# Import necessary libraries
import pandas as pd  # For handling data in DataFrames
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For data visualization
import seaborn as sns  # For advanced statistical visualizations
import psycopg2  # For connecting to PostgreSQL databases

# Scikit-learn modules for machine learning and preprocessing
from sklearn.model_selection import train_test_split  # Splitting data into training and testing sets
from sklearn.linear_model import LinearRegression  # Linear regression model
from sklearn.metrics import mean_absolute_error, r2_score  # Model evaluation metrics
from sklearn.preprocessing import StandardScaler, OneHotEncoder  # Data normalization and encoding categorical variables
from sklearn.impute import SimpleImputer  # Handling missing values

# SQLAlchemy for database connection and execution
from sqlalchemy import create_engine  # For managing database connections

In [2]:
# Load and Sample Data  
df = pd.read_csv("listings.csv")  # Load dataset  

# Select a random sample of 100 rows  
data = df.sample(n=100, random_state=42)  # Ensures reproducibility  

# Save the sampled data to a new CSV file  
data.to_csv("listings_data.csv", index=False)  

# Remove the sampled rows from the original DataFrame  
df_remaining = df.drop(data.index)  

# ### Database Connection (Supabase PostgreSQL)  
DATABASE_URL = "postgresql://postgres:fwSMslYZJjGfDstk@db.ahrshioxhhqpfnuwvufg.supabase.co:5432/postgres"  
engine = create_engine(DATABASE_URL)  

# Store the remaining data in PostgreSQL under the "raw" schema  
df_remaining.to_sql("listingstable", engine, schema="raw", if_exists="replace", index=False)  

print("Data successfully stored in PostgreSQL under the 'raw' schema.")  


Data successfully stored in PostgreSQL under the 'raw' schema.


# **Data Definition and Analytical Question**

# Dataset Features  

- **id**: Unique identifier for each listing *(Nominal)*  
- **name**: Listing name *(Nominal)*  
- **host_id**: Unique identifier for the host *(Nominal)*  
- **host_name**: Host's name *(Nominal)*  
- **neighbourhood_group**: Broad geographical area *(Nominal)*  
- **neighbourhood**: Specific neighborhood within the group *(Nominal)*  
- **latitude**: Latitude coordinate *(Continuous)*  
- **longitude**: Longitude coordinate *(Continuous)*  
- **room_type**: Type of accommodation offered *(Nominal)*  
- **price**: Nightly rate *(Continuous)*  
- **minimum_nights**: Minimum required stay *(Discrete)*  
- **number_of_reviews**: Total number of reviews received *(Discrete)*  
- **last_review**: Date of the most recent review *(Nominal)*  
- **reviews_per_month**: Average number of reviews per month *(Continuous)*  
- **calculated_host_listings_count**: Total number of listings per host *(Discrete)*  
- **availability_365**: Number of days the listing is available in a year *(Discrete)*  

---

# Analytical Inquiry  

**"What is the relationship between room type, neighborhood, and pricing in influencing the number of reviews a listing receives?"**  

**Target Variable**: *number_of_reviews*


# **Feature Selection**  

For our analytical question, **"How do room type, neighborhood, and pricing impact the number of reviews a listing receives?"**, we select features that directly influence the target variable: **number_of_reviews**.  

### **Selected Features:**  
- **neighbourhood_group** – Broad geographical classification affecting listing popularity.  
- **neighbourhood** – More specific location data, influencing pricing and demand.  
- **room_type** – Accommodation type (e.g., private room, entire home), impacting customer preference.  
- **price** – Cost per night, which may affect reviews and occupancy.  
- **minimum_nights** – Booking restrictions that could influence total stays and reviews.  
- **reviews_per_month** – Indicates how frequently guests leave feedback.  
- **availability_365** – The number of days a listing is available, influencing the likelihood of receiving reviews.  

### **Excluded Features:**  
- **id, host_id** – Unique identifiers that don’t contribute to predictions.  
- **name, host_name** – Free text fields with minimal predictive value.  
- **last_review** – A date field, better represented by *reviews_per_month*.  
- **calculated_host_listings_count** – More relevant for host analytics than guest review behavior.  

---

# **Data Cleaning and Preparation**  

To ensure high-quality data for analysis and machine learning, we perform the following steps:  

## **1. Handling Missing Values**  
- Identify and address missing values.  
- **Numerical features** – Use mean or median imputation.  
- **Categorical features** – Apply mode imputation or remove records with excessive missing data.  

## **2. Handling Outliers**  
- Detect anomalies in numerical data using **boxplots or Z-scores**.  
- Apply transformations such as **log scaling** or **capping extreme values** where necessary.  

## **3. Encoding Categorical Variables**  
- Convert non-numerical data into numerical form using:  
  - **One-hot encoding** for nominal categories.  
  - **Label encoding** for ordinal categories (if applicable).  

## **4. Normalization & Scaling**  
- Apply **Min-Max Scaling** or **Standardization** to ensure numerical features are on a comparable scale.  

By implementing these steps, we refine the dataset for effective machine learning modeling.  


In [3]:
# Checking for missing values
data.isnull().sum()

id                                  0
name                                0
host_id                             0
host_name                           0
neighbourhood_group               100
neighbourhood                       0
latitude                            0
longitude                           0
room_type                           0
price                               0
minimum_nights                      0
number_of_reviews                   0
last_review                        18
reviews_per_month                  18
calculated_host_listings_count      0
availability_365                    0
dtype: int64

In [4]:
# --- Step 1: Feature Selection ---
selected_features = [
    "neighbourhood_group", "neighbourhood", "room_type", "price",
    "minimum_nights", "number_of_reviews", "reviews_per_month", "availability_365"
]

# Create a copy of the selected data to prevent modifications to the original dataset
df = data[selected_features].copy()

# --- Step 2: Handling Missing Values ---

# Drop 'neighbourhood_group' if it is entirely missing
if df["neighbourhood_group"].isna().all():
    df.drop(columns=["neighbourhood_group"], inplace=True)
else:
    # Fill missing categorical values with the most frequent category
    df["neighbourhood_group"] = SimpleImputer(strategy="most_frequent").fit_transform(df[["neighbourhood_group"]])

# Fill missing values in numerical columns using the median
df["reviews_per_month"] = SimpleImputer(strategy="median").fit_transform(df[["reviews_per_month"]])

print("Data preprocessing completed successfully!")


Data preprocessing completed successfully!


In [5]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

# --- Step 1: Feature Selection ---
selected_features = [
    "neighbourhood_group", "neighbourhood", "room_type", "price",
    "minimum_nights", "number_of_reviews", "reviews_per_month", "availability_365"
]

df = data[selected_features].copy()

# --- Step 2: Handling Missing Values ---

# Drop 'neighbourhood_group' if it is completely missing
if df["neighbourhood_group"].isna().all():
    df.drop(columns=["neighbourhood_group"], inplace=True)

# Define categorical and numerical columns
categorical_cols = [col for col in ["neighbourhood_group", "neighbourhood", "room_type"] if col in df.columns]
numeric_features = ["price", "minimum_nights", "number_of_reviews", "reviews_per_month", "availability_365"]

# Impute missing values for categorical columns
if categorical_cols:
    df[categorical_cols] = SimpleImputer(strategy="most_frequent").fit_transform(df[categorical_cols])

# Impute missing values for numerical columns
df[numeric_features] = SimpleImputer(strategy="median").fit_transform(df[numeric_features])

# --- Step 3: Outlier Removal using Z-score ---
def remove_outliers(data, columns, threshold=3):
    """Removes outliers based on Z-score."""
    for col in columns:
        z_scores = np.abs((data[col] - data[col].mean()) / data[col].std())
        data = data[z_scores < threshold]
    return data

df = remove_outliers(df, numeric_features)

# --- Step 4: Encoding Categorical Features ---
if categorical_cols:
    encoder = OneHotEncoder(drop="first", sparse_output=False)
    encoded_df = pd.DataFrame(encoder.fit_transform(df[categorical_cols]), columns=encoder.get_feature_names_out(categorical_cols))

    # Drop original categorical columns and merge encoded features
    df.drop(columns=categorical_cols, inplace=True)
    df = pd.concat([df.reset_index(drop=True), encoded_df], axis=1)

# --- Step 5: Scaling Numerical Features ---
df[numeric_features] = StandardScaler().fit_transform(df[numeric_features])

# --- Save and Display Processed Data ---
df.to_csv("cleaned_data.csv", index=False)

print("Data preprocessing completed successfully!")
print(df.head())

Data preprocessing completed successfully!
      price  minimum_nights  number_of_reviews  reviews_per_month  \
0  1.168854       -0.576693          -0.692832          -1.235100   
1 -0.922748       -0.576693          -0.712220          -0.161991   
2 -0.909340       -0.576693          -0.692832          -0.421995   
3  1.906278       -0.423363          -0.498942          -0.336903   
4 -0.775263       -0.500028           1.672619          -0.024897   

   availability_365  neighbourhood_Armour Square  neighbourhood_Avondale  \
0          1.432073                          0.0                     0.0   
1         -1.278977                          0.0                     0.0   
2          1.372490                          0.0                     0.0   
3          1.260771                          0.0                     0.0   
4          1.424625                          0.0                     0.0   

   neighbourhood_Brighton Park  neighbourhood_Clearing  \
0                          

In [6]:
# Generate a DataFrame to log data preprocessing actions
cleansing_decisions = pd.DataFrame({
    "field_name": ["price", "minimum_nights", "number_of_reviews", "reviews_per_month", "availability_365"],
    "manipulation_type": ["outlier_removal", "outlier_removal", "outlier_removal", "imputation", "outlier_removal"],
    "threshold_or_value": ["Z-score < 3", "Z-score < 3", "Z-score < 3", "Median Imputation", "Z-score < 3"]
})


In [7]:
try:
    # Establish a connection to the PostgreSQL database
    conn = psycopg2.connect(
        dbname="MSDS610",
        user="postgres",
        password="your_new_password",
        host="localhost",
        port="5432"
    )
    cursor = conn.cursor()

    # Define an SQL query to create the 'listingdata' table if it does not already exist
    create_table_query = """
    CREATE TABLE IF NOT EXISTS listingdata (
        id SERIAL PRIMARY KEY,  -- Auto-incrementing primary key
        field_name TEXT NOT NULL,  -- Column to store field names
        manipulation_type TEXT NOT NULL,  -- Column to describe type of data manipulation
        threshold_or_value TEXT  -- Column to store threshold values or specific replacement values
    );
    """

    # Execute the table creation query and commit the changes
    cursor.execute(create_table_query)
    conn.commit()

    # Insert data from the cleansing_decisions DataFrame into the table
    for _, row in cleansing_decisions.iterrows():
        cursor.execute(
            "INSERT INTO listingdata (field_name, manipulation_type, threshold_or_value) VALUES (%s, %s, %s)",
            (row["field_name"], row["manipulation_type"], row["threshold_or_value"])
        )

    # Commit the inserted data
    conn.commit()
    print("Table created and data inserted successfully!")

except Exception as e:
    # Handle any exceptions and display an error message
    print(f"An error occurred: {e}")

finally:
    # Ensure the database connection is closed properly
    cursor.close()
    conn.close()

Table created and data inserted successfully!


In [8]:
# Feature Engineering: Creating new variables to enhance data insights

# Calculate price per night to normalize pricing across different listings
df["price_per_night"] = df["price"] / df["minimum_nights"]

# Compute review rate as a measure of engagement (reviews per available day)
df["review_rate"] = df["number_of_reviews"] / (1 + df["availability_365"])

# Identify luxury listings (top 10% based on price)
luxury_threshold = df["price"].quantile(0.90)
df["is_luxury"] = (df["price"] > luxury_threshold).astype(int)

# Store feature descriptions for documentation
features_info = pd.DataFrame({
    "feature_name": ["price_per_night", "review_rate", "is_luxury"],
    "description": [
        "Computed as price divided by minimum nights to standardize cost impact",
        "Ratio of number of reviews to availability (adjusted to avoid division by zero)",
        "Binary classification for luxury listings (top 10% by price)"
    ]
})

# Export feature details to a CSV file for reference
features_info.to_csv("new_features.csv", index=False)

In [9]:
# Features to scale
features_to_normalize = ["price_per_night", "review_rate", "price", "minimum_nights", "availability_365"]
scaler = StandardScaler()
df[features_to_normalize] = scaler.fit_transform(df[features_to_normalize])


In [10]:
# Connect to PostgreSQL
conn = psycopg2.connect(
    dbname="MSDS610",
    user="postgres",
    password="your_new_password",
    host="localhost",
    port="5432"
)
cursor = conn.cursor()

# Create table if it doesn't exist
cursor.execute("""
    CREATE TABLE IF NOT EXISTS new_features (
        id SERIAL PRIMARY KEY,
        feature_name TEXT NOT NULL,
        description TEXT
    );
""")
conn.commit()

# Insert data into the table
for _, row in features_info.iterrows():
    cursor.execute(
        "INSERT INTO new_features (feature_name, description) VALUES (%s, %s)",
        (row["feature_name"], row["description"])
    )

conn.commit()
print("Data inserted successfully!")

# Close the connection
cursor.close()
conn.close()

Data inserted successfully!


In [11]:
# Define the target variable (what we want to predict) and feature variables (input data)
X = df.drop(columns=["number_of_reviews"])  # Features (all columns except the target)
y = df["number_of_reviews"]  # Target variable (the column we want to predict)

# Split the dataset into training (70%), validation (15%), and testing (15%) sets
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, random_state=42  # 30% of data reserved for validation and testing
)

# Further split the remaining 30% into validation (15%) and test (15%) sets
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=42  # Splitting evenly into 15% validation and 15% test data
)

# Save validation data to CSV files for future model validation
X_val.to_csv("X_val.csv", index=False)  # Save validation features
y_val.to_csv("y_val.csv", index=False)  # Save validation target variable

# **Choosing the Right Model for Prediction**  

For our analysis, we require a **supervised learning algorithm** since we aim to predict a **continuous** target variable: **number_of_reviews**. Given that our dataset includes both categorical and continuous features, we explore regression models best suited for this task.  

---

## **Why Use a Regression Model?**  
- **Continuous Target Variable** – Since *number_of_reviews* is numeric, regression models are ideal.  
- **Multivariable Impact Analysis** – We seek to understand how factors like *price, room type, and neighborhood* influence the number of reviews.  
- **Baseline & Beyond** – A **linear regression model** provides a starting point, but we will explore **tree-based models** if the relationships are complex.  

---

## **Candidate Models**  

### **1. Linear Regression**  
- Simple and interpretable, providing insights into feature impact.  
- Limited for capturing complex patterns and non-linear relationships.  

### **2. Random Forest Regressor**  
- Effectively captures non-linearity and interactions between features.  
- Provides feature importance, helping identify key influencers of review counts.  
- More computationally intensive than linear models.  

### **3. Gradient Boosting (XGBoost)**  
- Optimized for better predictive accuracy through boosting.  
- Handles missing values and outliers more effectively.  
- Requires careful tuning and longer training time for optimal performance.  

---

## **Model Evaluation**  
After training, we will compare model performance using:  
- **Mean Absolute Error (MAE)** – Measures the average difference between predicted and actual values.  
- **R² Score** – Assesses how well the model explains variance in the data.  

By testing multiple models and evaluating their performance, we will identify the most effective approach for predicting *number_of_reviews*.  


In [12]:
# Initialize and train a Linear Regression model
lr_model = LinearRegression()  # Create an instance of the Linear Regression model
lr_model.fit(X_train, y_train)  # Train the model using the training dataset

# Make predictions on the training and validation sets
y_pred_train = lr_model.predict(X_train)  # Predictions on training data
y_pred_val = lr_model.predict(X_val)  # Predictions on validation data

# Evaluate the model's performance using Mean Absolute Error (MAE) and R² score
mae = mean_absolute_error(y_val, y_pred_val)  # Calculate MAE to measure prediction accuracy
r2 = r2_score(y_val, y_pred_val)  # Calculate R² score to evaluate model fit

# Print evaluation metrics
print(f"Linear Regression MAE: {mae}")  # Lower MAE indicates better accuracy
print(f"Linear Regression R² Score: {r2}")  # R² closer to 1 indicates a better fit

Linear Regression MAE: 0.8111819021485752
Linear Regression R² Score: -0.8164122508520433


In [13]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)  
# n_estimators=100 -> The model will use 100 decision trees
# random_state=42 -> Ensures reproducibility of results

# Train the model using the training dataset
rf_model.fit(X_train, y_train)  

# Make predictions on the validation set
y_pred_val = rf_model.predict(X_val)  

# Evaluate model performance using Mean Absolute Error (MAE) and R² score
mae = mean_absolute_error(y_val, y_pred_val)  # Measures the average prediction error
r2 = r2_score(y_val, y_pred_val)  # Determines how well the model explains the variance in data

# Print evaluation metrics
print(f"Random Forest MAE: {mae}")  # Lower MAE indicates better prediction accuracy
print(f"Random Forest R² Score: {r2}")  # R² closer to 1 means better model performance


Random Forest MAE: 0.43948764752532526
Random Forest R² Score: 0.178419419575495


In [14]:
from xgboost import XGBRegressor  # Import the XGBoost Regressor model

# Initialize the XGBoost Regressor model
xgb_model = XGBRegressor(
    n_estimators=100,       # Number of boosting rounds (trees)
    learning_rate=0.1,      # Step size shrinkage to prevent overfitting
    random_state=42         # Ensures reproducibility of results
)

# Train the model using the training dataset
xgb_model.fit(X_train, y_train)  

# Make predictions on the validation set
y_pred_val = xgb_model.predict(X_val)  

# Evaluate model performance using Mean Absolute Error (MAE) and R² score
mae = mean_absolute_error(y_val, y_pred_val)  # Measures the average prediction error
r2 = r2_score(y_val, y_pred_val)  # Determines how well the model explains the variance in data

# Print evaluation metrics
print(f"XGBoost MAE: {mae}")  # Lower MAE indicates better prediction accuracy
print(f"XGBoost R² Score: {r2}")  # R² closer to 1 means better model performance

XGBoost MAE: 0.5089927149215583
XGBoost R² Score: -0.05978837726929731


In [15]:
from xgboost import XGBRegressor

# Train model
xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
xgb_model.fit(X_train, y_train)

# Predictionsfrom xgboost import XGBRegressor  # Import the XGBoost Regressor model

# Initialize the XGBoost Regressor model with key parameters
xgb_model = XGBRegressor(
    n_estimators=100,      # Number of trees (boosting rounds)
    learning_rate=0.1,     # Controls how much the model adjusts with each step
    random_state=42        # Ensures reproducibility of results
)

# Train the model using the training dataset
xgb_model.fit(X_train, y_train)  

# Make predictions on the validation set
y_pred_val = xgb_model.predict(X_val)  

# Evaluate model performance using Mean Absolute Error (MAE) and R² score
mae = mean_absolute_error(y_val, y_pred_val)  # Measures the average absolute difference between actual and predicted values
r2 = r2_score(y_val, y_pred_val)  # Determines how well the model explains the variance in data

# Print evaluation metrics
print(f"XGBoost MAE: {mae}")  # Lower MAE indicates better prediction accuracy
print(f"XGBoost R² Score: {r2}")  # R² closer to 1 means better model performance

XGBoost MAE: 0.5089927149215583
XGBoost R² Score: -0.05978837726929731


In [16]:
# Loading live data
live_data = pd.read_csv("listings_data.csv")

In [17]:
def preprocess_live_data(df):
    """
    Cleans and preprocesses the live data to match the format of the trained model.
    This function handles missing values, encodes categorical variables, and normalizes numerical columns.
    """
    
    # Step 1: Handle missing values by filling them with 0
    df.fillna(0, inplace=True)

    # Step 2: Convert categorical variables to numerical values using predefined mappings
    # Here, "room_type" is mapped to numerical values for model compatibility
    df["room_type"] = df["room_type"].map({"Entire home/apt": 0, "Private room": 1, "Shared room": 2})

    # Step 3: Normalize numerical features to ensure consistent scale
    # Standardizing the "price" column by subtracting the mean and dividing by the standard deviation
    df["price"] = (df["price"] - df["price"].mean()) / df["price"].std()

    return df  # Return the cleaned and transformed DataFrame

# Apply preprocessing to the incoming live data
live_data_processed = preprocess_live_data(live_data)

In [20]:
# Loading the best-trained model
import joblib
best_model = joblib.load("optimized_xgb.pkl")

In [26]:
# Ensure the model is trained or loaded
if "optimized_xgb" not in locals():
    optimized_xgb = joblib.load("optimized_xgb.pkl")  # Load the trained model if not defined

# Retrieve feature names correctly
trained_feature_names = optimized_xgb.get_booster().feature_names if hasattr(optimized_xgb, "get_booster") else optimized_xgb.feature_names_in_

# One-Hot Encode Categorical Variables in live data
live_data_processed = pd.get_dummies(live_data, columns=["room_type", "neighbourhood"])

# Ensure all necessary features exist in live_data_processed
missing_cols = set(trained_feature_names) - set(live_data_processed.columns)
for col in missing_cols:
    live_data_processed[col] = 0  # Assign 0 to missing columns

# Reorder columns to match trained model
live_data_processed = live_data_processed[trained_feature_names]

# Ensure best_model is correctly assigned (use optimized_xgb)
best_model = optimized_xgb  # Assign the trained model

# Generate predictions
live_data["predicted_reviews"] = best_model.predict(live_data_processed)

In [25]:
# Load the CSV file containing predictions
df = pd.read_csv("live_data_predictions.csv")  

# Display the first few rows of the dataset to verify the data
print(df.head())  

         id  predicted_reviews
0  40182247          -0.241450
1  45438479          -0.205501
2  39793384           0.996668
3  35942729           1.036146
4   1468342           0.991411


# **Key Findings from the Analysis**  

- **Neighborhood and Room Type Influence Reviews**  
  - Listings in specific neighborhoods and room types consistently received higher review counts, supporting our initial hypothesis.  

- **Model Performance and Limitations**  
  - The model effectively captured overall trends but struggled with extreme values, potentially due to outliers or unaccounted factors affecting review frequency.  

- **Impact of Pricing on Reviews**  
  - Price exhibited a moderate correlation with review counts, indicating that affordability may play a role in booking frequency and subsequent guest feedback.  
