# Random Forest Model (California Housing Dataset)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor       # Import the Random Forest Regressor
from sklearn.model_selection import train_test_split     # Good practice for train/test split
from sklearn.metrics import mean_squared_error, r2_score # For evaluating the model
import time # For timing
import math # For math.ceil
import copy # For copy.deepcopy
import joblib

## 1. Introduction
This notebook demonstrates the implementation and evaluation of a Random Forest Regressor for predicting California house values. It builds upon previous analyses (e.g., Linear Regression) by applying targeted feature engineering and leveraging a more flexible, non-linear model to capture complex relationships in the data.

In [None]:
# --- 1. Data Loading and Initial Preparation --- 
# Load the California Housing dataset. This dataset is suitable for regression tasks.
housing = fetch_california_housing()
# Create a Pandas DataFrame from the features, using the feature names for columns.
california_df = pd.DataFrame(housing.data, columns=housing.feature_names)
# Add the target variable (Median House Value) to the DataFrame.
california_df['MedHouseVal'] = housing.target

In [None]:
# Define the initial set of features to be used. These were selected based on
# common sense and initial exploratory data analysis (e.g., from previous linear regression attempts).

initial_feature_names_rf = [
    'MedInc',      # Median Income in block group
    'HouseAge',    # Median house age in block group
    'AveRooms',    # Average number of rooms per household
    'Population'   # Block group population
]

# Create the base feature matrix (X) and target vector (y) for the Random Forest model.
# X_train_rf will be further modified with engineered features.
X_train_rf = california_df[initial_feature_names_rf].values
y_train_rf = california_df['MedHouseVal'].values

# This list will dynamically track all features used in the RF model for plotting labels
X_features_rf = list(initial_feature_names_rf)
#print(f"Base X_train_rf shape (original features): {X_train_rf.shape}")

# --- 2. Feature Engineering for Random Forest ---
# Random Forests are powerful because they can capture non-linearity and interactions
# automatically. However, well-engineered features can still significantly improve
# their performance and help them discover patterns more easily.
# Importantly, tree-based models like Random Forests do NOT require feature scaling/normalization. 

#print("\n2. Performing Feature Engineering (Polynomial, Logarithmic, Interaction) for Random Forest...")

# --- Store original column indices for plotting later ---
# We need to know the original index of certain features in the initial_feature_names_rf list
# before we start adding new columns, as concatenation changes column positions.
medinc_original_column_index_rf = X_features_rf.index('MedInc')
population_original_column_index_rf = X_features_rf.index('Population')
averooms_original_column_index_rf = X_features_rf.index('AveRooms')


# --- Feature 2.1: Add MedInc_Sq (Polynomial Feature) ---
# Rationale: Initial Linear Regression analysis showed 'MedInc' had a non-linear 
# relationship with the target, which linear models struggle with. Adding a squared term
# allows the model to approximate this curve.
medinc_column_data_rf = X_train_rf[:, medinc_original_column_index_rf]
medinc_squared_data_rf = medinc_column_data_rf**2
X_features_rf.append('MedInc_Sq') # Add the new feature name to our list
X_train_rf = np.c_[X_train_rf, medinc_squared_data_rf] # Concatenate as a new column


# --- Feature 2.2: Add Log_Population (Logarithmic Transformation) ---
# Rationale: 'Population' often has a highly skewed distribution (many small values, few large).
# Log transforms can make skewed distributions more symmetrical and relationships more linear-like,
# benefiting many models, even Random Forests, by normalizing the spread.
# Using np.log1p (log(1+x)) is safer as it handles potential zero values gracefully.
population_column_data_rf = X_train_rf[:, population_original_column_index_rf]
log_population_data_rf = np.log1p(population_column_data_rf)
X_features_rf.append('Log_Population')
X_train_rf = np.c_[X_train_rf, log_population_data_rf]


# --- Feature 2.3: Add MedInc_x_AveRooms (Interaction Feature) ---
# Rationale: An interaction term captures how the effect of one feature might depend
# on the value of another. For example, the impact of average rooms might be different
# at different median income levels.
# It's best practice to use the original, untransformed features for interaction terms.
medinc_x_averooms_data_rf = california_df['MedInc'].values * california_df['AveRooms'].values
X_features_rf.append('MedInc_x_AveRooms')
X_train_rf = np.c_[X_train_rf, medinc_x_averooms_data_rf]


print(f"X_train_rf shape after adding all new features: {X_train_rf.shape}")
print(f"Final feature list for RF model: {X_features_rf}")

In [None]:
# --- 3. Split Data into Training and Testing Sets ---
# This is a crucial step for robust model evaluation. It helps assess how well the model
# generalizes to unseen data, preventing overfitting.
print("\n3. Splitting data into training (80%) and testing (20%) sets...")

X_train_split, X_test_split, y_train_split, y_test_split = train_test_split(
    X_train_rf, y_train_rf, test_size=0.2, random_state=42 # random_state for reproducibility
)
print(f"X_train_split shape: {X_train_split.shape}")
print(f"X_test_split shape: {X_test_split.shape}")


# --- 4. Initialize and Train the Random Forest Regressor ---
print("\n4. Initializing and Training Random Forest Regressor...")
# RandomForestRegressor: An ensemble learning method for regression. It constructs
# a multitude of decision trees at training time and outputs the average prediction
# of the individual trees.
# n_estimators: The number of trees in the forest. More trees generally improve accuracy
#               but increase computation time. 100 is a common starting point.
# random_state: Controls the randomness of the bootstrapping of the samples and
#               the splitting criteria for each tree. Ensures reproducible results.
# n_jobs=-1: Tells scikit-learn to use all available CPU cores for parallel processing,
#            which significantly speeds up training for larger datasets/models.
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

# Train the model using the training data.
print("Training the Random Forest model...")
tic_rf = time.time()
rf_model.fit(X_train_split, y_train_split)
toc_rf = time.time()
print(f"Random Forest training duration: {1000*(toc_rf-tic_rf):.4f} ms ")

# Save the trained model to a .pkl file
# This will create a file named 'random_forest_model.pkl' in the same directory (Can be used by an API later)
joblib.dump(rf_model, 'random_forest_model.pkl')
joblib.dump(X_features_rf, 'feature_names.pkl')
# --- 5. Make Predictions and Evaluate the Model ---
#print("\n5. Making predictions and evaluating the model...")

# Make predictions on both the training and testing sets.
# Predictions on the test set are critical for assessing generalization performance.
yp_train_rf = rf_model.predict(X_train_split)
yp_test_rf = rf_model.predict(X_test_split)

# Evaluate the model using Root Mean Squared Error (RMSE) and R-squared (R2).
# RMSE: Measures the average magnitude of the errors. (Lower is better).
# R-squared: Explains the proportion of variance in the dependent variable
#            that can be predicted from the independent variables. Higher (closer to 1) is better.
train_rmse = np.sqrt(mean_squared_error(y_train_split, yp_train_rf))
train_r2 = r2_score(y_train_split, yp_train_rf)
print(f"Random Forest (Train Set) RMSE: {train_rmse:.4f}")
print(f"Random Forest (Train Set) R-squared: {train_r2:.4f}")

test_rmse = np.sqrt(mean_squared_error(y_test_split, yp_test_rf))
test_r2 = r2_score(y_test_split, yp_test_rf)
print(f"Random Forest (Test Set) RMSE: {test_rmse:.4f}")
print(f"Random Forest (Test Set) R-squared: {test_r2:.4f}")

# --- Interpretation of Evaluation ---
# If Train R2 is much higher than Test R2, it suggests overfitting.
# If both are high, it suggests good generalization.


# --- 6. Plot Predictions vs. Target for Key Features ---
# Visualizing predictions against original features helps in understanding
# how well the model captures underlying patterns, especially non-linear ones.
print("\n6. Plotting Random Forest predictions vs. original features...")

# Define custom colours for plotting
TARGET_COLOR = 'mediumseagreen' # For the actual target values
PREDICT_COLOR = 'rebeccapurple' # For the model's predictions

# To visualize predictions on the full dataset (as in previous LR plots),
# we predict on the original full X_train_rf (which has all features).
yp_rf_full = rf_model.predict(X_train_rf)

# Plotting the initial 4 base features (MedInc, HouseAge, AveRooms, Population)
# The predictions now incorporate the effect of all 7 features in the RF model.
fig,ax=plt.subplots(1,len(initial_feature_names_rf),figsize=(16, 4),sharey=True)
for i, feature_name in enumerate(initial_feature_names_rf):
    ax[i].scatter(X_train_rf[:,i], y_train_rf, label = 'Target', s=5, alpha=0.5, color=TARGET_COLOR)
    ax[i].scatter(X_train_rf[:,i], yp_rf_full, label = 'Predict (RF)', s=5, alpha=0.7, color=PREDICT_COLOR)
    ax[i].set_xlabel(feature_name)
    ax[i].set_title(feature_name)

ax[0].set_ylabel("Median House Value");
ax[0].legend(); # Legend is added only to the first subplot due to sharey=True
fig.suptitle("Target vs. Prediction: Random Forest Regressor (on original features)", y=1.02)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# --- Crucial plot: MedInc (original) vs. predictions with Random Forest ---
# This plot is key to visually assessing how well the Random Forest handles
# the previously observed non-linearity and varying dispersion in 'MedInc'.
plt.figure(figsize=(8, 5))
plt.scatter(X_train_rf[:, medinc_original_column_index_rf], y_train_rf, label='Target', s=5, alpha=0.5, color=TARGET_COLOR)
plt.scatter(X_train_rf[:, medinc_original_column_index_rf], yp_rf_full, label='Predict (RF)', s=5, alpha=0.7, color=PREDICT_COLOR)
plt.xlabel("MedInc (Original)")
plt.ylabel("Median House Value")
plt.title(f"MedInc (Original) vs. Target/Prediction (Random Forest)")
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()


print("\n--- End of Random Forest Analysis ---")


## 2. Conclusion

The Random Forest Regressor demonstrated a significant improvement over the linear regression models. Its ability to handle non-linear relationships allowed it to capture the complex patterns in the California housing data much more effectively, resulting in superior predictive performance.

**Key advantages of the Random Forest model observed in this analysis:**

- Handling Non-Linearity: Unlike linear regression, Random Forest inherently models non-linear relationships, leading to more accurate predictions, particularly for features like 'MedInc'.

- Computational Efficiency: The Random Forest model trained relatively quickly, especially compared to the iterative process of feature engineering and Gradient Descent required for linear regression.

- No Normalization Required: Random Forest models are tree-based and thus invariant to feature scaling, eliminating the need for normalization and simplifying the data preparation process.

The success of the Random Forest Regressor in this project makes it an excellent candidate for deployment. To demonstrate this capability and make the model readily accessible, the next step will be to create a simple API (using FastAPI) to serve predictions based on user-provided housing features. This API deployment will be detailed in a separate project.