# House Price Prediction Project

This notebook demonstrates a complete machine learning pipeline for predicting house prices using the California Housing dataset. We'll go through the following steps:

1. Data Loading and Preprocessing
2. Exploratory Data Analysis
3. Feature Engineering
4. Model Selection and Training
5. Model Evaluation
6. Hyperparameter Tuning
7. Making Predictions

Let's begin by importing the necessary libraries and loading our dataset.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set the style for our plots
plt.style.use('seaborn')
sns.set_palette("husl")

## 1. Data Loading and Preprocessing

We'll use the California Housing dataset from scikit-learn. This dataset contains information about housing prices in California, with various features such as:
- MedInc: Median income in the block
- HouseAge: Median house age in the block
- AveRooms: Average number of rooms per household
- AveBedrms: Average number of bedrooms per household
- Population: Block population
- AveOccup: Average occupancy
- Latitude: Block latitude
- Longitude: Block longitude
- Target: Median house value

In [None]:
# Load the California Housing dataset
from sklearn.datasets import fetch_california_housing

# Load the data
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Display the first few rows and basic information about the dataset
print("Dataset Shape:", X.shape)
print("\nFeature Names:", housing.feature_names)
print("\nFirst few rows of the dataset:")
print(X.head())
print("\nBasic statistics of the target variable (house prices in $100,000):")
print(pd.Series(y).describe())

## 2. Exploratory Data Analysis

Let's analyze our dataset through visualizations and statistical summaries to better understand the relationships between variables and identify any patterns or anomalies.

In [None]:
# Create correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(X.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Features')
plt.tight_layout()
plt.show()

# Distribution of target variable
plt.figure(figsize=(10, 6))
sns.histplot(y, bins=50)
plt.title('Distribution of House Prices')
plt.xlabel('Price (in $100,000)')
plt.ylabel('Count')
plt.show()

# Create scatter plots for important features
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

sns.scatterplot(data=X, x='MedInc', y=y, ax=axes[0])
axes[0].set_title('House Price vs. Median Income')

sns.scatterplot(data=X, x='HouseAge', y=y, ax=axes[1])
axes[1].set_title('House Price vs. House Age')

sns.scatterplot(data=X, x='AveRooms', y=y, ax=axes[2])
axes[2].set_title('House Price vs. Average Rooms')

sns.scatterplot(data=X, x='Population', y=y, ax=axes[3])
axes[3].set_title('House Price vs. Population')

plt.tight_layout()
plt.show()

## 3. Feature Engineering

Now that we understand our data better, let's prepare it for modeling by:
1. Scaling numerical features
2. Creating new features
3. Handling any outliers

In [None]:
# Create new features
X['RoomsByBedrooms'] = X['AveRooms'] / X['AveBedrms']
X['PopulationByHousehold'] = X['Population'] / X['AveOccup']

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Display the first few rows of scaled features
print("Scaled features:")
print(X_scaled.head())

# Check for and remove outliers using IQR method
def remove_outliers(df, columns):
    df_clean = df.copy()
    for column in columns:
        Q1 = df_clean[column].quantile(0.25)
        Q3 = df_clean[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df_clean = df_clean[
            (df_clean[column] >= lower_bound) & 
            (df_clean[column] <= upper_bound)
        ]
    return df_clean

# Remove outliers from selected columns
columns_to_clean = ['MedInc', 'AveRooms', 'Population']
X_clean = remove_outliers(X_scaled, columns_to_clean)
y_clean = y[X_clean.index]

print("\nShape before outlier removal:", X_scaled.shape)
print("Shape after outlier removal:", X_clean.shape)

## 4. Model Selection and Training

We'll try two different models and compare their performance:
1. Linear Regression (baseline model)
2. Random Forest Regressor (more complex model)

Let's split our data into training and testing sets, then train both models.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_clean, y_clean, test_size=0.2, random_state=42
)

# Train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Train Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions with both models
lr_pred = lr_model.predict(X_test)
rf_pred = rf_model.predict(X_test)

# Print initial results
print("Linear Regression Results:")
print("R² Score:", r2_score(y_test, lr_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, lr_pred)))
print("\nRandom Forest Results:")
print("R² Score:", r2_score(y_test, rf_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, rf_pred)))

## 5. Model Evaluation

Let's evaluate our models in more detail by:
1. Comparing predicted vs actual values
2. Analyzing residuals
3. Identifying feature importance (for Random Forest)

In [None]:
# Create scatter plots of predicted vs actual values
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Linear Regression
ax1.scatter(y_test, lr_pred, alpha=0.5)
ax1.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
ax1.set_xlabel('Actual Price')
ax1.set_ylabel('Predicted Price')
ax1.set_title('Linear Regression: Predicted vs Actual')

# Random Forest
ax2.scatter(y_test, rf_pred, alpha=0.5)
ax2.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
ax2.set_xlabel('Actual Price')
ax2.set_ylabel('Predicted Price')
ax2.set_title('Random Forest: Predicted vs Actual')

plt.tight_layout()
plt.show()

# Plot feature importance for Random Forest
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature')
plt.title('Feature Importance (Random Forest)')
plt.tight_layout()
plt.show()

# Plot residuals
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Linear Regression residuals
residuals_lr = y_test - lr_pred
ax1.scatter(lr_pred, residuals_lr, alpha=0.5)
ax1.axhline(y=0, color='r', linestyle='--')
ax1.set_xlabel('Predicted Price')
ax1.set_ylabel('Residuals')
ax1.set_title('Linear Regression: Residual Plot')

# Random Forest residuals
residuals_rf = y_test - rf_pred
ax2.scatter(rf_pred, residuals_rf, alpha=0.5)
ax2.axhline(y=0, color='r', linestyle='--')
ax2.set_xlabel('Predicted Price')
ax2.set_ylabel('Residuals')
ax2.set_title('Random Forest: Residual Plot')

plt.tight_layout()
plt.show()

## 6. Hyperparameter Tuning

Since the Random Forest model performed better, let's optimize its hyperparameters using GridSearchCV to improve its performance further.

In [None]:
# Define parameter grid for Random Forest
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform grid search
rf_grid = GridSearchCV(
    RandomForestRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)

# Print best parameters and score
print("Best parameters:", rf_grid.best_params_)
print("Best score:", np.sqrt(-rf_grid.best_score_))

# Make predictions with the optimized model
rf_best_pred = rf_grid.predict(X_test)

# Print final results
print("\nOptimized Random Forest Results:")
print("R² Score:", r2_score(y_test, rf_best_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, rf_best_pred)))

## 7. Final Predictions and Model Persistence

In [None]:
# Save the best model
import joblib

# Save the model
model_filename = 'california_housing_model.joblib'
joblib.dump(rf_grid.best_estimator_, model_filename)
print(f"Model saved as {model_filename}")

# Example of loading and using the model
loaded_model = joblib.load(model_filename)

# Make predictions on a sample
sample_predictions = loaded_model.predict(X_test[:5])
print("\nSample predictions:")
print("Predicted prices:", sample_predictions)
print("Actual prices:", y_test[:5].values)

# Create a DataFrame with actual vs predicted values
comparison_df = pd.DataFrame({
    'Actual Price': y_test[:5],
    'Predicted Price': sample_predictions,
    'Difference': y_test[:5].values - sample_predictions
})
print("\nPrediction Comparison:")
print(comparison_df)