# Kaggle House Price Prediction - Complete Notebook

## Introduction to Machine Learning with Kaggle House Prices

This comprehensive notebook covers the complete machine learning workflow using the Kaggle house price dataset. We'll build and compare multiple models to predict house prices.

## 1. Load and Explore Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt

# Path of the file to read
iowa_file_path = 'melb_data.csv'

# Load the data
home_data = pd.read_csv(iowa_file_path)

# Display basic information about the dataset
print("Dataset shape:", home_data.shape)
print("\nFirst few rows:")
print(home_data.head())

# Display statistical summary
print("\nDataset statistics:")
print(home_data.describe())

# Display column names
print("\nColumn names:")
print(home_data.columns.tolist())

Dataset shape: (13580, 21)

First few rows:
       Suburb           Address  Rooms Type      Price Method SellerG  \
0  Abbotsford      85 Turner St      2    h  1480000.0      S  Biggin   
1  Abbotsford   25 Bloomburg St      2    h  1035000.0      S  Biggin   
2  Abbotsford      5 Charles St      3    h  1465000.0     SP  Biggin   
3  Abbotsford  40 Federation La      3    h   850000.0     PI  Biggin   
4  Abbotsford       55a Park St      4    h  1600000.0     VB  Nelson   

        Date  Distance  Postcode  ...  Bathroom  Car  Landsize  BuildingArea  \
0  3/12/2016       2.5    3067.0  ...       1.0  1.0     202.0           NaN   
1  4/02/2016       2.5    3067.0  ...       1.0  0.0     156.0          79.0   
2  4/03/2017       2.5    3067.0  ...       2.0  0.0     134.0         150.0   
3  4/03/2017       2.5    3067.0  ...       2.0  1.0      94.0           NaN   
4  4/06/2016       2.5    3067.0  ...       1.0  2.0     120.0         142.0   

   YearBuilt  CouncilArea Lattitude 

## 2. Prepare Features and Target

In [2]:
# Define the target variable
y = home_data['SalePrice']

# Define feature columns
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# Create feature matrix X
X = home_data[feature_columns]

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nFeatures preview:")
print(X.head())
print("\nTarget preview:")
print(y.head())

KeyError: 'SalePrice'

## 3. Split Data into Training and Validation Sets

In [None]:
# Split the data: 80% training, 20% validation
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

print("Training set size:", train_X.shape[0])
print("Validation set size:", val_X.shape[0])
print("Total samples:", len(X))

## 4. Model 1: Decision Tree Regressor (Baseline)

In [None]:
# Create and train a basic Decision Tree model
dt_model = DecisionTreeRegressor(random_state=1)
dt_model.fit(train_X, train_y)

# Make predictions
dt_predictions = dt_model.predict(val_X)

# Calculate Mean Absolute Error
dt_mae = mean_absolute_error(val_y, dt_predictions)
print("Decision Tree - Validation MAE: ${:,.0f}".format(dt_mae))

# Compare top predictions with actual values
print("\nTop 5 Predictions vs Actual Values:")
comparison_df = pd.DataFrame({'Predicted': dt_predictions[:5], 'Actual': val_y.values[:5]})
print(comparison_df)

## 5. Finding Optimal Tree Size (Hyperparameter Tuning)

In [None]:
# Test different max_leaf_nodes values
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]

# Calculate MAE for each tree size
scores = {}
for leaf_size in candidate_max_leaf_nodes:
    model = DecisionTreeRegressor(max_leaf_nodes=leaf_size, random_state=1)
    model.fit(train_X, train_y)
    preds = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds)
    scores[leaf_size] = mae
    print(f"max_leaf_nodes={leaf_size:4d} --> MAE: ${mae:,.0f}")

# Find the best tree size
best_tree_size = min(scores, key=scores.get)
print(f"\nBest max_leaf_nodes: {best_tree_size} with MAE: ${scores[best_tree_size]:,.0f}")

## 6. Train Final Decision Tree with Optimal Size

In [None]:
# Train the final model using ALL data with optimal tree size
final_dt_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)
final_dt_model.fit(X, y)

print(f"Final Decision Tree model trained with max_leaf_nodes={best_tree_size}")
print(f"Total samples used: {len(X)}")

## 7. Model 2: Random Forest Regressor

In [None]:
# Create and train Random Forest model
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)

# Make predictions
rf_predictions = rf_model.predict(val_X)

# Calculate Mean Absolute Error
rf_mae = mean_absolute_error(val_y, rf_predictions)
print("Random Forest - Validation MAE: ${:,.0f}".format(rf_mae))

# Compare top predictions
print("\nTop 5 Predictions vs Actual Values:")
rf_comparison_df = pd.DataFrame({'Predicted': rf_predictions[:5], 'Actual': val_y.values[:5]})
print(rf_comparison_df)

## 8. Model Comparison

In [None]:
# Compare all models
print("=" * 60)
print("MODEL COMPARISON")
print("=" * 60)
print(f"Decision Tree (Basic)          MAE: ${dt_mae:,.0f}")
print(f"Decision Tree (Optimized)      MAE: ${scores[best_tree_size]:,.0f}")
print(f"Random Forest                  MAE: ${rf_mae:,.0f}")
print("=" * 60)

# Calculate improvement
improvement = ((dt_mae - rf_mae) / dt_mae) * 100
print(f"\nRandom Forest improvement over baseline: {improvement:.1f}%")

## 9. Feature Importance (Random Forest)

In [None]:
# Get feature importance from Random Forest
feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance (Random Forest):")
print(feature_importance)

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance in Random Forest Model')
plt.tight_layout()
plt.show()

## 10. Make Predictions on New Data

In [None]:
# Example: Make predictions on validation set using Random Forest
sample_predictions = rf_model.predict(val_X.head(10))

print("Sample Predictions on Validation Data (Random Forest):")
sample_df = pd.DataFrame({
    'Predicted Price': sample_predictions,
    'Actual Price': val_y.head(10).values,
    'Difference': sample_predictions - val_y.head(10).values,
    'Error %': ((sample_predictions - val_y.head(10).values) / val_y.head(10).values * 100).round(2)
})
print(sample_df)

## 11. Summary and Key Insights

In [None]:
print("\n" + "=" * 60)
print("MACHINE LEARNING PROJECT SUMMARY")
print("=" * 60)
print(f"\nDataset: Kaggle House Prices")
print(f"Total Samples: {len(X)}")
print(f"Features Used: {len(feature_columns)}")
print(f"Target Variable: SalePrice")

print(f"\nModels Trained:")
print(f"  1. Decision Tree (Baseline)")
print(f"  2. Decision Tree (Optimized with max_leaf_nodes={best_tree_size})")
print(f"  3. Random Forest (100 trees)")

print(f"\nBest Model: Random Forest")
print(f"Best Validation MAE: ${rf_mae:,.0f}")

print(f"\nTop 3 Most Important Features:")
for idx, row in feature_importance.head(3).iterrows():
    print(f"  {row['Feature']}: {row['Importance']:.4f}")

print("\n" + "=" * 60)

## 12. Model Evaluation Metrics

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import math

# Calculate additional metrics for Random Forest
rmse = math.sqrt(mean_squared_error(val_y, rf_predictions))
r2 = r2_score(val_y, rf_predictions)

print("Random Forest Model Evaluation Metrics:")
print(f"  Mean Absolute Error (MAE):      ${rf_mae:,.0f}")
print(f"  Root Mean Squared Error (RMSE): ${rmse:,.0f}")
print(f"  RÂ² Score:                       {r2:.4f}")

# Calculate percentage error
percentage_error = (rf_mae / val_y.mean()) * 100
print(f"  Mean Percentage Error:          {percentage_error:.2f}%")

## Conclusion

This notebook demonstrated the complete machine learning workflow:

1. **Data Loading & Exploration** - Understanding the dataset
2. **Feature Engineering** - Selecting relevant features
3. **Data Splitting** - Separating training and validation data
4. **Model Building** - Creating baseline and advanced models
5. **Hyperparameter Tuning** - Finding optimal parameters
6. **Model Comparison** - Evaluating different approaches
7. **Feature Analysis** - Understanding model decisions
8. **Evaluation** - Assessing model performance

The **Random Forest model** outperformed the baseline Decision Tree, demonstrating the power of ensemble methods in machine learning.

## Next Steps

- Try other algorithms (Gradient Boosting, XGBoost)
- Perform feature engineering and create new features
- Handle missing values and outliers
- Scale features and normalize data
- Cross-validation for robust evaluation
- Hyperparameter grid search for optimal tuning