# Predicting House Prices Using the Boston Housing Dataset


**Description:**

This Jupyter Notebook demonstrates building regression models (Linear Regression, Random Forest, and conceptually XGBoost) from scratch to predict house prices using the Boston Housing Dataset. We will cover data preprocessing, custom model implementations, performance comparison using RMSE and R², and visualization of feature importance for tree-based models.


## Dataset Information

**Dataset Used:** Boston Housing Dataset (fetched from [OpenML](https://www.openml.org/d/531))

**Source:**  
The dataset is retrieved from OpenML using the `fetch_openml()` function from `sklearn.datasets`. It is commonly used for regression problems in machine learning and includes various features of residential homes in Boston.

**Dataset Summary:**  
- **Instances:** 506  
- **Features:** 13 numerical predictors  
- **Target Variable:** `MEDV` (Median value of owner-occupied homes in $1000s)

**Features Description:**
- `CRIM`: Per capita crime rate by town  
- `ZN`: Proportion of residential land zoned for lots over 25,000 sq.ft.  
- `INDUS`: Proportion of non-retail business acres per town  
- `CHAS`: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)  
- `NOX`: Nitrogen oxides concentration (parts per 10 million)  
- `RM`: Average number of rooms per dwelling  
- `AGE`: Proportion of owner-occupied units built prior to 1940  
- `DIS`: Weighted distances to five Boston employment centres  
- `RAD`: Index of accessibility to radial highways  
- `TAX`: Full-value property-tax rate per $10,000  
- `PTRATIO`: Pupil-teacher ratio by town  
- `B`: 1000(Bk - 0.63)^2 where Bk is the proportion of Black residents by town  
- `LSTAT`: Percentage lower status of the population  

## Data Preprocessing


In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Boston Housing Dataset


boston = fetch_openml(name='boston', version=1, as_frame=True)
df = boston.frame
X = df.drop('MEDV', axis=1)
y = df['MEDV']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize numerical features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert scaled data back to Pandas DataFrames for easier handling (optional)
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test.columns)


## Model Implementation (From Scratch)


### 1. Linear Regression (From Scratch)


In [4]:
class LinearRegressionScratch:
    def __init__(self):
        self.weights = None
        self.bias = None

    def fit(self, X, y, learning_rate=0.01, n_iterations=1000):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(n_iterations):
            y_predicted = np.dot(X, self.weights) + self.bias

            # Calculate gradients
            dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
            db = (1 / n_samples) * np.sum(y_predicted - y)

            # Update weights and bias
            self.weights -= learning_rate * dw
            self.bias -= learning_rate * db

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias


### 2. Random Forest (Conceptual Outline & Simplified Implementation)


Implementing a full Random Forest from scratch involves building multiple decision trees, each trained on a random subset of the data and a random subset of the features. The final prediction is the average of the predictions from all the trees.

Due to complexity, here's a simplified structure to illustrate the concept.


In [5]:
class DecisionTreeRegressorScratch:
    def __init__(self, min_samples_split=2, max_depth=None, random_feature_subset=None):
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth
        self.random_feature_subset = random_feature_subset
        self.tree = None

    def _split(self, X, y, feature_indices):
        best_split = {}
        return best_split

    def _grow_tree(self, X, y, depth=0):
        n_samples, n_features = X.shape
        n_labels = len(np.unique(y))

        if (self.max_depth is not None and depth >= self.max_depth) or \
           n_labels == 1 or n_samples < self.min_samples_split:
            return np.mean(y)

        if self.random_feature_subset is not None:
            feature_indices = np.random.choice(n_features, self.random_feature_subset, replace=False)
        else:
            feature_indices = np.arange(n_features)

        best_split = self._split(X, y, feature_indices)
        if not best_split:
            return np.mean(y)

        return {'feature_index': best_split['feature_index'], 'threshold': best_split['threshold'], 'left': ..., 'right': ...}

    def fit(self, X, y):
        self.tree = self._grow_tree(X, y)

    def predict(self, X):
        return np.array([self._traverse_tree(x, self.tree) for x in X])

    def _traverse_tree(self, x, node):
        if isinstance(node, (int, float)):
            return node
        if x[node['feature_index']] <= node['threshold']:
            return self._traverse_tree(x, node['left'])
        return self._traverse_tree(x, node['right'])

class RandomForestRegressorScratch:
    def __init__(self, n_estimators=10, min_samples_split=2, max_depth=None, random_feature_subset='sqrt'):
        self.n_estimators = n_estimators
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth
        self.random_feature_subset = random_feature_subset
        self.trees = []

    def fit(self, X, y):
        n_samples, n_features = X.shape
        for _ in range(self.n_estimators):
            indices = np.random.choice(n_samples, n_samples, replace=True)
            X_boot, y_boot = X[indices], y[indices]

            if self.random_feature_subset == 'sqrt':
                n_feat_subset = int(np.sqrt(n_features))
            elif isinstance(self.random_feature_subset, int):
                n_feat_subset = self.random_feature_subset
            else:
                n_feat_subset = n_features

            tree = DecisionTreeRegressorScratch(min_samples_split=self.min_samples_split,
                                                max_depth=self.max_depth,
                                                random_feature_subset=n_feat_subset)
            tree.fit(X_boot, y_boot)
            self.trees.append(tree)

    def predict(self, X):
        predictions = np.array([tree.predict(X) for tree in self.trees])
        return np.mean(predictions, axis=0)


## Performance Comparison


In [6]:
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and train the Linear Regression model
lr_scratch = LinearRegressionScratch()
lr_scratch.fit(X_train_scaled, y_train, learning_rate=0.01, n_iterations=1500)
y_pred_lr_scratch = lr_scratch.predict(X_test_scaled)

rmse_lr_scratch = np.sqrt(mean_squared_error(y_test, y_pred_lr_scratch))
r2_lr_scratch = r2_score(y_test, y_pred_lr_scratch)

print("Linear Regression (From Scratch) Performance:")
print(f"RMSE: {rmse_lr_scratch:.4f}")
print(f"R²: {r2_lr_scratch:.4f}")


Linear Regression (From Scratch) Performance:
RMSE: 4.9917
R²: 0.6602
