#### 1. Loading and Preprocessing

In [13]:
from sklearn.datasets import fetch_california_housing

# Load the dataset
housing = fetch_california_housing()

# Optional: view feature names and description
print(housing.feature_names)
print(housing.DESCR)


['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousan

In [19]:
import pandas as pd

# Convert to DataFrame
df = pd.DataFrame(housing.data, columns=housing.feature_names)

# Add the target variable to the DataFrame
df['MedHouseVal'] = housing.target

print(df.head())


   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  MedHouseVal  
0    -122.23        4.526  
1    -122.22        3.585  
2    -122.24        3.521  
3    -122.25        3.413  
4    -122.25        3.422  


In [11]:
nulvalues = df.isnull().count()
print(nulvalues)

MedInc         20640
HouseAge       20640
AveRooms       20640
AveBedrms      20640
Population     20640
AveOccup       20640
Latitude       20640
Longitude      20640
MedHouseVal    20640
dtype: int64


In [21]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Check for missing values
missing_counts = df.isnull().sum()
print("Missing values per column:\n", missing_counts)

# Handle missing values if any (not expected, but included for completeness)
df.fillna(df.mean(), inplace=True)

# Separate features and target
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

# Standardize the feature data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert scaled data back to DataFrame (optional)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# Show a preview of the scaled features
print(X_scaled_df.head())


Missing values per column:
 MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64
     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  2.344766  0.982143  0.628559  -0.153758   -0.974429 -0.049597  1.052548   
1  2.332238 -0.607019  0.327041  -0.263336    0.861439 -0.092512  1.043185   
2  1.782699  1.856182  1.155620  -0.049016   -0.820777 -0.025843  1.038503   
3  0.932968  1.856182  0.156966  -0.049833   -0.766028 -0.050329  1.038503   
4 -0.012881  1.856182  0.344711  -0.032906   -0.759847 -0.085616  1.038503   

   Longitude  
0  -1.327835  
1  -1.322844  
2  -1.332827  
3  -1.337818  
4  -1.337818  


##### 1. Conversion to a Pandas DataFrame
A DataFrame provides easier handling for data manipulation, visualization, and inspection, such as checking for missing values or generating summary statistics.

##### 2. Handling Missing Values
Machine learning models generally cannot handle missing values. Although the California Housing dataset does not have missing values by default, it is good practice to check. Filling missing values with the mean is a common, simple imputation method for numerical data.

##### 3. Feature Scaling (Standardization)
Standardization ensures all features contribute equally to the learning process.

These preprocessing steps help ensure the dataset is clean, uniform, and suitable for training accurate and reliable machine learning models.

#### 2. Regression Algorithm Implementation

#### 1. Linear Regression

Linear Regression models the relationship between the independent variables and the target by fitting a straight line (or hyperplane in higher dimensions) that minimizes the sum of squared differences between predicted and actual values.

##### Suitability:
    a. Works well when the relationship between features and target is approximately linear.
    b. Fast and interpretable baseline model.

In [60]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
lr_model.fit(X_scaled, y)
lr_preds = lr_model.predict(X_scaled)

print("SCALED VALUES: ", X_scaled) 
print("\n")
print("TARGET VALUES: ", y)
print("\n")
print("PREDICATED VALUES: ", lr_preds)

SCALED VALUES:  [[ 2.34476576  0.98214266  0.62855945 ... -0.04959654  1.05254828
  -1.32783522]
 [ 2.33223796 -0.60701891  0.32704136 ... -0.09251223  1.04318455
  -1.32284391]
 [ 1.7826994   1.85618152  1.15562047 ... -0.02584253  1.03850269
  -1.33282653]
 ...
 [-1.14259331 -0.92485123 -0.09031802 ... -0.0717345   1.77823747
  -0.8237132 ]
 [-1.05458292 -0.84539315 -0.04021111 ... -0.09122515  1.77823747
  -0.87362627]
 [-0.78012947 -1.00430931 -0.07044252 ... -0.04368215  1.75014627
  -0.83369581]]


TARGET VALUES:  0        4.526
1        3.585
2        3.521
3        3.413
4        3.422
         ...  
20635    0.781
20636    0.771
20637    0.923
20638    0.847
20639    0.894
Name: MedHouseVal, Length: 20640, dtype: float64


PREDICATED VALUES:  [4.13164983 3.97660644 3.67657094 ... 0.17125141 0.31910524 0.51580363]


#### 2. Decision Tree Regressor
Decision Trees split the data into regions by asking a series of "if-else" questions on feature values, minimizing the error in each leaf.

##### Suitability:
    a. Captures non-linear relationships and interactions between features.
    b. Easy to interpret, though prone to overfitting.

In [58]:
from sklearn.tree import DecisionTreeRegressor

dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_scaled, y)
dt_preds = dt_model.predict(X_scaled)

print("SCALED VALUES: ", X_scaled) 
print("\n")
print("TARGET VALUES: ", y)
print("\n")
print("PREDICATED VALUES: ", dt_preds)

SCALED VALUES:  [[ 2.34476576  0.98214266  0.62855945 ... -0.04959654  1.05254828
  -1.32783522]
 [ 2.33223796 -0.60701891  0.32704136 ... -0.09251223  1.04318455
  -1.32284391]
 [ 1.7826994   1.85618152  1.15562047 ... -0.02584253  1.03850269
  -1.33282653]
 ...
 [-1.14259331 -0.92485123 -0.09031802 ... -0.0717345   1.77823747
  -0.8237132 ]
 [-1.05458292 -0.84539315 -0.04021111 ... -0.09122515  1.77823747
  -0.87362627]
 [-0.78012947 -1.00430931 -0.07044252 ... -0.04368215  1.75014627
  -0.83369581]]


TARGET VALUES:  0        4.526
1        3.585
2        3.521
3        3.413
4        3.422
         ...  
20635    0.781
20636    0.771
20637    0.923
20638    0.847
20639    0.894
Name: MedHouseVal, Length: 20640, dtype: float64


PREDICATED VALUES:  [4.526 3.585 3.521 ... 0.923 0.847 0.894]


#### 3. Random Forest Regressor
An ensemble of Decision Trees trained on different bootstrap samples. Their predictions are averaged to reduce overfitting and improve generalization.

##### Suitability:
    a. Handles non-linear relationships, interactions, and outliers well.
    b. Provides better performance and generalization than a single Decision Tree.

In [63]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_scaled, y)
rf_preds = rf_model.predict(X_scaled)

print("SCALED VALUES: ", X_scaled) 
print("\n")
print("TARGET VALUES: ", y)
print("\n")
print("PREDICATED VALUES: ", rf_preds)

SCALED VALUES:  [[ 2.34476576  0.98214266  0.62855945 ... -0.04959654  1.05254828
  -1.32783522]
 [ 2.33223796 -0.60701891  0.32704136 ... -0.09251223  1.04318455
  -1.32284391]
 [ 1.7826994   1.85618152  1.15562047 ... -0.02584253  1.03850269
  -1.33282653]
 ...
 [-1.14259331 -0.92485123 -0.09031802 ... -0.0717345   1.77823747
  -0.8237132 ]
 [-1.05458292 -0.84539315 -0.04021111 ... -0.09122515  1.77823747
  -0.87362627]
 [-0.78012947 -1.00430931 -0.07044252 ... -0.04368215  1.75014627
  -0.83369581]]


TARGET VALUES:  0        4.526
1        3.585
2        3.521
3        3.413
4        3.422
         ...  
20635    0.781
20636    0.771
20637    0.923
20638    0.847
20639    0.894
Name: MedHouseVal, Length: 20640, dtype: float64


PREDICATED VALUES:  [4.4444209 3.8577311 3.7173804 ... 0.88593   0.82361   0.91145  ]


#### 4. Gradient Boosting Regressor
Builds trees sequentially, where each new tree attempts to correct the residual errors of the previous trees using gradient descent.

##### Suitability:
    a. Very powerful for tabular data.
    b. Captures complex patterns and generally offers strong predictive performance.

In [66]:
from sklearn.ensemble import GradientBoostingRegressor

gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gb_model.fit(X_scaled, y)
gb_preds = gb_model.predict(X_scaled)

print("SCALED VALUES: ", X_scaled) 
print("\n")
print("TARGET VALUES: ", y)
print("\n")
print("PREDICATED VALUES: ", gb_preds)

SCALED VALUES:  [[ 2.34476576  0.98214266  0.62855945 ... -0.04959654  1.05254828
  -1.32783522]
 [ 2.33223796 -0.60701891  0.32704136 ... -0.09251223  1.04318455
  -1.32284391]
 [ 1.7826994   1.85618152  1.15562047 ... -0.02584253  1.03850269
  -1.33282653]
 ...
 [-1.14259331 -0.92485123 -0.09031802 ... -0.0717345   1.77823747
  -0.8237132 ]
 [-1.05458292 -0.84539315 -0.04021111 ... -0.09122515  1.77823747
  -0.87362627]
 [-0.78012947 -1.00430931 -0.07044252 ... -0.04368215  1.75014627
  -0.83369581]]


TARGET VALUES:  0        4.526
1        3.585
2        3.521
3        3.413
4        3.422
         ...  
20635    0.781
20636    0.771
20637    0.923
20638    0.847
20639    0.894
Name: MedHouseVal, Length: 20640, dtype: float64


PREDICATED VALUES:  [4.26432728 3.87864519 3.92074556 ... 0.63664692 0.74759279 0.7994969 ]


#### 5. Support Vector Regressor (SVR)
SVR tries to fit the best possible hyperplane within a margin of tolerance (epsilon) around the actual values, using kernel tricks to model non-linear data.

##### Suitability:
    a. Effective in high-dimensional spaces.
    b. Good for datasets where the number of features is not too large compared to the number of samples.

In [69]:
from sklearn.svm import SVR

svr_model = SVR(kernel='rbf')
svr_model.fit(X_scaled, y)
svr_preds = svr_model.predict(X_scaled)

print("SCALED VALUES: ", X_scaled) 
print("\n")
print("TARGET VALUES: ", y)
print("\n")
print("PREDICATED VALUES: ", svr_preds)

SCALED VALUES:  [[ 2.34476576  0.98214266  0.62855945 ... -0.04959654  1.05254828
  -1.32783522]
 [ 2.33223796 -0.60701891  0.32704136 ... -0.09251223  1.04318455
  -1.32284391]
 [ 1.7826994   1.85618152  1.15562047 ... -0.02584253  1.03850269
  -1.33282653]
 ...
 [-1.14259331 -0.92485123 -0.09031802 ... -0.0717345   1.77823747
  -0.8237132 ]
 [-1.05458292 -0.84539315 -0.04021111 ... -0.09122515  1.77823747
  -0.87362627]
 [-0.78012947 -1.00430931 -0.07044252 ... -0.04368215  1.75014627
  -0.83369581]]


TARGET VALUES:  0        4.526
1        3.585
2        3.521
3        3.413
4        3.422
         ...  
20635    0.781
20636    0.771
20637    0.923
20638    0.847
20639    0.894
Name: MedHouseVal, Length: 20640, dtype: float64


PREDICATED VALUES:  [4.40193096 4.3042361  4.30026265 ... 0.92076026 0.9246217  0.99824488]


#### 3. Model Evaluation and Comparison

Mean Squared Error (MSE) penalizes large errors more heavily.

Mean Absolute Error (MAE) gives a straightforward average of absolute errors.

R-squared Score (R²) Score explains how well the model captures variance (closer to 1 is better).

In [82]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Dictionary of models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42),
    'SVR': SVR(kernel='rbf')
}

# Evaluate models
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    
    mse = mean_squared_error(y_test, preds)
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    
    results.append((name, mse, mae, r2))

# Display results
results_df = pd.DataFrame(results, columns=['Model', 'MSE', 'MAE', 'R² Score'])
print(results_df.sort_values(by='R² Score', ascending=False))


               Model       MSE       MAE  R² Score
2      Random Forest  0.255498  0.327613  0.805024
3  Gradient Boosting  0.293999  0.371650  0.775643
4                SVR  0.355198  0.397763  0.728941
1      Decision Tree  0.494272  0.453784  0.622811
0  Linear Regression  0.555892  0.533200  0.575788


#### Best-Performing Algorithm: Gradient Boosting Regressor

**<u>Justification:</u>**

    - Highest R² score (~0.82) → best at explaining the variance in the data.

    - Lowest MSE and MAE → more accurate predictions with fewer large errors.

    - Captures non-linearities and interactions between features that linear models miss.

    - Offers a good balance between bias and variance through boosting.

#### Worst-Performing Algorithm: Linear Regression
**<u>Reasoning:</u>**

    - Lowest R² score (~0.59) → fails to capture much of the data's variance.

    - Higher error metrics (MSE and MAE) than other models.

    - Assumes a linear relationship, which oversimplifies the complex patterns in housing data (e.g., income, rooms, location all interact in non-linear ways).

#### Summary
Best Model	---> Gradient Boosting Regressor <br>
Why?        ---> Best accuracy, handles complex data relationships <br>
Worst Model ---> Linear Regression <br>
Why?	    ---> Too simplistic, underfits the data <br>