In [1]:
# Load the California Housing dataset using the fetch_california_housing function from sklearn.
# Convert the dataset into a pandas DataFrame for easier handling.

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
# Load dataset
data= fetch_california_housing()

# Convert to DataFrame
x = pd.DataFrame(data.data, columns=data.feature_names)

# Add the target variable (Median House Value)
y = pd.Series(data.target,name="Target")
y


Unnamed: 0,Target
0,4.526
1,3.585
2,3.521
3,3.413
4,3.422
...,...
20635,0.781
20636,0.771
20637,0.923
20638,0.847


In [2]:
#Handle missing values
null_values=x.isnull().sum()
null_values

Unnamed: 0,0
MedInc,0
HouseAge,0
AveRooms,0
AveBedrms,0
Population,0
AveOccup,0
Latitude,0
Longitude,0


There is no null values


In [3]:
#checking skewness for detecting outliers
skewness=x.skew()
skewness

Unnamed: 0,0
MedInc,1.646657
HouseAge,0.060331
AveRooms,20.697869
AveBedrms,31.316956
Population,4.935858
AveOccup,97.639561
Latitude,0.465953
Longitude,-0.297801


There is outliers in MedInc,AveRooms,AveBedrms ,Population ,AveOccup
So transform them using Iqr method and capping



In [4]:
#dfine a function to transform outlier using IQR
def cap_outliers_iqr(series):
    """Caps outliers in a pandas Series using IQR."""
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_whisker = Q1 - 1.5 * IQR
    upper_whisker = Q3 + 1.5 * IQR
    return np.clip(series, lower_whisker, upper_whisker)

# Columns to transform
columns_to_transform = ['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']

# Apply the capping function to the specified columns
x[columns_to_transform] = x[columns_to_transform].apply(cap_outliers_iqr)

In [5]:
skew=x.skew()
skew

Unnamed: 0,0
MedInc,0.735618
HouseAge,0.060331
AveRooms,0.348567
AveBedrms,0.462645
Population,0.842247
AveOccup,0.510453
Latitude,0.465953
Longitude,-0.297801


Transformed all outliers.

In [6]:
#feature scaling
from sklearn.preprocessing import StandardScaler
# Create scaler object
scaler = StandardScaler()
# Fit and transform features (x)
x_scaled = scaler.fit_transform(x)
x_scaled

array([[ 2.54100555,  0.98214266,  1.34766453, ..., -0.49787057,
         1.05254828, -1.32783522],
       [ 2.54100555, -0.60701891,  0.74902704, ..., -1.14278053,
         1.04318455, -1.32284391],
       [ 2.08515552,  1.85618152,  2.39409751, ..., -0.14091034,
         1.03850269, -1.33282653],
       ...,
       [-1.26748763, -0.92485123, -0.07960306, ..., -0.83054596,
         1.77823747, -0.8237132 ],
       [-1.16661997, -0.84539315,  0.01987977, ..., -1.12343912,
         1.77823747, -0.87362627],
       [-0.85207213, -1.00430931, -0.040142  , ..., -0.40899298,
         1.75014627, -0.83369581]])

 **Explanation of Preprocessing Steps**
1) Converted to DataFrame: Easier data handling.

2) Checked for Missing Values: Ensures data quality.

3) Outlier detection and caping:limits their extreme values to improve model accuracy and stability by reducing the influence of anomalous data.

4) Feature Scaling: Standardization ensures that all features contribute equally to model training.

**Regression Algorithm Implementation**

In [7]:
#split the data training and testing
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x_scaled,y)

1)Linear Regression

In [8]:
#import modules
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)
#making predictions
linear_pred = model.predict(x_test)
linear_pred

array([0.37025891, 2.71088519, 2.65143191, ..., 1.38970092, 0.93367935,
       1.38482399])

Linear Regression is a statistical method used for predicting a continuous
target variable based on one or more predictor variables. It assumes a linear relationship between the features and the target, meaning that a change in a feature will result in a proportional change in the target variable.

Linear Regression might be suitable for the California Housing dataset for the following reasons:

**Interpretability**: Linear Regression models are easy to interpret. The coefficients provide insights into the relationship between each feature and the target variable.

**Simplicity:** Linear Regression is a relatively simple algorithm to implement and understand, making it a good starting point for regression tasks.

**Potential Linear Relationships**: While the relationship between housing features and prices might not be perfectly linear, there could be some underlying linear trends that the model can capture.

2)Random Forest Regression:

In [9]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(x_train, y_train)
#making predictions
rf_pred = model.predict(x_test)
rf_pred

array([0.81145  , 2.3696201, 3.39753  , ..., 1.49172  , 1.30761  ,
       1.0067499])

**Explanation**: An ensemble of decision trees, where each tree is trained on a random subset of the data and features. The final prediction is an average of the predictions from all trees.

Suitability for California Housing : Can handle non-linear relationships and interactions between features, making it suitable for the potentially complex housing market data.

3) Decision Tree Regression :

In [10]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(x_train, y_train)
#making predictions
dt_pred = model.predict(x_test)
dt_pred

array([1.042, 1.775, 3.296, ..., 1.72 , 1.231, 1.292])

Explanation: Builds a tree-like structure to make predictions based on feature thresholds. It recursively splits the data into subsets based on feature values to minimize prediction errors.

Suitability for California Housing: Can handle non-linear relationships and interactions between features, making it suitable for the potentially complex housing market data.


4)Gradient Boosting Regressor



In [11]:
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor()
model.fit(x_train, y_train)
#making predictions
gb_pred = model.predict(x_test)
gb_pred

array([0.96195248, 2.3146233 , 3.04811747, ..., 1.5335493 , 1.10260754,
       1.28663209])

**Explanation**: Combines weak learners (typically decision trees) sequentially, where each learner focuses on correcting the errors of the previous learners. This iterative process improves prediction accuracy.

Suitability for California Housing: Known for high prediction accuracy and can handle various data types and relationships, making it a potentially strong candidate for this dataset


5)Support Vector Regressor (SVR)

In [25]:
from sklearn.svm import SVR
model = SVR()
model.fit(x_train, y_train)
#making predictions
svr_pred = model.predict(x_test)
svr_pred

array([0.95276978, 2.89683293, 2.80281643, ..., 1.39528704, 1.27111635,
       0.82740057])

Explanation: Uses support vectors to define a hyperplane that best separates the data points and predicts the target variable. It can handle non-linear relationships using kernel functions.


Suitability for California Housing: Effective for non-linear relationships and robust to outliers, which might be present in housing data.

MODEL EVALUATION AND COMPARISON

Evaluate the performance of each algorithm using the following metrics:
Mean Squared Error (MSE)

Mean Absolute Error (MAE)

R-squared Score (R²)


In [13]:
#Import necessary module
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

1)Linear regression model


In [14]:
#mean squared error metric
mse_linear = mean_squared_error(y_test, linear_pred)
#mean absolute error metric
mae_linear = mean_absolute_error(y_test, linear_pred)  # Calculate MAE
#r2 score
r2_linear = r2_score(y_test, linear_pred)

In [15]:
#PRINT THE RESULT
print("Linear Regression Evaluation : ")
print("Mean_squared Error : ",mse_linear)
print("Mean_Absolute Error",mae_linear)
print("R2 Score ",r2_linear)

Linear Regression Evaluation : 
Mean_squared Error :  0.45370539797782
Mean_Absolute Error 0.5021936620375631
R2 Score  0.658285047961088


2)Random Forest Regression:

In [16]:
#mean squared error metric
mse_rf = mean_squared_error(y_test, rf_pred)
#mean absolute error metric
mae_rf= mean_absolute_error(y_test, rf_pred)
#r2 score
r2_rf = r2_score(y_test, rf_pred)

In [17]:
#PRINT THE RESULT
print("Random Forest Regression Evaluation : ")
print("Mean_squared Error : ",mse_rf)
print("Mean_Absolute Error" ,mae_rf)
print("R2 Score ",r2_rf)

Random Forest Regression Evaluation : 
Mean_squared Error :  0.2625646880817721
Mean_Absolute Error 0.3334248555426359
R2 Score  0.8022455095423819


3) Decision Tree Regression :

In [20]:
#mean squared error metric
mse_dt = mean_squared_error(y_test, dt_pred)
#mean absolute error metric
mae_dt= mean_absolute_error(y_test, dt_pred)
#r2 score
r2_dt = r2_score(y_test, dt_pred)

In [21]:
#PRINT THE RESULT
print("Decision Tree Regression Evaluation : ")
print("Mean_squared Error : ",mse_dt)
print("Mean_Absolute Error" ,mae_dt)
print("R2 Score ",r2_dt)

Decision Tree Regression Evaluation : 
Mean_squared Error :  0.5204221506011241
Mean_Absolute Error 0.4615872441860465
R2 Score  0.6080363358574272


4)Gradient Boosting Regressor

In [23]:
#mean squared error metric
mse_gb = mean_squared_error(y_test, gb_pred)
#mean absolute error metric
mae_gb= mean_absolute_error(y_test, gb_pred)
#r2 score
r2_gb = r2_score(y_test, gb_pred)

In [24]:
#PRINT THE RESULT
print("Decision Tree Regression Evaluation : ")
print("Mean_squared Error : ",mse_gb)
print("Mean_Absolute Error" ,mae_gb)
print("R2 Score ",r2_gb)

Decision Tree Regression Evaluation : 
Mean_squared Error :  0.29207407577659245
Mean_Absolute Error 0.3735577211434748
R2 Score  0.7800200763741254


5)Support Vector Regressor (SVR)

In [26]:
#mean squared error metric
mse_svr = mean_squared_error(y_test, svr_pred)
#mean absolute error metric
mae_svr= mean_absolute_error(y_test, svr_pred)
#r2 score
r2_svr= r2_score(y_test, svr_pred)

In [27]:
#PRINT THE RESULT
print("Support Vector Regression Evaluation : ")
print("Mean_squared Error : ",mse_svr)
print("Mean_Absolute Error" ,mae_svr)
print("R2 Score ",r2_svr)

Support Vector Regression Evaluation : 
Mean_squared Error :  0.3153890782571265
Mean_Absolute Error 0.3733279205148143
R2 Score  0.7624600363350773


**Comparing all 5 models :**

In [29]:
model_comparison = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest', 'Decision Tree', 'Gradient Boosting', 'SVR'],
    'MSE': [mse_linear, mse_rf, mse_dt, mse_gb, mse_svr],
    'MAE': [mae_linear, mae_rf, mae_dt, mae_gb, mae_svr],
    'R-squared': [r2_linear, r2_rf, r2_dt, r2_gb, r2_svr]
})

model_comparison

Unnamed: 0,Model,MSE,MAE,R-squared
0,Linear Regression,0.453705,0.502194,0.658285
1,Random Forest,0.262565,0.333425,0.802246
2,Decision Tree,0.520422,0.461587,0.78002
3,Gradient Boosting,0.292074,0.373558,0.78002
4,SVR,0.315389,0.373328,0.76246


Best Performing Algorithm

Based on R-squared: The model with the highest R-squared value is generally the best. R-squared represents the proportion of variance in the target variable that is explained by the model. A higher R-squared indicates a better fit.

Based on MSE and MAE: The model with the lowest MSE and MAE values is preferred. These metrics measure the average error between the predicted and actual values. Lower values indicate better accuracy.

In this project,**Random Forest Regression** has high **highest R-squared value** and **lowest MSE and MAE values ** .Therefore,Random Forest is the best performing algorithm.

At the same time,**Linear Regression** has **lowest R-squared value** and **highest MSE** and **MAE** values.Therefore,Linear reression is the worst performing algorithm.