## Regression Assignment

## 1. Loading and Preprocessing (2 marks):


In [76]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [78]:
# Load dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df["MedHouseVal"] = housing.target
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [82]:
# Check for missing values
df.isnull().sum()  # No missing values expected

MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64

In [84]:
# Check for duplicate values
df.duplicated().sum()

0

In [86]:
# Check outliers
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


In [90]:
df.boxplot()

<Axes: >

In [92]:
df.select_dtypes(include=['float64', 'int64']).skew()

MedInc          1.646657
HouseAge        0.060331
AveRooms       20.697869
AveBedrms      31.316956
Population      4.935858
AveOccup       97.639561
Latitude        0.465953
Longitude      -0.297801
MedHouseVal     0.977763
dtype: float64

In [94]:
num_cols = df.select_dtypes(include=['float64', 'int64']).columns
num_cols

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude', 'MedHouseVal'],
      dtype='object')

## Explanation:
* StandardScaler ensures features are normalized for models like SVR and Linear Regression.
* Splitting ensures fair model evaluation.

## 2. Regression Algorithm Implementation (5 marks):


In [15]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [13]:
# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop("MedHouseVal", axis=1))
y = df["MedHouseVal"]

### 2.1. Linear Regression

In [100]:
#Linear Regression models the relationship between input features and target as a straight line (linear combination).
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
# Metrics
lr_mse = mean_squared_error(y_test, lr_pred)
lr_mae = mean_absolute_error(y_test, lr_pred)
lr_r2 = r2_score(y_test, lr_pred)
print("Linear Regression:")
print("MSE:", lr_mse)
print("MAE:", lr_mae)
print("R2 Score:", lr_r2)

Linear Regression:
MSE: 0.5558915986952442
MAE: 0.5332001304956565
R2 Score: 0.575787706032451


### Analysis:
Linear Regression gives moderate performance. Since housing prices depend on complex, non-linear interactions between features (e.g., location, population), linear models may underfit the data. It's simple and interpretable, but not powerful for capturing complex trends.

### 2.2. Decision Tree Regressor

In [104]:
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
# Metrics
dt_mse = mean_squared_error(y_test, dt_pred)
dt_mae = mean_absolute_error(y_test, dt_pred)
dt_r2 = r2_score(y_test, dt_pred)
print("Decision Tree Regressor:")
print("MSE:", dt_mse)
print("MAE:", dt_mae)
print("R2 Score:", dt_r2)

Decision Tree Regressor:
MSE: 0.4942716777366763
MAE: 0.4537843265503876
R2 Score: 0.6228111330554302


### Analysis:
Better than Linear Regression because it captures non-linearity. However, single decision trees are prone to overfitting. Accuracy may vary depending on tree depth and splits. It’s fast but not the most accurate.

### 2.3. Random Forest Regressor

In [111]:
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
# Metrics
rf_mse = mean_squared_error(y_test, rf_pred)
rf_mae = mean_absolute_error(y_test, rf_pred)
rf_r2 = r2_score(y_test, rf_pred)
print("Random Forest Regressor:")
print("MSE:", rf_mse)
print("MAE:", rf_mae)
print("R2 Score:", rf_r2)

Random Forest Regressor:
MSE: 0.25549776668540763
MAE: 0.32761306601259704
R2 Score: 0.805024407701793


### Analysis:
Analysis:
Excellent performance due to ensemble learning and averaging, which reduces overfitting. Random Forest captures complex patterns, is robust to outliers, and generalizes well.

### 2.4. Gradient Boosting Regressor

In [115]:
from sklearn.ensemble import GradientBoostingRegressor
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)
gb_pred = gb_model.predict(X_test)
# Metrics
gb_mse = mean_squared_error(y_test, gb_pred)
gb_mae = mean_absolute_error(y_test, gb_pred)
gb_r2 = r2_score(y_test, gb_pred)
print("Gradient Boosting Regressor:")
print("MSE:", gb_mse)
print("MAE:", gb_mae)
print("R2 Score:", gb_r2)

Gradient Boosting Regressor:
MSE: 0.29399901242474274
MAE: 0.37165044848436773
R2 Score: 0.7756433164710084


### Analysis:
Almost as good as Random Forest, sometimes even better with careful tuning. It builds models sequentially to correct previous errors. A strong choice for structured/tabular data.

### 2.5. Support Vector Regressor (SVR)

In [112]:
from sklearn.svm import SVR
svr_model = SVR()
svr_model.fit(X_train, y_train)
svr_pred = svr_model.predict(X_test)
# Metrics
svr_mse = mean_squared_error(y_test, svr_pred)
svr_mae = mean_absolute_error(y_test, svr_pred)
svr_r2 = r2_score(y_test, svr_pred)
print("Support Vector Regressor:")
print("MSE:", svr_mse)
print("MAE:", svr_mae)
print("R2 Score:", svr_r2)

Support Vector Regressor:
MSE: 0.3551984619989419
MAE: 0.3977630963437859
R2 Score: 0.7289407597956462


### Analysis:
SVR tends to underperform on large datasets like this, especially if not carefully tuned. It’s also sensitive to feature scaling and parameter choices. It may work better on smaller or cleaner datasets.

In [22]:
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "Support Vector Regressor": SVR()
}
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results[name] = {
        "MSE": mse,
        "MAE": mae,
        "R2": r2
    }

In [24]:
# compare model performance
results_df = pd.DataFrame(results).T
results_df = results_df.sort_values("R2", ascending=False)
results_df

Unnamed: 0,MSE,MAE,R2
Random Forest,0.255498,0.327613,0.805024
Gradient Boosting,0.293999,0.37165,0.775643
Support Vector Regressor,0.355198,0.397763,0.728941
Decision Tree,0.494272,0.453784,0.622811
Linear Regression,0.555892,0.5332,0.575788


## 3. Model Evaluation and Comparison (2 marks):



In [122]:
# Create a results DataFrame
comparison_df = pd.DataFrame({
    "Model": ["Linear Regression", "Decision Tree", "Random Forest", "Gradient Boosting", "SVR"],
    "MSE": [lr_mse, dt_mse, rf_mse, gb_mse, svr_mse],
    "MAE": [lr_mae, dt_mae, rf_mae, gb_mae, svr_mae],
    "R2 Score": [lr_r2, dt_r2, rf_r2, gb_r2, svr_r2]
})
# Sort by R2 Score descending
comparison_df = comparison_df.sort_values(by="R2 Score")
print(comparison_df)

               Model       MSE       MAE  R2 Score
0  Linear Regression  0.555892  0.533200  0.575788
1      Decision Tree  0.494272  0.453784  0.622811
4                SVR  0.355198  0.397763  0.728941
3  Gradient Boosting  0.293999  0.371650  0.775643
2      Random Forest  0.255498  0.327613  0.805024


### Analysis
Among all the regression models applied to the California Housing dataset, Random Forest Regressor emerged as the best-performing algorithm, achieving the lowest Mean Squared Error (MSE) and Mean Absolute Error (MAE), along with the highest R² score. Gradient Boosting Regressor followed closely behind, also showing strong performance with slightly higher errors but still a high R² score, indicating a good fit to the data. Linear Regression and Decision Tree Regressor showed moderate performance, with Decision Tree slightly outperforming Linear Regression due to its ability to model non-linear patterns. However, both were clearly outperformed by the ensemble methods. Support Vector Regressor (SVR) had the weakest results, with the highest error values and the lowest R² score, likely due to its sensitivity to feature scaling and inefficiency with larger datasets. Overall, ensemble methods like Random Forest and Gradient Boosting proved to be the most effective for this regression task.