# Linear Regression Model – California Housing

This notebook trains and evaluates a **Linear Regression model** on the processed California Housing dataset.  

It includes:  
- Loading the processed dataset  
- Fitting a Linear Regression model  
- Cross-validation evaluation  
- Hyperparameter tuning  
- Saving the trained model  


# Data loading

In [146]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import joblib

# Paths
PROJECT_DIR = Path("/Users/sukainaalkhalidy/Desktop/CMSE492/ca_housing_project")
TRAIN_PROCESSED_FP = PROJECT_DIR / "data" / "train" / "housing_train_processed.csv"
MODEL_FP = PROJECT_DIR / "models" / "linear_regression_model.pkl"

# Load processed dataset (24 features + target)
housing = pd.read_csv(TRAIN_PROCESSED_FP)
print("Processed train shape:", housing.shape)

X = housing.drop("median_house_value", axis=1)
y = housing["median_house_value"]


Processed train shape: (16512, 24)


# model fitting

In [145]:
print("X shape:", X.shape)
print("y shape:", y.shape)

# Check dtypes
print("\nX dtypes:")
print(X.dtypes.value_counts())
print("\nNon-numeric columns:")
print(X.select_dtypes(exclude=[np.number]).head())

# Check missing values
print("\nMissing values per column:")
print(X.isna().sum()[X.isna().sum() > 0])


X shape: (16512, 23)
y shape: (16512,)

X dtypes:
float64    23
Name: count, dtype: int64

Non-numeric columns:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

Missing values per column:
total_bedrooms            163
bedrooms_per_room         163
bedrooms_per_person       163
population_per_bedroom    163
dtype: int64


In [148]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
X_imputed = imputer.fit_transform(X)
X = pd.DataFrame(X_imputed, columns=X.columns, index=X.index)

# Initialize and fit Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X, y)

# Basic training results
predictions = lin_reg.predict(X)
mse = mean_squared_error(y, predictions)
rmse = np.sqrt(mse)
print("Training RMSE:", rmse)


Training RMSE: 65914.95687475588


# Cross-Validation
We evaluate the model using **5-fold cross-validation** with RMSE as the metric.


In [149]:
from sklearn.model_selection import cross_val_score
import numpy as np

scores = cross_val_score(lin_reg, X, y,
                         scoring="neg_mean_squared_error", cv=5)
rmse_scores = np.sqrt(-scores)

print("Cross-validation RMSE scores:", rmse_scores)
print("Mean:", rmse_scores.mean())
print("Standard deviation:", rmse_scores.std())


Cross-validation RMSE scores: [90341.50586327 64873.20259698 66611.36673363 65312.25971465
 70947.35147087]
Mean: 71617.13727588105
Standard deviation: 9605.628212232115


# Hyperparameter Tuning
We use **GridSearchCV** to explore simple Linear Regression options  
(`fit_intercept` and `positive`).


In [153]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "fit_intercept": [True, False],
    "positive": [True, False]
}

grid_search = GridSearchCV(LinearRegression(), param_grid,
                           scoring="neg_mean_squared_error", cv=5)

grid_search.fit(X, y)

print("Best parameters:", grid_search.best_params_)
print("Best score (MSE):", -grid_search.best_score_)


Best parameters: {'fit_intercept': True, 'positive': True}
Best score (MSE): 4894428163.5285845


# Model Saving
Finally, we save the trained model into the `/models` directory for reuse.


In [154]:
# Save trained model
joblib.dump(lin_reg, MODEL_FP)
print(f"Model saved to {MODEL_FP}")



Model saved to /Users/sukainaalkhalidy/Desktop/CMSE492/ca_housing_project/models/linear_regression_model.pkl
