# Create a Regression Model

## Instructions

In this lesson you were shown how to build a model using both Linear and Polynomial Regression. Using this knowledge, find a dataset or use one of Scikit-learn's built-in sets to build a fresh model. Explain in your notebook why you chose the technique you did, and demonstrate your model's accuracy. If it is not accurate, explain why.

## Rubric for grading

| Criteria | Exemplary                                                    | Adequate                   | Needs Improvement               |
| -------- | ------------------------------------------------------------ | -------------------------- | ------------------------------- |
|          | presents a complete notebook with a well-documented solution | the solution is incomplete | the solution is flawed or buggy |

## Solution

### Introduction

✅**Dataset Chosen:** Scikit-learn's California Housing Dataset  

This is a well-known regression dataset with continuous target values `MedHouseVal` (median house value), making it ideal for demonstrating regression techniques.

✅**Technique Chosen:** Linear Regression  

Reason for choosing Linear Regression:  
- Simple, interpretable model.  
- Works well when relationship between features and target is approximately linear.  

First, load the data:


In [None]:

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

housing = fetch_california_housing(as_frame=True)
california_housing = housing.frame

california_housing.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


Check for missing values:

In [None]:
print(california_housing.isnull().sum())

MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64


No missing values, which is appropriate for regression techniques.

Calculate correlation coefficients between features and target:

In [None]:
corr = california_housing.corr()["MedHouseVal"].drop("MedHouseVal")
print("Correlation between features and target:")
print(corr)

Correlation between features and target:
MedInc        0.688075
HouseAge      0.105623
AveRooms      0.151948
AveBedrms    -0.046701
Population   -0.024650
AveOccup     -0.023737
Latitude     -0.144160
Longitude    -0.045967
Name: MedHouseVal, dtype: float64


There is moderate correlation between the features and `MedHouseVal`, this means it's suitable for Linear Regression.

Let's split the data into training and testing sets:

In [None]:
X = california_housing.drop("MedHouseVal", axis=1)
y = california_housing["MedHouseVal"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42
)


Next, we train the model:


In [None]:

model = LinearRegression()
model.fit(X_train, y_train)


Our model is trained and ready to make predictions.

Let's test the model and evaluate its performance:

In [None]:
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}")
print(f"R² Score: {r2:.3f}")


RMSE: 0.728
R² Score: 0.596



### Conclusion

The accuracy is not very high, which could indicate the following:  
- Relationships between features and target are not purely linear.  
- We did not apply feature selection to choose the best features that could more accurately predict the target.
