In [54]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
# Load the housing dataset
housing = fetch_california_housing()

In [55]:
# creating data frame
x = pd.DataFrame(housing.data, columns=housing.feature_names) 

# Target: median housing prices
y = pd.Series(housing.target, name='med_house_value')

In [56]:
# First 5 rows of the feature dataset
print(x.head())

# Print the feature names
print("\nFeature names:")
print(y.head())

# Check for missing values
print("\nMissing values in the dataset:")
print(x.isnull().sum())


   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  

Feature names:
0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: med_house_value, dtype: float64

Missing values in the dataset:
MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64


In [57]:
# Generate summary statistics
print(x.describe())
print(y.describe())

             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
25%        2.563400     18.000000      4.440716      1.006079    787.000000   
50%        3.534800     29.000000      5.229129      1.048780   1166.000000   
75%        4.743250     37.000000      6.052381      1.099526   1725.000000   
max       15.000100     52.000000    141.909091     34.066667  35682.000000   

           AveOccup      Latitude     Longitude  
count  20640.000000  20640.000000  20640.000000  
mean       3.070655     35.631861   -119.569704  
std       10.386050      2.135952      2.003532  
min        0.692308     32.540000   -124.350000  
25%        2.429741     33.930000   -1

In [58]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [59]:
# Split the raw data (80% training, 20% testing)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Initialize and train the linear regression model on unscaled data
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)

# Make predictions on the test set
y_pred = lin_reg.predict(x_test)

In [66]:
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
rmse  = root_mean_squared_error(y_test, y_pred)
r2  = r2_score(y_test, y_pred)

# View our model's coefficients
print("Model Coefficients (Unscaled):")
print(pd.Series(lin_reg.coef_,
                index=x.columns))

# print values as floats w/ 2 decimal places
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Squared Error: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")


Model Coefficients (Unscaled):
MedInc        0.431822
HouseAge      0.009615
AveRooms     -0.101645
AveBedrms     0.609838
Population   -0.000002
AveOccup     -0.003443
Latitude     -0.419338
Longitude    -0.432621
dtype: float64
Mean Squared Error: 0.54
Root Squared Error: 0.74
R² Score: 0.59


What does the R² score tell us about model performance?
 - This metric tells the user the proportion of variance in the target variable explained by the model (independent variable). Values closer to 1 suggest a model that clearly explains variations in the target variable using the independent variable, while a value closer to zero indicate that the model explains very little of the variance (poor fitting model). In this example the R squared is telling us how well all of the model coeffieicents are predictions housing prices. 

 
Which features seem to have the strongest impact on predictions based on the model’s coefficients?
- Median income, number of bedrooms, and latitude/longitude seem to have the strongest impact on price. This is because their coefficients have the largest abslute values. 

How well do the predicted values match the actual values?
- The predicted values are somewhat accurate, the R squared value is only around .6 which means the model explains 60% of the variance. Furthermore, the RMSE of .74 means the model predictions are off by $74,000 on average which seems pretty substantial.

In [69]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
x_scaled = pd.DataFrame(x_scaled, columns=x.columns)

# Split scaled data
x_train_scaled, x_test_scaled, y_train_scaled, y_test_scaled = train_test_split(x_scaled, y, test_size=0.2)


# Initialize and train model on scaled data
lin_reg_scaled = LinearRegression()
lin_reg_scaled.fit(x_train_scaled, y_train_scaled)


# Make predictions on the test set
y_pred_scaled = lin_reg_scaled.predict(x_test_scaled)

# Evaluate model performance
mse_scaled = mean_squared_error(y_test_scaled, y_pred_scaled)
r2_scaled = r2_score(y_test_scaled, y_pred_scaled)
rmse_scaled = root_mean_squared_error(y_test_scaled, y_pred_scaled)


# View our model's coefficients
print("Model Coefficients (Scaled):")
print(pd.Series(lin_reg_scaled.coef_,
                index=x.columns))

print(f"Mean Squared Error: {mse_scaled:.2f}")
print(f"Root Squared Error: {rmse_scaled:.2f}")
print(f"R² Score: {r2_scaled:.2f}")


Model Coefficients (Scaled):
MedInc        0.839597
HouseAge      0.126841
AveRooms     -0.278961
AveBedrms     0.343215
Population   -0.000563
AveOccup     -0.036204
Latitude     -0.893186
Longitude    -0.860459
dtype: float64
Mean Squared Error: 0.52
Root Squared Error: 0.72
R² Score: 0.60


Compare the metrics before and after scaling. What changed, and why?
- The coefficients seemed to change somewhat substantially. This is because the scaled model indicates that each coefficient represents one standard deviation change in that feature. This makes it easier to compare the relative importance of features against one another, and it made mediance income one of the biggest predictors of housing price. 

Did the R² score improve? Why or why not?
- The R squared score stayed the same, because although the scale of the input variables was altered the relatonship between them and the variance of the dependent varibale is unchanged. 

What role does feature scaling play in linear regression?
- Feature scaling changes the interpretation of how well each indivual input influences the dependent variable, but not the actual accuracy of the model's prediction ability. 


In [None]:
# Select three features for simplified model
X_simplified_df = X[['AveRooms', 'AveOccup', 'MedInc']]

# Split the data into training and testing sets (80/20)
X_train_simplified, X_test_simplified, y_train_simplified, y_test_simplified = train_test_split(X_simplified_df, y, test_size=0.2)


# Initialize and train the linear regression model on unscaled data
lin_reg = LinearRegression()
lin_reg.fit(X_train_simplified, y_train_simplified)

# Make predictions on the test set
y_pred_simplified = lin_reg.predict(X_test_simplified)

mse_simplified = mean_squared_error(y_test_simplified, y_pred_simplified)
r2_simplified = r2_score(y_test_simplified, y_pred_simplified)
rmse_simplified = root_mean_squared_error(y_test_simplified, y_pred_simplified)



print(f"Mean Squared Error: {mse_simplified:.2f}")
print(f"Root Squared Error: {rmse_simplified:.2f}")
print(f"R² Score: {r2_simplified:.2f}")

Mean Squared Error: 0.72
Root Squared Error: 0.85
R² Score: 0.45


How does the simplified model compare to the full model?
- The simplified moddel has a lower R squared value because the simplified model has less information to use to predict variance. It also has a higher RMSE value, indicating more errors in its prediction. 

Would you use this simplified model in practice? Why or why not?
- I probably would not use the simplified model because it has so much error and a lower R squared. It is usefull to highlight just how important a single input like median income is, but it makes a worse model overall. 