Here's how you can apply **OLS Linear Regression** using the **California Housing Dataset** available in `sklearn`. This dataset is well-suited for regression tasks, as it predicts continuous housing prices. Here's the updated code:

---

### **1. Load the Dataset**

```python
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
california = fetch_california_housing()
data = pd.DataFrame(california.data, columns=california.feature_names)
data['MedHouseVal'] = california.target  # Target variable

# Display first few rows
print(data.head())
```

---

### **2. Define Features and Target**

Separate the independent variables (`X`) and the dependent variable (`y`).

```python
# Define features and target variable
X = data[['MedInc', 'AveRooms', 'AveOccup']]  # Selecting a few features
y = data['MedHouseVal']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

---

### **3. Fit OLS Linear Regression Model**

We'll use `statsmodels` for detailed output.

```python
# Add a constant term to the features
X_train_sm = sm.add_constant(X_train)

# Fit the model
ols_model = sm.OLS(y_train, X_train_sm).fit()

# Print summary
print(ols_model.summary())
```

---

### **4. Make Predictions and Evaluate**

```python
# Add constant to test set
X_test_sm = sm.add_constant(X_test)

# Predict on the test set
y_pred = ols_model.predict(X_test_sm)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
```

---

### **5. Check Assumptions**

#### a. **Linearity**
Visualize the relationship between predictors and the outcome.

```python
# Pairplot to visualize linear relationships
sns.pairplot(data[['MedHouseVal', 'MedInc', 'AveRooms', 'AveOccup']])
plt.show()
```

#### b. **Residual Analysis**
Check homoscedasticity and normality of residuals.

```python
# Residuals
residuals = y_train - ols_model.predict(X_train_sm)

# Plot residuals vs. fitted values (Homoscedasticity check)
plt.scatter(ols_model.predict(X_train_sm), residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Fitted Values")
plt.show()

# Q-Q Plot for normality
sm.qqplot(residuals, line='45')
plt.show()
```

#### c. **Multicollinearity**
Check Variance Inflation Factor (VIF).

```python
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF
vif_data = pd.DataFrame()
vif_data["Feature"] = X_train.columns
vif_data["VIF"] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]

print(vif_data)
```

---

### **6. Address Violations (if needed)**
- **Non-linearity**: Use polynomial features or transformations (e.g., log or sqrt).
- **Heteroscedasticity**: Use weighted least squares or transform the dependent variable.
- **Multicollinearity**: Drop or combine highly correlated features.

---

### **Key Notes**
1. The **OLS summary output** provides coefficients, p-values, and R-squared, which help interpret the model.
2. The residual plots and VIF analysis ensure assumptions are not violated.
3. By using a subset of features, we simplify interpretation while focusing on meaningful predictors.

This code demonstrates how to perform regression and validate assumptions using a real-world dataset.

In [1]:
# !pip install scikit-learn statsmodels matplotlib seaborn
!pip install scikit-learn




[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
california = fetch_california_housing()
data = pd.DataFrame(california.data, columns=california.feature_names)
data['MedHouseVal'] = california.target  # Target variable

# Display first few rows
data.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [4]:
X = data[['MedInc', 'AveRooms', 'AveOccup']]  # Selecting a few features
y = data['MedHouseVal']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Add a constant term to the features
X_train_sm = sm.add_constant(X_train)
# X_train_sm
# # Fit the model
ols_model = sm.OLS(y_train, X_train_sm).fit()

# # Print summary
print(ols_model.summary())

                            OLS Regression Results                            
Dep. Variable:            MedHouseVal   R-squared:                       0.485
Model:                            OLS   Adj. R-squared:                  0.484
Method:                 Least Squares   F-statistic:                     5173.
Date:                Wed, 04 Dec 2024   Prob (F-statistic):               0.00
Time:                        20:17:48   Log-Likelihood:                -20354.
No. Observations:               16512   AIC:                         4.072e+04
Df Residuals:                   16508   BIC:                         4.075e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.6081      0.018     33.134      0.0

In [8]:
# Add constant to test set
X_test_sm = sm.add_constant(X_test)

# Predict on the test set
y_pred = ols_model.predict(X_test_sm)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 0.700685591222525
R-squared: 0.4652924370503556
