<a href="https://colab.research.google.com/github/svgoudar/My-Data-Science-Roadmap/blob/main/ML/Supervised%20Learning/Regression/Linear%20Regression/6.Reggression%20with%20OLS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here’s a clear and focused summary of the **GeeksforGeeks** article on **using Ordinary Least Squares (OLS) with the `statsmodels` library** in Python:

---

## Overview

 **OLS regression** is a fundamental statistical method for estimating the parameters of a **linear regression model**. It emphasizes how OLS works by **minimizing the sum of squared residuals** (the vertical distances between observed and predicted values).

---

## Key Highlights

1. **Linear Regression Basics**

   * OLS models the relationship between a dependent variable $\hat{y}$ and one or more independent variables $x$:

     $$
     \hat{y} = b_0 + b_1 x + \dots
     $$
   * It aims to minimize:

     $$
     S = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
     $$

     where $b_0$ is the intercept and $b_1$ the slope or coefficient.
     ([GeeksforGeeks][1])

2. **Using `statsmodels` for OLS**

   * You set up your model using `sm.OLS()`, fit it, and then call `.summary()` to view detailed outputs.
   * This includes coefficients, statistical significance (p-values), R-squared, and other performance diagnostics.
     ([GeeksforGeeks][2])

3. **Interpreting the Summary**

   * The summary provides essential model metrics:

     * **Coefficients**: estimated intercept and slopes.
     * **R-squared / Adjusted R-squared**: how well the model explains target variance.
     * **p-values**: statistical significance tests for each predictor.
     * **F-statistic**: overall model significance.
     * **Standard Errors**: measure of estimation precision.
       ([GeeksforGeeks][2])

4. **Why It Matters**

   * OLS in `statsmodels` is not only easy to implement, but also provides deep statistical insights through hypothesis testing and diagnostics, making it ideal for exploratory analysis and model evaluation.

---

### Summary Table

| Step                  | Action                                    |
| --------------------- | ----------------------------------------- |
| Define model          | Use `sm.OLS()` with predictors and target |
| Fit model             | `.fit()` method                           |
| Evaluate output       | `.summary()` gives detailed metrics       |
| Interpret key results | Coefficients, R-squared, p-values, etc.   |


In [7]:
import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import fetch_california_housing

# 1️⃣ Load dataset
data = fetch_california_housing(as_frame=True)
df = data.frame

# Let's pick some features
X = df[["MedInc", "AveRooms", "AveOccup"]]  # Independent variables
y = df["MedHouseVal"]  # Dependent variable

# 2️⃣ Add a constant (intercept) for OLS
X = sm.add_constant(X)  # Adds column of ones for b0

# 3️⃣ Create and fit OLS model
model = sm.OLS(y, X).fit()

# 4️⃣ View summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:            MedHouseVal   R-squared:                       0.481
Model:                            OLS   Adj. R-squared:                  0.481
Method:                 Least Squares   F-statistic:                     6370.
Date:                Sun, 10 Aug 2025   Prob (F-statistic):               0.00
Time:                        15:26:04   Log-Likelihood:                -25477.
No. Observations:               20640   AIC:                         5.096e+04
Df Residuals:                   20636   BIC:                         5.099e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.6069      0.016     37.444      0.0

Here's an explanation of each field in the `OLS.summary()` output, using your provided example for context.

### Model Summary
This section gives you a high-level view of how well the overall model performs.

* **Dep. Variable: MedHouseVal**: This is the dependent variable you're trying to predict—in this case, the median house value.
* **R-squared ($R^2$): 0.481**: This indicates that approximately **48.1%** of the variation in `MedHouseVal` can be explained by the independent variables in your model (`MedInc`, `AveRooms`, and `AveOccup`).
* **Adj. R-squared ($R^2_{adj}$): 0.481**: This is a modified $R^2$ that accounts for the number of predictors. Since it's very close to the regular $R^2$, it suggests that all the predictors in the model are likely useful.
* **F-statistic: 6370.**: This is a test for the overall significance of the model. The large value suggests that at least one of your independent variables is significantly related to the dependent variable.
* **Prob (F-statistic): 0.00**: The p-value for the F-statistic. A value of 0.00 (which is less than 0.05) indicates that the model is statistically significant.
* **No. Observations: 20640**: This is the number of data points used to train the model.

---

### Coefficients Table
This table details the individual impact of each predictor on the dependent variable.

* **const (coefficient): 0.6069**: This is the y-intercept. It's the predicted value of `MedHouseVal` when all the independent variables are zero.
* **MedInc (coefficient): 0.4347**: For every one-unit increase in the median income, the median house value is predicted to increase by **0.4347**, holding all other variables constant.
* **AveRooms (coefficient): -0.0383**: For every one-unit increase in the average number of rooms, the median house value is predicted to decrease by **0.0383**, holding all other variables constant.
* **AveOccup (coefficient): -0.0042**: For every one-unit increase in the average number of occupants, the median house value is predicted to decrease by **0.0042**, holding all other variables constant.
* **t-statistic**: For each predictor, this value helps determine if the coefficient is statistically different from zero. For example, `MedInc` has a very high t-statistic (**134.806**), which is strong evidence that its coefficient is significant.
* **P > |t|**: The p-value for each coefficient. All of your predictors have a p-value of **0.000**, which is much less than the typical significance level of 0.05. This means all three predictors are statistically significant in explaining `MedHouseVal`.
* **[0.025, 0.975]**: This is the 95% confidence interval for each coefficient. For `MedInc`, the interval is `[0.428, 0.441]`. Since this range does not include zero, it reinforces that the relationship is statistically significant.

---

### Diagnostics Table
This section helps you assess if the underlying assumptions of linear regression are met.

* **Omnibus: 4836.746**: This tests the normality of the residuals (the differences between the predicted and actual values). A large value suggests that the residuals are not normally distributed, which is an assumption of OLS.
* **Prob(Omnibus): 0.000**: A low p-value here confirms that the residuals are not normally distributed.
* **Durbin-Watson: 0.693**: This tests for autocorrelation (correlation between consecutive residuals). A value close to 2 indicates no autocorrelation. Your value of **0.693** suggests a high degree of positive autocorrelation, which may mean there are unobserved variables or that your model is missing some time-series components.
* **Jarque-Bera (JB): 12992.753**: Another test for normality. A high value confirms the residuals are not normally distributed.
* **Skew: 1.256**: This measures the asymmetry of the residuals. A positive skew indicates that the residuals are skewed to the right.
* **Kurtosis: 5.965**: This measures the "peakedness" of the residuals. A value significantly higher than 3 (the kurtosis of a normal distribution) indicates that the residuals have "fat tails," meaning there are more outliers than you'd expect in a normal distribution.
* **Cond. No.: 31.4**: This checks for multicollinearity (when independent variables are highly correlated). A value above 30 indicates a problem. Your value of **31.4** is slightly above this threshold, suggesting some multicollinearity issues that might be affecting the stability of your coefficient estimates.

This output indicates that your model is statistically significant and its predictors are useful. However, the diagnostics section reveals potential issues, such as non-normal residuals and positive autocorrelation, that you may need to address for a more robust model.