# Simple Linear Regression

In [1]:
# import numpy
import numpy as np
import pandas as pd

# import linear_model and datasets from sklearn
from sklearn import linear_model, datasets
from sklearn.datasets import fetch_california_housing

## Loading Data

In this activity, we will work with the dataset as the “California housing dataset”. This dataset can be fetched from internet using `scikit-learn` module.

Once we loaded up the dataset, we call the `housing` object to inspect the contents. `housing` is a Python dictionary. We will then turn the data into our independent variables (X) and dependent variables (y).

In [2]:
# Load data
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [3]:
X = pd.DataFrame(housing['data'],columns = housing['feature_names'])
y = pd.Series(housing['target'],name=housing['target_names'][0])

We print the shape of X, and inspect the top 5 rows.

In [4]:
X.shape

(20640, 8)

In [5]:
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


We also print the first 5 rows of y.

In [6]:
y.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64

## Linear Regression Models

Regression analysis is a series of statistical processes used to estimate the relationships between a dependent variable and various independent variables in statistical modeling. 

Linear regression is the most common form of regression analysis. One seeks the line that best matches the data according to a set of mathematical criteria. In simple terms, it uses a straight line to define the relationship between two variables (one independent variable and dependent variable). So, if we want to estimate the value of a variable based on the value of another variable, you can use simple linear regression to model this phenomenon.

But, in an instance where there are two independent variables, it becomes a multiple linear regression. This means that a multiple linear regression or a multiple regression is when two or more explanatory/independent variables have a linear relationship with the dependent variable. 

We can start by understanding the difference between simple and multiple regression.

- Simple linear regression: there is just one x and one y variable.
- Multiple linear regression: there is one y variable and two or more x variables. 

A simple linear regression model usually takes the form of: 
![simple linear regression](img/simple_lr.png)


Considering the above-stated formula, there are a couple of assumptions or requirements that must be met for a formula to be regarded as a simple linear regression, and they are;

- Linear relationship: The independent variable, x, and the dependent variable, y, have a linear relationship.
- Independent residuals: The residuals are self-contained. In time series results, there is no connection between consecutive residuals in particular.
- Homoscedasticity: At any degree of x, the residuals have the same variance.
- Normality: The model’s residuals have a regular distribution.


If any of these assumptions are broken, any linear regression findings can be inaccurate or even misleading. 

## Creating a Simple Linear Regression Model (using `statsmodel` package)

Say something more about linear regression

In [7]:
import statsmodels.api as sm

In [8]:
X = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,X['AveRooms'])

In [9]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                                 OLS Regression Results                                
Dep. Variable:            MedHouseVal   R-squared (uncentered):                   0.681
Model:                            OLS   Adj. R-squared (uncentered):              0.681
Method:                 Least Squares   F-statistic:                          4.411e+04
Date:                Fri, 05 Apr 2024   Prob (F-statistic):                        0.00
Time:                        21:25:18   Log-Likelihood:                         -35286.
No. Observations:               20640   AIC:                                  7.057e+04
Df Residuals:                   20639   BIC:                                  7.058e+04
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------