# Intro

Regression is a technique to determine the relationship between two or more variables.

You can apply regression to scenarios that require prediction or causal inference.

You can use regression to understand the extent to which the area of a house affects the housing prices.

Regress means predicting one variable from another.

What can Regression Show ?
 - Regression can show how one variable varies with respect to another variable.

 - For example, the price of a wine bottle can vary depending on the average growing season temperature.

What Regression cannot show ?
 - Regression cannot show any causal relationship between two variables.

 - For example, if the area of the house is an independent variable and the price of the house is a dependent variable, you cannot conclude that houses with larger areas will increase the price of the house.


 Correlation is a measure that describes the strength of relationship between two variables .
 
Regression explains in more detail about this strength

**ERRORS**

y - Dependent variable

x - Independent variable

e - Error measure

B0 and B1 Parameters that best fit the model

The actual values are scattered and the predicted values are along the line.

The difference between actual and predicted values gives the error. This is also called the residual error (e).

The parameters (Beta0 and Beta1) are chosen to minimize the total error between the actual and predicted values.

You have seen how to fit a model that best describes the data. However, you can never get a perfect fit.

**How will you measure the error/deviation in a model that is fit to the data ?**


**SSE**

Sum of Squared Errors (SSE) is a measure of the quality of the Regression Line .

If there are n data points, then the SSE is the sum of square of the residual errors .

SSE is small for the Line of Best Fit and big for the baseline model.

The line with the minimum SSE is the Regression Line. SSE is sometimes difficult to interpret because,

It depends on the number of values (n)

The units are hard to comprehend

So, is there a better way to gauge the quality of the Regression Model ?

**RMSE**

At times, the SSE is difficult to interpret and the units are difficult to comprehend. So, the alternative measure of quality is the Root Mean Square Error (RMSE).

RMSE shrinks the magnitude of error by taking the square root of SSE divided by the number of observations (n).


**Best Model Vs Baseline Model**


The baseline model gives the Average value.

The SSE values for baseline model is the Total Sum of Square values(SST)

RSquare = 1 - ((SSE) / (SST))


**R Square(R Sq) Properties**


SSE and SST values should be greater than zero.

R Sq lies between 0 and 1.

R Sq is a unit less quantity.

R Sq = 0 means the model is just as good as the base line and there is no improvement from the baseline model.

R Sq = 1 means it is a perfect model. Ideally, you should strive towards getting the R Sq close to 1 . But some models with R Sq = 0 are also accepted depending on the scenario.


**Model Interpretation**


This is the equation for line of best fit

y = 249.85714 - 0.7928571x

For a unit change in X there is a .793 decrease in Y

For a unit increase in price of the house, .793 lesser houses are sold .

B0 is 249.85714

B1 is -0.7928571

# Regression

In [3]:
import pandas as pd

price = [160,180,200,220,240,260,280]

sale = [126,103,82,75,82,40,20]

priceDF = pd.DataFrame(price, columns=list('x'))

saleDF = pd.DataFrame(sale, columns=list('y'))

houseDf = pd.concat((priceDF, saleDF),axis=1)

print(houseDf)

print(priceDF)

     x    y
0  160  126
1  180  103
2  200   82
3  220   75
4  240   82
5  260   40
6  280   20
     x
0  160
1  180
2  200
3  220
4  240
5  260
6  280


In [4]:
# Statsmodel can take input similar to R (Pass the variables with the dataframe) or take input as arrays.

import statsmodels.api as sm

import statsmodels.formula.api as smf

smfModel = smf.ols('y~x',data=houseDf).fit()

print(smfModel.summary())


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.911
Model:                            OLS   Adj. R-squared:                  0.893
Method:                 Least Squares   F-statistic:                     50.93
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           0.000838
Time:                        01:08:45   Log-Likelihood:                -26.006
No. Observations:                   7   AIC:                             56.01
Df Residuals:                       5   BIC:                             55.90
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    249.8571     24.841     10.058      0.0



**Understanding the Output**

Dep. Variable: The Dependent Variable

Model: Algorithm used. Here, it is Ordinary Least Squares

Method: Parameter Fitting method. Here, it is Least Squares

No. Observations: Number of rows used for model fitting.

DF Residuals: The degrees of freedom of the residuals (Difference between the number of observations and parameters).

DF Model: The degrees of freedom of the model (The number of parameters estimated in the model excluding the constant term) .

R-squared: Measure that says how well the model has performed with respect to the baseline model.

In [5]:
# Data prep

from sklearn.datasets import fetch_california_housing
from sklearn.datasets import load_boston
import pandas as pd 
boston = load_boston()
california = fetch_california_housing()
dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target
print(dataset.head()) 


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  target  
0     15.3  396.90   4.98    24.0  
1     17.8  396.90   9.14    21.6  
2     17.8  392.83   4.03    34.7  
3     18.7  394.63   2.94    33.4  
4     18.7  396.90   5.33    36.2  


# Multiple Linear Regression

The MLR model is represented as,

y - Dependent variable

x - Independent variable

e - Error measure

B0 , B1 ,B2 ... Bk Parameters that best fit the model

**MLR**

Multiple Regression helps in predicting a single variable using multiple independent variables. This improves the model by increasing the accuracy

In today's complex world a given phenomenon(variable) is affected by more than one variable. Hence it is advised to opt for a Multiple Regression Model

During this model fitting process, some variables will contribute significantly to the model but some might not. It is better to remove variables that are not of significance to the model. -So, how do we check if a variable is significant for the output?

**Law of Diminishing Returns**

More variables can increase the accuracy of the model. But sometimes the incremental value of adding each new variable might decrease.

According to the Law of Diminishing Returns, the marginal improvement decreases as new variables are added.

For example,

 - When you include x1 and x2 variables the R Sq = .8
 - When you add x3 to the model the R Sq might become .85 
 - Finally when you add x4 to this model the R Sq might become .87.
In this process the incremental value has reduced from .05 to .02

EXAMPLE

Price(thousands of $) x

Sales of new homes y

Number of red cars z

Data Source : http://www.yale.edu/statlab


MLR Equation
The MLR equation is, y = 252.85965 - .824935 x 1 + .3592748 x 2

The number of houses sold is a linear function of both the price of a house and number of cars sold

A unit increase in the number of cars sold increases the number of houses sold by a proportion of .35

A unit increase in price of a house decreases the number of houses sold by a proportion of .82

B0 252.85965

B1 -0.824935

B2 0.3592748


**What is Multi Collinearity ?**

Multi collinearity happens when two independent variables in a Multiple Regression model are correlated to each other. This will affect the outcome of your regression model.

The best way to avoid multi collinearity is to omit one of the independent variables that is highly correlated with the other. The variable to omit depends on how the variable behaves in the presence of other variables.


**Best Practices while Fitting MLR**

Determine the correlation matrix of all the independent variables .

Omit the terms that has high correlation with another.

Remove the terms that do not predict the output significantly.

In [6]:
# Input Data Load
# Let us consider the dataset available in the previous topic.
# Price of the House , Number of units sold and the number of cars sold.
# Let us create a dataframe from the list using the following code.

import pandas as pd
price = [160,180,200,220,240,260,280]
sale = [126,103,82,75,82,40,20]
cars = [0,9,19,5,25,1,20]
priceDF = pd.DataFrame(price, columns=list('x'))
saleDF = pd.DataFrame(sale, columns=list('y'))
carsDf = pd.DataFrame(cars, columns=list('z'))
houseDf = pd.concat([priceDF,saleDF,carsDf],axis=1)

In [7]:
# Fitting the Model
# Here we fit the model by giving the dependent (number of units sold) and independent variables (price of the house, number of cars sold).

X = houseDf.drop(['y'], axis=1)
y = houseDf.y
Xc = sm.add_constant(X)
linear_regression = sm.OLS(y,Xc)
fitted_model = linear_regression.fit()
fitted_model.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,y,R-squared:,0.919
Model:,OLS,Adj. R-squared:,0.879
Method:,Least Squares,F-statistic:,22.74
Date:,"Tue, 01 Nov 2022",Prob (F-statistic):,0.00654
Time:,01:08:48,Log-Likelihood:,-25.654
No. Observations:,7,AIC:,57.31
Df Residuals:,4,BIC:,57.15
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,252.8597,26.812,9.431,0.001,178.417,327.302
x,-0.8249,0.128,-6.445,0.003,-1.180,-0.470
z,0.3593,0.552,0.650,0.551,-1.174,1.893

0,1,2,3
Omnibus:,,Durbin-Watson:,1.646
Prob(Omnibus):,,Jarque-Bera (JB):,0.407
Skew:,0.546,Prob(JB):,0.816
Kurtosis:,2.549,Cond. No.,1270.0


**Interpreting the Coef of the Model**

The final equation is

y = 126.45 - 0.55 * x - 0.322 * z

the P(>|t|) values for each parameter is

Constant Term 252.85965 is 0.001 - meaning - this term is significant in predicting the output

x - House Price - -0.8249 is 0.003 - this term is also significant in predicting the output.

z - car sales - 0.3593 is 0.551 this term is not so significant in predicting the output.

We do not have to omit the third variable.



**Interpreting the terms**

Coef column gives the value of estimated coefficients (B0, B1, B2 etc.) .

If the coef is zero then that independent variable does not predict the dependent variable correctly.

Std err denotes how much each coefficient varies from the estimated value

t-value - = Estimated coef/stderr

P(>|t|) how likely the estimated value is zero

- This value also indicates how significant a variable is to a model.
- The smaller the value, the more significant a given variable is to the model.
- it is better to remove variables with higher values of `P(>|t|)


**MLR Model Building**

- Consider that for a given dependent variable y, there are 4 independent variables x1,x2,x3 and x4 that affect the outcome. A possible way of building a Multiple Regression Model is to first use each independent variable separately against the dependent variable and measure the R-squared value.
- Another way of doing this is by incrementally adding each independent variable and measuring the R-squared value for each combination.


**Handling Multicollinearity**

- A good practice while fitting multiple regression model is to check if there is any correlation among the independent variables.
- In python, for a random array X the command to find correlation is X.corr().

**Tips**

Choose the coef with low Pr(>|t|) value.

Reject that variable with correlation outside the range -0.7 and 0.7 with any other variable.



In [8]:
# Let us create a dataframe from the list using the following code.

import pandas as pd
price = [160,180,200,220,240,260,280]
sale = [126,103,82,75,82,40,20]
cars = [0,9,19,5,25,1,20]
priceDF = pd.DataFrame(price, columns=list('x'))
saleDF = pd.DataFrame(sale, columns=list('y'))
carsDf = pd.DataFrame(cars, columns=list('z'))
houseDf = pd.concat([priceDF,saleDF,carsDf],axis=1)

In [9]:
# Here we fit the model by giving the dependent (number of units sold) and 
# independent variables (price of the house, number of cars sold).

X = houseDf.drop(['y'], axis=1)
y = houseDf.y
Xc = sm.add_constant(X)
linear_regression = sm.OLS(y,Xc)
fitted_model = linear_regression.fit()
fitted_model.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,y,R-squared:,0.919
Model:,OLS,Adj. R-squared:,0.879
Method:,Least Squares,F-statistic:,22.74
Date:,"Tue, 01 Nov 2022",Prob (F-statistic):,0.00654
Time:,01:08:48,Log-Likelihood:,-25.654
No. Observations:,7,AIC:,57.31
Df Residuals:,4,BIC:,57.15
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,252.8597,26.812,9.431,0.001,178.417,327.302
x,-0.8249,0.128,-6.445,0.003,-1.180,-0.470
z,0.3593,0.552,0.650,0.551,-1.174,1.893

0,1,2,3
Omnibus:,,Durbin-Watson:,1.646
Prob(Omnibus):,,Jarque-Bera (JB):,0.407
Skew:,0.546,Prob(JB):,0.816
Kurtosis:,2.549,Cond. No.,1270.0


**Interpreting the Coef of the Model**

The final equation is

y = 126.45 - 0.55 * x - 0.322 * z

the P(>|t|) values for each parameter is

Constant Term 252.85965 is 0.001 - meaning - this term is significant in predicting the output

x - House Price - -0.8249 is 0.003 - this term is also significant in predicting the output.

z - car sales - 0.3593 is 0.551 this term is not so significant in predicting the output.

We do not have to omit the third variable.

**Interpreting the terms**

Coef column gives the value of estimated coefficients (B0, B1, B2 etc.) .

If the coef is zero then that independent variable does not predict the dependent variable correctly.

Std err denotes how much each coefficient varies from the estimated value

t-value - = Estimated coef/stderr

P(>|t|) how likely the estimated value is zero

- This value also indicates how significant a variable is to a model.
- The smaller the value, the more significant a given variable is to the model.
- it is better to remove variables with higher values of `P(>|t|)`

> Indented block



**MLR Model Building**

- Consider that for a given dependent variable y, there are 4 independent variables x1,x2,x3 and x4 that affect the outcome. A possible way of building a Multiple Regression Model is to first use each independent variable separately against the dependent variable and measure the R-squared value.
- Another way of doing this is by incrementally adding each independent variable and measuring the R-squared value for each combination.

**Handling Multicollinearity**

A good practice while fitting multiple regression model is to check if there is any correlation among the independent variables.
In python, for a random array X the command to find correlation is X.corr().

**Tips**

Choose the coef with low Pr(>|t|) value.

Reject that variable with correlation outside the range -0.7 and 0.7 with any other variable.

In [12]:
# Data Prep
# Hope you've understood how to deal with multiple variables and perform multiple 
# regressions. Let us consider the dataset created using the following code for further practice.

from sklearn.datasets import fetch_california_housing
from sklearn.datasets import load_boston
boston = load_boston()
california = fetch_california_housing()
dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target

import statsmodels.api as sm
dataset.iloc[:,:-1].corr()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
CRIM,1.0,-0.200469,0.406583,-0.055892,0.420972,-0.219247,0.352734,-0.37967,0.625505,0.582764,0.289946,-0.385064,0.455621
ZN,-0.200469,1.0,-0.533828,-0.042697,-0.516604,0.311991,-0.569537,0.664408,-0.311948,-0.314563,-0.391679,0.17552,-0.412995
INDUS,0.406583,-0.533828,1.0,0.062938,0.763651,-0.391676,0.644779,-0.708027,0.595129,0.72076,0.383248,-0.356977,0.6038
CHAS,-0.055892,-0.042697,0.062938,1.0,0.091203,0.091251,0.086518,-0.099176,-0.007368,-0.035587,-0.121515,0.048788,-0.053929
NOX,0.420972,-0.516604,0.763651,0.091203,1.0,-0.302188,0.73147,-0.76923,0.611441,0.668023,0.188933,-0.380051,0.590879
RM,-0.219247,0.311991,-0.391676,0.091251,-0.302188,1.0,-0.240265,0.205246,-0.209847,-0.292048,-0.355501,0.128069,-0.613808
AGE,0.352734,-0.569537,0.644779,0.086518,0.73147,-0.240265,1.0,-0.747881,0.456022,0.506456,0.261515,-0.273534,0.602339
DIS,-0.37967,0.664408,-0.708027,-0.099176,-0.76923,0.205246,-0.747881,1.0,-0.494588,-0.534432,-0.232471,0.291512,-0.496996
RAD,0.625505,-0.311948,0.595129,-0.007368,0.611441,-0.209847,0.456022,-0.494588,1.0,0.910228,0.464741,-0.444413,0.488676
TAX,0.582764,-0.314563,0.72076,-0.035587,0.668023,-0.292048,0.506456,-0.534432,0.910228,1.0,0.460853,-0.441808,0.543993


In [13]:
fitted = sm.OLS(dataset.target, dataset.iloc[:,:-1]).fit()
print(fitted.summary())

                                 OLS Regression Results                                
Dep. Variable:                 target   R-squared (uncentered):                   0.959
Model:                            OLS   Adj. R-squared (uncentered):              0.958
Method:                 Least Squares   F-statistic:                              891.3
Date:                Tue, 01 Nov 2022   Prob (F-statistic):                        0.00
Time:                        01:45:15   Log-Likelihood:                         -1523.8
No. Observations:                 506   AIC:                                      3074.
Df Residuals:                     493   BIC:                                      3128.
Df Model:                          13                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

# Model Improvement

**Occam's razor**

- When you have two Multiple Regression Models fit for a given data set ,if one is simple and another is complex , choose the simple model.
- Whenever you are in the Model Building exercise , start with a simple model and then build complexity on top of it.

**TIP**

Do not discard theoretical considerations based on statistical measures.

**Feature Scaling**

- Your data-set might contain different features like independent variables (columns) with different magnitudes. So always bring them to a proper scale for ease of operation. This process is called feature scaling.
- You can achieve Feature scaling with the help of either Normalization or Standardization depending on the magnitude of the variables.

In [14]:
# Normalization
# Normalization is the process of re-scaling any value to the range [-1,1] .
# Python has ready-made packages for re-scaling the data
from sklearn import preprocessing
import numpy as np
sampleData = np.array([[ -3., -1.,  4.]])
normalized_sampleData = preprocessing.normalize(sampleData)
normalized_sampleData

array([[-0.58834841, -0.19611614,  0.78446454]])

In [16]:
# Standardization
# Standardization is the process of removing the arithmetic mean and dividing by the standard deviation.
# Standardization in python is done in the following way:

from sklearn.preprocessing import StandardScaler
X = np.array([[1,2,3,4,5]])
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
rescaledX

array([[0., 0., 0., 0., 0.]])