# Intro

Regression is a technique to determine the relationship between two or more variables.

You can apply regression to scenarios that require prediction or causal inference.

You can use regression to understand the extent to which the area of a house affects the housing prices.

Regress means predicting one variable from another.

What can Regression Show ?
 - Regression can show how one variable varies with respect to another variable.

 - For example, the price of a wine bottle can vary depending on the average growing season temperature.

What Regression cannot show ?
 - Regression cannot show any causal relationship between two variables.

 - For example, if the area of the house is an independent variable and the price of the house is a dependent variable, you cannot conclude that houses with larger areas will increase the price of the house.


 Correlation is a measure that describes the strength of relationship between two variables .
 
Regression explains in more detail about this strength

**ERRORS**

y - Dependent variable

x - Independent variable

e - Error measure

B0 and B1 Parameters that best fit the model

The actual values are scattered and the predicted values are along the line.

The difference between actual and predicted values gives the error. This is also called the residual error (e).

The parameters (Beta0 and Beta1) are chosen to minimize the total error between the actual and predicted values.

You have seen how to fit a model that best describes the data. However, you can never get a perfect fit.

**How will you measure the error/deviation in a model that is fit to the data ?**


**SSE**

Sum of Squared Errors (SSE) is a measure of the quality of the Regression Line .

If there are n data points, then the SSE is the sum of square of the residual errors .

SSE is small for the Line of Best Fit and big for the baseline model.

The line with the minimum SSE is the Regression Line. SSE is sometimes difficult to interpret because,

It depends on the number of values (n)

The units are hard to comprehend

So, is there a better way to gauge the quality of the Regression Model ?

**RMSE**

At times, the SSE is difficult to interpret and the units are difficult to comprehend. So, the alternative measure of quality is the Root Mean Square Error (RMSE).

RMSE shrinks the magnitude of error by taking the square root of SSE divided by the number of observations (n).


**Best Model Vs Baseline Model**


The baseline model gives the Average value.

The SSE values for baseline model is the Total Sum of Square values(SST)

RSquare = 1 - ((SSE) / (SST))


**R Square(R Sq) Properties**


SSE and SST values should be greater than zero.

R Sq lies between 0 and 1.

R Sq is a unit less quantity.

R Sq = 0 means the model is just as good as the base line and there is no improvement from the baseline model.

R Sq = 1 means it is a perfect model. Ideally, you should strive towards getting the R Sq close to 1 . But some models with R Sq = 0 are also accepted depending on the scenario.


**Model Interpretation**


This is the equation for line of best fit

y = 249.85714 - 0.7928571x

For a unit change in X there is a .793 decrease in Y

For a unit increase in price of the house, .793 lesser houses are sold .

B0 is 249.85714

B1 is -0.7928571

# Regression

In [1]:
import pandas as pd

price = [160,180,200,220,240,260,280]

sale = [126,103,82,75,82,40,20]

priceDF = pd.DataFrame(price, columns=list('x'))

saleDF = pd.DataFrame(sale, columns=list('y'))

houseDf = pd.concat((priceDF, saleDF),axis=1)

print(houseDf)

print(priceDF)

     x    y
0  160  126
1  180  103
2  200   82
3  220   75
4  240   82
5  260   40
6  280   20
     x
0  160
1  180
2  200
3  220
4  240
5  260
6  280


In [2]:
# Statsmodel can take input similar to R (Pass the variables with the dataframe) or take input as arrays.

import statsmodels.api as sm

import statsmodels.formula.api as smf

smfModel = smf.ols('y~x',data=houseDf).fit()

print(smfModel.summary())


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.911
Model:                            OLS   Adj. R-squared:                  0.893
Method:                 Least Squares   F-statistic:                     50.93
Date:                Sun, 30 Oct 2022   Prob (F-statistic):           0.000838
Time:                        10:36:09   Log-Likelihood:                -26.006
No. Observations:                   7   AIC:                             56.01
Df Residuals:                       5   BIC:                             55.90
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    249.8571     24.841     10.058      0.0



**Understanding the Output**

Dep. Variable: The Dependent Variable

Model: Algorithm used. Here, it is Ordinary Least Squares

Method: Parameter Fitting method. Here, it is Least Squares

No. Observations: Number of rows used for model fitting.

DF Residuals: The degrees of freedom of the residuals (Difference between the number of observations and parameters).

DF Model: The degrees of freedom of the model (The number of parameters estimated in the model excluding the constant term) .

R-squared: Measure that says how well the model has performed with respect to the baseline model.

In [3]:
# Data prep

from sklearn.datasets import fetch_california_housing
from sklearn.datasets import load_boston
import pandas as pd 
boston = load_boston()
california = fetch_california_housing()
dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target
print(dataset.head()) 


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  target  
0     15.3  396.90   4.98    24.0  
1     17.8  396.90   9.14    21.6  
2     17.8  392.83   4.03    34.7  
3     18.7  394.63   2.94    33.4  
4     18.7  396.90   5.33    36.2  
