# Linear Regression in Python using Statsmodels
https://www.datacourses.com/linear-regression-in-python-using-statsmodels-911/

## What is Regression?
In the simplest terms, regression is the method of finding relationships between different phenomena. It is a statistical technique which is now widely being used in varios areas of machine learning.

## What is Linear Regression?
Linear regression is the simplest of regression analysis methods. It determines the linear function or the straight line that best represents your data's distribution. As such, linear regression is often called *line of best fit*

## Simple Linear Regression
When you have to find the relationship between just two variables (one dependent and one independent), then simple regression is used.\
The indenpendent variable is usually denoted as X, while the dependent variable is denoted by Y.

## Using Statsmodels to perform Simple Linear Regression in Python

In [30]:
# import necessary packages/libraries
import pandas as pd
from pandas import DataFrame
import numpy as np
import statsmodels.api as sm

# read the csv file
myurl = 'https://people.sc.fsu.edu/~jburkardt/data/csv/homes.csv'
df = pd.read_csv(myurl)

# modify the headers
df.columns = ['Sell', 'List', 'Living', 'Rooms', 'Beds', 'Baths', 'Age', 'Acres', 'Taxes']

# check the first few rows
df.head()

Unnamed: 0,Sell,List,Living,Rooms,Beds,Baths,Age,Acres,Taxes
0,142,160,28,10,5,3,60,0.28,3167
1,175,180,18,8,4,1,12,0.43,4033
2,129,132,13,6,3,1,41,0.33,1471
3,138,140,17,7,3,1,22,0.46,3204
4,232,240,25,8,4,3,5,2.05,3613


### Check how dependent the *selling price* of house is on *Taxes*. Let's assign 'Taxes' to the variable X. 

In [31]:
X = df[['Taxes']]
Y = df[['Sell']]

# convert the two columns to float
X = X.astype(float)
Y = Y.astype(float)

# Create OLS model named 'model' and assigned to it the variables X and Y.
# Once created, you can apply the fit() function to find the idean regression line
# that fits the distribution of X and Y
model = sm.OLS(Y, X).fit()

# the variable model now holds the detailed information about our fitted regression model.
print(model.summary())

                                 OLS Regression Results                                
Dep. Variable:                   Sell   R-squared (uncentered):                   0.970
Model:                            OLS   Adj. R-squared (uncentered):              0.969
Method:                 Least Squares   F-statistic:                              1571.
Date:                Sat, 08 Jul 2023   Prob (F-statistic):                    6.90e-39
Time:                        16:31:22   Log-Likelihood:                         -244.50
No. Observations:                  50   AIC:                                      491.0
Df Residuals:                      49   BIC:                                      492.9
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

### Understanding the Results

**R-squared value:** This is a statistical measure of how well the regression line fits with the real data points. The higher the value, the better the fit.
\
**Adj, R-squared:** This is the corrected R-squared value according to the number of input features. Ideally, it should be close to the R-squareds value.
\
**Coefficient:** This gives the ‘M’ value for the regression line. It tells how much the Selling price changes with a unit change in Taxes. A positive value means that the two variables are directly proportional. A negative value, however, would have meant that the two variables are inversely proportional to each other.
\
**Std error:**  This tells us how accurate our coefficient value is. The lower the standard error, the higher the accuracy.
\
**P >|t| :** This is the p-value. It tells us how statistically significant Tax values are to the Selling price. A value less than 0.05 usually means that it is quite significant.

### Making Predictions

In [32]:
predictions = model.predict(X)
#print(predictions)

# if we reply on the model, let's see what is the selling pricess would be if taxes where 3200.0
# first we need to make a dataframe with the value 3200.00
test_tax = pd.DataFrame([32000])

predictions = model.predict(test_tax)
print(predictions)

0    1448.935804
dtype: float64


## Using Statsmodels to Perform Multiple Linear Regression in Python

Let us see if we get a better prediction by considering a combination of more than one input variables. Let's try using a combination of 'Taxes', 'Living' and 'List'.

In [33]:
new_X = df[['Taxes', 'Living', 'List']]

new_X = new_X.astype(float)

new_X = sm.add_constant(new_X)

new_model = sm.OLS(Y, new_X).fit()

In [34]:
print(new_model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Sell   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                     3383.
Date:                Sat, 08 Jul 2023   Prob (F-statistic):           6.16e-54
Time:                        16:31:22   Log-Likelihood:                -149.77
No. Observations:                  50   AIC:                             307.5
Df Residuals:                      46   BIC:                             315.2
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          9.7904      2.025      4.835      0.0

There are 4 different coefficient values instead of one. These are coefficients (or M values) corresponding to *Taxes*, *Age* and *List*. There's also an additional coefficient called *constant coefficient*, which basically the *C*-value in our regression equation. \
\
The R-squared value is 0.995. It's a high value which means the regression plane fits quite well with real data points. \
Adj, R-squared is equal to the R-squared value, which is a good sign. \
The constant coefficient  value (C) is 9.9709 \
\
Let's take a look at each of the independent variables and how they affect the selling price. \
\
List price (or asking price) coefficient is 0.9764, which means the selling price is positively correlated to the asking price. The corresponding p-value is also less than 0.05, further reinforcing the relationship. \
\
Living Area coeficient is -0.4205. This means that selling price decreases with increasing living area. The p-value for this is also less than 0.05, meaning that this variable has a significant effect on the selling price. \
\
Negative coefficient for Taxes (0.0013), this is just the opposite of what we found out from earlier model. The p-value for Taxes 0.249 tells us that of the three variables that we considered, Taxes has a low significance to the selling price.


### Making predictions

In [35]:
predictions = new_model.predict(new_X)
#print(predictions)

test_input = pd.DataFrame({'const': [1.0], 'Taxes': [3200.0], 'Living': [14.0], 'List':[165.0]})

predictions = new_model.predict(test_input)
print(predictions)

0    160.967981
dtype: float64
