# Linear Regression

We are going back to our _House Price Prediction_ notebook and we will fit a regression there and explore the results.

## Predicting House Prices with Linear Regression 


We finished with DataFrame X with 10 variables. We will fit _linear regression_ to this and explore if all these attributes are significant. There are two main interpretations of _linear regression_ in Python:

- statsmodels
- sklearn


### statsmodels

First of all, we will import `statsmodel` into our notebook:

```python
import statsmodels.api as sm
```

We have to add an intercept to our predictive dataset to also estimate the intercept. If we don't do that the intercept will be considered 0.

```python
X = sm.add_constant(X) # adding a constant
```


In [1]:
import pandas as pd

X=pd.read_csv('df_final.csv')

In [2]:
import statsmodels.api as sm
X = sm.add_constant(X)

In [3]:
y=X[['Price']]
X=X[['SqFt','Offers','Number_Rooms','Brick_Yes','Neighborhood_East','Neighborhood_North','const']]

In [4]:
X

Unnamed: 0,SqFt,Offers,Number_Rooms,Brick_Yes,Neighborhood_East,Neighborhood_North,const
0,1790,2,4,0,1,0,1.0
1,2030,3,6,0,1,0,1.0
2,1740,1,5,0,1,0,1.0
3,1980,3,5,0,1,0,1.0
4,2130,3,6,0,1,0,1.0
...,...,...,...,...,...,...,...
123,1900,3,6,1,1,0,1.0
124,2160,3,7,1,1,0,1.0
125,2070,2,4,0,0,1,1.0
126,2020,1,6,0,0,0,1.0


Now, we can create a Python object that will represent _linear regression_:


In [5]:
lin_reg = sm.OLS(y,X)

If we use the `type()` function, we will see that the object `lin_reg` is a linear model with the ordinary least square method.

As the next step, we will fit this using our training data and export the summary of the model:

In [6]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.867
Model:                            OLS   Adj. R-squared:                  0.860
Method:                 Least Squares   F-statistic:                     131.2
Date:                Tue, 30 Apr 2024   Prob (F-statistic):           1.56e-50
Time:                        21:29:21   Log-Likelihood:                -1357.5
No. Observations:                 128   AIC:                             2729.
Df Residuals:                     121   BIC:                             2749.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
SqFt                  54.0050      5

Now, let's do the same linear regression model with sklearn library

We need to import our model first:

```python
from sklearn.linear_model import LinearRegression
```

Then we initialize the object and fit the model on our data:


In [7]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X, y)


We should see a summary like this:

```
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
```

This gives us an overview of the parameters we can set up for _linear regression_ in sklearn. The most important one is `fit_intercept`. In `sklearn`, we don't have to add a constant to a dataset. We have to set this parameter to the value `True` if we want to compute an intercept as well. 

We can check the beta coefficient now:


In [8]:
print(regressor.coef_)

[[    54.00500003  -8350.9116557    5626.33680408  17654.58036008
  -21803.04411941 -20067.94595903      0.        ]]


This will show us a NumPy array with beta coefficients. They have the same order as our columns in X. the last one is 0 because we have added a constant column before `statsmodel` modeling. This column doesn't have any meaning in `sklearn` so we could have dropped that before.

We can see that the results look much nicer in the `statsmodel` package. Another huge disadvantage of `sklearn` is that we don't have access to p-values, so we cannot check the importance of different variables for prediction. 

If we want to know the R-squared, we can get it with:


In [9]:
regressor.score(X,y)

0.8668096918895594

The advantage of the `sklearn` implementation is that it is consistent with all other methods and models in this library. We will be using many of them soon.

## Conclusion

We have compared two implementations of _linear regression_ in Python, `statsmodel` and `sklearn`, and saw their respective advantages and disadvantages.

