# Feature scaling with sklearn - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. This exercise is very similar to a previous one. This time, however, **please standardize the data**.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression
from sklearn.preprocessing import StandardScaler

## Load the data

In [4]:
data = pd.read_csv('real_estate_price_size_year.csv')

In [6]:
data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


## Create the regression

### Declare the dependent and the independent variables

In [9]:
x = pd.DataFrame(data=data[['size','year']])
y = data['price']
x.head()

Unnamed: 0,size,year
0,643.09,2015
1,656.22,2009
2,487.29,2018
3,1504.75,2015
4,1275.46,2009


### Scale the inputs

In [14]:
scaler = StandardScaler()
scaler.fit(x)
x_scaled = scaler.transform(x)
x_scaled[:10]

  return self.partial_fit(X, y)
  This is separate from the ipykernel package so we can avoid doing imports until


array([[-0.70816415,  0.51006137],
       [-0.66387316, -0.76509206],
       [-1.23371919,  1.14763808],
       [ 2.19844528,  0.51006137],
       [ 1.42498884, -0.76509206],
       [-0.937209  , -1.40266877],
       [-0.95171405,  0.51006137],
       [-0.78328682, -1.40266877],
       [-0.57603328,  1.14763808],
       [-0.53467702, -0.76509206]])

### Regression

In [15]:
reg = LinearRegression()
reg.fit(x_scaled,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

### Find the intercept

In [16]:
reg.intercept_

292289.4701599997

### Find the coefficients

In [17]:
reg.coef_

array([67501.57614152, 13724.39708231])

### Calculate the R-squared

In [21]:
R2 = reg.score(x_scaled,y)
R2

0.7764803683276793

### Calculate the Adjusted R-squared

In [26]:
n = x_scaled.shape[0]
p = x_scaled.shape[1]
(n,p)

(100, 2)

In [30]:
R2Adj = 1 - (1 - R2)*(n - 1)/(n - p - 1)
R2Adj

0.77187171612825

### Compare the R-squared and the Adjusted R-squared

They are very close and high - the regression is good and we are not panelized for including extra variables

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Similar

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [31]:
pred_input = scaler.transform([[750,2009]])
pred_input

array([[-0.34752816, -0.76509206]])

In [32]:
reg.predict(pred_input)

array([258330.34465995])

### Calculate the univariate p-values of the variables

In [33]:
f_regression(x,y)[1].round(3)

array([0.   , 0.357])

In [34]:
f_regression(x_scaled,y)[1].round(3)

array([0.   , 0.357])

### Create a summary table with your findings

In [50]:
summary = pd.DataFrame(data=['bias','size','year'], columns=['Feature'])
summary['Weight']=reg.intercept_,reg.coef_[0],reg.coef_[1]
summary

Unnamed: 0,Feature,Weight
0,bias,292289.47016
1,size,67501.576142
2,year,13724.397082


The regression is good since it has high R2 and Adjusted R2 values.  Also, the weight of both features is high.  However, P value shows that year is not significant, so removing it might yield better results 