# Feature scaling with sklearn - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. This exercise is very similar to a previous one. This time, however, **please standardize the data**.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [21]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Load the data

In [6]:
df_data = pd.read_csv('real_estate_price_size_year.csv')

In [9]:
df_data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


## Create the regression

### Declare the dependent and the independent variables

In [10]:
X = df_data.drop('price', axis=1)

In [11]:
y =df_data.price

### Scale the inputs

In [14]:
scaler = StandardScaler()

In [18]:
scaler.fit(X)

In [19]:
sc_X = scaler.transform(X)

In [20]:
sc_X

array([[-0.70816415,  0.51006137],
       [-0.66387316, -0.76509206],
       [-1.23371919,  1.14763808],
       [ 2.19844528,  0.51006137],
       [ 1.42498884, -0.76509206],
       [-0.937209  , -1.40266877],
       [-0.95171405,  0.51006137],
       [-0.78328682, -1.40266877],
       [-0.57603328,  1.14763808],
       [-0.53467702, -0.76509206],
       [ 0.69939906, -0.76509206],
       [ 3.33780001, -0.76509206],
       [-0.53467702,  0.51006137],
       [ 0.52699137,  1.14763808],
       [ 1.51100715, -1.40266877],
       [ 1.77668568, -1.40266877],
       [-0.54810263,  1.14763808],
       [-0.77276222, -1.40266877],
       [-0.58004747, -1.40266877],
       [ 0.58943055,  1.14763808],
       [-0.78365788,  0.51006137],
       [-1.02322731,  0.51006137],
       [ 1.19557293,  0.51006137],
       [-1.12884431,  0.51006137],
       [-1.10378093, -0.76509206],
       [ 0.84424715,  1.14763808],
       [-0.95171405,  1.14763808],
       [ 1.62279723,  0.51006137],
       [-0.58004747,

### Regression

In [22]:
X_train, X_test, y_train, y_test = train_test_split(sc_X, y, test_size=0.2, random_state=10)

In [23]:
price_model = LinearRegression()
price_model.fit(X_train, y_train)

In [24]:
price_model.score(X_train,y_train)

0.7995052583576073

### Find the intercept

In [25]:
price_model.intercept_

np.float64(290631.07131090434)

### Find the coefficients

In [27]:
price_model.coef_

array([69242.76564901, 17853.96620656])

### Calculate the R-squared

In [28]:
r2 = price_model.score(X_train,y_train)

### Calculate the Adjusted R-squared

In [31]:
r2_adj = 1-(1-r2)*(len(X_train) -1)/(len(X_train)-2-1)

In [32]:
r2_adj

0.7942976027305322

### Compare the R-squared and the Adjusted R-squared

Answer...

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Answer...

In [39]:
sc_X[:,0]

array([-0.70816415, -0.66387316, -1.23371919,  2.19844528,  1.42498884,
       -0.937209  , -0.95171405, -0.78328682, -0.57603328, -0.53467702,
        0.69939906,  3.33780001, -0.53467702,  0.52699137,  1.51100715,
        1.77668568, -0.54810263, -0.77276222, -0.58004747,  0.58943055,
       -0.78365788, -1.02322731,  1.19557293, -1.12884431, -1.10378093,
        0.84424715, -0.95171405,  1.62279723, -0.58004747,  2.17014356,
        0.5306345 , -0.58004747, -0.8606021 , -1.10378093,  0.015233  ,
       -0.77603429, -0.10057126, -0.95387294, -0.56517136, -0.5219598 ,
        0.56983186, -0.57603328, -0.10057126,  1.62279723,  0.69939906,
       -0.5219598 , -0.7415595 , -0.5219598 , -0.7415595 , -0.79600403,
       -0.69328805,  0.56983186,  0.56983186, -0.42214483, -0.69328805,
        2.21224194,  0.6039356 ,  1.45329055, -0.08495304, -0.95751607,
       -0.08387359, -0.52125142,  1.18939985,  0.56983186, -0.56517136,
       -0.08748299,  0.52699137, -1.02285625, -0.56517136,  2.17

In [42]:
sim_price_model = LinearRegression()
sim_price_model.fit(X_train[:,0].reshape(-1,1), y_train)
sim_price_model.score(X_train[:,0].reshape(-1,1),y_train)

0.7508666992930698

In [50]:
sample_pre =pd.DataFrame([[750, 2009]], columns = ['size','year'])

In [53]:
sample_pre

Unnamed: 0,size,year
0,750,2009


In [54]:
sample_pre = scaler.transform(sample_pre)

In [55]:
sample_pre

array([[-0.34752816, -0.76509206]])

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [57]:
price_model.predict(sample_pre)

array([252907.33290071])

### Calculate the univariate p-values of the variables

In [59]:
from sklearn.feature_selection import f_regression
p_value = f_regression(X_train, y_train)[1].round(3)

In [67]:
p_value

array([0.   , 0.111])

### Create a summary table with your findings

In [66]:
summary_table = pd.DataFrame([['bias',price_model.intercept_,0],['size',price_model.coef_[0],p_value[0]],['year',price_model.coef_[1],p_value[1]]], columns =['features','coeff','p_values'])
summary_table

Unnamed: 0,features,coeff,p_values
0,bias,290631.071311,0.0
1,size,69242.765649,0.0
2,year,17853.966207,0.111


Unnamed: 0,features,p_values
0,bias,0
1,size,0
2,year,0


Answer...