# Feature scaling with sklearn

Given a real estate dataset, to create a multiple linear regression (similar to the one in the lecture). The dependent variable is 'price', while the independent variables are 'size' and 'year'.

-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. 
-  Create a summary table with your findings



## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

## Load the data

In [3]:
df = pd.read_csv("real_estate_price_size_year.csv")

In [4]:
df.describe(include="all")

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


## Create the regression

### Declare the dependent and the independent variables

In [33]:
x = df[["size","year"]]
y = df["price"]

### Scale the inputs

Two different ways

In [6]:
from sklearn import preprocessing

In [12]:
x_scaled = preprocessing.scale(x)


In [13]:
print(x_scaled.mean(axis=0))
print(x_scaled.std(axis=0))

[5.18474152e-16 1.93578487e-14]
[1. 1.]


In [16]:
from sklearn.preprocessing import StandardScaler

In [36]:
scaler = StandardScaler()
scaler.fit(x)
x_scaled = scaler.transform(x)
print(x_scaled.mean(axis=0))
print(x_scaled.std(axis=0))

[5.18474152e-16 1.93578487e-14]
[1. 1.]


### Regression

In [23]:
reg = LinearRegression()
reg.fit(x_scaled,y)

LinearRegression()

### Intercept

In [37]:
reg.intercept_

292289.4701599997

### Coefficients

In [38]:
reg.coef_

array([67501.57614152, 13724.39708231])

### R-squared

In [39]:
reg.score(x_scaled,y)

0.7764803683276793

### Adjusted R-squared

In [30]:
def adj_r2(x,y):
    r2 = reg.score(x,y)
    n = x.shape[0]
    p = x.shape[1]
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    return adjusted_r2

In [32]:
adj_r2(x_scaled,y)

0.77187171612825

### R-squared and the Adjusted R-squared

R-squared > Adjusted R-squared, no penalty by including 2 independent variables

### djusted R-squared with the R-squared of the simple linear regression

Comparing the Adjusted R-squared with the R-squared of the simple linear regression (when only 'size' was used - a couple of lectures ago), we realize that 'Year' is not bringing too much value to the result.

### Predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.


In [40]:
x_predictions = [[750,2009]]
x_predictions_scaled = scaler.transform(x_predictions)

In [41]:
reg.predict(x_predictions_scaled)

array([258330.34465995])

### The univariate p-values of the variables

In [44]:
from sklearn.feature_selection import f_regression

In [46]:
print("F-score & P-value")
f_regression(x_scaled,y)


F-score & P-value


(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [50]:
p_values = f_regression(x_scaled,y)[1]
p_values

array([8.12763222e-31, 3.57340758e-01])

In [51]:
p_values.round(3)

array([0.   , 0.357])

### Summary table with your findings

In [52]:
reg_summary = pd.DataFrame(data = x.columns.values, columns=['Features'])
reg_summary ['Coefficients'] = reg.coef_
reg_summary ['p-values'] = p_values.round(3)
reg_summary

Unnamed: 0,Features,Coefficients,p-values
0,size,67501.576142,0.0
1,year,13724.397082,0.357


'Year' is not event significant, therefore we should remove it from the model.