# Nonlinear Regression

Do house prices (y) grow nonlinearly with respect to X?

Is the relationship (X and y) continuous?

Technical stuffs:
+ Ridge, DecisionTree
+ sklearn.preprocessing.PolynomialFeatures

In [1]:
import pandas
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_validate, ShuffleSplit
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor



df = pandas.read_csv('../Datasets/cali_housing.csv')
df = df.dropna()
df = pandas.get_dummies(df)
X = df.drop(columns=['median_house_value'])
y = df.median_house_value

def evaluate(model, X, y, d=1):
    if d>1:
        trans = PolynomialFeatures(degree=d)
        X = trans.fit_transform(X)
    results = cross_validate(model, X, y, cv=ShuffleSplit(n_splits=100))
    print('{}\n\tR2: {}'.format(model, results['test_score'].mean().round(3)))

In [2]:
df.sample(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
14909,-117.04,32.62,26,3620,607.0,2000,593,4.9962,156000,0,0,0,0,1
5728,-118.21,34.18,14,2672,335.0,1113,318,12.1579,500001,1,0,0,0,0
17509,-121.9,37.34,52,241,69.0,385,64,2.619,212500,1,0,0,0,0
4797,-118.35,34.02,52,427,92.0,233,116,3.25,134700,1,0,0,0,0
10224,-117.89,33.87,25,1492,439.0,755,389,3.0893,188200,1,0,0,0,0


In [3]:
m = LinearRegression()
evaluate(m, X, y)

LinearRegression()
	R2: 0.644


In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import ShuffleSplit, cross_validate

def evaluate_model(model, X, y, d = 1):
    if d > 1:
        trans = PolynomialFeatures(degree = d)
        X = trans.fit_transform(X)
    results = cross_validate(model, X, y, cv = ShuffleSplit(n_splits = 200))
    print(results['test_score'].mean(), results['test_score'].std())

In [5]:
model = LinearRegression()
X = df.drop(columns = ['median_house_value'])
y = df['median_house_value']
evaluate_model(model, X, y)

0.6454815220279733 0.018504415414594893


In [6]:
evaluate_model(model, X, y, d = 2)

0.5716508102922395 0.49284782987783593


Reminder: how does linear regression work?

We have (1) X - features, (2) target variable.

In linear regression modeling, we find a coefficient $w$ and intercept $b$ such that the sum of errors is minimal.

The predictor is: $f(X) = w\cdot X + b$

Error is: $|| y - f(x) ||$.


### Ridge regression

This is small extension of linear regression.  In ordinary linear regression, we want to minimize $||y-(w\cdot X + b)||$.



In Ridge regression, we want to minimize $||y-(w\cdot X + b)|| + \alpha \cdot ||w||$.



$||w|| = \sqrt{\sum w_i^2}$

Ridge wants to minimize both the errors and the weight vector.

$w_1$ is the weight of the first feature.

In [7]:
m.fit(X,y)

LinearRegression()

In [8]:
print(max(m.coef_))
m.coef_

130113.5963943307


array([-2.68129893e+04, -2.54821848e+04,  1.07252004e+03, -6.19326372e+00,
        1.00556290e+02, -3.79690829e+01,  4.96173261e+01,  3.92595729e+04,
       -2.27883447e+04, -6.20726449e+04,  1.30113596e+05, -2.67423963e+04,
       -1.85102104e+04])

In [9]:
X.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity_<1H OCEAN', 'ocean_proximity_INLAND',
       'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY',
       'ocean_proximity_NEAR OCEAN'],
      dtype='object')

In [10]:
mr = Ridge(alpha=0.05, normalize=True)
evaluate(mr, X, y)

Ridge(alpha=0.05, normalize=True)
	R2: 0.636


In [11]:
mr.fit(X,y)
print(mr.coef_.dot(mr.coef_))

35792956876.27669


**not all features have the same scales.**

Not much a difference right now.

In [12]:
X

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-122.23,37.88,41,880,129.0,322,126,8.3252,0,0,0,1,0
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,0,0,0,1,0
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,0,0,0,1,0
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,0,0,0,1,0
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25,1665,374.0,845,330,1.5603,0,1,0,0,0
20636,-121.21,39.49,18,697,150.0,356,114,2.5568,0,1,0,0,0
20637,-121.22,39.43,17,2254,485.0,1007,433,1.7000,0,1,0,0,0
20638,-121.32,39.43,18,1860,409.0,741,349,1.8672,0,1,0,0,0


In [13]:
evaluate(KNeighborsRegressor(n_neighbors=7), X, y)

KNeighborsRegressor(n_neighbors=7)
	R2: 0.29


In [14]:
evaluate(LinearRegression(), X, y)
# evaluate(LinearRegression(normalize=True))
evaluate(Ridge(normalize=True, alpha=0.05), X, y, d=2)
# evaluate(Ridge(normalize=True, alpha=0.05), X, y, d=3)
evaluate(KNeighborsRegressor(), X, y)
evaluate(DecisionTreeRegressor(max_depth=10), X, y)
evaluate(DecisionTreeRegressor(max_depth=10), X, y, d=2)

# explore d with ridge, max_depth with decision trees

LinearRegression()
	R2: 0.645
Ridge(alpha=0.05, normalize=True)
	R2: 0.674
KNeighborsRegressor()
	R2: 0.263
DecisionTreeRegressor(max_depth=10)
	R2: 0.721
DecisionTreeRegressor(max_depth=10)
	R2: 0.704


In [15]:
evaluate(DecisionTreeRegressor(), X, y)

DecisionTreeRegressor()
	R2: 0.654


Decision trees are good.  They give comparable (maybe a little better than linear regression).

Ridge is comparable to linear regression.



### Are house prices linearly influenced by the features?

Presently, linear regression gives R2 around 0.64.

In [16]:
from sklearn.preprocessing import PolynomialFeatures

df = pandas.read_csv('../Datasets/cali_housing.csv')
df = df.dropna()
df = pandas.get_dummies(df)
X = df.drop(columns=['median_house_value'])
y = df.median_house_value

def evaluate(model, X, y, d=1):
    if d>1:
        trans = PolynomialFeatures(degree=d)
        X = trans.fit_transform(X)
    results = cross_validate(model, X, y, cv=ShuffleSplit(n_splits=100))
    print('{}\n\tR2: {}'.format(model, results['test_score'].mean().round(3)))



In [17]:
evaluate(LinearRegression(), X, y)

LinearRegression()
	R2: 0.648


In [18]:
evaluate(LinearRegression(), X, y, d=2)

LinearRegression()
	R2: 0.622


In [19]:
evaluate(Ridge(normalize=True, alpha=0.05), X, y, d=2)

Ridge(alpha=0.05, normalize=True)
	R2: 0.672


In [28]:
evaluate(DecisionTreeRegressor(max_depth = 5), X, y)

DecisionTreeRegressor(max_depth=5)
	R2: 0.626


In [20]:
evaluate(DecisionTreeRegressor(max_depth=10), X, y, d=2)

DecisionTreeRegressor(max_depth=10)
	R2: 0.705


In [21]:
evaluate(Ridge(normalize=True, alpha=0.05), X, y, d=3)

Ridge(alpha=0.05, normalize=True)
	R2: 0.681


In [22]:
evaluate(DecisionTreeRegressor(max_depth=10), X, y, d=3)

DecisionTreeRegressor(max_depth=10)
	R2: 0.693


There's definitely nonlinear effects in this dataset.  House prices are not simply linearly related to the features. 

In [23]:
evaluate(KNeighborsRegressor(), X, y)

KNeighborsRegressor()
	R2: 0.256


In [24]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

In [25]:
X2 = StandardScaler().fit_transform(X)

In [26]:
evaluate(KNeighborsRegressor(), X2, y)

KNeighborsRegressor()
	R2: 0.723
