Spot-Check Regression Algorithms
Date: 2018/11/09

Spot-checking is a way of discovering which algorithms perform well on your machine learning problem. You cannot know which algorithms are best suited to your problem beforehand.

### 12.2 Linear Machine Learning Algorithms
#### 12.2.1 Linear Regression

Linear regression assumes that the input variables have a Gaussian distribution. It is also assumed that input variables are relevant to the output variable and that they are not highly correlated with each other.

In [11]:
# Linear Regression
import pandas
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

url = "https://goo.gl/sXleFv"
names=["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = model_selection.KFold(n_splits=num_folds, random_state=seed) 
model = LinearRegression()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print(results.mean())

-34.705255944524446


#### 12.2.2 Ridge Regression

Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model measured as the sum squared value of the coe cient values (also called the L2-norm). 

In [13]:
# Ridge Regression
import pandas
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
url = "https://goo.gl/sXleFv"
names=["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10

seed = 7
kfold = model_selection.KFold(n_splits=num_folds, random_state=seed) 
model = Ridge()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print(results.mean())

-34.07824620925937


#### 12.2.3 LASSO Regression
The Least Absolute Shrinkage and Selection Operator (or LASSO for short) is a modification of linear regression, like ridge regression, where the loss function is modified to minimize the complexity of the model measured as the sum absolute value of the coe cient values (also called the L1-norm).

In [16]:
# Lasso Regression
import pandas
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso
url = "https://goo.gl/sXleFv"
names=["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names) 
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = Lasso()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

-34.46408458830233
