# 1 Ridge Regression

Ridge regression is also a linear model for regression, so the formula it uses to make
predictions is the same one used for ordinary least squares. In ridge regression,
though, the coefficients ($w$) are chosen not only so that they predict well on the training
data, but also to fit an additional constraint. We also want the magnitude of coefficients
to be as small as possible; in other words, all entries of w should be close to
zero. Intuitively, this means each feature should have as little effect on the outcome as
possible (which translates to having a small slope), while still predicting well. This
constraint is an example of what is called regularisation. Regularisation means explicitly
restricting a model to avoid overfitting. The particular kind used by ridge regression
is known as L2 regularisation.

## 1.1 Data

Boston Housing dataset

In [4]:
import mglearn
X, y = mglearn.datasets.load_extended_boston()

## 1.2 Model

**Q1**: Comparing training set and test set scores ($R^2$) by using LR and Ridge. Using [*train_test_split*](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) with *random_state=0* for spliting the data.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
lr = LinearRegression().fit(X_train, y_train)
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.95
Test set score: 0.61


In [8]:
from sklearn.linear_model import Ridge

ridge = Ridge().fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge.score(X_test, y_test)))

Training set score: 0.89
Test set score: 0.75


The Ridge model makes a trade-off between the simplicity of the model (near-zero
coefficients) and its performance on the training set. How much importance the
model places on simplicity versus training set performance can be specified by the
user, using the alpha parameter. In the previous example, we used the default parameter
$alpha=1.0$. There is no reason why this will give us the best trade-off, though.
The optimum setting of $alpha$ depends on the particular dataset we are using.
Increasing $alpha$ forces coefficients to move more toward zero, which decreases
training set performance but might help generalisation.

**Q2**: Using the same *X_train*, *X_test* as in **Q1** and reporting the training set and test set scores ($R^2$) by using Ridge ($alpha=10$) and Ridge ($alpha=0.1$).

In [9]:
ridge10 = Ridge(alpha=10).fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge10.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge10.score(X_test, y_test)))

Training set score: 0.79
Test set score: 0.64


In [10]:
ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge01.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge01.score(X_test, y_test)))

Training set score: 0.93
Test set score: 0.77


# 2 Lasso Regression

An alternative to Ridge for regularising linear regression is Lasso. As with ridge regression, the lasso also restricts coefficients to be close to zero, but in a slightly different way, called L1 regularisation.

The consequence of L1 regularisation is that when using the lasso, some coefficients are exactly zero. This means some features are entirely ignored by the model. This can be seen as a form of automatic feature selection.

Having some coefficients be exactly zero often makes a model easier to interpret, and can reveal the most important features of your model.

**Q3**: Apply the lasso with default parameters to the Boston Housing dataset and use the same training and test data as in **Q1**. Report the $R^2$ score of the training set and test set, respectively. Also, report the number of features used.

In [14]:
from sklearn.linear_model import Lasso
import numpy as np

lasso = Lasso().fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
print("Number of features used:", np.sum(lasso.coef_ != 0))

Training set score: 0.29
Test set score: 0.21
Number of features used: 4


**Q4**: Analyse the reasons and find the possible solution.

As you can see, Lasso does quite badly, both on the training and the test set. This indicates that we are underfitting, and we find that it used only 4 of the 104 features. Similarly to Ridge, the Lasso also has a regularisation parameter, alpha, that controls how strongly coefficients are pushed toward zero. In the previous example, we used the default of alpha=1.0. To reduce underfitting, let’s try decreasing alpha.

**Q5**: Increase the default setting of "max_iter" to 100,000 and set the alpha to 0.01 and do the **Q3**.

In [18]:
lasso001 = Lasso(alpha=0.01, max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso001.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso001.score(X_test, y_test)))
print("Number of features used:", np.sum(lasso001.coef_ != 0))

Training set score: 0.90
Test set score: 0.77
Number of features used: 33


**Q6**: Increase the default setting of "max_iter" to 100,000 and set the alpha to 0.000001 and do the **Q3**. Compare the results to the LR.

In [28]:
lasso00001 = Lasso(alpha=0.000001, max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso00001.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso00001.score(X_test, y_test)))
print("Number of features used:", np.sum(lasso00001.coef_ != 0))

Training set score: 0.95
Test set score: 0.61
Number of features used: 104




In [26]:
LR = LinearRegression().fit(X_train, y_train)
print("Training set score: {:.2f}".format(LR.score(X_train, y_train)))
print("Test set score: {:.2f}".format(LR.score(X_test, y_test)))
print("Number of features used:", np.sum(LR.coef_ != 0))

Training set score: 0.95
Test set score: 0.61
Number of features used: 104


**Q7**: Analyse the results in **Q6** and find the possible reason.

If we set alpha too low, however, we again remove the effect of regularisation and end up overfitting, with a result similar to LinearRegression

# 3 PCA

## 3.1 Load Iris Dataset

The Iris dataset is one of datasets scikit-learn comes with that do not require the downloading of any file from some external website. The code below will load the iris dataset.

In [35]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
# load dataset into Pandas DataFrame
df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target'])
df.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,target
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## 3.2 Standardize the Data

PCA is affected by scale, so you need to scale the features in your data before applying PCA. Use `StandardScaler` to help you standardise the dataset’s features onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of many machine learning algorithms. If you want to see the negative effect not scaling your data can have, scikit-learn has a section on the [effects of not standardizing your data](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py).

In [40]:
from sklearn.preprocessing import StandardScaler
features = ['sepal length', 'sepal width', 'petal length', 'petal width']
# Separating out the features
x = df.loc[:, features].values
# Separating out the target
y = df.loc[:,['target']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)


StandDf = pd.DataFrame(data = x,columns = features)
StandardDf = pd.concat([StandDf, df[['target']]], axis = 1)
StandardDf.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,target
0,-0.900681,1.032057,-1.341272,-1.312977,Iris-setosa
1,-1.143017,-0.124958,-1.341272,-1.312977,Iris-setosa
2,-1.385353,0.337848,-1.398138,-1.312977,Iris-setosa
3,-1.506521,0.106445,-1.284407,-1.312977,Iris-setosa
4,-1.021849,1.26346,-1.341272,-1.312977,Iris-setosa


## 3.3 Project the original data which is 4 dimensional into 2 dimensions using PCA with the first 2 components. And draw the corressponding new data table.

In [42]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
finalDf = pd.concat([principalDf, df[['target']]], axis = 1)
finalDf.head()

Unnamed: 0,principal component 1,principal component 2,target
0,-2.264542,0.505704,Iris-setosa
1,-2.086426,-0.655405,Iris-setosa
2,-2.36795,-0.318477,Iris-setosa
3,-2.304197,-0.575368,Iris-setosa
4,-2.388777,0.674767,Iris-setosa
