# Assignment 1: Machine Learning Basics
---

## Exercise 1. Dataset Inspection

For this assignment, you will need the packages `sklearn` ([link](https://scikit-learn.org/stable/index.html)) and `matplotlib` ([link](https://matplotlib.org/)). 

Please complete the following steps.


- Load the “diabetdiaes dataset”: set `return_X_y` to `True` and store the features in `diabetes_X` and labels in `diabetes_y`.


- Check the dimensions of `diabetes_X` and `diabetes_y`.


- Visual inspection of a dataset is usually the first step after loading the dataset and also a crucial step prior to the actual learning process. However, here the feature space is high-dimensional, thus difficult to visualize. Let’s reduce it to 2-D using the following code.
<br><br>

                    from sklearn import decomposition
                    pca = decomposition.PCA(n_components=2)
                    pca.fit(diabetes_X)
                    X_2d = pca.transform(diabetes_X)

Tip: the first principal component can be accessed as `X_2d[:,0]`; the second principal component can be accessed as `X_2d[:,1]`.

Now, let's use `matplotlib` to plot `X_2d` and produce Figure 1. Remember to provide the axis labels as appeared in the figure.

<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/151662229-62615e47-38f6-4024-9396-47e97b11211d.png"/>

</p>

<p align="center">
  <em>Figure 1: Visual inspection of the dataset.</em>
</p>





The default labels of this dataset (i.e., `diabetes_y`) are numerical and not categorical. This means if we want to perform classification instead of regression, we need to convert the labels from numerical to categorical. This kind of conversion is commonly used in various machine learning projects. 

Here is a conversion method: take the average value of `diabetes_y` and create another variable `y_binary` for storing the binary labels `1` and `−1`; for elements in `diabetes_y` that are greater than the average, assign them `1` otherwise `−1`.

Let's visualize the dataset again but this time with the positive examples marked as `+` and negative examples marked as `x`. If successful, you should see Figure 2. Remember to provide the axis labels and legends as appeared in the figure. 

Tip: use `train_test_split` by importing it from `sklearn.model_selection`.

<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/151720700-525f60a5-a9f4-4496-9c00-a96152d023c2.png" />
</p>

<p align="center">
  <em>Figure 2: Visual inspection of the dataset with binary labels.</em>
</p>

## Exercise 2. Bias-variance Tradeoff


First, let's generate 1-D dataset using the following code.


                    data_X = np.array([i*np.pi/180 for i in range(1,150,2)])
                    data_y = np.sin(X) + np.random.normal(0,0.15,len(X))


Split `data_X` and `data_y` into training set that contains 50 samples and test set that contains 25 samples. 

PS. if you want repeatable experiments, you need to fix the random seed.

Use the following code to train a (ordinal) linear regression model.

                    from sklearn.linear_model import LinearRegression
                    lr = LinearRegression()
                    lr.fit(X_train, y_train)

The coefficients and intercept can be accessed as `lr.coef_` and `lr.intercept_`. Use them to compute the training error and test error. Plot the fitted line along with the training set and test set as shown in Figure 3 (you may not have the same plot because the dataset is generated randomly).

<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/151720757-4ea26cc2-17c0-45fb-8273-5024a242ebea.png" />
</p>

<p align="center">
  <em>Figure 3: Ordinal linear regression with the training set and test set. </em>
</p>

<br>

The objective function of ordinal linear regression is the squared errors between predicted labels and true labels:

<br>
$$J(\theta,\theta_0) = \sum_{i=1}^{n}\left(\left(\theta \cdot X^{(i)}+\theta_0\right) - y^{(i)}\right)^2.$$
<br>


Another type of linear regression, called "ridge regression," has a regularization term added to the objective function:


<br>
$$J(\theta,\theta_0) = \frac{1}{n}\sum_{i=1}^{n}\left(\left(\theta \cdot X^{(i)}+\theta_0\right) - y^{(i)}\right)^2 + \alpha\left(\theta^2 + \theta_0^2\right).$$
<br>



Let's create ridge regression using the following code.
<br>

                    from sklearn.linear_model import Ridge
                    rr = Ridge(alpha=10)
                    rr.fit(X_train, y_train)

<br>

Lastly, use the coefficients `rr.coef_` and intercept `rr.intercept_` to compute the training error and test error. Plot the fitted line along with the training set and test set as in Figure 3. Do this for `alpha` under 1, 100, and 1000, respectively. What are the differences? What causes the differences? And when we will have the same line as the ordinal linear regression shown in Figure 3? 

