### Table of Contents
1. [Question 1](#q1)
2. [Question 2](#q2)
3. [Question 3-4](#q3-4)
4. [Question 5](#q5)
5. [Question 6-7](#q6-7)
6. [Questions 8-10](#q8-10)
7. [Question 11-12](#q11-12)
8. [Questions 13-18](#q13-18)
9. [Questions 19-20](#q19-20)

In [1]:
import math
import numpy as np
from sklearn.linear_model import LinearRegression

### Question 1 <a id="q1" />
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q1.png" alt="Question 1" style="width: 600px;"/>

**Answer**: Deterministic noise will increase

**Explanation**: Deterministic noise arise from the delta between complexity of target function and that of the model. Thus when model if of lower order compared to target function, deterministic noise tends to increase


### Question 2 <a id="q2" />
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q2.png" alt="Question 2" style="width: 600px;"/>

**Answer**: $\mathcal{H}(10, 3) \subset \mathcal{H}(10,4)$

**Explanation**: Given definition of the hypothesis, larger value of parameter $d_0$ means the resulting hypothesis will contain more higher-order terms (more non-zero $w_i$), which implies that the hypothesis set is larger. 

### Question 3-4  <a id="q3-4"/>
#### Question 3
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q3.png" alt="Question 3" style="width: 600px;"/>

**Answer**: $w(t+1) \leftarrow (1 + \frac{2\eta\lambda}{N})w(t) -\eta\triangledown E_{in}(w(t))$

**Explanation**: Recall that gradient descent obtains updated weight $w(t+1)$ by moving step size $\eta$ in the **opposite** direction of current error gradient $\triangledown E_{in}(w(t))$. Substitute in the definition of augmented error (note that $w^Tw$ is the matrix form of $w^2$) to obtain the answer.

#### Question 4
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q4.png" alt="Question 4" style="width: 600px;"/>

**Answer**: $\|w_{reg}(\lambda)\| \leq \|w_{lin}\|$ for any $\lambda > 0$

**Explanation**: Recall the optimal solution for unconstrained linear regression is $w_{lin} = (X^TX)^{-1}X^Ty$, whereas the optimal solution with regularization (ridge regession) is $w_{reg} = (X^TX + \lambda I)^{-1}X^Ty$. Since $\lambda$ is in the inverse portion, any $\lambda > 0$ would give $\|w_{reg}\| < \|w_{lin}\|$, with the special case of $\|w_{reg}\| = \|w_{lin}\|$ when $\lambda = 0$ (no regularization)

### Question 5 <a id="q5"/>
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q5.png" alt="Question 5" style="width: 600px;"/>

**Answer**: None of the other choices

**Explanation**: See code snippet below. Note that this question **cannot be solved analytically** given the number of unknown variables. The only approach is to produce a set of linear regression models for each value of $\rho$ and calculate the respective leave-one-out cross-validation error.

In [2]:
# Question 5

# When the model is a horizontal line, the intercept must be the average of all y
def q5_const_loocv():
    y = [0.0, 1.0, 0.0]
    loocv = 0
    for i in range(3):
        # Hack to leave out the sample at index i
        y_prime = y[:i] + y[i+1:]
        b0 = np.mean(y_prime)
        loocv += (b0 - y[i])**2
    print("Leave-one-out cross-validation error for h0 is {}".format(loocv/3))

def q5_loocv(rho):
    x = [-1.0, rho,  1.0]
    y = [0.0, 1.0, 0.0]
    loocv = 0
    for i in range(3):
        linreg = LinearRegression(fit_intercept=True)
        # Hack to leave out the sample at index i
        x_prime = np.array(x[:i] + x[i+1:])
        y_prime = np.array(y[:i] + y[i+1:])
        linreg.fit(x_prime[:, np.newaxis], y_prime)
        a1 = linreg.coef_[0]
        b1 = linreg.intercept_
        loocv += (a1 * x[i] + b1 - y[i])**2 
    print("Leave-one-out cross-validation error for h1 when rho = {} is {}".format(rho, loocv/3))

def q5():
    q5_const_loocv()
    rho1 = math.sqrt(4 + math.sqrt(3))
    rho2 = math.sqrt(math.sqrt(3) - 1)
    rho3 = math.sqrt(9 + 4 * math.sqrt(6))
    rho4 = math.sqrt(9 - math.sqrt(6))
    for rho in [rho1, rho2, rho3, rho4]:
        q5_loocv(rho)

q5()

Leave-one-out cross-validation error for h0 is 0.5
Leave-one-out cross-validation error for h1 when rho = 2.3941701709713277 is 1.135043367685941
Leave-one-out cross-validation error for h1 when rho = 0.8555996771673521 is 64.66494840795228
Leave-one-out cross-validation error for h1 when rho = 4.335661307243996 is 0.4999999999999998
Leave-one-out cross-validation error for h1 when rho = 2.5593964634688433 is 0.9868839293305472


  linalg.lstsq(X, y)


### Question 6-7 <a id="q6-7"/>
#### Question 6
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q6.png" alt="Question 6" style="width: 600px;"/>

**Answer**: To make sure that at least one person receives correct predictions on all 55 games from the sender, after the first letter `predicts' the outcome of the first game, the sender should target at least 1616 people with the second letter.


**Explanation**:
  * Total of  $2^5 = 32$ combinations for 5 games, thus the sender needs to start with 32 letters to ensure at least one person receives the correct prediction on all 5 games.
  * For each subsequent game, the sender only needs to target half of the receivers from the previous game (who received the correct prediction)

#### Question 7
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q7.png" alt="Question 7" style="width: 600px;"/>

**Answer**: 370

**Explanation**: 1000 - (32 + 16 + 8 + 4 + 2 + 1) * 10 = 370

### Question 8-10 <a id="q8-10"/>
#### Question 8
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q8.png" alt="Question 8" style="width: 600px;"/>

**Answer**: 1

**Explanation**: The hypothesis is mathematically derived.

#### Question 9
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q9.png" alt="Question 9" style="width: 600px;"/>

**Answer**: 0.271

**Explanation**: Finite-bin Hoeffding's inequality for Bernoulli distribution is $ \mathbb{P}(|S_n - \mathbb{E}[S_n]| \ge \epsilon)) \leq 2Me^{-2n\epsilon^2}$, substituting in $N = 10,000, M = 1$ and $\epsilon = 0.01$ gives:
$$ P = 2 \exp^{-2*10000*0.01*0.01} = 0.271$$

#### Question 10
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q10.png" alt="Question 10" style="width: 600px;"/>

**Answer**: $a(x) \text{ AND } g(x)$

**Explanation**: Training data used to learn $g(x)$ had alredy been "pre-filtered" by $a(x)$ to only include the approved cases. Therefore the two models must be used in conjunction to provide satisfactory result.

### Question 11-12 <a id="q11-12"/>
#### Question 11
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q11.png" alt="Question 11" style="width: 600px;"/>

**Answer**: $(X^TX + \bar{X}^T\bar{X})^{-1}(X^Ty + \bar{X}^T\bar{y})$

**Explanation**: Extend the optimal solution of linear regression to include the virtual samples

#### Question 12
<img src="https://github.com/yijieqiu/coursera-ml-foundations/raw/master/assignment4/q12.png" alt="Question 12" style="width: 600px;"/>

**Answer**: $\bar{X} = \sqrt{\lambda}X, \bar{y} = y$

**Explanation**: Recall optimal solution of ridge regression:
        $$w_{reg} = (X^TX + \lambda I)^{-1}X^T y$$
Let:
    $$(X^TX + \bar{X}^T\bar{X})^{-1}(X^Ty + \bar{X}^T\bar{y}) = (X^TX + \lambda I)^{-1}X^T y$$