(a)

To calculate the derivative, we need to write down the related equations for forward propagation first.

$$ z_2^{(i)} = w_{1,2}^{[1]} x_1^{(i)} +  w_{2,2}^{[1]} x_2^{(i)} + w_{0,2}^{[1]}$$

$$ h_2^{(i)} = g(z_2^{(i)}) $$

$$z^{(i)} = w_1^{[2]} h_1^{(i)} + w_2^{[2]} h_2^{(i)} + w_3^{[2]} h_3^{(i)} + w_0^{[2]}$$

$$ o^{(i)} = g(z^{(i)})$$

where $g$ is the sigmoid function. The loss of a single example is:

$$l^{(i)} = (o^{(i)} - y^{(i)})^2$$

Using the chain rule, we have

$$
\begin{align}
\frac{\partial l^{(i)}}{\partial w_{1,2}^{[1]}} 
&= \frac{\partial l^{(i)}}{\partial o^{(i)}} \frac{\partial o^{(i)}}{\partial z^{(i)}} \frac{\partial z^{(i)}}{\partial h_2^{(i)}} \frac{\partial h_2^{(i)}}{\partial z_2^{(i)}} \frac{\partial z_2^{(i)}}{\partial w_{1,2}^{[1]}}
\\
&=2(o^{(i)} - y^{(i)}) o^{(i)}(1- o^{(i)}) w_2^{[2]} h_2^{(i)} (1 - h_2^{(i)})x_1^{(i)}
\end{align}
$$

So

$$\frac{\partial l}{\partial w_{1,2}^{[1]}}  = \frac{1}{m} \sum_{i=1}^m \frac{\partial l^{(i)}}{\partial w_{1,2}^{[1]}} = \frac{2}{m} \sum_{i=1}^m (o^{(i)} - y^{(i)}) o^{(i)}(1- o^{(i)}) w_2^{[2]} h_2^{(i)} (1 - h_2^{(i)})x_1^{(i)} $$

So the gradient descent update to $w_{1,2}^{[1]}$ is
$$w_{1,2}^{[1]} := w_{1,2}^{[1]} - \alpha \frac{\partial l}{\partial w_{1,2}^{[1]}} = w_{1,2}^{[1]} - \frac{2\alpha}{m} w_2^{[2]} \sum_{i=1}^m (o^{(i)} - y^{(i)}) o^{(i)}(1- o^{(i)}) h_2^{(i)} (1 - h_2^{(i)})x_1^{(i)} $$

(2)

The forward propagation is given by

$$h = f(W^{[1]} x + W_0^{[1]})$$

$$o = f(W^{[2]} h + W_0^{[2]})$$

(b)

Yes, it is possible.

From the scatter plot of the dataset, we can see that when $x_1 < 0.5$ or $x_2 < 0.5$ or $x_1 + x_2 > 4$, the example is in class 1. Otherwise, the example is in class 0.

Since there are 3 neurons in the hidden layer, we can store 3 pieces of information

1. Is $x_1 \leq 0.5$?
2. Is $x_2 \leq 0.5$?
3. Is $x_1 + x_2 \geq 4$?

in the three neurons respectively, and then use these 3 pieces of information to determine the which class the example belongs to.

Let say we use $h_1$, $h_2$ and $h_3$ to store these 3 pieces of information. Let's analyze $h_1$ first. When $x_1 \leq 0.5$, the class label is $1$. So we want $h_1 = 1$, which is consistent with the class label. Since $h_1 = f(z_1)$, we want $z_1 \geq 0$. Therefore, we want $z_1 \geq 0$ when $x_1 \leq 0.5$. To achieve this, we can set

$$z_1 = 0.5 - x_1$$

Similarly, we want $h_2 = 1$ when $x_2 leq 0.5$ and $h_3 = 1$ when $x_1 + x_2 \geq 4$. From a similar analysis, we can set

$$z_2 = 0.5 - x_2$$

and

$$ z_3 = x_1 + x_2 - 4$$

Which leads to 
$$
W^{[1]} = 
\begin{pmatrix}
w_{1,1}^{[1]} & w_{2,1}^{[1]}\\
w_{1,2}^{[1]} & w_{2,2}^{[1]}\\
w_{1,3}^{[1]} & w_{2,3}^{[1]}
\end{pmatrix}
=
\begin{pmatrix}
-1 & 0\\
0 & -1\\
1 & 1
\end{pmatrix}
$$

and 

$$
W_0^{[1]} = 
\begin{pmatrix}
w_{0,1}^{[1]}\\
w_{0,2}^{[1]}\\
w_{0,3}^{[1]}
\end{pmatrix}
=
\begin{pmatrix}
0.5\\
0.5\\
-4
\end{pmatrix}
$$

So 

$$z^{[1]} = W^{[1]} x + W_0^{[1]} = 
\begin{pmatrix}
0.5 - x_1\\
0.5 - x_2\\
x_1 + x_2 -4
\end{pmatrix}$$

and 

$$ h = f(z^{[1]}) = 
\begin{pmatrix}
1\{0.5 - x_1 \geq 0\} \\
1\{0.5 - x_2 \geq 0\} \\
1\{x_1 + x_2 - 4 \geq 0\}
\end{pmatrix}$$

Then we try to construct $W^{[2]}$ and $w_0^{[2]}$. If any element in $h$ is $1$, we want $o$ to be $1$, which means $z^{[2]} \geq 0$. Otherwise (if all elements in $h$ are $0$), we want $z^{[2]} \leq 0$. We can achieve this by letting:

$$W^{[2]} = 
\begin{pmatrix}
w_1^{[2]} \; w_2^{[2]} \; w_3^{[2]}
\end{pmatrix} 
=
\begin{pmatrix}
1\;1\;1
\end{pmatrix}$$

and 

$$w_0^{[2]} = -0.5$$

then 

$$z^{[2]} = W^{[2]} h + w_0^{[2]} = 1\{0.5 - x_1 \geq 0\} + 1\{0.5 - x_2 \geq 0\} + 1\{x_1 + x_2 - 4 \geq 0\} - 0.5$$

We can check that if any indicator function equals $1$, then $z^{[2]} \geq 0$, which makes $o=1$. Otherwise, the data is in the triangle area, and $o=0$.

(3)

No, it is not possible.

Since $h = f(z^{[1]}) = z^{[1]}$, and $z^{[1]}$ is a linear function of $x$, $h$ is also a linear function of $x$.

Therefore, since $z^{[2]}$ is a linear function of $h$, it is also a linear function of $x$.

For positive examples, we require $o = 1$, which means $z^{[2]} \geq 0$. Since $z^{[2]}$ is a linear function of $x$, this leads to a linear decision boundary, which cannot achieve 100% accuracy on this dataset.