# Neural Network Basics

### Brief Review of Machine Learning

In supervised learning, parametric models are those where the model is a function of a fixed form with a number of unknown _parameters_.  Together with a loss function and a training set, an optimizer can select parameters to minimize the loss with respect to the training set.  Common optimizers include stochastic gradient descent.  It tweaks the parameters slightly to move the loss "downhill" due to a small batch of examples from the training set.

## Part A:  Linear & Logistic Regression

You've likely seen linear regression before.  In linear regression, we fit a line (technically, hyperplane) that predicts a target variable, $y$, based on some features $x$.  The form of this model is affine (even if we call it "linear"):  

$$y_{hat} = xW + b$$

where $W$ and $b$ are weights and an offset, respectively, and are the parameters of this parametric model.  The loss function that the optimizer uses to fit these parameters is the squared error ($||\cdots||_2$) between the prediction and the ground truth in the training set.

You've also likely seen logistic regression, which is tightly related to linear regression.  Logistic regression also fits a line - this time separating the positive and negative examples of a binary classifier.  The form of this model is similar: 

$$y_{hat} = \sigma(xW + b)$$

where again $W$ and $b$ are the parameters of this model, and $\sigma$ is the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) which maps un-normalized scores ("logits") to values $\hat{y} \in [0,1]$ that represent probabilities. The loss function that the optimizer uses to fit these parameters is the [cross entropy](../a1/information_theory.ipynb) between the prediction and the ground truth in the training set.

This pattern of an affine transform, $xW + b$, occurs over and over in machine learning.

**We'll use logistic regression as our running example for the rest of this part.**


### Short Answer Questions

Imagine you want to implement logistic regression:

* `z = xW + b`
* `y_hat = sigmoid(z)`

Where:
1.  `x` is an 11-dimensional feature vector
2.  `W` is the weight vector
3.  `b` is the bias term

What are the dimensions of `W` and `b`?  Recall that in logistic regression, `z` is just a scalar (commonly referred to as the "logit").

Sketch a picture of the whole equation using rectangles to illustrate the dimensions of `x`, `W`, and `b`.  See examples below for inspiration (though please label each dimension).  We don't ask you to submit this, but make sure you can do it!  It's the "print" debugging statement of neural networks!  It's also useful for reading papers... if you can't draw the shapes of all the tensors, you don't (yet) know what's going on!

#### Answer: x is of dimension (1, 11), W is of dimension (11, 1) and b is a scalar

## Part B: Batching

Let's say we want to perform inference using your model (parameters `W` and `b`) above on multiple examples instead of just one. On modern hardware (especially GPUs), we can do this efficiently by *batching*.

To do this, we stack up the feature vectors in x like in the diagram below.  Note that changing the number of examples you run on (i.e. your batch size) *does not* affect the number of parameters in your model.  You're just running the same thing in parallel (instead of running the above one feature vector at a time at a time).

![](batchaffine.png)

The red (# features) and blue (batch size) lines represent dimensions that are the same.

### Short Answer Questions

If we have 11 features and running the model in parallel with 30 examples, what are the dimensions of:

1. `W` ?
2. `b` ?
3. `x` ?
4. `z` ?

_Hint:_ remember that your model parameters stay fixed!

#### Answer:

1. $W=[11,1]$
2. $b=[1,1]$
3. $x=[1,11]$
4. $z=[1,1]$

## Part C: Logistic Regression - NumPy Implementation

In this section, we'll implement logistic regression by hand and compute a few values to make sure we understand what's going on!

Let's say your model has the following parameters:

In [2]:
import numpy as np

W = np.array([46,6,3,24,-1])
b = 5

If you want to run the model on the following three examples:

* [1, 2, 3, 4, 5]
* [0, 0, 0, 0, 5]
* [-3, -4, -12, -1, 1]

Construct the x matrix **such that you compute the answer all in one big batch** and compute the probability of the positive class for each.

In [94]:
# Import sigmoid.
from scipy.special import expit as sigmoid

### YOUR CODE HERE

#define the input vector x with 3 examples
x = np.array([[1, 2, 3, 4, 5], [0, 0, 0, 0, 5], [-3, -4, -12, -1, 1]])

#compute the output z
z = x.dot(W) + b
print('z = Wx + b:', z)

#compute the logistic regression
h = sigmoid(z)

#print z
print('Sigmoid(z):', h)
### END YOUR CODE

z = Wx + b: [ 163    0 -218]
Sigmoid(z): [1.00000000e+00 5.00000000e-01 2.10767161e-95]


### Short Answer Questions

1. What is the probability of the positive class for the second (middle) example?
2. What is the cross-entropy loss in Base 2 of the second example if its label is positive?

#### Answers:

1. Probability of the positive class for the second example = $0.5$

2. Cross Entropy:

$$ \text{CE}(P,Q) = \sum_{x} - P(x) \log_2 Q(x) = \mathrm{E}_{x \sim P(x)}\left[ -\log_2 Q(x) \right] $$

In [86]:
import pandas as pd

p_data = [0, 1, 0]
#p = pd.DataFrame(p_data)

# use actual probailities from 1. above
q_data = h
#q = pd.DataFrame(q_data)

print('P(x):', p_data)
print('Q(x):', q_data)

print(np.log2(q_data))

cross_entropy = round(-1*sum(p_data*np.log2(q_data)), 5)
print("Cross Entropy = ", cross_entropy)

P(x): [0, 1, 0]
Q(x): [1.00000000e+00 5.00000000e-01 2.10767161e-95]
[   0.           -1.         -314.50751891]
Cross Entropy =  1.0


## Part D: NumPy Feed Forward Neural Network

Let's do the same procedure for a simple feed-forward neural network.

Imagine you have a 3 layer network (hint: # of affines = # of layers)  Each hidden layer is size 10.  Just like before, you've already trained your model and you just want to run it forward.  For this exercise, let's say that each weight matrix is np.ones(...) and each bias term is [-1, -2, -3, ..., -n] if the bias term is $n$ long.  Compute the probability of the positive class for the three examples above, again in a single batch.

**Hint:  Draw the shapes of the matrices at each layer out on a piece of paper!  Include it with any questions you post to Piazza.**

Assume your model uses a sigmoid as the nonlinearity for all layers.

#### Answer: 3-layer network:

![image.png](attachment:image.png)

In [79]:
### YOUR CODE HERE

print('x shape:', x.shape)
#print(x[0])

#First Hidden Layer
W_1 = np.ones((5, 10))
print('W1 shape:', W_1.shape)
#print(W_1)

b_1 = np.asarray([np.arange(-1, -11, -1), np.arange(-1, -11, -1), np.arange(-1, -11, -1)])
print('b1 shape', b_1.shape)
#print(b_1)

#Hidden Layer 1 output
H_1 = sigmoid(x.dot(W_1) + b_1)
print('H1 shape:', H_1.shape)
#print(H_1)

#Second Hidden Layer
W_2 = np.ones((10, 10))
print('W2 shape:', W_2.shape)
#print(W_2)

b_2 = b_1
print('b2 shape', b_2.shape)
#print(b_2)

#Hidden Layer 2 output
H_2 = sigmoid(H_1.dot(W_2) + b_2)
print('H2 shape:', H_2.shape)
#print(H_2)

#Output Layer
W_3 = np.ones((10, 1))
print('W3 shape:', W_3.shape)
#print(W_3)

b_3 = np.arange(-1, -2, -1)
print('b3 shape', b_3.shape)
print(b_3)

O = sigmoid(H_2.dot(W_3) + b_3)
print('O shape:', O.shape)
print('Probabilities:', O)

### END YOUR CODE

x shape: (3, 5)
W1 shape: (5, 10)
b1 shape (3, 10)
H1 shape: (3, 10)
W2 shape: (10, 10)
b2 shape (3, 10)
H2 shape: (3, 10)
W3 shape: (10, 1)
b3 shape (1,)
[-1]
O shape: (3, 1)
Probabilities: [[0.99967432]
 [0.95354128]
 [0.3691505 ]]


### Short Answer Questions

1.  What is the probability of the third example?
2.  What is the cross-entropy loss if its label is negative?

#### Answers:

1. Probability of the positive class for the second example = $0.36915$

2. Cross Entropy:

$$ \text{CE}(P,Q) = \sum_{x} - P(x) \log_2 Q(x) = \mathrm{E}_{x \sim P(x)}\left[ -\log_2 Q(x) \right] $$

In [95]:
#If third example is negative, then it is zero and the others are 1
p_data_2 = [1, 1, 0]
#p = pd.DataFrame(p_data_2)

# use actual probailities from 1. above
q_data_2 = [O[0][0], O[1][0], O[2][0]]
#q = pd.DataFrame(q_data_2)

print('P(x):', p_data_2)
print('Q(x):', q_data_2)

print(np.log2(q_data_2))

cross_entropy_2 = round(-1*sum(p_data_2*np.log2(q_data_2)), 5)
print("Cross Entropy = ", cross_entropy_2)

P(x): [1, 1, 0]
Q(x): [0.9996743190158155, 0.9535412841350756, 0.36915049986717824]
[-4.69934869e-04 -6.86326927e-02 -1.43771898e+00]
Cross Entropy =  0.0691


## Part E: Softmax

Recall that softmax(z) is a vector with the same length as z, and whose components are:  $softmax(z)_i = \frac{e^{z_i}}{\Sigma_j e^{z_j}}$.

### Short Answer Questions

1. If the logits coming from the main body of the network are [4, 5, 6], what is the probability of the middle class?
2. What is the cross-entropy loss if the correct class is the last one? (i.e. corresponding to logit=6)?
3. If you had such a three-class classification problem, what would the dimensions of W and b be for the last layer of the feed forward neural network above? 

In [96]:
#import softmax function
from scipy.special import softmax

logits = [4, 5, 6]
s_z = softmax(logits)

print(s_z)
print('Sum of probabilities:', sum(s_z))

[0.09003057 0.24472847 0.66524096]
Sum of probabilities: 0.9999999999999997


#### Answers:

1. Probability of middle class: $0.24473$

In [89]:
#2. Cross Entropy

import pandas as pd

p_data_3 = [0, 0, 1]
#p = pd.DataFrame(p_data)

# use predicted probabilities from 1. above
q_data_3 = s_z
#q = pd.DataFrame(q_data)

print('P(x):', p_data_3)
print('Q(x):', q_data_3)

print(np.log2(q_data_3))

cross_entropy_3 = round(-1*sum(p_data_3*np.log2(q_data_3)), 5)
print("2. Cross Entropy = ", cross_entropy_3)

P(x): [0, 0, 1]
Q(x): [0.09003057 0.24472847 0.66524096]
[-3.47344119 -2.03074614 -0.5880511 ]
2. Cross Entropy =  0.58805


3. $W$ is of dimension (10, 3) and $b$ has dimension of (3)