# Grokking Deep Learning

> Andrew Trask

Notes by Tobias Reaper

---
---

## Chapter 2: Fundamental Concepts

* What are deep learning, machine learning, and artificial intelligence?
* What are parametric models and nonparametric models?
* What are supervised learning and unsupervised learning?
* How can machines learn?

---

### What is deep learning?

* Deep learning is a subset of machine learning
  * Subset of methods in the ML toolbox primarily using ANNs

#### What is machine learning?

* Subfield of computer science
* Deals with teaching machines to perform tasks for which they were not explicitly programmed
* Machines observe a pattern and attempt to imitate it, either directly or indirectly
  * i.e. supervised or unsupervised

---

### Supervised vs unsupervised

### Supervised

* Transform one dataset into another
  * Input is what is known
  * Output is what one wants to know
  
### Unsupervised

* Also transforms one dataset into another
* However, the output dataset is not previously known
* No "right answer"
  * Finding patterns in the data
* Ex: clustering datapoints into cluster labels

> In general, all forms of unsupervised learning are ~forms of clustering

---

### Parametric vs nonparametric learning

* Supervision ~ the type of pattern being learned
* Parametricism ~ the way the learning is stored and the method for learning
* Characterization by number of parameters
  * Parametric model has a fixed number of parameters
    * Tend to use trial and error
  * Nonparametric model has potentially infinite number of parameters
    * Tend to use counting
    
#### Supervised parametric learning

* "Trial and error learning using knobs"
  * Trial-and-error is a common property of parametric models (with exceptions)
* Machines with a fixed number of knobs (parameters)
  * Learning means turning the knobs
* Input data is processed based on the settings of the knobs, transformed into a prediction
* Steps
  * 1. Predict
  * 2. Compare to truth pattern
  * 3. Learn the pattern
  
#### Unsupervised parametric learning

* Uses knobs to group data
  * Knobs in each group map the input's "affinity" to a particular group
* Each group's machine transforms the input data to number 0-1
  * Probability that the input is a member of that group

#### Nonparametric learning

* "Counting-based methods"
* Number of parameters/knobs is based on the data, rather than predefined
* There is some overlap / a blurry boundary between (non)parametric
  * Because the number of parameters in parametric models will be influenced by the number of classes in the data
* "Parameters" is a generic term
  * Set of numbers used to model a pattern

---

### Review and solifidy

> Record myself explaining (ELI5) the following topics

* [ ] Machine learning and deep learning
* [ ] Supervised vs unsupervised learning
* [ ] Parametric vs nonparametric learning

---
---

## Chapter 3: Forward Propogation

* A simple network making a prediction
* What is a neural network and what does it do?
* Making a prediction with...
  * Multiple inputs
  * Multiple outputs
  * Multiple inputs and outputs
* Predicting on predictions

---

### Step 1: Predict

* Number of datapoints processed at a time does a lot to determine the structure of the NN
* How many to process at a time?
  * Can the neural network be accurate with the (batch of) data it's given?
  * E.g. predicting on an image requires the entire image data, not just a single pixel
  * Rule of thumb: always present "enough" (how much a human might need to make the same prediction)
* Network defined by the shape of the input and output
  * Start with a single datapoint making a single prediction

#### Making a prediction with a single input

In [1]:
# === First neural network === #
weight = 0.1

def neural_network(input, weight):
    pred = input * weight
    return pred

In [4]:
number_of_fingers = [9.5, 10, 15.5, 11]
input = number_of_fingers[0]

pred = neural_network(input, weight)
pred

0.9500000000000001

#### What does this NN do?

* Multiplies the input by a weight
  * Scales the input by a certain amount
* Input variable = information
* Weight variable = knowledge
* Prediction = output

> Though they might get more complex, this same underlying concept always applies.

* Another way to look at it:
  * Weight value is a measure of sensitivity b/w input and the net's prediction
    * Large weights amplify the range of predictions
    * Small weights "understate" or reduce range of preds
  * "Volume knob"
* Inputs, predictions can be negative

#### Making a prediction with multiple inputs

* Combining intelligence from multiple datapoints
  * Multiply each input by its own weight
  * Sum the predictions (weighted sum)
    * Dot product

In [6]:
def w_sum(a, b):
    assert(len(a) == len(b))
    output = 0
    for i in range(len(a)):
        output += (a[i] * b[i])
    return output

def neural_network(input, weights):
    pred = w_sum(input, weights)
    return pred

In [8]:
weights = [0.1, 0.2, 0]

toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

input = [toes[0], wlrec[0], nfans[0]]

pred = neural_network(input, weights)
pred

0.9800000000000001

##### Vector math

* Elementwise operation
  * Mathematical operation between two vectors of equal length
  * Pair up values according to their positions in their vectors

In [10]:
def elementwise_multiplication(vec_a, vec_b):
    assert len(vec_a) == len(vec_b)
    vec_out = [0 for i in range(len(vec_a))]
    for i in range(len(vec_a)):
        vec_out[i] = vec_a[i] * vec_b[i]
    return vec_out

In [11]:
def elementwise_addition(vec_a, vec_b):
    assert len(vec_a) == len(vec_b)
    vec_out = [0 for i in range(len(vec_a))]
    for i in range(len(vec_a)):
        vec_out[i] = vec_a[i] + vec_b[i]
    return vec_out

In [12]:
def vector_sum(vec_a):
    output = 0
    for v in vec_a:
        output += v
    return output

In [13]:
def vector_average(vec_a):
    output = 0
    for v in vec_a:
        output += v
    return output / len(vec_a)

In [14]:
def dot_product(vec_a, vec_b):
    vec_c = elementwise_multiplication(vec_a, vec_b)
    return vector_sum(vec_c)

##### Dot product intuition

The intuition behind how and why a dot product (weighted sum) works is one of the most important parts of understanding how NNs make predictions.

> A dot product gives you a notion of similarity between two vectors

The highest dot product is between two identical vectors.

What does it mean when a neural network makes a prediction?

> The network gives a high score of the inputs based on how similar they are to the weights.

In [1]:
# === Multiple inputs using numpy === #
import numpy as np

weights = np.array([0.1, 0.2, 0])
def neural_network(input, weights):
    pred = input.dot(weights)
    return pred

toes = np.array([8.5, 9.5, 9.9, 9.0])
wlrec = np.array([0.65, 0.8, 0.8, 0.9])
nfans = np.array([1.2, 1.3, 0.5, 1.0])

input = np.array([toes[0], wlrec[0], nfans[0]])
pred = neural_network(input, weights)
print(pred)

0.9800000000000001


#### Making a prediction with multiple outputs

I.e. 3 disconnected single-weight networks.

In [3]:
def ele_mul(number,vector):
    output = [0] * len(vector)
    assert(len(output) == len(vector))
    for i in range(len(vector)):
        output[i] = number * vector[i]
    return output

def neural_network(input, weights):
    pred = ele_mul(input, weights)
    return pred

weights = [0.3, 0.2, 0.9]

wlrec = [0.65, 0.8, 0.8, 0.9]
input = wlrec[0]
pred = neural_network(input, weights)
print(pred)

[0.195, 0.13, 0.5850000000000001]


#### Multiple inputs and multiple outputs

Three independent weighted sums of the input -> three predictions.

In [14]:
def w_sum(a, b):
    assert len(a) == len(b)
    output = 0
    for i in range(len(a)):
        output += (a[i] * b[i])
    return output

def vect_mat_mul(vect, matrix):
    assert len(vect) == len(matrix)
    output = [0] * len(vect)
    for i in range(len(vect)):
        output[i] = w_sum(vect, matrix[i])
    return output

def neural_network(input, weights):
    pred = vect_mat_mul(input, weights)
    return pred

weights = [
    [0.1, 0.1, -0.3], # hurt?
    [0.1, 0.2, 0.0],  # win?
    [0.0, 1.3, 0.1],  # sad?
]

toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

input = [toes[0], wlrec[0], nfans[0]]
pred = neural_network(input, weights)
pred

[0.555, 0.9800000000000001, 0.9650000000000001]

#### Predictions on predictions

Take the output of one network and use it as input to another.

In [15]:
ih_wgt = [
    [0.1, 0.2, -0.1], # hid[0]
    [-0.1,0.1, 0.9],  # hid[1]
    [0.1, 0.4, 0.1],  # hid[2]
]

hp_wgt = [
    [0.3, 1.1, -0.3], # hurt?
    [0.1, 0.2, 0.0],  # win?
    [0.0, 1.3, 0.1],  # sad?
]

weights = [ih_wgt, hp_wgt]

def neural_network(input, weights):
    hid = vect_mat_mul(input, weights[0])
    pred = vect_mat_mul(hid, weights[1])
    return pred

neural_network(input, weights)

[0.21350000000000002, 0.14500000000000002, 0.5065]

In [18]:
# === numpy version === #
import numpy as np

ih_wgt = np.array([
    [0.1, 0.2, -0.1], # hid[0]
    [-0.1,0.1, 0.9],  # hid[1]
    [0.1, 0.4, 0.1],  # hid[2]
])

hp_wgt = np.array([
    [0.3, 1.1, -0.3], # hurt?
    [0.1, 0.2, 0.0],  # win?
    [0.0, 1.3, 0.1],  # sad?
])

weights = [ih_wgt, hp_wgt]

input = np.array([toes[0], wlrec[0], nfans[0]])

def neural_network(input, weights):
    hid = input.dot(weights[0])
    pred = hid.dot(weights[1])
    return pred

neural_network(input, weights)

array([ 0.496,  1.256, -0.286])

For dot product: put the (rows, columns) shapes next to each other - the inner ones should match.

---

### Chapter 3: Review and solifidy

> Record myself explaining (ELI5) the following topics

* [ ] What is a neural network and what does it do?
* [ ] What is the intuition behind the dot product of two vectors?
* [ ] What dimensions have to match when taking the dot product?

---
---

## Chapter 4: Gradient Descent

* Do neural networks make accurate predictions?
* Why measure error?
* Hot and cold learning
* Calculating both direction and amount from error
* Gradient descent
* Learning is just reducing error
* Derivatives and how to use them to learn
* Divergence and alpha

> Or, the "compare" and "learn" part of the "predict, compare, and learn" process.

Compare provides a measurement of how far off a prediction was, while learning tells each weight how it can change to reduce the error.

### Compare

Does your network make good predictions?

* Measuring error simplifies the problem
* Different ways of measuring error prioritize error differently
  * Mean squared error amplifies large error and minimizes small error

In [19]:
knob_weight = 0.5
input = 0.5
goal_pred = 0.8

pred = input * knob_weight
error = (pred - goal_pred) ** 2

print(error)

0.30250000000000005
