## Logistic Regression

Using an example very similar to that of the "Intro to Machine Learning" lesson that covers *data surfaces*, we learn this process of determining whether a point *passes* or *fails* a tast (based on historics) is __Logistic Regression__

- The example (with the similar data surface) is of student grades and entrance test scores, colored for acceptance
  - green dots on the right hand of the D.S. are (supposedly) accepted
  - red dots on the left hand are subsequently rejected

Another way to think about this is: "how easily can we separate the data?" (with a line)
- In this lesson, we'll gradually reduce our losses via __gradient decent__
  - The *log-loss function* will have an output whose error we want to minimize

Answer this question:
  - In which direction can we rotate (or move) the line for the maximum error reduction?

# Neural Networks

> It looks like we're proposing too many acceptances, because some of these students have low grades with good test scores...

> Hmmmm.

> Maybe a cirle?

> How about two lines? Yes, __let's use two lines__

But how do we find the two lines?
- Gradient descent still does the job.
- We'll calculate against the log-loss function, looking for two different positions

## This is a neural network
1. Is this point over the horizontal line?
2. Is this point over the vertical line?
3. Are the answers to both 1 and 2 yes?
  - This yields only a single "yes"


---

This feels like the makings of a perceptron...
- Doing bare-bones, basic boolean ANDs

I'm wondering how we translate this truth-table-like problem into a system of nodes

---

And then teach breaks it down for us ;)
> Let's graph each question as a small node

### A breakdown
(To better form an intuition?)

1. Is the point over the horizontal line?
    - And two input nodes: test score and grades
    - We can plot this like coordinate pair
    - Outputs a Boolean
2. Is the point over the vertical line?
    - Ouputs a Boolean to answer the question
3. Add a node (the __output node__), one layer "in"?
    - This node receives the previous two inputs and logical ANDs them together
    - Ouputs the result of that boolean addition(?)

---
But __how__ is this a neural network?

Well just look at the layers like this:

\begin{equation}
\begin{vmatrix}
Test: & 1 \\
Grade: & 8
\end{vmatrix} {_{->}^{->}}
\begin{vmatrix}
horizLineTestNode \\
vertLineTestNode
\end{vmatrix} {_{->}^{->}}
\begin{vmatrix}
AND
\end{vmatrix} ->
Output
\end{equation}

- picture the *test* and *grade* flowing into both test nodes (four arrows instead of two)
- That's the general idea
  - Data flows left-to-right
  - Remember: a given neuron takes, as input, the output of other neurons
    - We use numbers at this level
    - But soon, we'll do so much more

I really don't want to type out a truth table, or a binary grid
  - Please just remember this
  - Or just review [this](http://kias.dyndns.org/comath/21.html)

# Perceptron

---
Vocabulary:
  - __Perceptrons__, or neurons, are individual nodes in an interconnected network
    - They are the basic unit of neural networks
    - Function: *Look at input data and decide how to categorize said data*

---

With the previous example of school acceptance
- inputs either passed the threshold for grades and test scores, or didn't
- The outputs - "yes" and "no" - were the __categories__

Those categories then combine to produce a decision!
- Back example land, whether or not the student is granted admission.

---
But wait a second....

How the heck do these nodes know whats important in checking that threshold?
> When initialized, we don't know what information will be most important.

> So... We have the network __learn for itself__. And even let it adjust how it considers the data.

All of this is done with a little something called...

### Weights

---
My turn.

Perceptrons (aka "neurons", "nodes") have weights for all the inputs that they receive
- Think of this like a person who values certain peoples' opinions more than others

Based on the inputs (and associated weights), an output can be determined.
- And since neurons will only be connected to a set number of other neurons, it is safe to have a fixed number of weights!
  - It all makes sense in my brain: 
    - each "opinion" has more/less sway on the node, in its categorization
  - Let's see how the course explains it.

---

> When input data is received, it get multiplied by a weight that is assigned to this particular input.
- If we have our example of school acceptance
  - `tests` could be an input name for test scores
  - `grades` could be an input name for the student's average grades

---
Pretty much what I said, with a different tone of voice, and vocabulary.

---

__Note__ that neuron's given weight for some input starts out random (adding up to 1.00? because percentages???)
- Overtime, as the neural network learns more about the kind of input data that leads to student acceptance
  - The network adjusts weights based on erors in categorization (from previous results)
  - This is called __training__

---

Vocabulary:
  - __training__: The process of improving a network's individual node-level weights, to produce the desired output

---

- An __extreme__ example of weights would be if the test scores had *no affect at all* on the university acceptance
  - The weight of "test score"-input would be zero and have no affect on the output of the perceptron

## Processing Input (at the perceptron level)

Just to review:

- Each input has a weight that represents its importance
- Weights are determined during the learning process (aka, __training__)

### Summing the Input Data

---

> Note that weights will *always* be represented by some type of the letter __w__. It will be capitalized - __W__ - when it represents a __matrix__
- Subscript will specify *which* weights.

Just remember the variable naming conventions: both math and code



---

They set this equation as the relationship between weights and inputs:

\begin{equation}
w_{grades} \cdot x_{grades} + w_{test} \cdot x_{test} = -1 \cdot x_{grades} - 0.2 \cdot x_{test}
\end{equation}

> The perceptron applies these weights to the inputs and sums them in a __linear combination__.

---
The next section is verbose as it tries to skirt around the math notation - it's just summations
- Of `x_input` and `w_input`, which is the __input__ and the __weight__ associated with said __input__ 
  - `x` differentiates it, nothing more.

- Summatively, that looks like:

\begin{equation}
\sum_{i=1}^m w_{i} \cdot x_{i}
\end{equation}

- Where *m* is the number of inputs
  - This can be omitted

#### Please Note
that the relative size of weights is what's most important, not the weights themselves

- He uses an example of
  - w_grades = -1
  - w_test = -.2
- And points out that the ratio of 5:1 is what's important

__In brief__, multiply an input by its weight

- Take an opinion by its importance factor
- It's that easy.
  - Now to learn about these activation functions

---

### Calculating Output: an Activation Function

First the analogy that makes the most since in my mind:
> Consider an actual neuron in the brain:
- It absorbs charge from its neighbors
- After a certain amount of input, it discharges

> When the activation function fires, the neuron discharges

__He shows__ an activation function as mearly a conditional function:

\begin{equation}
f(h) = \{_{1 if h \geq 0}^{0 if h \lt 0}
\end{equation}

- Apparently, this function is called the __Heaviside Step Function__
  - This sucks for university acceptance :P

---
This guy is really making this seem too easy.
- I get that the math is simple, but his round-about way of explaining it is frustrating
  - I know how to graph on a chart!

---

So we've written a dodgy function (only poor students activate). We'll fix it
- The instructor's finally getting to the point!

### Bias, or "shifting the goods"

Suppose we have a good shape (to the graph of acceptable inputs), but we don't like the selection.
- Just add __bias__

---
Vocabulary:

  - __bias__ is a translation of the graph of acceptable inputs.
    - Simple *adding* to the result of the base equation
    - First demonstrated on the y-axis.

---

---
For feedback:

> This feels like a bad way to explain this. Sure, not everyone remembers mathematics as I do. But this doesn't do the equations justice, and I think it would be inappropriate for students to feel confident with these hand-wave-y definitions

- The math is clear, but the words are __not__.
  - Almost has me doubting my math

---

More to my point, I'm just going to copy the instructors paragraph...

---
Note to self:

- I keep slipping in and out of the Feyman techinque.
- I want to rearticulate it, but there's nothing to say
  - __ADD__ a bias. Literally.
    - Some number, by the science of neural networks, will be generated to shift the good shape to have the right coverage
  - I'm not going to re-explain [what summation is](https://en.wikipedia.org/wiki/Summation)
    - I've been familiar with this concept for longer than I've been programming.
    - Here's my attempt at it:
      - For every *single* number between \*points to start and end of the summation\*
      - Solve the equation with __that__ number as a result
      - Add __all__ of those results together.

---

![perceptron formula](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/58951180_perceptron-equation-2/perceptron-equation-2.gif "Perceptron Formula")

Here's the aformentioned, poor, presumptive summary:

> This formula returns $1$ if the input ($x_{1}$,$x_{2}$,...,$x_{m}$) belongs to the accepted-to-university category or returns 0 if it doesn't. The input is made up of one or more real numbers, each one represented by $x_i$, where $m$ is the number of inputs.

Then the neural network starts to learn! Initially, the weights ( $w_{i}$ ) and bias ( $b$ ) are assigned a random value, and then they are updated using a learning algorithm like gradient descent. The weights and biases change so that the next training example is more accurately categorized, and patterns in data are "learned" by the neural network.

Now that you have a good understanding of perceptions, let's put that knowledge to use. In the next section, you'll create the AND perceptron from the Neural Networks video by setting the values for weights and bias.

---

While I feel like I'm growing comfortable with Neural Networks, I'm seriously disappointed in this section
- "Uhhh... We can't figure out how to teach this to you, so we're going to give you an impartial __text__ lecture of that is heavy on mathematics, but without the explanations of their necessity"
- It's like Siraj's videos, but less goofy; more stoic
  - That is __not__ a good thing.

## Quizzes, or "suffering will be your teacher"

---
This first quiz feels almost unfair, given the abyssmal wrap up of the last section.

- I'm about to read another resource to get together a more decent understanding of the activation.
  - I can appreciate the abiguity of "you define it however you want", but what it this threshold for effective
    - How could you ever know before hand?
  - I don't imagine the "Heaviside step function" will always suffice for an activation function

It's a "fill in the blank", with the weights of the inputs, a bunch of graphs showing desired behavior, and frustration...
- They want the AND function first..
  - It's contrived instruction, to get us used to using the $b$
- `AND` can be programmed much more simply.

---
So, some how, changing the weights and bias for these logic gate neural networks is relaxing my mind.

- My brain is just treating them like puzzles.
  - I still don't know if this is a good thing
- After I gave the answer *it wanted* for "how would we shift this graph to go from AND to OR?"
  - For the record: "Increase the weights" and "Decrease the magnitude of the bias"

__ALL__ of these are still using the aforementioned biased step function.

---
Back at it:
> A neural network is like any tool. You have to know when to use it.

Thankful they didn't make me play the "weight puzzle game" with XOR - it's always a pain.
- Thankfully, the XOR is a composition of
  - two NOT gates
  - two AND gates
  - one OR gate
  - a few passthroughs to give the illusion of "layers".
- It's __very__ similar to the circuit.

> The power of a neural network isn't building it by hand, like we were doing. It's the ability to learn from examples. In the next few sections, you'll learn how a neural networks sets it's own weights and biases.

... Yeah. Hopefully.
- I feel like I've learned a lot of vocabulary, and a few theoretical applications of math I've known for ages...
  - And refreshed my Linear Algebra, which is always good.
- I've seen glimpses of the applications of that Matrix Math in these neural systems
  - I want to weild that power...
  - There, I said it.

---

### Mat returns

He gives a breakdown of all that goes into a neuron:
- Inputs
  - And associated weights
- A bias
- resultant input, $h$
- Activation function $f(h)$
- Output, $y$

Drawn as such:

![image](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/589366f0_simple-neuron/simple-neuron.png "Cirles are units, boxes are operations")

This is rather decent, leading to some insights:

- Construction of the activation function argument
  - Weighted input, with bias
  - In other words: $x_1 \cdot w_1 + b$
- The argument to the activation function: $h$
- The exit - and output - is __activation function__: $y = f(h)$

---
My mind understands composition.

This a layer, a composition, of trivial operations

- Multiplication
- Addition
- Identity

... Along with some arbitrary activation function.

---

Mat has us program the [Sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) and they keep repeating this:

> stacking units will let you model linearly inseparable data, impossible to do with regression models.

This phrase "linearly inseperable data" has been said four times now.
- I haven't heard it before
- But I don't think many operations a above decomposition.
  - I actually think there are infinitely many operations which can be decomposed

---
I'm not going to run the code, but here it is:

```python
import numpy as np

def sigmoid(x):
    # TODO: Implement sigmoid function
    return 1/(1 + np.exp(-x))

inputs = np.array([0.7, -0.3])
weights = np.array([0.1, 0.8])
bias = -0.1

# TODO: Calculate the output
output = sigmoid(np.dot(weights, inputs) + bias)

print('Output:', output)

```

# Gradient Descent
## Learning Weights

If we don't know the correct weights, we have to learn them from history.

- Example data

Then we can then predict (produce outputs) with those weights

### Know when you're wrong

Since we start with random weights, we need to adjust those weights towards what is proper

- *iterate*, if you will

__So we *measure*__ how wrong we are, to gather some sense of how to course correct

- i.e. "Too big", "too big", "too hot", "too cold", "just right"
  - Thanks little Red

One of the best methods for doing this is the "[Sum of Squared Errors](https://en.wikipedia.org/wiki/Residual_sum_of_squares)":

\begin{equation}
E = \frac 1 2 \sum{_\mu} \sum{_j} [y_j^\mu - \hat{y}_j^\mu]^2
\end{equation}

#### Let's break it down

Pseudo-academia (me):
> Take half of the sum of the sum of:
  - all outputs (predictions) subtracted from the true value for some given data point $\mu$ and a predction $j$ squared

Layman's terms:
> For every data point (in our training set) calculate the inner sum of 
- The squared difference of
- The true value (categorization [think back]) of a data point, and the prediction for that data point

---
We're building a graph and we need to get the shape right.
- The squaring is probably inherited from statistical models that fit to a curve. 
  - Or correlation against a best fit line

---

오후 11시 9분 끝났습니다...

Having studied hard, had an awful work day, and endured ill-suited-for-me instruction ("intellectual babble", that uses all the right words sans substance), I just want to go to sleep.
- My chest is heavy.
  - It hurts to feel this much angst
  - It's disappointing that getting started is such a labor
- Was this really worth \$600?
  - When I've got so many things I could have read for free\?
  - When the, arguably, most important lesson in this course isn't singing true like the previous math lesson\?
    - 왜? 어떡해? :(

### 2017년 4월 20일 - Reviewing
-

First off: yes, it is worth it.

While the teaching may not be my style - "look what we can do. And here's some math... Ooo! Shiny, cool things" - there is genuinely good content here.

- The lesson on matrices and Linear Algebra is learning a semester's worth of work in a few hours.
  - Concise and effective
- Mat's visual-driven lessons are actually effective, which I'm grateful for.
  - Human brains are naturally potent at visual comprehension
    - Which is why teaching a computer to see is so dang impressive

What I need is patience with this, and do what I do best:

- Put it together in my head, in a way that makes sense to me.
  - This curriculum is designed to hit for everyone
  - Not like [some courses](http://neuralnetworksanddeeplearning.com/chap1.html#perceptrons), which cater to 내 스타일

---
Let's talk about that other "course" (it's a book) for a moment.

In about 10 minutes of reading, he has addressed almost __everything__ that this Udacity course has waved its hands at

- Node: perceptron and Sigmoid neuron are __different__
  - Perceptron does binary output
  - Sigmoid neuron outputs the result of a Sigmoid against some input
- Gradient descent
  - He doesn't refer to it like we already know what it is
  - Instead, he talks about the math and it's utility.
- Others that I (hopefully) will fill out later.

At work, I tried to employ the Feynman technique to teach all that I know about Neural Networks.

- I will attempt to do so again here...

---

### Neural Networks (Feynman self-Review)

---

A neural network is a system of interconnected (specialized) neurons, organized in columns we call __layers__.

- The two most common types of neuron (henceforth, "node") are:
  - Perceptron, which outputs binary (typically via some Heaviside Step function)
  - Sigmoid neuron, which outputs the result of a Sigmoid (see above) function

__Perceptrons__, thanks to their binary nature, can parallel any logical operation.

- They can become an AND, NAND, OR, or XOR gate
  - Parallel to circuitry, and equally composable
- Because this logical similarity, neural networks can perform any computation
  - Treating the nodes as logic gates
    - In-memory circuitry, if you will
    

---
Hypothesis: This proved to be really slow, which is why we didn't get very far with neural network-based Machine Learning for decades.

---

(Perceptron) Nodes take input, and __categorize__ it: determine it exceeds a threshold

- That "determining" process is just a '$\geq$' comparison; nothing complicated
- Nodes treat their input as humans take opinion's
  - "With a grain of salt"
  - A unique weight factor ("factor", think "multiplication") is applied to each input
    - $x_{input1}$, is the actual input
      - Mathematicians love this variable. It's so easy to write ;)
    - $w_{input1}$, is the weight associated with it
    - Let's use an intermediary $h$ to represent that product of $x_{input1} * w_{input1}$
  - The result, $h$, and the some (optional) bias $b$ are added together
- Now __we finally have__ something number to compare
  - $h + b \ge 0$? output $1$
  - otherwise, output $0$

A human analogy:

- I think Bob is a bit of a tosspot, so I don't really value what he says
  - "I'm only going to pay partial attention when he's talking to me"
- However, Jenny has been my crush for 3 years, so she has more sway
  - "She might say she'll date me, so I'm going to listen hard"


Bob's *input* is undervalued __relative__ to Jenny. That's the extent of inter-input-relation

- One input means nothing to another

---
This gets marginally more complicated with more inputs

- Most networks have multiple inputs, which are all processed by all nodes
- In this case, we take the summatino of all our "$h + b$"-s
  - We usually write it in the long form as: $\sum_{i=1}{(x_i * w_i) + b}$
    - Where $i$ in the equation $x_i * w_i$ is a give input value in a list of inputs values
      - Remember, we've got multiple inputs here
      - With programming, this would be like an array index
    - We swap $h$ back out for what it's equal to
  - This can also be expressed as a dot product
    - $weights = [1, 1, 3]$.
      - Let's say these are Bob, Joe, and Jenny's opinions
    - $inputs = [5, 6, 3]$
      - Let's say these are the number of words in their responses
        - And a certain three-word sentence would make someone's life
    - $bias = 3$, in a "learning mood"
    - $weights \cdot inputs = 1*5 + 1*6 + 3*3$ = 20
  - Now we just test that $20$ against our threshold, and we've got output!
- That's all there is to calculation.
  - The funny part is when we get to *calculating weights on our own*
  - No foreknowledge.
    - Just your wits about you
    - And math ;)

__There are an infinitely many__ number of ways to approximate appropriate weights.

BUT BEFORE ANYTHING ELSE, you have to have effective data to learn from. Otherwise, we're just going crazy here.

Here is what I know:

1. Our guesses are initially random
2. They are wrong to *some degree*
3. We must correct against the wrongness


Let's look at each of these in turn

__1. Initially random__

- We really just pick a number
  - The idea is "you gotta start somewhere"
- Very simple to implement programmatically
```python
import random as rand
print(rand.randint(0,10))
```

__2. Our guess is wrong *to some degree*__

It would be amazing if our guess were always right.

- Heck, that's practically what we're trying to build here, with the network

But, it is very likely that our guess is __not__ right.

To measure how wrong we are several calculations can be employed.

One such calculation is the "Sum of Squared Errors", which is really a sum of sums, but that's over complicating the matter.

1. We take the difference of our guessed output (for a given input) and the "true" output
  - $y - \hat{y}$, where $y$ is the true output and $\hat{y}$ is a guess
2. Square it
  - Why?
    - So we can have __a.__ a positive number to work with and __b.__ accentuate big differences
3. Do this for every output node of the network, then sum them
  - Yes, there can be multiple output nodes
    - Not ___really___ in perceptrons
  - This is that first sum
4. Do this for all input-weight pairs (which should come from a bunch of data points), and __sum__ them
  - This is that second summation

And here's the crazy math equation, where $E$ is "error":

\begin{equation}
E = \frac 1 2 \sum_{\mu} \sum_j [y_j^{\mu} - \hat{y}_j^{\mu}]^2
\end{equation}

---
__Think about how this relates__ to the perceptron equations, and it actually makes sense:

A perceptron only has __1__ output node, so we only did that summation ___once___.

- If there had been more output nodes, we would have done it those $m$ times!

Eureka

---

__3. We must correct against our wrongness__

One of the most common methods of correction is (stocastic) __Gradient Descent__

- I haven't learned this yet, but I do know a few things about it

Namely:

- It effectively test extremes in a "too hot, too cold" manner
  - It hops side-to-side, downwards towards the bottom of the "well" of this curve
  - Every time it jumps, it decreases the intensity, so as to not jump *strictly* side-to-side
- It is sensitive to local minima:
  - ![caveat local minima](https://d17h27t6h515a5.cloudfront.net/topher/2017/January/587c5ebd_local-minima/local-minima.png)

Determining which is "correct" is difficult from this image, but there are __two__ distinct valleys

- Either of which would be reported as "correct"

Lastly, some ___Math facts___ to ease the pain

- "Gradient" is just another word for rate of change
  - In simple plots, this is just the slope.
  - In other words
    - Where a tensor supports values of n-dimensions, 
    - Gradients support rates of change in $(n-1)$-dimensions
- Multi-variable calculus is encouraged because it does the standard (easy) calculations across an n-variable space
  - Hurray for applied mathematics!
- (Review) you can find the slope at a point one of two ways
  - Infinitely approach it, from both sides, taking the slopes and approximating
  - Taking the derivative, and computing for the given $x$-value
    - $f(x) = x^2$
    - $f'(x) = 2x$ (apply the power rule)
    - We want the slope at $x = 2$, so just basic algebra:
      - $f'(2) = 2(2) = 4$
      - That's the slope/"rate of change"/gradient of the $x^2$ plotted parabola

---
Mat says there's a way to avoid said valleys, [like momentum](http://sebastianruder.com/optimizing-gradient-descent/index.html#momentum) (as you roll through the well)

- And this turns out to be [a pretty epic resource](http://sebastianruder.com/optimizing-gradient-descent/index.html)

---

# Gradient Descent (returns)

---
I can't help but think, "why are we playing with perceptrons, when practically all the behavior we want will be archieved via a Sigmoid neuron?"

- Then I remind myself how Udacity doles this stuff out:
  - "Just enough to keep you going"
  - "We'll give you the rest later. We swear"

Gotta keep my eye on the prize, and just keep going

---

## The Math

Mat repeats a lot of what I covered in the "Review", with some additions

In regards to the Sum of Squared Errors:

1) Why not take the absolute value of the difference, instead of the square?
> The square penalizes larger errors ... and makes the math nice later on

2) Clarification of the activation function
- Again, like I covered in the Review, it's simply the thing that determines the output of the node
  - Takes into account the $bias$

3) Trimmer definition of the SSE:

\begin{equation}
E = \frac 1 2 \sum_{\mu} (y^{\mu} - \hat{y}^{\mu})^2
\end{equation}

... and given than $\hat{y} = \sum_i w_i x_i^{\mu}$ ...

\begin{equation}
E = \frac 1 2 \sum_{\mu} (y^{\mu} - (\sum_i w_i x_i^{\mu}))^2
\end{equation}

__Our data records__ are represented by $\mu$

- You can think of these as
  - Two tables
  - ... arrays
  - ... matrices
  - Whatever you want


If it needed reiterating:

1) The $\sum_{\mu}$ is just iterating the "two tables", summing up the SSE
  - Two because you'll have your __inputs__ ($x$) and __targets__ ($y$)
    - Gotta hit _some_ mark
  - The SSE, because that's how we train - in this case

2) (Like in the review) The SSE is a measure of how wrong we are
  - If the SSE is high, we're making lots of bad predictions
  - If it's low, we're on the road to good predictions
    - We also don't want to change that much
    - We've found the sweet spot, no need to move out of it.

---
Mat then asserts what I've derived:

- With only one data record, there isn't a summation
  - Because $[\sum_{i=1}^1 f(x_i)] = f(1)$, for any $f(x)$
- With only one output, there's only the inner summation
  - And if we've only got one record, there's no $\sum$ to speak of
  - It's clear are rain

---

Before I get to the upcoming mic drop...

### Mat makes simple Gradient Descent

Imagine a graph of $(7x - 8)^2 + 3$; a very thin parabola that doesn't have a *zero*

- But its $y$-axis is labeled $E$, for our error
- And its $x$-axis is labeled $w$, for our weight

---
Partial derivative seems to just be the derivative in one direction

- Instead of the full slope, we only look at half of it
  - Thus, a "partial"

---

We want to *head towards the minimum*, which is the opposite of the slope

- $\Delta w = -gradient$

> If we take increasingly small steps *down the gradient*, eventually, the weight will find the minimum of the error function
- This is __Gradient Descent__

Then we just redefine the weights:

\begin{equation}
w_{i+1} = w_i + \Delta w_i
\end{equation}
, where 

- $\Delta w_i$ is the determined step
- $w_i$ is what used during the current iteration
- $w_{i + 1}$ will be what we want to use next iteration

Wow...

### Mat Speaks Math

He derived a one line equation that represents the 

## The Code

## Implementation Quiz

## Multilayered Perceptrons

## Backpropogation

### Implementation

# Finale

## Resources