## Logistic Regression

Using an example very similar to that of the "Intro to Machine Learning" lesson that covers *data surfaces*, we learn this process of determining whether a point *passes* or *fails* a tast (based on historics) is __Logistic Regression__

- The example (with the similar data surface) is of student grades and entrance test scores, colored for acceptance
  - green dots on the right hand of the D.S. are (supposedly) accepted
  - red dots on the left hand are subsequently rejected

Another way to think about this is: "how easily can we separate the data?" (with a line)
- In this lesson, we'll gradually reduce our losses via __gradient decent__
  - The *log-loss function* will have an output whose error we want to minimize

Answer this question:
  - In which direction can we rotate (or move) the line for the maximum error reduction?

# Neural Networks

> It looks like we're proposing too many acceptances, because some of these students have low grades with good test scores...

> Hmmmm.

> Maybe a cirle?

> How about two lines? Yes, __let's use two lines__

But how do we find the two lines?
- Gradient descent still does the job.
- We'll calculate against the log-loss function, looking for two different positions

## This is a neural network
1. Is this point over the horizontal line?
2. Is this point over the vertical line?
3. Are the answers to both 1 and 2 yes?
  - This yields only a single "yes"


---

This feels like the makings of a perceptron...
- Doing bare-bones, basic boolean ANDs

I'm wondering how we translate this truth-table-like problem into a system of nodes

---

And then teach breaks it down for us ;)
> Let's graph each question as a small node

### A breakdown
(To better form an intuition?)

1. Is the point over the horizontal line?
    - And two input nodes: test score and grades
    - We can plot this like coordinate pair
    - Outputs a Boolean
2. Is the point over the vertical line?
    - Ouputs a Boolean to answer the question
3. Add a node (the __output node__), one layer "in"?
    - This node receives the previous two inputs and logical ANDs them together
    - Ouputs the result of that boolean addition(?)

---
But __how__ is this a neural network?

Well just look at the layers like this:

\begin{equation}
\begin{vmatrix}
Test: & 1 \\
Grade: & 8
\end{vmatrix} {_{->}^{->}}
\begin{vmatrix}
horizLineTestNode \\
vertLineTestNode
\end{vmatrix} {_{->}^{->}}
\begin{vmatrix}
AND
\end{vmatrix} ->
Output
\end{equation}

- picture the *test* and *grade* flowing into both test nodes (four arrows instead of two)
- That's the general idea
  - Data flows left-to-right
  - Remember: a given neuron takes, as input, the output of other neurons
    - We use numbers at this level
    - But soon, we'll do so much more

I really don't want to type out a truth table, or a binary grid
  - Please just remember this
  - Or just review [this](http://kias.dyndns.org/comath/21.html)

# Perceptron

---
Vocabulary:
  - __Perceptrons__, or neurons, are individual nodes in an interconnected network
    - They are the basic unit of neural networks
    - Function: *Look at input data and decide how to categorize said data*

---

With the previous example of school acceptance
- inputs either passed the threshold for grades and test scores, or didn't
- The outputs - "yes" and "no" - were the __categories__

Those categories then combine to produce a decision!
- Back example land, whether or not the student is granted admission.

---
But wait a second....

How the heck do these nodes know whats important in checking that threshold?
> When initialized, we don't know what information will be most important.

> So... We have the network __learn for itself__. And even let it adjust how it considers the data.

All of this is done with a little something called...

### Weights

---
My turn.

Perceptrons (aka "neurons", "nodes") have weights for all the inputs that they receive
- Think of this like a person who values certain peoples' opinions more than others

Based on the inputs (and associated weights), an output can be determined.
- And since neurons will only be connected to a set number of other neurons, it is safe to have a fixed number of weights!
  - It all makes sense in my brain: 
    - each "opinion" has more/less sway on the node, in its categorization
  - Let's see how the course explains it.

---

> When input data is received, it get multiplied by a weight that is assigned to this particular input.
- If we have our example of school acceptance
  - `tests` could be an input name for test scores
  - `grades` could be an input name for the student's average grades

---
Pretty much what I said, with a different tone of voice, and vocabulary.

---

__Note__ that neuron's given weight for some input starts out random (adding up to 1.00? because percentages???)
- Overtime, as the neural network learns more about the kind of input data that leads to student acceptance
  - The network adjusts weights based on erors in categorization (from previous results)
  - This is called __training__

---

Vocabulary:
  - __training__: The process of improving a network's individual node-level weights, to produce the desired output

---

- An __extreme__ example of weights would be if the test scores had *no affect at all* on the university acceptance
  - The weight of "test score"-input would be zero and have no affect on the output of the perceptron

## Processing Input (at the perceptron level)

Just to review:

- Each input has a weight that represents its importance
- Weights are determined during the learning process (aka, __training__)

### Summing the Input Data

---

> Note that weights will *always* be represented by some type of the letter __w__. It will be capitalized - __W__ - when it represents a __matrix__
- Subscript will specify *which* weights.

Just remember the variable naming conventions: both math and code



---

They set this equation as the relationship between weights and inputs:

\begin{equation}
w_{grades} \cdot x_{grades} + w_{test} \cdot x_{test} = -1 \cdot x_{grades} - 0.2 \cdot x_{test}
\end{equation}

> The perceptron applies these weights to the inputs and sums them in a __linear combination__.

---
The next section is verbose as it tries to skirt around the math notation - it's just summations
- Of `x_input` and `w_input`, which is the __input__ and the __weight__ associated with said __input__ 
  - `x` differentiates it, nothing more.

- Summatively, that looks like:

\begin{equation}
\sum_{i=1}^m w_{i} \cdot x_{i}
\end{equation}

- Where *m* is the number of inputs
  - This can be omitted

#### Please Note
that the relative size of weights is what's most important, not the weights themselves

- He uses an example of
  - w_grades = -1
  - w_test = -.2
- And points out that the ratio of 5:1 is what's important

__In brief__, multiply an input by its weight

- Take an opinion by its importance factor
- It's that easy.
  - Now to learn about these activation functions

---

### Calculating Output: an Activation Function

First the analogy that makes the most since in my mind:
> Consider an actual neuron in the brain:
- It absorbs charge from its neighbors
- After a certain amount of input, it discharges

> When the activation function fires, the neuron discharges

__He shows__ an activation function as mearly a conditional function:

\begin{equation}
f(h) = \{_{1 if h \geq 0}^{0 if h \lt 0}
\end{equation}

- Apparently, this function is called the __Heaviside Step Function__
  - This sucks for university acceptance :P

---
This guy is really making this seem too easy.
- I get that the math is simple, but his round-about way of explaining it is frustrating
  - I know how to graph on a chart!

---

So we've written a dodgy function (only poor students activate). We'll fix it
- The instructor's finally getting to the point!

### Bias, or "shifting the goods"

Suppose we have a good shape (to the graph of acceptable inputs), but we don't like the selection.
- Just add __bias__

---
Vocabulary:

  - __bias__ is a translation of the graph of acceptable inputs.
    - Simple *adding* to the result of the base equation
    - First demonstrated on the y-axis.

---

---
For feedback:

> This feels like a bad way to explain this. Sure, not everyone remembers mathematics as I do. But this doesn't do the equations justice, and I think it would be inappropriate for students to feel confident with these hand-wave-y definitions

- The math is clear, but the words are __not__.
  - Almost has me doubting my math

---

More to my point, I'm just going to copy the instructors paragraph...

---
Note to self:

- I keep slipping in and out of the Feymann techinque.
- I want to rearticulate it, but there's nothing to say
  - __ADD__ a bias. Literally.
    - Some number, by the science of neural networks, will be generated to shift the good shape to have the right coverage
  - I'm not going to re-explain [what summation is](https://en.wikipedia.org/wiki/Summation)
    - I've been familiar with this concept for longer than I've been programming.
    - Here's my attempt at it:
      - For every *single* number between \*points to start and end of the summation\*
      - Solve the equation with __that__ number as a result
      - Add __all__ of those results together.

---

![perceptron formula](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/58951180_perceptron-equation-2/perceptron-equation-2.gif "Perceptron Formula")

Here's the aformentioned, poor, presumptive summary:

> This formula returns $1$ if the input ($x_{1}$,$x_{2}$,...,$x_{m}$) belongs to the accepted-to-university category or returns 0 if it doesn't. The input is made up of one or more real numbers, each one represented by $x_i$, where $m$ is the number of inputs.

Then the neural network starts to learn! Initially, the weights ( $w_{i}$ ) and bias ( $b$ ) are assigned a random value, and then they are updated using a learning algorithm like gradient descent. The weights and biases change so that the next training example is more accurately categorized, and patterns in data are "learned" by the neural network.

Now that you have a good understanding of perceptions, let's put that knowledge to use. In the next section, you'll create the AND perceptron from the Neural Networks video by setting the values for weights and bias.

---

While I feel like I'm growing comfortable with Neural Networks, I'm seriously disappointed in this section
- "Uhhh... We can't figure out how to teach this to you, so we're going to give you an impartial __text__ lecture of that is heavy on mathematics, but without the explanations of their necessity"
- It's like Siraj's videos, but less goofy; more stoic
  - That is __not__ a good thing.

## Quizzes, or "suffering will be your teacher"

---
This first quiz feels almost unfair, given the abyssmal wrap up of the last section.

- I'm about to read another resource to get together a more decent understanding of the activation.
  - I can appreciate the abiguity of "you define it however you want", but what it this threshold for effective
    - How could you ever know before hand?
  - I don't imagine the "Heaviside step function" will always suffice for an activation function

It's a "fill in the blank", with the weights of the inputs, a bunch of graphs showing desired behavior, and frustration...
- They want the AND function first..
  - It's contrived instruction, to get us used to using the $b$
- `AND` can be programmed much more simply.

---
So, some how, changing the weights and bias for these logic gate neural networks is relaxing my mind.

- My brain is just treating them like puzzles.
  - I still don't know if this is a good thing
- After I gave the answer *it wanted* for "how would we shift this graph to go from AND to OR?"
  - For the record: "Increase the weights" and "Decrease the magnitude of the bias"

__ALL__ of these are still using the aforementioned biased step function.

---
Back at it:
> A neural network is like any tool. You have to know when to use it.

Thankful they didn't make me play the "weight puzzle game" with XOR - it's always a pain.
- Thankfully, the XOR is a composition of
  - two NOT gates
  - two AND gates
  - one OR gate
  - a few passthroughs to give the illusion of "layers".
- It's __very__ similar to the circuit.

> The power of a neural network isn't building it by hand, like we were doing. It's the ability to learn from examples. In the next few sections, you'll learn how a neural networks sets it's own weights and biases.

... Yeah. Hopefully.
- I feel like I've learned a lot of vocabulary, and a few theoretical applications of math I've known for ages...
  - And refreshed my Linear Algebra, which is always good.
- I've seen glimpses of the applications of that Matrix Math in these neural systems
  - I want to weild that power...
  - There, I said it.

---

### Mat returns

He gives a breakdown of all that goes into a neuron:
- Inputs
  - And associated weights
- A bias
- resultant input, $h$
- Activation function $f(h)$
- Output, $y$

Drawn as such:

![image](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/589366f0_simple-neuron/simple-neuron.png "Cirles are units, boxes are operations")

This is rather decent, leading to some insights:

- Construction of the activation function argument
  - Weighted input, with bias
  - In other words: $x_1 \cdot w_1 + b$
- The argument to the activation function: $h$
- The exit - and output - is __activation function__: $y = f(h)$

---
My mind understands composition.

This a layer, a composition, of trivial operations

- Multiplication
- Addition
- Identity

... Along with some arbitrary activation function.

---

Mat has us program the [Sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) and they keep repeating this:

> stacking units will let you model linearly inseparable data, impossible to do with regression models.

This phrase "linearly inseperable data" has been said four times now.
- I haven't heard it before
- But I don't think many operations a above decomposition.
  - I actually think there are infinitely many operations which can be decomposed

---
I'm not going to run the code, but here it is:

```python
import numpy as np

def sigmoid(x):
    # TODO: Implement sigmoid function
    return 1/(1 + np.exp(-x))

inputs = np.array([0.7, -0.3])
weights = np.array([0.1, 0.8])
bias = -0.1

# TODO: Calculate the output
output = sigmoid(np.dot(weights, inputs) + bias)

print('Output:', output)

```

# Gradient Descent
## Learning Weights

If we don't know the correct weights, we have to learn them from history.

- Example data

Then we can then predict (produce outputs) with those weights

### Know when you're wrong

Since we start with random weights, we need to adjust those weights towards what is proper

- *iterate*, if you will

__So we *measure*__ how wrong we are, to gather some sense of how to course correct

- i.e. "Too big", "too big", "too hot", "too cold", "just right"
  - Thanks little Red

One of the best methods for doing this is the "[Sum of Squared Errors](https://en.wikipedia.org/wiki/Residual_sum_of_squares)":

\begin{equation}
E = \frac 1 2 \sum{_\mu} \sum{_j} [y_j^\mu - \hat{y}_j^\mu]^2
\end{equation}

#### Let's break it down

Pseudo-academia (me):
> Take half of the sum of the sum of:
  - all outputs (predictions) subtracted from the true value for some given data point $\mu$ and a predction $j$ squared

Layman's terms:
> For every data point (in our training set) calculate the inner sum of 
- The squared difference of
- The true value (categorization [think back]) of a data point, and the prediction for that data point

---
We're building a graph and we need to get the shape right.
- The squaring is probably inherited from statistical models that fit to a curve. 
  - Or correlation against a best fit line

---

오후 11시 9분 끝났습니다...

Having studied hard, had an awful work day, and endured ill-suited-for-me instruction ("intellectual babble", that uses all the right words sans substance), I just want to go to sleep.
- My chest is heavy.
  - It hurts to feel this much angst
  - It's disappointing that getting started is such a labor
- Was this really worth \$600?
  - When I've got so many things I could have read for free\?
  - When the, arguably, most important lesson in this course isn't singing true like the previous math lesson\?
    - 왜? 어떡해? :(