# Table of Contents
* [1) Regression fundamentals](#1%29-Regression-fundamentals)
    * [1) Data and model](#1%29-Data-and-model)
	* [2) Block diagram](#2%29-Block-diagram)
* [2) Neural Networks](#2%29-Neural-Networks)
    * [1) Model Representation I](#1%29-Model-Representation-I)
	* [2) Model Representation II](#2%29-Model-Representation-II)
* [3) Applications](#3%29-Applications)
	* [1) Examples and Intuitions I](#1%29-Examples-and-Intuitions I)
	* [2) Examples and Intuitions II](#2%29-Examples-and-Intuitions-II)
    * [3) Multiclass Classification](#3%29-Multiclass-Classification)

# 1) Regression fundamentals

## 1) Data and model

- x: input
- y: output
- f(x): functional relationship, expected relationship between x and y

$y_{i} = f(x_{i}) + e_{i}$
- $e_{i}$: error term

You can easily imagine that there are errors in this model, because you can have two houses that have exactly the same number of square feet, but sell for very different prices because they could have sold at different times. They could have had different numbers of bedrooms, or bathrooms, or size of the yard, or specific location, neighborhoods, school districts. Lots of things that we might not have taken into account in our model. 

<img src="images/lec1_pic01.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/M6cMZ/regression-fundamentals-data-model) 6:45*

<!--TEASER_END-->

## 2) Block diagram

- $\hat{y}$: predicted house sales price
- $\hat{f}$: estimated function
- y: actual sales price

we're gonna compare the actual sales price to the predicted sales price using the any given $\hat{f}$. And the quality metric will tell us how well we do. So there's gonna be some error in our predicted values. And the machine learning algorithm we're gonna use to fit these regression models is gonna try to minimize that error. So it's gonna search over all these functions to reduce the error in these predicted values. 

<img src="images/lec1_pic02.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/fKsPh/regression-ml-block-diagram) 3:48*

<!--TEASER_END-->

# 2) The simple linear regression model, its use, and interpretation

## 1) The simple linear regression model

**What's the equation of a line? **
- it's just (intercept + slope * our variable of interest) so that we're gonna say that's $f(x) = w_{0} + w_{1}x$

And what this regression model then specifies is that each one of our observations $y_{i}$ is simply that function evaluated at $x_{i}$, so:
$$y_{i} = w_{0} + w_{1}x + \epsilon_{i}$$

- $\epsilon_{i}$: error term, the distance from our specific observation back down to the line
- $w_{0}, w_{1}$: intercept and slope respectively, they are called regression coefficients.

<img src="images/lec1_pic03.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/N8p7w/the-simple-linear-regression-model) 1:12*

<!--TEASER_END-->

## 2) The cost of using a given line

**What is the difference between residual and error?**

https://www.coursera.org/learn/ml-regression/lecture/WYPGc/the-cost-of-using-a-given-line/discussions/Lx0xn5j1EeW0dw6k4EUmPw
- Residual is the difference between the observed value and the predicted value. Error is the difference between the observed value and the (often unknown) true value. As such, residuals refer to samples whereas errors refer to populations.
- There is a true function f(x) that we want to learn, and the observed values $y_{i}$ we have are in fact: $y_{i} = f(x_{i}) + e_{i}$, because we need to assume our measures have some error (or noise if you prefer this term). This $e_{i}$ is the real error, because the real value is $f(x_{i})$. In the other hand, the residual is $y_{i} - \hat f(x_{i})$, where $\hat f$ is our approximation (estimation) of the real f(x)

**Residual sum of squares (RSS)**

The sum of all the differences between predicted values and actual values and then square the sum.

<img src="images/lec1_pic04.png">
<img src="images/lec1_pic05.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/WYPGc/the-cost-of-using-a-given-line) 3:26*

<!--TEASER_END-->

## 3) Using the fitted line

- a model is in terms of sum parameters and a fitted line is a specific example within that model class


**Why the hat notation?**

https://www.coursera.org/learn/ml-regression/lecture/RjYbf/using-the-fitted-line/discussions/QOsWrZkGEeWKNwpBrKr_Fw

- In statistics, the hat operator is used to denote the predicted value of the parameter.

eg: Y-hat stands for the predicted values of Y (house-price).

http://mathworld.wolfram.com/Hat.html

https://en.wikipedia.org/wiki/Hat_operator

- The hat denotes a predicted value, as contrasted with an observed value. For our purposes right now, I think of the hat value as the value that sits on the regression line, because that's the value our regression analysis would predict. So, for example, the residual for a particular observation is y_i minus y_ihat, where y_i is the actual observed outcome at a particular observed value of x and y_ihat is the value that our regression analysis predicts for y at that same x value.

<img src="images/lec1_pic06.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/RjYbf/using-the-fitted-line) 2:00*

<!--TEASER_END-->

## 4) Interpreting the fitted line

- $\hat{w}$: predicted changed in the output per unit change in the input.

One thing I want to make very, very clear is that the magnitude of this slope, depends both on the units of our input, and on the units of our output. So, in this case, the slope, the units of slope, are dollars per square feet. And so, if I gave you a house that was measured in some other unit, then this coefficient would no longer be appropriate for that. 

For example, if the input is square feet and you have another house was measured in square meters instead of square feet, well, clearly I can't just plug into that equation.

<img src="images/lec1_pic07.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/x8ohF/interpreting-the-fitted-line) 3:54*

<!--TEASER_END-->

# 3) An aside on optimization: one dimensional objectives

## 1) Defining our least squares optimization objective

$RSS(w_{0}, w_{1}) = g(w_{0}, w_{1})$

Our objective here is to minimize over all possible combinations of $w_{0}$, and $w_{1}$

<img src="images/lec1_pic08.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/a1QCT/defining-our-least-squares-optimization-objective) 2:35*

<!--TEASER_END-->

## 2) Finding maxima or minima analytically

- Concave function: The way we can define a concave function is we can look at any two values of w: a, and b. Then we draw a line between a and b, that line lies below g(w) everywhere.
- Convex function: Opposite properties of Concave function where the line connects g(a) and g(b) is above g(w) everywhere.
- There a functions which are neither Concave nor Convex function where the line lies both below and above the g(w) function.

<img src="images/lec1_pic09.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/RUtxG/finding-maxima-or-minima-analytically) 3:49*

<!--TEASER_END-->

- In a concave function, at the point of max g(w), this is the point where the derivative = 0. Same thing for convex function, if we want to find minimum of all w over g(w), at the minimum point, the derivative = 0.

Example: $g(w) = 5 - (w - 10)^{2}$

$\frac{d g(w)}{d w} = 0 - 2(w-10)^{2} \cdot 1 = -2w + 20$

When we draw this derivative, we can see it has a concave form. **How do I find this maximum?**
- I take this derivative and set it equal to 0, and we can solve it for w = 10.

<img src="images/lec1_pic10.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/RUtxG/finding-maxima-or-minima-analytically) 6:45*

<!--TEASER_END-->

## 3) Finding the max via hill climbing

If we're looking at these concave situations and our interest is in finding the max over all w of g(w) one thing we can look at is something called a **hill-climbing algorithm**. Where it's going to be an integrative algorithm where we start somewhere in this space of possible w's and then we keep changing w hoping to get closer and closer to the optimum.

Okay, so, let's say that we start w here. And a question is well **should I increase w, move w to the right or should I decrease w and move w to the left to get closer to the optimal. **
- What I can do is I can look at the function at w and I can take the derivative and if the derivative is positive like it is here, this is the case where I want to increase w. If the derivative is negative, then I want to decrease w.
- So, we can actually divide the space into two. 
    - Where on the left of the optimal, we have that the derivative of g with respect to w is greater than 0. And these are the cases where we're gonna wanna increase w. 
    - And on the right-hand side of the optimum we have that the derivative of g with respect to w is negative. And these are cases where we're gonna wanna decrease w. 
    - If I'm exactly at the optimum, which maybe I'll call $w^{*}$. I do not want to move to the right or the left, because I want to stay at the optimum. The derivative at this point is 0.
    
So, again, the derivative is telling me what I wanna do. We can write down this climbing algorithm:
- While not converged, I'm gonna take my previous w, where I was at iteration t, so t is the iteration counter. And I'm gonna move in the direction indicated by the derivative. So, if the derivative of the function is positive, I'm going to be increasing w, and if the derivative is negative, I'm going to be decreasing w, and that's exactly what I want to be doing. 
- But instead of moving exactly the amount specified by the derivative at that point, we can introduce something, I'll call it ada. And ada is what's called a step size. 
- So, when I go to compute my next w value, I'm gonna take my previous w value and I'm going to move and amount based on the derivative as determined by the step size. 

Example: Let's say I happen to start on this left hand side at this w value here. And at this pointthe derivative is pretty large. This function's pretty steep. So, I'm going to be taking a big step. Then, I compute the derivative. I'm still taking a fairly big step. I keep stepping increasing. What I mean by each of these is I keep increasing w. Keep taking a step in w. Going, computing the derivative and as I get closer to the optimum, the size of the derivative has decreased.

<img src="images/lec1_pic11.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/O4j1e/finding-the-max-via-hill-climbing) 3:34*

<!--TEASER_END-->