# Machine Learning Math 1

This series starts with introducing you to the math that allows you to *interpret* a linear regression.

## Machine Learning Intuition

### Plain English

In a lot of computer science, we understand algorithms by assuming an oracle machine where you feed it with inputs and it produces the correct output every time. Essentially, black magic hiding inside of black boxes.

* [Oracle Machine](https://en.wikipedia.org/wiki/Oracle_machine)

Can we create something that isn't a black box that will produce the correct output every time? We can't be sure. Maybe no oracle machine can be built. Maybe building such an oracle machine is extremely complex and time-consuming.

What if we could step back and ask the question, "What if an almost-oracle is good enough?"

At a very abstract level, if we have many data points, we can theoretically summarize the data points with a model that says something meaningful or illuminating about the data. While it isn't necessarily 100% correct, this model can be used to understand both the data points that we've seen as well as data points we've not yet seen.

* [George Box quote](https://en.wikipedia.org/wiki/All_models_are_wrong)

The process of creating these almost-oracles based on data points is known as machine learning.

In machine learning, you choose some class of models you have reason to believe will provide you with almost correct outputs given what you understand about the inputs. You feed your model with the input data, you provide some hints, and then you ask the model to train itself. By observing the outputs it creates, the model adjusts itself to provide increasingly correct answers as outputs.

To give you a better sense of what this means in practice, we'll talk about the model you created in the machine learning workshop: linear regression.

### Matrix Transformation

Let's start by pretending that you want to describe an object, say a house. To create this description, you might capture many aspects of the house. You might record values like the size of the house, the size of the backyard, the number of bedrooms and bathrooms in the house. Each of these values that you choose to collect is referred to as a *feature*.

Assume you have $m$ different objects that you've described with $n$ features. You can label the specific feature of a specific object as $x_{i,j}$, where $i$ corresponds to the ID of the object, and $j$ corresponds to the index of the feature. This allows us to represent these $m$ data points as an $m \times n$ matrix.

$$
\begin{bmatrix}
x_{1,1} & \cdots & x_{1,n} \\
x_{2,1} & \cdots & x_{2,n} \\
\vdots & \ddots & \vdots \\
x_{m,1} & \cdots & x_{m,n}
\end{bmatrix}
$$

Let's assume that there is an additional feature that is intuitively related to the features we collected. In our case, it might be the price of the house in question. Because we have $m$ different values for this additional feature, we can represent these as an $m \times 1$ column vector.

$$
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_m
\end{bmatrix}
$$

So to repeat what we just said about the almost-oracles in a more mathematical way, machine learning is the process of deriving transformations of the input matrix into the output matrix.

## Define Incorrectness

So, let's take a quick step back. You will be asking the model to train itself, and it will provide increasingly correct answers as outputs. How does the model know what to do in order to know how to get increasing correctness?

Essentially, it knows because you have a definition of "degree of incorrectness". In machine learning, this is referred to as a *cost function*.

Intuitively, the cost function provides a way for the model to detect whether it got better or if it got worse. There are a variety of considerations that you take into account when saying one model is better than another, but one of the more obvious elements of the cost function is the difference between what the model guessed $\hat{y_i}$ as compared to the true correct answer $y_i$.

The notion of a guess can vary between machine learning models. Some guess at a number in the range $(-\infty, +\infty)$. Some guess in a more bounded range, such as $[0, 1]$ or $[-1, 1]$. Still others might have multiple outputs.

<span style="color: crimson">**TODO**: understanding the output variable $(-\infty, +\infty)$, different loss functions (zero-one, absolute difference, square difference)

## Baseline Model

Once you've defined a cost function, all machine learning follows with the following question. What is the simplest almost-oracle I can think of related to this degree of incorrectness?

* [How to Get Baseline Results and Why They Matter](http://machinelearningmastery.com/how-to-get-baseline-results-and-why-they-matter/)

Answering this question gives you a meaningful starting point and it allows you to say whether your self-adjusting almost-oracle is actually any better than the simplest almost-oracle.

For a baseline model, you will often choose an almost-oracle that is equivalent to the central tendency measure connected to your definition of incorrectness.

* [Modes, Medians, Means: A Unifying Perspective](http://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-means-an-unifying-perspective/)

To summarize the article above:

* If your definition of incorrectness is that wrong is wrong (all wrong answers are penalized equally), you would choose the mode of the output values as the baseline model.
* If your definition of incorrectness is absolute difference from the output values, you would choose the median of the output values.
* If your definition of incorrectness is the [sum of squared errors](https://en.wikipedia.org/wiki/Residual_sum_of_squares), the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error), or the [root-mean-square error](https://en.wikipedia.org/wiki/Root-mean-square_deviation), you would choose the mean of the output values.

### Checkpoint: Create the Baseline Model

## Linear Model Overview

When you choose linear regression as a model, you essentially say that the output variable is a linear combination of the input variables. More explicitly, we have a vector of *weights* that contains $n$ entries (for each of the features) and an additional intercept term $\beta_0$:

$$
\begin{bmatrix}
\beta_0 \\
\vdots \\
\beta_n
\end{bmatrix}
$$

If we were to take the numerical value we gave to each feature $x_{i,j}$ and multiply it by the corresponding $\beta_j$ and then sum these values along with the intercept term $\beta_0$, we have an estimate of the result variable $\hat{y_i}$.

$$
\hat{y_i} =
\begin{bmatrix}
1 & x_{i,1} & \cdots & x_{i,n}
\end{bmatrix}
\times
\begin{bmatrix}
\beta_0 \\
\vdots \\
\beta_n
\end{bmatrix}
$$

Choosing a linear regression model is equivalent to saying that we can estimate the output vector by transforming the input matrix with a single matrix multiplication.

$$
\begin{bmatrix}
\hat{y_1} \\
\hat{y_2} \\
\vdots \\
\hat{y_m}
\end{bmatrix}
=
\begin{bmatrix}
1 & x_{1,1} & \cdots & x_{1,n} \\
1 & x_{2,1} & \cdots & x_{2,n} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x_{m,1} & \cdots & x_{m,n}
\end{bmatrix}
\times
\begin{bmatrix}
\beta_0 \\
\vdots \\
\beta_n
\end{bmatrix}
$$

## Linear Model Assumptions

The assumptions you make with a linear model are summarized in the following discussion.

* [Linear Models](http://www.stat.berkeley.edu/~aditya/resources/LectureFOUR.pdf)

In order to know that this model is valid, we assume that input variables are non-random, but we accept that the result variables are random. If those terms aren't familiar to you, it's good to do a quick refresher of drawing samples from a probability distribution.

* [Chapter 1 of Introduction to Stochastic Processes](https://www.ma.utexas.edu/users/gordanz/notes/introduction_to_stochastic_processes.pdf)

We also allow for our the actual $y_i$ to vary from the estimate $\hat{y_i}$ by a white noise error term $\epsilon_i$.

* [White Noise](https://en.wikipedia.org/wiki/White_noise#Mathematical_definitions)

Assuming we satisfy all of the previous assumptions, we would also be making the following claim about $y_i$ values that we have actually observed relative to the $\hat{y_i}$ values that our model estimates.

$$
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_m
\end{bmatrix}
=
\begin{bmatrix}
\hat{y_1} \\
\hat{y_2} \\
\vdots \\
\hat{y_m}
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_1 \\
\epsilon_2 \\
\vdots \\
\epsilon_m
\end{bmatrix}
$$

## Indicator/Boolean Variables

### Checkpoint: Indicator/Boolean Variables

## Linear Regression, Take 1

$$
y_i = \widehat{\beta_0} + \widehat{\beta_1} x_1 + \widehat{\epsilon_i}
$$

## Meaningful Numeric Variables

### Checkpoint: Meaningful Numeric Variables

## Less Meaningful Numeric Variables

### Checkpoint: Less Meaningful Numeric Variables

## Linear Regression, Take 2

$$
y_i = \widehat{\beta_0} + \widehat{\beta_1} x_1 + \dotsb + \widehat{\beta_n} x_n + \widehat{\epsilon_i}
$$

## Transformed Variables

### Checkpoint: Transformed Variables

## Interaction Terms

### Checkpoint: Interaction Terms

## Linear Regression, Take 3

$$
y_i = \widehat{\beta_0} + \widehat{\beta_1} x_1 + \dotsb + \widehat{\beta_n} x_n + \widehat{\epsilon_i}
$$

## Text Vectorization

### Checkpoint: Text Vectorization

## Linear Regression, Take 4

$$
y_i = \widehat{\beta_0} + \widehat{\beta_1} x_1 + \dotsb + \widehat{\beta_n} x_n + \widehat{\epsilon_i}
$$

## Closing Thoughts

Hopefully you have now become curious about linear regression.

You might wonder about simple extensions to linear regression, such as linear spline regression where you have boundary points where the coefficients completely change.

* [An Introduction to Splines](http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf)

You might also wonder why we talked about cost functions at the start and if choosing a different cost function might change the way the regression works.

* [Quantile Regression: An Introduction](http://www.econ.uiuc.edu/~roger/research/intro/rq3.pdf)

It's likely that you've also been wondering about applying transformations of the input and output variables in order to overcome the constraints of the linear relationship between variables.

* [Transformations in Regression](http://people.stern.nyu.edu/jsimonof/classes/2301/pdf/transfrm.pdf)

You might be able to follow examples of people looking at input types that we haven't talked about (such as geospatial data) that will allow you to apply linear regression to other problems that involve the prediction of a continuous variable with range $(-\infty, \infty)$.

* [AirBnb Properties in Boston](https://github.com/ResidentMario/boston-airbnb-geo/blob/master/notebooks/boston-airbnb-geo.ipynb)

However, before we let you get into any of that, the next important thing we want you to understand is the math behind the cost function(s) that might be applied when performing linear regression. That will be our next lesson.

* [Usual assumptions for linear regression](http://stats.stackexchange.com/questions/16381/what-is-a-complete-list-of-the-usual-assumptions-for-linear-regression)
* Convexity of cost functions
* Stochastic gradient descent
* Different interpretations of distance (euclidean, etc.)
* Regularization

## Additional Resources