# 1.3 Training, Validating, and Testing a Model

So, what is *Training*? I've mentioned it a few times over the past few sections, but have glossed over what it is and how it works. Maybe you also noticed in my code some mention of a "Test set". What is *Testing*? 

We'll go over Training, Testing, and a tangentially related concept, Validation, in this section.

### 1.3.1 Understanding and Defining Training

Here I will explain the mathematical concepts of training through the example of a simple Machine Learning algorithm: the Linear Regression. We've encountered this algorithm before in section 1.1, but here I'll go in depth on how it works, and define some key training concepts along the way. You'll get a general sense of how training works for most machine learning algorithms by understanding how it works for this fundamental algorithm.

Its probably best that we return to the definition of a Machine Learning algorithm from Chapter 1.1: a set of instructions, which can change and improve with experience, and that a computer can follow to solve a problem. **Training** is the process in which an algorithm changes and improves, given some experience in the form of data. Most models have mathematical terms called **parameters** that are tuned/changed/fit to follow the trends of the dataset during training. 

Remember the familiar formula for the Linear Regression that I showed you in the first section? y = m * x + b? m is the slope of the line, and b is the y-intercept. These are model parameters! So the question of how training works is of how the computer figured out values for m and b. 

#### 1.3.1.1 Linear Algebra and Notation Overview: Linear Regression

If you hate math, bear with this section for a bit. This stuff is kinda necessary if you wanna understand the inner workings of training algorithms. That being said, you *could* skip to 1.3.2 if you just want to get to how to train (and test) without the derivations, but you wont understand how the process actually works.

I should start off by mentioning that I lied a little bit. The formula for Linear Regression is a little more complicated than y = m * x + b. While, yes, that is the formula for 1D input data (1 dimensional input), where the input features are a single value, there exists a general formula for all dimensions.

\begin{align}
\ y & = X \beta \\
\end{align}

This equation might look a little scary with all the greek symbols. But its actually pretty similar to y = m * x + b - we'll derive that soon. The major difference is that the notation, or the use of greek symbols and captial versus lowercase letters, all have a specific meanings, which differ from the meanings of slope (m) and y-intercept (b). 

*Typically* x and y are always used to describe feature variables and target variables, respectively. The beta variable almost always represents "bias". 

*Typically* capital letters indicate a **Matrix**, and lowercase letters represent a **Vector**. You may sometimes see bolding and normal font, typically, if there is bolded font, then anything *not* bolded is a single value.

Matrices are an array (or grouping) of numbers such that there are rows and columns of values. Rows typically represent a group of features describing a a single data point. Columns are dimensions of the data, which each column representing a different feature. Convention is that rows are counted with the symbol n, and columns are counted with the symbol m. Heres an example:

\begin{align}
\begin{bmatrix}
\ 20 & 130 & 8 \\
\ 35 & 155 & 10 \\
\ 15 & 100 & 7
\end{bmatrix}
\end{align}

Lets pretend here that each row represents a person, and each column represents a feature - in this case the first column is age, the second column is weight and the third column is shoe size. Also in this case, there are n = 3 people, and m = 3 features. This is a could be example of an ***X*** matrix, like one that could be used in the above equation, but it's actually typical to pad the left side of the ***X*** with 1s - and you'll see why soon.

\begin{align}
\begin{bmatrix}
\ 1 & 20 & 130 & 8 \\
\ 1 & 35 & 155 & 10 \\
\ 1 & 15 & 100 & 7
\end{bmatrix}
\end{align}

Vectors are an array of numbers such that there are only rows of values (sometimes only columns). Rows correspond to a signle data point. Heres an example:

\begin{align}
\begin{bmatrix}
\ 5.5 \\
\ 5.9 \\
\ 4.8
\end{bmatrix}
\end{align}

Pretending again that each row is a person, we can consider each row to represent a different person's height. In this case, there are also n = 3 people. This is a good example of a ***y*** vector. 

In the above equation, beta is a vector with m bias parameters. 

\begin{align}
\begin{bmatrix}
\ \beta_{0} \\
\ \beta_{1} \\
\ \beta_{2} \\
\ \beta_{3}
\end{bmatrix}
\end{align}

If we were to matrix multiply X and beta together, like below, then there will be n * m resulting terms:

\begin{align}
\begin{bmatrix}
\ y_{0} \\
\ y_{1} \\
\ y_{2}
\end{bmatrix} =
\begin{bmatrix}
\ 1 & 20 & 130 & 8 \\
\ 1 & 35 & 155 & 10 \\
\ 1 & 15 & 100 & 7
\end{bmatrix}
\begin{bmatrix}
\ \beta_{0} \\
\ \beta_{1} \\
\ \beta_{2} \\
\ \beta_{3}
\end{bmatrix}
\end{align}

\begin{align}
\begin{bmatrix}
\ y_{0} \\
\ y_{1} \\
\ y_{2}
\end{bmatrix} =
\begin{bmatrix}
\ (\beta_{0} * 1) + (\beta_{1} * 20) + (\beta_{2} * 130) + (\beta_{3} * 8) \\
\ (\beta_{0} * 1) + (\beta_{1} * 35) + (\beta_{2} * 155) + (\beta_{3} * 10) \\
\ (\beta_{0} * 1) + (\beta_{1} * 15) + (\beta_{2} * 100) + (\beta_{3} * 7)
\end{bmatrix}
\end{align}

Remember when I said that the ***X*** matrix typically had 1s padded on the left? Thats because that allows the beta_0 value to become the y-intercept! And each parameter has its own "regressor coefficient", b_1 through b_n. 

I'll make it a bit clearer... lets start with a simpler ***X*** matrix, with not actual values:

\begin{align}
\begin{bmatrix}
\ 1 & x_{1,1}
\end{bmatrix}
\end{align}

Here, our X matrix is n = 1 by m = 1 padded column + 1 dimension = 2. So, 1 by 2. Lets plug that into the formula.

\begin{align}
\begin{bmatrix}
\ y_{0} 
\end{bmatrix} =
\begin{bmatrix}
\ 1 & x_{1,1}
\end{bmatrix}
\begin{bmatrix}
\ \beta_{0} \\
\ \beta_{1} 
\end{bmatrix}
\end{align}

\begin{align}
\begin{bmatrix}
\ y_{0}
\end{bmatrix} =
\begin{bmatrix}
\ (\beta_{0} * 1) + (\beta_{1} * x_{1,1})
\end{bmatrix}
\end{align}

Lets get rid of some of the scary looking stuff and...

\begin{align}
y = \beta_{0} + (\beta_{1} * x)
\end{align}

\begin{align}
y = b + (m * x)
\end{align}

\begin{align}
y = m * x + b
\end{align}

There we go! We have that formula we saw before - this shows how the general linear regression formula, applied to 1D data is equal to y = m * x + b.

The question still stands though, how do you get m and b? Or, now that we know the correct notation, how do we get the beta vector?

#### 1.3.1.2 Ordinary Least Squares

Many types of models have formulas like the Linear Regression model does. And these formulas all have parameters that need to be fit. Fitting happens during training. Training is performed through different type of methods, dependent on the model formulas and structures. These methods are determined based on minimizing the amount of error the model will produce by predicting over a set of data.

Take the formula for Linear Regression:

\begin{align}
\ y & = X \beta \\
\end{align}

Now lets imagine an unfit form of the model, but where we tack on another vector, epsilon, for errors. In this scenario, the model accurately guesses every target variable per input variable, even with the incorrectly fit beta values. Lets make up some random values for beta and some random data points:

\begin{align}
\beta =
\begin{bmatrix}
\ 1.5 \\
\ 1.2 \\
\ 0.9 \\
\ 1.1
\end{bmatrix}
\end{align}

\begin{align}
X =
\begin{bmatrix}
\ 20 & 130 & 8 \\
\ 35 & 155 & 10 \\
\ 15 & 100 & 7
\end{bmatrix}
\end{align}

\begin{align}
y =
\begin{bmatrix}
\ 5.5 \\
\ 5.9 \\
\ 4.8
\end{bmatrix}
\end{align}

Now lets plug that into our formula, with the error vector tacked on, and solve for errors.

\begin{align}
\ y & = X \beta + \epsilon \\
\end{align}

\begin{align}
\begin{bmatrix}
\ 5.5 \\
\ 5.9 \\
\ 4.8
\end{bmatrix} =
\begin{bmatrix}
\ 1 & 20 & 130 & 8 \\
\ 1 & 35 & 155 & 10 \\
\ 1 & 15 & 100 & 7
\end{bmatrix}
\begin{bmatrix}
\ 1.5 \\
\ 1.2 \\
\ 0.9 \\
\ 1.1
\end{bmatrix} + 
\begin{bmatrix}
\ \epsilon_{0}  \\
\ \epsilon_{1}  \\
\ \epsilon_{2}
\end{bmatrix}
\end{align}

\begin{align}
\begin{bmatrix}
\ 5.5 \\
\ 5.9 \\
\ 4.8
\end{bmatrix} =
\begin{bmatrix}
\ 151.3 \\
\ 194 \\
\ 117.2 
\end{bmatrix} + 
\begin{bmatrix}
\ \epsilon_{0}  \\
\ \epsilon_{1}  \\
\ \epsilon_{2}
\end{bmatrix}
\end{align}

\begin{align}
\begin{bmatrix}
\ \epsilon_{0}  \\
\ \epsilon_{1}  \\
\ \epsilon_{2}
\end{bmatrix} = 
\begin{bmatrix}
\ -145.8 \\
\ -188.1 \\
\ -112.4
\end{bmatrix}
\end{align}

The errors for each row (data point) were all pretty large and negative. Remember, we are calculating height with this Linear Regression model. The orignal values from X * beta were in the 100s. No one is 100 feet tall. Our error values quantitatively show us how far off we were. 

That quantitative measure is really helpful actually. Its our starting point for training the Linear Regression. In an ideal scenario, the errors that we calculate for *any* point would be very very close to 0, not in the magnitude of 100s. Our goal is to find values of beta, such that the errors are minimized. 

Lets solve for error symbolically:

\begin{align}
\ y & = X \beta + \epsilon \\
\end{align}


\begin{align}
\ y - X \beta & = \epsilon \\
\end{align}

Remember that our errors can be negative, so we want errors as close to 0 as possible. One way to do this is to square the error terms, making them all positive. There is also a more complicated motive for this, described by the **Normal Equations**. Perhaps I will write a whole Linear Algebra script at some point, but for now, let me point you to a resource that can explain the Normal Equations and their relevance in training here: https://www.youtube.com/watch?v=3g-e2aiRfbU https://www.youtube.com/watch?v=xVgqM35YSDY. 

Lets write squared errors as a function of the choice of beta:

\begin{align}
\epsilon^2 = S(b) = \sum_{i=1}^n (y_i - x_i \beta)^2 \\
\end{align}

And we can take this a step further to reveal the **Normal Equation** (well, one of them):

\begin{align}
\sum_{i=1}^n (y_i - x_i \beta)^2 = (y-X\beta)^\mathrm{T}(y-X\beta) = (X^\mathrm{T}X)^{-1}X^\mathrm{T}y\\
\end{align}

And thats the equation you use to find beta! You might find that to be a bit of a jump. If that is the case, check out some of the references that I put here.

*Note: There are some metrics used as **R squared** that are commonly used to understand how well beta was fit that you may want to know about, more on that here: https://en.wikipedia.org/wiki/Ordinary_least_squares*

Lets jump into Python and try to train out model using this formula.

In [4]:
import numpy as np

y = np.array([5.5, 5.9, 4.8])
X = np.array([[1,20,130,8],[1,35,155,10],[1,15,100,7]])

X_plus = np.linalg.inv(np.dot(X.T,X))
X_star = np.dot(X_plus,X.T)
beta = np.dot(X_star,y)
print("B0 = ", beta[0])
print("B1 = ", beta[1])
print("B2 = ", beta[2])
print("B3 = ", beta[3])

B0 =  6.9
B1 =  -0.49296874999999996
B2 =  -0.04121093749999999
B3 =  1.9249999999999998


Now we have some actual values for beta! Let's test it out by calculating the errors

\begin{align}
\ y & = X \beta + \epsilon \\
\end{align}

\begin{align}
\begin{bmatrix}
\ 5.5 \\
\ 5.9 \\
\ 4.8
\end{bmatrix} =
\begin{bmatrix}
\ 1 & 20 & 130 & 8 \\
\ 1 & 35 & 155 & 10 \\
\ 1 & 15 & 100 & 7
\end{bmatrix}
\begin{bmatrix}
\ 6.9 \\
\ -0.49 \\
\ -0.04 \\
\ 1.92
\end{bmatrix} + 
\begin{bmatrix}
\ \epsilon_{0}  \\
\ \epsilon_{1}  \\
\ \epsilon_{2}
\end{bmatrix}
\end{align}

In [6]:
import numpy as np

y = np.array([5.5, 5.9, 4.8])
X = np.array([[1,20,130,8],[1,35,155,10],[1,15,100,7]])
beta = np.array([6.9,-.49,-.04,1.92])

errors = y - (np.dot(X,beta))

print("E0 = ", errors[0])
print("E1 = ", errors[1])
print("E2 = ", errors[2])

E0 =  -1.7599999999999998
E1 =  3.1499999999999986
E2 =  -4.19


Thats so much better than the error we got before! 

So, for Linear Regression, the typical approach to fitting/training a model is to solve that equation over a large dataset. The matrix multiplication stuff happens in the background when you fit a Linear Regression model using some Python package.


To recap, this is just how to solve one type of model: a Linear Regression model. There was a function describing the model, parameters we wanted to fit to reduce the amount of error produced by the function, and an approach we used to solve the problem of fitting these parameters. Thats generally true of most Machine Learning algorithms - theres a function to describe how they operate, parameters to fit, and an approach to minimizing the error. Chapter 2 will be a survey of Machine Learning algorithms, so you'll see the different parameters of many different functions, and the different approaches to minimizing error. Some approaches are not so clean cut, and are iterative processes that take time to perform. Theres actually an interative process for Linear Regression too: IRLS. You can read about that here if youd like: https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares. Otherwise, we will touch on iterative methods later. For now, just know that training is the process of fitting model parameters to minimizing the error of the model.

### 1.3.2 Understanding and Defining Testing

The story doesn't end once you've trained your model (i.e. fit parameters of the model to a specifica dataset). How do you know your model is any good? By that I mean, how do you know your model will actually give you a good prediction on data that it hasn't seen before?

**Testing** is the process of determining efficacy of a model in an unbiased way. The idea is that we can use our trained model to predict target values on a set of data in which we actually have the real target values of. 

Imagine we have a dataset of the temperature of each day in 2019 in New York City. In this dataset we have many variables for each day, including the date, the earth's tilt towards the sun, the cloud coverage, and a bunch of other possibly relevant stuff. We also have access to the target variable, temperature. We could train an algorithm on a random two thirds of the dataset, mapping the variables to the target value. Now we have one third of the data left over, and we can use this data to *test* the trained model. We can push the variables we have for X through our trained model, and see what the model predicts for y, the target variable. Then we can evaluate how close the predictions were to actual value, which we have access to.

### 1.3.3 How to Practically Train and Test