# Assessing a learning algorithm

## What happens to K as we vary it for KNN?

- For K = N, the resulting value will be average value for every single query. This is just a constant line.
- For K =1, this model will pick every data point that is a neighbor; in other words, it works too hard to fit every data point.

### In KNN, as we increase K are we more likely to over fit?
- In case of K = 1, we overfit; for K = N we have a constant for all query. So, we do not overfit with higher values of K

### Example:
![KNN Eg](./images/3-3_1_KNN.png)


- Notice that beyond the boundary of data, the line is constant. We cannot extrapolate with KNN

## Considering a parametric model (polynomial model)

## Suppose we are using a polynomial of degree d.

- Match Q1 with model on side
- Answer Q2

![Parametric Eg](./images/3-3_2_Parameteric.png)

### Q1:
- For d = 1, it is a linear model (equation of line such as y = mx + c). and it matches with graph c
- For d = 2, it a quadratic equation of the form y = ax<sup>2</sup> + bx + c. Graph a matches this
- For d = 3, it has to be b

### Q2:
- In above graph, we see that going from 1 to 3, we are tagging more and more data. Therefore, as the order of polynomial increases we are more likely to tag all data points and therefore over fit

### Beyond boundaries (or edges of data):
- Using the parametric model, we can extrapolate in the direction data seems to be going which we couldn't do in KNN

# Metric 1: RMS Error

- Standard way to measure error
<br/> <br />
![RMS-1](./images/3-3_3_RMS_1.png)
<br/> <br />  
- Suppose we use the training data (green dots in the plot) to build the model, say a linear model. We can assess the model at each real data point and measure the difference between y value of data point and the model; this difference is the error. We have got an error at every single data points which looks like the image below
<br/> <br />
![RMS-2](./images/3-3_3_RMS_2.png)
<br/> <br />

- To Measure RMS Error, take error at each data points: square every error, sum them and average them. Finally take square root of it! 
- This is an approximation of average errors but we end up emphasizing larger errors

## In Sample vs Out of Sample

- For the sample data (training data) we can build a model that can fit the training  set exactly
- But the more important measure is "What is our error out of sample?". 
  - Out of sample means, we train on training set; but for testing we test on separate set of data - testing set
  - To measure our out of sample error we measure on testing set (not on the training set)
- In the graph below, look at the blue data points and plug it into our RMSE Equation 
<br/><br/>

![OOS](./images/3-3_4_Out_of_Sample.png)

## Quiz?

Q: Which error would you expect to be larger?
<br/>
[A] in sample error (training set)
<br/>
[B] out of sample error (test set)


=> B

## Cross Validation

- When evaluating a learning algorithm, they split their data into 2 chunks - training (60% of data set) & testing (40 % of data set)
- You train on a data and then you test on it, this is one trial. In many cases that's enough; you measure your RMS and that's the assessment of your algorithm and compare it against another algorithm
<br/><br/>
![CV](./images/3-3_5-CV_1.png)

<br/>
- However, sometimes the `problem encountered is that researchers do not have enough data` to analyze their algorithm. In such case, they can create more data by slicing it up and running more trails.
- One approach can be that slice data into 5 chunks and use 80% of it for training and use the 20% of it for training. Suppose we used first 4 chunks for training and last 1 for testing; this one one trial.
- Then we can switch things up: say we reserve first data set for testing and last four for training; this will be another trial
- We can again switch things up and so on. With this slicing into 5 chunks, we can get 5 different trials out of this one data set.
<br/><br/>

![CV](./images/3-3_5-CV_2.png)

## Roll forward cross validation:

- Cross validation is great tool but it doesn't fit well for financial data application because it can permit peeking into the future. For instance, if our training slice is after the test slice we are peeking into future ahead of our test which can result in unrealistically optimistic results.
- With this sort of financial data we need to avoid peeking; one way to avoid it is with roll forward cross validation - means training data is always before test data. Even in this case we can have multiple trials by rolling our data forward until we run out of data

<img alt="3-3_5-RF_1.png" src="images/3-3_5-RF_1.png" width="800">


<img alt="3-3_5-RF_2.png" src="images/3-3_5-RF_2.png" height="300" width="150">
<img alt="3-3_5-RF_3.png" src="images/3-3_5-RF_3.png" height="300" width="150">
<img alt="3-3_5-RF_4.png" src="images/3-3_5-RF_4.png" height="300" width="150">

## Metric 2: correlation

- another way to visualize and evaluate the accuracy of regression algorithm is to look at the relationship between predicted and the actual value of our dependent variable y
- Y<sub>predict</sub> = result from our model for given X<sub>test</sub>
- real value of y is Y<sub>test</sub>
- Now plot a scatter plot Y<sub>predict</sub> and Y<sub>test</sub> & fit a line
- Correlation: Are the data points close to the lines?
- We can measure this property of quantitatively using co-relation. 
- `Numpy` provides a function `np.corrcoef` to measure the value which ranges from -1 to +1. Values towards +1 means positively co-related; -1 means inversely co-related; Values towards 0 means there is no corelation

![corr1](./images/3-3_6-corr_1.png)

Correlation is not a slope. See this awesome 2 part video if you have no idea what correlation means: <br/>
- https://www.youtube.com/watch?v=qtaqvPAeEJY
- https://www.youtube.com/watch?v=xZ_z8KWkhXE

## Question: Correlation and RMSError

![Quiz](./images/3-3_7_corr_rmse_quiz.png)

-> In general as RMS Error increase means predicted value is off from the real value; hence, correlation decreases

(But it is also possible to construct eg where as RMSE increases, corr might increase; so Option 3 is also ok to pick)

## Overfitting

- let's create a multiple polynomial model, where we increase degree 1 at a time; so we start from d = 1 and go to 2, 3, 4 and so on
- Lets graph this where x is degree of polynomial (or degree of freedom) and on y graph the error of our model. 
- Graph for in sample error and out of sample error: <br/>  <br/>
![OF](./images/3-3_8-overfitting.png)

<br/>

- For in sample error: as our degree increases, our error drops
- For out sample error: as our degree increases, our error drops; but eventually we will reach  point where our out of sample error will start to increase again (may increase strongly). 
- The region where in sample error is decreasing but our out of sample error is increasing is the region where we start `overfitting`

## KNN Overfitting

![KNN Overfitting Quiz](./images/3-3_9_quiz.png)

### My thoughts
IN sample:
- For K = 1, it fits every dat point so error is low
- For K = N, its constant and misses by a lot! so error is high
- starts low and ends high

Out of sample: hmm...
- For K = 1, it can best describe the training data set perfectly and therefore low hope for out of sample; so error must be high
- For K = N, it is still constant and must miss a ton
- Starts high, ends high

=> b

### Given explanation
When k = 1, the model fits the training data perfectly, therefore in-sample error is low (ideally, zero). Out-of-sample error can be quite high.

As k increases, the model becomes more generalized, thus out-of-sample error decreases at the cost of slightly increasing in-sample error.

After a certain point, the model becomes too general and starts performing worse on both training and test data.

## Few other considerations for evaluating learning algorithm

Which algorithm is better - Linear Regression or KNN?

![qz](./images/3-3_10_quiz.png)