# Regression

All about supervised regression learning.

Goal: Predict numerical output based on set of numerical input!

## Parametric Regression:

- way of building a model by representing model with number of parameters
- classic approach is to fit a line to the data and this is called linear regression
  <img alt="3-2_1.png" src="images/3-2_1.png" width="500">
- Equation of line: y = mx + b:
  - x is one of the axis and y is another axis
  - <i>m & b are the parameters of our model</i>
  - our model can be fully described by this model
  - To predict: simply plug some x value with our model parameters m & b. This will give our estimate

- `Linear regression` approach is how we arrive at m & b
  

### Polynomial model
- for above data we can also try and fit a polynomial instead of a line
  <img alt="3-2_2.png" src="images/3-2_2.png" width="500">
- more complex the model, more complex the parameter

- Once we have our parameters, our model can be represented simply by the parameters. Then we can `throw away the data` and `model can be represented by the parameters!`


## K Nearest neighbor (KNN) [Non-Parameteric]: 

- another approach
- data centric approach or instance based approach where we keep the data, unlike regression based approach
- Suppose K = 3; for the query, find the 3 nearest historical data points (hence a nearest neighbor) to the query and get mean of them for estimate. (Each data points contribute same amount)
- can fit a line or complex polynomial to any shape of data with KNN just by following means of neighbor

## Kernel regression [Non-Parameteric]:
- just like KNN keeps the data around
- This method is also similar to KNN but for kernel regression, we weight the contribution of each of the nearest data points according to how distant they are (where as in KNN all data points has same weights)


## Parametric or non-parameteric?

- Things are mathematically well defined and therefore can be modelled mathematically?
  - YES: Parameteric
  - NO: Non-Parametric
    - Overall if you have no guess on what a mathematical model might be, start with non-parametric learners because it can fit any shape of data

## Pros and Cons of these two approach:

- For parametric model we don't have to store the original data and is therefore space efficient; because of this we can't easily update the model as more data is gathered. Usually we have to do complete re-run of the algorithm to update the model. Thus for parametric model training is slow but querying is fast.

- For instance based model, we have to store all the data points. So, it is hard to apply when we have huge data sets but new evidence can be added easily since no parameters need to be learned. Because of this, adding data points doesn't consume additional time which leads to faster training time; however, querying is slow. Because non parameteric avoids assuming certain type of model, be it linear or quadratic or so on, which makes them suitable to fit any type of complex patter where we don't really know what the underlying model is like

## Training and Testing

### Out of sample testing:
- Procedure of separating training and testing data
- Input to the model is test data, x & y; 
- Then test the accuracy of model by using test x data; then out comes some y; question is how close is the output from model to the actual y?
- For time series data, train on earlier data and test on future data

## API Interface ideas:

### For Linear Regression:

```
    learner = LinRegLearner()
    learner.train(xtrain, ytrain)
    y = learner.query(xtest)  <--- the y that we will compare with the actual y to evaluate our model
```

### For KNN

```
    learner = KNNLeaner(k=3)
    learner.train(xtrain, ytrain)
    y = learner.query(xtest)  <--- the y that we will compare with the actual y to evaluate our model
```

both learner has same interface but just different constructor

## Example for linear regression - pseudocode

```
class LinRegLearner:
    def __init__():
        pass

    def train(x, y):
        # this should now find a way to fit a line in the training data
        self.m, self.b = favorite_linreg_algorithm(x, y)

    def query(x):
        y = self.m * x + self.b
        return y
```

Build KNN

