## Machine Learning Overview

Supervised Regression Learning:
- Numerical prediction given examples of inputs and outputs trained with data
- As opposed to classification learning

Types of learning:
- Linear regression (parametric)
- K nearest neighbor (instance based)
- Decision trees
- Decision forests (lots and lots of decision trees taken together)

Learning with stock data:
- given a dataframe with features of a set of stocks over time (measureable predictive factors)
- feature data serves as the input X
- for training, use historical prices and historical features for learning
- use a historical price in the future from a feature and record price(t+5) vs. feature set(t) pairs as data to train the model
- depth & breadth given by the time period over which training occurs and the stock universe we will look at
- once model is trained, start doing predictions

Problems with regression:
- noisy and uncertain
- challenging to estimate confidence in forecasts
- holding time, allocation uncertain
- can be addressed by enforcement learning 
    

### Supervised Regression Learning
- Using data to build a model that predicts a numerical outputs based on a set of numerical inputs.

#### Parametric regression: Simple Regression
- represent the model with a number of parameters
- for example, fitting a line to data (through linear regression on the line y = mx + b)
    - parameters given by m and b
- could fit, theoretically, any polynomial with additional parameters to try and describe the behavior of the data more accurately
- in practice, much of the time you throw away the data once the model is parameterized and use that for predictions

#### Instance regression: K Nearest Neighbor
- instead, you could use a data-centric approach where you keep the data and use it to better inform you predictions
- find K nearest data points to the query and use them to estimate the output prediction
- take the mean Y value of the K nearest neighbors for the prediction
- if you repeat this process you would have a model that fits the data more appropriately
- another similar method is kernel regression, that assigns a weight to each neighbor based on the distance from the query X value (or cartesian distance)
- non-parametric approaches are good for models that are hard to approximate/derive mathematically, and instead are well-suited for numerical methods instead

#### Training and Testing:
- we have data on prices and features for our stocks
- we first want to separate testing and training data, to be able to see if the model behaves well once the model has been trained appropriately
- take training data, put it through machine learning model to derive the parameters, then use testing data and put it through the model, and compare the output to the true prices that we know to see if the model has been successful
- generally, train on older data and test on newer data 

#### Learning APIs:
- will need to build api's for implementing the learners

##### Linear Regression:
- learner = LinRegLearner()
- learner.train(Xtrain,Ytrain)
- Y = learner.query(Xtest) --> compare to Ytest

<code> class LinRegLearner::
    def __init__(self): 
        pass
        
    def train(self,X,Y):
        # fit a line to the data
        # find an m and a b --> parameters of linear model
        self.m, self.b = favorite_linreg(X,Y) # use algo you want from SciPy and Numpy
        
    def query(self,X):
        Y = self.m*X + self.b
        return Y
</code>

##### K-Nearest Neighbor:
- learner = KNNLearner(k=3) --> arg = number of neighbors
- learner.train(Xtrain,Ytrain)
- Y = learner.query(Xtest) --> compare to Ytest

<code> class KNNLearner::
    def __init__(self,k): 
        self.k = k
        pass
        
    def train(self,X,Y):
        # find set of Y values given k for each value of X
        # don't really have to train much
        
    def query(self,X):
        Y = average Y-value of k-nearest neighbors
        return Y
</code>
    
#####

#####
