# Notes

## Chapter 1

### Types of Machine Learning
- $\textbf{Supervised Learning}$
    - Labeled Data
    - Direct Feedback
    - Predict outcome/future
    - Subcategories:
        - Classification - predict categorical class
        - Regression - predict continous outcomes
- $\textbf{Unsupervised Learning}$
    - No labels/targets
    - No feedback
    - Find hidden structure in data
    - Subcategories:
        - Clustering - organize data into meaningful subgroups
        - Dimensionality Reduction - compress the data onto a smaller dimensional subspace while retaining most of the relevant information
- $\textbf{Reinforcement Learning}$
    - Decision process
    - Reward system
    - Learn series of actions
    - Agent improves its performance based on interactions with the environment


## Chapter 2
<img src="machineLearningRoadmap.png" width="50%"/>

### Perceptron

- Rosenblatt's thresholded perceptron model mimics the function of a single neuron in the brain
- Use perceptron implementation to create a decision boundary to classify data into categories (binary)
- $\textbf{OvR}$: one-versus-rest - technique that allows us to extend any binary classifier to multi-class problems
- Takes a combination of certain inputs and a corresponding weight vector to predict the outcome

- Learning rule:
    1. Initialize the weights and bias unit to 0 or small random numbers
    2. For each training example, $x^{(i)}$:
        - Compute the output value, $\hat{y}^{(i)}$ (predicted class label)
        - Update the weights and bias unit
            - $w_{j} := w_{j} + \Delta w_{j}$ where $\Delta w_{j} = \eta(y^{(i)} - \hat{y}^{(i)})x_{j}^{(i)}$
            - $b := b + \Delta b$ where $\Delta b = \eta(y^{(i)} - \hat{y}^{(i)})$
            - $\eta$ is the learning rate


<img src="PerceptronWeightsAndBias.png" width="50%"/>

### Adaptive Linear Neurons (Adaline)

<img src="AdalineWeightsAndBias.png" width="50%"/>

- Weights are updated based on a linear activation function rather than a unit step function like in the perceptron
- Threshold function is still used to make the final prediction
- $\textbf{Objective Function}$ - optimized during the learning process; often a loss or cost function that we want to minimize
- Adaline loss function is the mean squared error (MSE) between the calculated outcome and the true class label:
    - $L(w, b) = \frac{1}{n} \sum\limits_{i=1}^{n} (y^{(i)} - \sigma(z^{(i)}))^{2}$
- The loss function becomes differentiable and convex allowing us to use the powerful $\textbf{gradient descent}$ optimization algorithm
    - Step in the opposite direction of the gradient
    - $\Delta \mathbf{w} = -\eta \nabla_{w} L(\textbf{w}, b) = \eta \frac{2}{n} \sum\limits_{i} ( y^{(i)} - \sigma(z^{(i)}))x_{j}^{(i)}$ 
    - $\Delta b = -\eta \nabla_{b} L(\textbf{w}, b) = \eta \frac{2}{n} \sum\limits_{i} ( y^{(i)} - \sigma(z^{(i)}))$
 
<img src="GradientDescent.png" width="40%"/>

- Gradient descent converges much more quickly if we implement feature scaling
- $\textbf{Standardization}$:
    - Shifts the mean of each feature so that it is centered at zero and each feature has a standard deviation of 1 (unit variance)
    - $x^{'}_{j} = \frac{x_{j} - \mu_{j}}{\sigma_{j}}$
    - where, $x_{j}$ is a vector consisting of the jth feature values of all training examples
- $\textbf{Stochastic Gradient Descent}$:
    - Instead of calculating the gradient from the entire data set we randomly pick one data point
    - Updates gradient much more frequently and although it is an approximation of the actual gradient it converges much faster
    - Fixed learning rate is replace by an adaptive learning rate that decreases over time: $\frac{c_{1}}{[number \ of \ iterations] + c_{2}}$
    - $\textbf{Online Learning}$ - model is trained on the fly as new training data arrives and data can be discarded after updating the model if storage space is an issue
- $\textbf{Mini-batch Gradient Descent}$:
    - Middle ground of SGD and full-batch gradient descent
    - Apply full-batch to a subset of the data


## Chapter 3

### Classifiers 

- No free lunch theorem - no single classifier works best across all possible scenarios
- It is recommended that you compare the performance of at least a handful of different learning algorithms to select the best model for the particular problem. These may differ in:
    - The number of features or examples
    - The amount of noise in the dataset
    - Whether the classes are linearly separable
 
- The five main steps involved in training a supervised machine learning algorithm:
    1. Selecting features and collecting labeled training examples
    2. Choosing a performance metric
    3. Choosing a learning algorithm and training a model
    4. Evaluating the performance of the model
    5. Changing the settings of the algorithm and tuning the model

### Logistic Regression

- Linear model used for binary classification
- Very easy to implement and performs very well on linearly separable classes
- The activation function becomes the sigmoid function which maps any number back to [0, 1] range
- The output of the sigmoid function is the interpreted as the probability of a particular example belonging to a class
- This probability is often just as interesting as the predicted class label
- It is recommended to use mroe advanced approaches than regular SGD (newton-cg, lbfgs, liblinear, sag, saga)

<img src="LogisticRegression.png" width="50%"/>

### Overfitting

- $\textbf{Overfitting}$ - a model performs well on training data but does not generalize well to unseen data (test data)
- It is said the model suffers from having high variance which can be caused by having too many parameters, leading to a model that is too complex for the underlying data
- $\textbf{Underfitting}$ - (high bias) the model is not complex enough to capture the pattern in the training data well and suffers from low performance on unseen data

<img src="Overfitting.png" width="50%"/>

- $\textbf{bias-variance tradeoff}$:
    - high variance refers to overfitting
    - high bias refers to underfitting
 
- One way of finding a good bias-variance tradeoff is to tune the complexity of the model via regularization
- $\textbf{Regularization}$ - introduce additional information to penalize extreme parameter (weight) values. Useful method for handling:
    - colinearity - high correlation among features
    - filtering out noise from the data
    - preventing overfitting
- $\textbf{L2 regularization}$ - (L2 shrinkage / weight decay) most common form of regularization. Adds the following term to the loss function:
    - $\frac{\lambda}{2n} ||\textbf{w} ||^{2} = \frac{\lambda}{2n} \sum\limits_{j=1}^{m} w_{j}^{2}$
    - Regularization parameter, $\lambda$ - used to control how closely we fit the traing data, while keeping the weights small
    - Parameter C implemented in Logistic Regression class in skikit-learn comes from SVM and is inversely proportional to the $\lambda$
    - If the regularization strength is set too high the weight coefficients approach zero and the model can perform poorly due to underfitting
 
### Support Vector Machines (SVM)

### Decision Tree

### Random Forest

### K-nearest neighbors