# Supervised learning

- Training with labeled data.
- Ex. linear regression, logistic regression, naive Bayes, KNN, SVM, decision tree, random forest, boosting tree, MLP, CNN, RNN, LSTM.

## Classification

- Finite number of outputs
- Ex. logistic regression, decision tree, random forests.
- Evaluation
    - Accuracy
    - Precision
    - Recall
    - F1 score (between $0$ and $1$)

## Imbalanced data in classification?

1. Collect more data.
2. Undersample from over-represented class.
3. Change performance metric
    - Accuracy is not the right metric to use when data is imbalanced.
    - Look at precision / recall / F1 score.
4. Data augmentation 
    - For example, crop/rotate images.

## Evaluate classification model

- True Negative: ground truth was negative and prediction was negative.
- True Positive: ground truth was positive and prediction was positive.
- False Negative: ground truth was positive but prediction was negative.
- False Positive: ground truth was negative but prediction was positive.
- Confusion table shows TP, FP, TN, FN.
- In perfectly separable data, both precision and recall can be $1$.
- But in real world, shift decision boundary increase one but decrease the other.

Precision
- Correctness on predicted positive.
- What percentage of positive predictions were correct?
    - Ex. Of examples recognized as cat, what % actually are cats?
- True Positive / (True Positive + False Positive)

Recall
- Correctness of actual positive.
- What percentage of positive cases did you catch?
    - Ex. What % of actual cats are correctly recognized.
- True Positive / (True Positive + False Negative)

F1 score
- Average of precision and recall.
- $\dfrac{2}{\dfrac{1}{P}+\dfrac{1}{R}}$

Accuracy
- What percentage of predictions were correct?
- (True Positive + True Negative) / (True Negative + True Positive + False Negative + False Positive)

False Positive Vs. False Negative
- In medical exam, False Negative is threatening to patients. Thus, False Positive is preferred.
- In spam filtering, False Positive is annoying to users. Thus, False Negative is preferred.

## One hot encoding

- Represent categorial variable in numerical vector space.
- Vectors of each category has equal distance to each other.

# Unsupervised learning

- Detect patterns in data without labels.
- Ex. clustering (k-means), PCA, autoencoder, GAN.

## K-means

1. partition points into $k$ subsets.
2. compute centroid of current partitioning.
3. assign each point to cluster.

## Gaussian mixture model Vs K-means

K-mean
- Data point must belong to one cluster.
- Computes distance.

GM
- Probability of point belonging to each cluster.
- Computes weighted distance.

# Feature selection

- Remove unneeded, irrelevant, redundant attribute from data.
- Redundant features can mislead the model. 
    - Especially, k-nearest neighbors.
- Irrelevant features can overfit the model. 
- Ex. PCA

## Filter method

- Assign score to each feature.
- Often considers features independent.
- Ex. chi squared test, information gain, correlation coefficient scores

## Embedded method

- Learn which features are contributing to the accuracy of model.
- Ex. regularization (LASSO, elastic net, ridge regression)

# Decision tree

- Used for classification.
- Internal node: test on attribute.
- Branch: test outcome.
- Leaf: class label.
- Main parameters: maximum tree depth, minimum samples per tree node, impurity criterion.

# Random forest

- Used for regression and classification.
- Consist of many decision trees

# Dimensionality

Curse of dimensionalty
- High dimensional data is extremely sparse.
- It's hard to do machine learning on sparse data.

Sigular value decomposition
- Refactor a matrix into three pieces: left matrix, diagonal matrix, right matrix.

Priciple component analysis
- Special type of SVD.
- Left matrix and right matrix are eigenvectors.
- Diagonal matrix is eigenvalues.

# Recommender system

Baseline
- Relevant and personalized information.
- Should not be something users know well.
- Diverse suggestions.
- Users should explore new items.

## Collaborative filtering

- Recommendation is calculated as average of other experiences.
- Does not work well on sparse data, also has cold start problem.

## Cold start problem
- Cannot make recommendation for new item.
- Cannot find similarity with other users for new user.

## Content-based filtering
- An approach to solve cold start problem.
- Recommend items that are similar to items that user liked already.

# Time series

- Observations ordered in time.
- Prediction depending on input vs prediction depending on certain pattern over time.

# Normalization

- Assures better convergence during backpropagation.

## Batch nomralization

- Normalize activations.
    - Given some intermediate values in neural network $z^{(1)} \dots z^{(m)}$
    - $\mu = \dfrac{1}{m}\displaystyle\sum_{i}z^{(i)}$
    - $\sigma = \dfrac{1}{m}\displaystyle\sum_{i}(z_{i}-\mu)^{2}$
    - $z_{norm}^{(i)} = \dfrac{z^{(i)}-\mu}{\sqrt{\sigma^{2}+\epsilon}}$
    - $\tilde{z}^{(i)} = \gamma z_{norm}^{(i)} + \beta$
- For example, if $\gamma = \sqrt{\sigma^{2}+\epsilon}, \beta = \mu$, then $z_{norm}^{(i)} = \tilde{z}^{(i)}$
- Use $\tilde{z}^{(i)}$ instead of ${z}^{(i)}$ 
- But unlike inputs, you don't want to force activation to be ~ $N(0,1)$

$X \xrightarrow{w^{[1]}, b^{[1]}} z^{[1]} \xrightarrow{\beta^{[1]}, \gamma^{[1]}} \tilde{z}^{[1]} \rightarrow a^{[1]} = g^{[1]}(\tilde{z}^{[1]}) \xrightarrow{w^{[2]}, b^{[2]}} z^{[2]} \xrightarrow{\beta^{[2]}, \gamma^{[2]}} \tilde{z}^{[2]} \rightarrow a^{[2]} \rightarrow \dots$ 
- parameters: $w, b, \beta, \gamma$

Working with mini-batches
- Parameters: $w, \beta, \gamma$ (no need for $b$)
- $z^{[l]} = w^{[l]}a^{[l-1]}$
- $\tilde{z}^{[l]} = \gamma^{[l]}z_{norm}^{[l]} + \beta^{[l]}$
- For $t = 1 \dots$ num_mini_batches
    - Compute forward prop on $X^{\{t\}}$ 
        - In each layer, use BN to replace $z^{[l]}$ with $\tilde{z}^{[l]}$
    - Use backprop to compute $dw^{[l]}, d\beta^{[l]}, d\gamma^{[l]}$ (no need for $db^{[l]}$)
    - Update $w^{[l]} = w^{[l]} - \alpha dw^{[l]}, \beta^{[l]} = \beta^{[l]} - \alpha d\beta^{[l]}, \gamma^{[l]} = \gamma^{[l]} - \alpha d\gamma^{[l]}$
    
Batch normalization as regularization
- Each mini-batch is scaled by mean/variance computed on just that mini-batch.
- This adds some noise to $z^{[l]}$
- This has slight regularization effect.

Batch normalization as test time
- $\mu, \sigma^{2}$: estimate using exponentially weighted average (across mini-batches)
- $X^{\{1\}} \rightarrow \mu^{\{1\}[l]}, \sigma^{\{1\}[l]}, X^{\{2\}} \rightarrow \mu^{\{2\}[l]}, \sigma^{\{1\}[2]}, X^{\{3\}} \rightarrow \mu^{\{3\}[l]}, \sigma^{\{3\}[l]}, \dots$

## Softmax regression

- Let $C$ be number of classes.
- Last layer (softwax layer) has $n^{[L]}= C$ units.
    - $z^{[L]} = w^{[L]}a^{[L-1]} + b^{[L]}$
    - $t = e^{(z^{[L]})}$
    - $a^{[L]} = \dfrac{e^{(z^{[L]})}}{\displaystyle\sum_{j}t_{i}}, a_{i}^{[L]} = \dfrac{t_{i}}{\displaystyle\sum_{j}t_{i}}$
- Softmax regression generalizes logistic regression to $C$ classes.
- Loss function
    - $L(\hat{y}, y) = -\displaystyle\sum_{j}y_{j}log\hat{y}_{j}$
- Cost function
    - $J(w^{[1]}, b^{[1]}, \dots) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)})$
    
$z^{[L]} \rightarrow a^{[L]} = \hat{y} \rightarrow L(\hat{y}, y)$
- Backprod: $dz^{[L]} = \hat{y} - y$

# Activation function

- Learn complex non-linear functions.
- If not, we are just stacking linear layers, which lead to learning a linear function.

## Sigmoid
- Use in binary classification.
- An activation function.
- Limit output range to $0$ and $1$.
- Derivative for large positive or negative is near $0$. 
    - Lead to vanishing gradient.
    
## Softmax
- Used in multiclass classification.    
    
## ReLU (Rectificed linear unit)
- Used for non-linearity of model.
- Addresses problem of vanishing gradient.
- About 50% of network yields 0 activation, thus fewer neurons are passing inputs to next layers making network light. 

## Linear
- Used in regression models.

# Bias and variance

Bias
- How far off model prediction is from correct value.
- Error from approximately true underlying function.
- Difference between predicted and actual value.

Variance
- Variability of model prediction for given data point.
- Sensitivity to changes in training data.
- Overfitting: model works well on training data, but doesn't generalize well on unseen data.

Ex. election survey
- Surveying from a phonebook is source of bias.
- Small sample size is source of variance.  

Need to find right balance without overfitting or underfitting the data.

## Why human-level performance

- While ML is worse than human, you can
    - Get labeled data from human.
    - Gain insight from manual error analysis. (why did a person get this right?)
    - Better analysis of bias/variance.
    
## Avoidable bias

- Human error as a proxy for bays error.
- Gap between human and training error: avoidable bias.
- Gap between training and dev error: variance.

## Two fundamental assumptions of supervised learning

- You can fit the training set pretty well ~ avoidable bias.
- Training set performance generalizes pretty well to dev/test set ~ variance.
- Avoidable bias
    - Traing bigger model.
    - Train longer / use better optimization algorithms.
    - NN architecture / hyperparameters search.
- Dev error
    - More data.
    - Regularization.
    - NN architecture / hyperparameters search.
- Increasing $\lambda$ decrease variance, decreasing $\lambda$ decrease bias.
- More features decrease bias but increases variance. Less features decreases variance but increases bias.

## Approaches

Linear model
- Regularization is used to decrease variance at the cost of increasing bias.

Neural network
- Variance increases and bias decreases with number of hidden units. Regularization is used.

K-nearest neighbor
- High $k$ leads to high bias and low variance.

Decision tree
- Depth of trees increases variance. Trees are pruned to control variance.

# Sparse data

- L1 regularization.
- Linear regression if linear relationship.
- One-hot encoding.

# Statistical power

- Likelyhood that study will find effect when in fact there is effect.
- Higher statistical power, less likely to make false negative.

# Outlier

- Can be removed during data preparation using standard deviation.

# Anomaly

- 68% of data is one std away.
- 95% of data is two std away.
- 99% of data is three std away.

Statistical method
- Consider data point with z-score $\ge 3$ outlier and likely anomaly.

Metric method
- A point is considered anomaly if removing it significantly improves the model.
- Outlier score is a degree that a point doesn't belong to a cluster.

# Gradient boosting

- Relies on regression trees, which minimizes MSE.
- Greedy algorithm: tree is built starting from root. For each leaf, split selected to minimize MSE for this step.
- Build collection of trees one by one. Then, predictions of individual trees are summed.

# Hyperparameters

- Should use random sampling to choose the number of layers, number of features, etc.
- Scale parameters accordingly.
    - For example, $\alpha = 0.0001 \dots 1$
        - Use log scale such that $0.0001, 0.001, 0.01, 0.1, 1$
    - For example, $\beta = 0.9 \dots 0.999$
        - Use $1-\beta$ such that $0.1, 0.01, 0.001$
- Panda: babysit one model.
- Caviar: train many models in parallel.

## Hyperparameter tuning
- Grid search
- Random search
- Bayesian Optimization (heaviliy outperforms above two)

# Project workflow

1. What is business objective?
    - Increase revenue, win more customers?
2. Define problem
    - Outline the gap we are trying to solve. 
3. Can the problem be solved without data science?
    - For example, just recommend top N items based on very simple logic.
4. Review existing ML
    - No need to re-invent the wheel.
5. Setup metrics.
    - What does it mean to be sucessful and not successful?
6. Exploratory data analysis
    - See what data is like via lots of plotting.
7. Partition data into 3 sets.
    - Train/dev/test.
8. Preprocess
    - data cleaning, transformation, etc.
9. Feature engineering
    - Requires domain knowledge. Can be minimum if using deep learning.
10. Develop model
    - Choose algorithms, hypterparameters, etc.
11. Ensemble
    - Beware. Some ensembles are too complex to put into prodiction.
12. Deploy/Monitor model
    - Continue iterating afterwards.