### Machine Learning
Typical workflow: identify problem, analyze problem, create detection algo for problem, program can flag issue areas, repeat until good enough to launch!
ML also good for problems w/ no algo to solve + can help humans to learn by showing us patterns found thru large amounts of data (data mining)
ML good at adapting to new environments, so can analyze fluctuating data too 

Applications:
1. image classification (CNNS, ch14)
2. detecting tumors in brain scans (Semantic segmentation, CNNs, ch14)
3. classify news articles (NLP, ch16)
4. forecasting revenue (regression, ch4, ch 10, ch 15/16)
5. react to voice commands (voice detection, ch15/16)
6. detect credit card fraud (anomaly detection, ch9)
7. segment clients based on purchasing patterns (clustering, ch9)
8. represet complex n-dim dataset clearly in diagram (ch 8)
9. recommend product for client based on past purchased (artificial neural network, ch 10)
10. intelligent bot for game (reinforcement learning, ch 18)

### Type of ML Systems
1. trained w/ human supervisions or not (supervised, unsupervised, semisupervised, reinforcement learning)
2. can learn incrementally (online vs batch learning)
3. work by comparing new data to old OR detect patterns in training data then build predictive model (instance-based vs

Can combine multiple criteria!

#### Supervised/unsupervised Learning
1. supervised learning
    - feed algorithm desired solutions ("labels")
    - e.g. classification, predicting a target numeric value (regression, needs many example sof predictors and their labels)
    - important algos:
       - k-nearest neighbors
       - linear regression
       - logistic regression
       - support vector machines (SVMs)
       - decision trees and random forests
       - neural networks

2. unsupervised learning
    - training data is unlabeled (not given any solutions)
    - important algos:
          - Clustering
              - k-means
              - DBSCAN
              - hierarchical cluster analysis (HCA)
          - anomaly deteciton and novelty detection
                  - system is trained to see normal instances, so when it sees new instance, can tell if it looks like normal
              - one-class SVM
              - isolation forest
          - visualization and dimensionality reduction
                  - visualization: feed lots of complex data, and can output 2d/3d representation ot be plotted, try not to simplify
                  - dim red: simplify data w/o losing too much info, e.g. combine stuff
                      - good to use dim red b4 feeding data to another ML algo 
              - principal component analysis (PCA)
              - kernel PCA
              - locally linear embedding (LLE)
              - t-distributed stochastic neighbor embedding (t-SNE)
          - association rule learning
                  - find unique relations in a lot of data 
              - apriori
              - eclat

3. semisupervised learning
    - labeling data is very costly/time consuming, so have only a few labeled and most unlabeled
    - e.g. google photos automatically recognizes same person showing up in multiple photos (unsupervised algo)

4. reinforcement learning
    - very diff from (un)(semi)supervised learning
        - learning system = agent
        - agent observed environment + performs actions + can get rewards or penalties
        - agent needs to learn by itself what is the best policy/strategy to get most reward
    - e.g. alpha go analyzed games and played by itself
  

### Batch/online Learning
1. Batch learning
    - system cannot learn incrementally, needs to be trained w/ all avail data
    - usually train offline b/c takes a long time
    - if want batch system to know about new data, need to train new system from scratch on WHOLE dataset, then replace old sys
       - can be automated so not so bad
    - cons:
        - lots of computing power (CPU, mem, i/o)
        - not good for rapidly changing data sets
2. Online learning
    - train system incrementally by feeding it data instances sequentially (individual or mini batch)
    - fast and cheap for each step
    - can also train systems on huge datasets that can't fit onto machine main mem (can work thru it incrementally)
    - learning rate = how fast sys adapts to changing data
          - high learning rate = high turnover (learn new quickly, forget old quickly)
          - slow learning rate = more intertia, less sensitive to noise in new data or outliers
    - con:
          - if bad data, performance goes down, clients notice (need fast response times)


### Instance Based/Model-Based Learning
1. instance learning
    - create measure of similarity between group that you want to flag, and group that is normal and doesn't need to be flagged, e.g. similar number of words
    - system will learn examples by heart and can generalize new cases w/ the similarity measure to compare to cases it knows the answer to

2. Model-based learning
    - build model using examples + use model to make prediction (way to generalize from set of examples)
    - even with noisy data, can still generalize the data by fitting it with a best-fit model, e.g. linear
    - will have model parameters, tweak params to make model repesent any linear function
          - need to define params b4 using model
          - specify performance measure: ultility function (Define how good model is) OR cost function (how bad)
              - linear regression = usually use cost
              - performance measure helps you find out what to set params to for best performance
    - linear regression algo fed training examples, algo will find params to make linear model fit best to data (training model)
        - training model = run algo to find params for model for it to be a best fit for training data
    - now that figure out params, can run model to make predictions
    - if after running model, the model doesn't make good predictions, options:
          - use more attributes (e.g. employment rate as factor affecting gdp, instead of only looking at exports)
          - get better quality training data
          - get more powerful model (e.g. polynomial regression model)


### Main Challenges of ML 
1. Insufficient quantity of training data
    - even for simple problems need thousands of examples
    - need millions of example for complex (e.g. image, speech recognition) unless can reused parts of existing model
      
2. Nonrepresentative Training data
    - training data needs to correlate to data you want model to be tested on (no matter instance or model learning)
    - e.g. if missing a few countries when finding model of GDP and life satisfaction, then linear line model is skewed and gives inaccurate predictions
    - training set NEEDS to represent cases you want to generalize to
          - set too small = sampling noise (nonrepresentative data b/c of chance)
          - set too big = sampling bias (nonrep if sampling method bad)

3. Poor quality data
    - data with errors, outliers, noise (bad measurements) = hard to detect patterns
    - need to clean up training data (throw away bad ones or manually clean)
    - if many samples missing an attribute (e.g. most ppl don't fill out age on survey, then decide if want to ignore the attribute, or fill it w/ medium value, or train one model with the attrib and w/o
  
4. Irrelevant Features
    - sys can only learn if training data have enough relevant features, not too many irrelevant ones
    - need feature engineering: selectin useful features to train on, combine existing features to create more useful one (e.g. us dim red algos), create new features by gathering data

4. Overfitting the Training Data
    - bad to overgeneralize ("overfit"), e.g. if 1 Canadian rips you off, doesn't necessarily mean that ALL canadians will rip you off
    - e.g. polynomial might fit data strongly, but not as good as predictor as linear model
    - complex models, e.g. neural networks, can detect small patterns in data (e.g. find patterns in noise)
        - e.g. just so happens that many countries w/ highest GDP have "w" in name != that having "w" means high GDP
        - model doesn't know if the "w" rule happened by chance from noise, or if it is a meaningful pattern
    - solutions to overfitting:
        1. simplify model w/ less params, less attributes in training data, or constraining model
            - "regularization"
                - 2 params = 2 degrees of freedom
                - if set params equal to each other, only 1 degree of freedom: simplified!
                - if allow algo to modify one param, but force keep it small, then algo has b/w 1 and 2 degrees of freedom: simplified!
            - control amt of regularization using hyperparam = param of learning algo (NOT of model)
                - won't be affected by learning sys + need to be set b4 training to keep model in check
                - large hyperparam = flat model (slope close to 0) --> not good
        2. more training data
        3. less noise in training data (fix data errors and remove outliers)
     
6. Underfitting the Training Data
    - model too simple to match unerlying structure of data
    - solutions:
        1. powerful model more
        2. better features to learning algo
        3. reduce constraints on model (e.g. lower regularization hyperparam)
     
### Testing and Validating
Need to try out model on new cases and monitor how it does 
    - need to split data into training set + test set (80%, 20%)
    - generalization error = error rate on new cases 
    - low training error, but high generalization error = overfitting training data 

#### Hyperparameter Tuning and Model Selection
1. how to decide between 2 models: train and compare both to how well generalize using test set 
    1. if linear model generalize better: 
        - apply regularization so not overfitting
        - how to choose regularization hyperparam? could train 100 diff models with 100 diff hyperparams 
        - eventually find one hyperparam value where generalization error least 
        - not good!!! will not work well on new data, gives higher generalization error than predicted 
        - solution: 
            - holdout validation = hold out part of training set to evaluate candidate models, then choose best one 
            - validation set = data that's held out to train other models to get best hyperparameter value 
            - once you get the model w/ best hyperparam value , train best model on FULL training set (including validation set) 
            - summary:
                - data set --> validation test --> test 100 models w/ diff hyperparameterization values --> choose  with lowest generalization error --> train that model with the WHOLE set of data to estimate generalization error
        - problems with solution:
            - validation set too small, and model evaluations inaccurate
            - validation set too large, and not enough remaining data to train final model
        - solution to problem:
            - use cross validation: e.g. many small validation sets, every model evaluated once per validation set after trained on rest of data
         
#### Data Mismatch
1. sometimes easy to get a lot of data, but data wont represent all data used in production
2. **most important!!!** make sure validation set is representative of data used in production
3. if train model and result bad, idk if problem is with overfitting training set, or bc mismatch of data
    - solution:
        - after model trained on training set, evaluate with train-dev set (set of held out training pictures)
        - if does poorly on validation set, then problem is from data mismatch
            - can try to constrain model
            - can try to retrain model using more representative data
        - if does poorly on train-dev set, thne overfit data
            - simplify or regularize model, get more training data, clean up training data
4. No Free Lunch Theorem: no modle is a priori guaranteed to work better (all have their own tradeoffs)
    - just have to make reasonable assumptions about data + evaluate only few reasonable models 