# Chapter 1 - General ML ideas

Some important terms & answers to exercises:

- **Machine learning**: programming computers so that they can learn/make inferences from the data (rather than the programmer having to define every rule / `if` statement) - i.e. they can generalize


- **When is ML useful?**

    - When problem has complex solutions that are hard to express traditionally
    - When solution requires a lot of fine-tuning, or longs lists of rules
    - When adaption to new data required
    - offering new insights into complex problems


- **Classification**: classify a data point with a label


- **Regression**: predict a target numeric value of a new data point, given a set of features


- **Supervised** vs **unsupervised** vs **semisupervised** learning:

    - **Supervised**: Training data is labelled
    - **Unsupervised**: training data is not labelled; program tries to learn & identify groups e.g. clustering, dimensionality reduction, rule learning, visualisation
    - **Semisupervised**: mixture of the above e.g. tagging photos: some people identified, others not. 


- **Reinforcement learning**: "agent" (learning system) performs actions in an environment, can get rewarded or penalised -> learns "policy", to get most reward e.g. playing gaes


- **Batch** vs **online** learning:

    - **Batch**: can't learn incrementally - must be trained with all data, then launched without learning any more (offline learning). Can be slow, resource-hungry.
    - **Online**: train incrementally - feed it smaller chunks of data: mini-batches. Good for big datasets that can't fit in memory (out-of-core learning). Has a learning rate: high rate = quick adaptation, quick forget older data.
    

- **Instance-based** vs **model-based** learning: (how does the system generalise?)

    - **Instance-based**: learn all examples, then for a new data point, figures out most "similar" to learned data points, make inference from those (e.g. k-nearest neighbours)
    - **Model-based**: build a model from all training data, then use model to make predictions


- **Challenges**:

    - **Too little training data**: need Ks of examples for basic problems, much more for more complex ones. c.f. "Unreasonable effectiveness of data": more data wins over more complex models
    - **Non-representative training data**: model has to be able to generalise: training data must be representative of new data you want to generalise to. 
        - _Sampling noise_ : when too little data, too noisy / variant
        - _Sampling bias_ : when sampling method is flawed - bias in the dataset (e.g. who tends to fill out questionnaires?)
    - **Poor-quality data**: crap in = crap out. Data needs cleaning before modelling (missing values? clear outliers?)
        - _Feature engineering_ : selecting the good set of features to train on
        - _Feature selection_ : selecting those from existing features
        - _Feature extraction_ : creating new features from existing ones (e.g. via dimenionality reduction, or physical principles)
    - **Overfitting training data**: model performs too well on training data, but doesn't generalise well
        - _Regularisation_ : constraining a model to enforce simplicity/generalisability, reduce chance of overfitting
    - **Underfitting training data**: model is too simple to learn
    
    
- How to **test & validate model**? Need to try it out on "new" cases. Use whole dataset, but partition into 3:

    - **Training set**: to train model on, should be large
    - **Validation set**: to test/extract hyperparameters (a parameter of the learning algorithm, not of the model)
    - **Test set**: to actually evaluate performance on new data, once parameters & hyperparameters determined


- Need **validation set** to avoid optimising for the specific test set


- What about natural statistical variation? **Cross-validation**: repeat process many times using different part of dataset as validation set, train on the rest of the data -> average out


- Subtleties when splitting dataset: e.g. time-dependent: random selection could give junk


- What if **Imbalanced classification** / data mismatch (more of one class than another, often smaller is the "true" thing to model): ensure validation & test set are all/predominantly "true" thing

    - But then training set diff to validation/test sets - if model performs poorly, how to tell if (a) overfitting, (b) doesn't generalise well?
    - Split training set into 2: training set, & train-dev set. Train model on training set, then test on train-dev set (which _does_ have a v.similar composition to the training set). Then if train-dev set poor,  -> overfitting. If train-dev set good, then model must not be generalising well to "true" thing
    - See https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
    - Make sure metric takes into account overwhelming number of one type vs other (e.g. if 90% A vs 10% B, if you always labels something as A then get instantly 90% accuracy)
    - Modify AUC e.g. to get precision per class or only for minority class
    - Drop some of one type to rebalance (**under-sampling**)? Or augment smaller type with modified copies? (e.g. rotate, stretch, crop images) (**over-sampling**)
    - Penalise models more for classifying minor type incorrectly
    - Convert into anomaly / outlier detection model?
    - https://www.reddit.com/r/MachineLearning/comments/12evgi/classification_when_80_of_my_training_set_is_of/
    

- **No Free Lunch**: a priori, no model is best on a specific dataset without any assumptions - you have to test them explicitly, or make some assumptions