# Chapter 4: Training Data

## Sampling

- To train anything, you need data. But training on everything is expensive and slow
- So you need to sample your data for better iteration!
- There are many ways to sample your data, but keep in mind that the entire point of sampling is that your data should be representative of the population of interest. Make sure you don't end up with sample bias!!

- Types of sampling
    - Non-probability sampling: These are the many ways of saying that I sample anyhowly
        - Convenience sample: Whatever is available
        - Snowball sample: Future samples based on current samples
        - Judgement sample: Someone decides what to sample
        - Quota sample: Sample based on quota, but no randomisation

    - Simple random sample
    - Stratified sample: based on some baseline population
    - Weighted sample: reweighting your population in some way
    - Reservoir sample: This is an interesting one, especially for streaming data
        - Problem: You have an incoming stream of tweets, and you want to sample some $K$ of them. You don't know how many tweets there are, but you cannot fit them all into memory. How can you make sure that every tweet has equal probability of selection, no matter when you stop the sampling?
        - Solution:
            - Make an array of size $K$. Put the first $K$ elements into the array
            - For the next element $K+1$, generate a random number $i$ such that $1 \le i \le K+1$
            - If $1 \le i \le K$, swap the i-th element with the new element
            - At each step $N$, every element has $\frac{K}{N}$ chance of getting put in/out of the array 

    - Importance sample: See subsection

### Importance Sampling

- Importance sampling is a very very important concept to learn, so it gets its own subsection

## Labeling

- Labelling is a core function of ML, because supervised learning is entirely dependent on label quality
- How to label?
    - Hand labelling: Expensive, not private, and slow
    - Label multiplicity: Often, to augment the usual hand-labelling method, you gather labels from multiple sources, leading to multiple different levels of expertise and accuracy, and, worse still, labels may contradict each other
        - Data Lineage: Without a good sense of your data lineage, you don't know when your data is reliable. Which can cause your data to fail without you understanding why
    - Natural labels: As far as possible, try to get labels from "nature", or at least from actions that people perform spontaneously
    - Window length: Your labels may not be instantaneous. Poeple read books/watch videos at different speeds. So the presence of a label may be a function of quality, but also a function of how long they take to finish consuming it!

- How to handle missing labels?

| Method | How | Ground truths required? | 
| - | - | - |
| Weak supervision | Use heuristics to generate labels | No, but small number of labels are recommended to guide the development of heuristics | 
| Semi-supervision | Structural assumptions to generate labels | Yes, small number of initial labels as seeds. Use the small number of labelled data to train a model, predict on unlabelled data, and add new labelled data (with high predicted confidence) to the training |
| Transfer learning | Use pre-trained models for new task | No if zero-shot learning. Yes for fine-tuning, but less than supervised case |
| Active learning | Label data most useful to your model | Yes. The idea here is to allow your machine learning model to send back queries to the human labeller to ask for more information on samples it is unsure about. |

## Class Imbalance

- The usual imbalanced class problems apply
    - Need for better evaluation (i.e. predicting all majority class gives you 99.99% accuracy)
    - Insufficient data for minority classification
    - Asymmetric cost of error; cost of mis-prediction of rare class may be much higher than wrong prediction of majority class

- Solutions to imbalance
    - Brute force: Bigger NNs seem to build better world representations, so deal with imbalance better
    - Using better evaluation metrics: F1 + prec/recall instead of accuracy
    - Resampling data: Please don't. The results are terrible
    - Modify loss function: The usual loss function used is $L(X; \theta) = \sum_x \frac{1}{N} L(x; \theta)$
        - Cost sensitive learning: $L(X; \theta) = \sum_j C_{ij} P(j | x; \theta)$
            - Compute the cost of classifying an observation with label $i$ as label $j$...
            - ...weighted by the probability of labelling it $j$
        - Class-weighted loss: Same idea as cost sensitive learning, but instead weight by the prevalence of the class
        - Focal loss: Increase weight for observations that have a lower probability of being right

## Data Augmentation

- Label preserving augmentation
    - Rotate/crop/modify some image, while preserving labels

- Perturbation
    - Randomly add noise to the data while retaining the same label

- Data synthesis
    - Use conversational AI to bootstrap training data in NLP