### Genetic Algorithm
- a non greedy algorithm for finding the optimal feature set
- good for when brute forcing is impractical
- converge to an optimal solution given enough time
- genetic algorithms rely on mutation, crossover, and selection
- process:
    - start with an initial population of individuals (features)
    - each iteration is called a generation
    - in each generation, the fitness of every individual is evaluated
        - fitness is usually the value of the objective function in the optimization problem (in our case, the accuracy of the model)
    - the individuals with the highest fitness are selected
        - take the n most fit individuals
    - the selected individuals are mutated and crossed over
        - mutation: randomly change some of the features
            -   defends against getting trapped in local optima
        - crossover: combine features from two individuals to create a new individual
    - terminate after a certain number of generations or when the fitness of the population reaches a certain threshold
        - in our case, we will terminate after the maximum number of features is reached
    - else, go back to step 2 and select the n most fit individuals
- crossover
    - can be done in many ways
    - one way is union and intersection
        - union: take all the best features from both individuals
        - intersection: take only the best features that are in both individuals
- mutation
    - can be done in many ways
    - one way is to randomly add, remove features, or swap features
        - randomly choose whether to add, remove, or swap
        - then randomly choose which feature to add, remove, or swap
    - the add/swap pool should not include features that are already in the individual
        - i.e. you should never have duplicate features in an individual

- advantages
    - easy to explain
    - does not make assumptions about the data
    - fast search technique
    - "close" to optimal solutions in "reasonable" time
    - suitable for parallel implementation
        - can calculate fitness of individuals in parallel
    - fairly simple to implement
- disadvantages
    - none, according to Dr. Johnson's experience
- tips
    - crossover rate should be high - 0.8 or 0.95


### In Class 10Oct23
- 1. A $\cup$ B = ${z_1, SL, PL, PW, z_2}$
- 2. D $\cap$ E = ${z_1, z_2, z_3, z_4}$
- 3. new C = ${z_1, z_2, z_3, SW, PL, SL}$
- 4. new C = ${z_1, z_2, z_3, SW}$
- 5. new C = ${z_1, z_2, z_3, SW, SL}$
- 6. 50 sets k folded at 0.1 seconds each = 5 seconds
- 7. 72 sets k folded at 0.1 seconds each = 7.2 seconds

### Unbalanced Datasets
- not all datasets will be balanced
    - i.e. the number of samples in each class will not be equal
    - e.g. a child running into the street is rare but critical to autonomous vehicle training
- methods for dealing with imbalanced data
    - collect more data
        - not always possible, but probably the best solution if it is
        - concentrate on collecting more data for the minority class
        - even if not balanced, more data is always better
            - can be used in oversampling and undersampling (discussed below)
    - change the performance metric
        - accuracy is not a good metric for imbalanced data
        - there is no consensus on what the best metric is
            - Class Balance Accuracy (CBA)
                - minimum between precision and recall is computed for each class
                - then the average of the minimums is computed
                - this is the CBA
            - Balanced Accuracy (BA)
                - average of the recall and specificity is computed for each class
                -  then the average of the averages is computed
                - this is the BA
        - specificity is the true negative rate
            - $specificity = \frac{TN}{TN + FP}$
            - <img src="images/specificity.png" width="800">
            - TN is all of the non-fish instances that ***are not*** classified as fish
                - $TN_{fish} = 4 + 3 + 1 + 6 = 14$
            - FP is all of the non-fish instances that ***are*** classified as fish
                - $FP_{fish} = 1 + 0 = 1$
            - specificity = $\frac{14}{14 + 1} = 0.933$
        - `balanced_score_accuracy` is a scikit-learn function that is different than CBA an BA

### Oversampling
- random oversampling
    - randomly duplicate samples from the minority class until it is as large as the majority classes
        - randomly chosen with replacement
    - this is the simplest method
    - can lead to overfitting
- SMOTE oversampling
    - Synthetic Minority Oversampling Technique
    - instead of duplicating samples, create new samples that are similar to the minority class
    - for each sample in the minority class, find the k nearest neighbors
        - k is a hyperparameter
    - randomly choose one of the k nearest neighbors
    - create a new sample that is a random combination of the sample and the randomly chosen neighbor
        - the sample will be the vector between the sample and the neighbor multiplied by a random number between 0 and 1
        - i.e. the new sample will be somewhere on the line between the sample and the neighbor
    - repeat until the minority class is as large as the majority classes
- ADASYN oversampling
    - Adaptive Synthetic Sampling
    - similar to SMOTE
    - finds nearest neighbors
    - instead of randomly choosing one of the neighbors, it uses a probability distribution to choose the neighbor
        - neighbors that are closer to the sample have a higher probability of being chosen
        - neighbors can be chosen multiple times
### Undersampling
- random undersampling
    - randomly remove samples from the majority class until it is as small as the minority classes
    - this is the simplest method
    - can increase variance
    - can discard useful and important samples
- cluster undersampling
    - replaces a cluster of samples with a single sample at the centroid of the cluster
    - centroid is calculated by taking the mean of each feature
        - this is via k-means clustering
- Tomek Links Undersampling
    - finds a pair of samples that are nearest neighbors but are of different classes
    - removes the majority class sample from the pair
    - this is repeated until the desired balance is achieved
    - i.e. it erases the boundary between the classes by removing majority samples only

### IC 12Oct23
- 1.- Cat:
        - Precision: $\frac{TP}{TP + FP} = \frac{4}{4+9} = \frac{4}{13} = 0.308$
        - Recall: $\frac{TP}{TP + FN} = \frac{4}{4+2} = \frac{2}{3} = 0.667$
    - Fish:
        - Precision: $\frac{TP}{TP + FP} = \frac{2}{2+1} = \frac{2}{3} = 0.667$
        - Recall: $\frac{TP}{TP + FN} = \frac{2}{2+8} = \frac{1}{5} = 0.2$
    - Hen:
        - Precision: $\frac{TP}{TP + FP} = \frac{6}{6+3} = \frac{2}{3} = 0.667$
        - Recall: $\frac{TP}{TP + FN} = \frac{6}{6+3} = \frac{2}{3} = 0.667$
    - Class balance accuracy: $\frac{0.308 + 0.2 + 0.667}{3} = 0.392$
- 2. - Cat:
        - Specificity: $\frac{TN}{TN + FP} = \frac{9}{9+2} = \frac{9}{11} = 0.818$
    - Fish:
        -  
            - 
- 3. $d = \sqrt{(5.1-4.9)^2+(3.5-3)^2+(1.4-1.4)^2+(.2-.2)^2}$
        - $d = \sqrt{.04+.25+0+0}$
        - $d = \sqrt{.29}$
        - $d = 0.539$
- 4. RN = 0.4
    synthetic sample $= (x_r + RN(x_n - x_r), y_r + RN(y_n-y_r), z_r + RN(z_n-z_r), a_r + RN(a_n-a_r)$
        - $x = 5.1 + 0.4(4.9-5.1) = 5.1 + 0.4(-0.2) = 5.1 - 0.08 = 5.02$
        - $y = 3.5 + 0.4(3-3.5) = 3.5 + 0.4(-0.5) = 3.5 - 0.2 = 3.3$
        - $z = 1.4 + 0.4(1.4-1.4) = 1.4 + 0.4(0) = 1.4$
        - $a = 0.2 + 0.4(0.2-0.2) = 0.2 + 0.4(0) = 0.2$
        - $synthetic sample = (5.02, 3.3, 1.4, 0.2)$
- 5. 

### Ensemble Learning
- ensemble learning is the process of combining multiple models to solve a problem
    - generally attain higher accuracy
    - no single ML model is the best for all problems
    - ensemble learning is a way to combine the strengths of multiple models
    - questions
        - how do you combine outputs of multiple models?
        - how do you generate base learners that complement each other?
- fusion
    - combination of outputs
    - categorical
        - vote by most frequent output
        - ties broken by random choice
    - value
        - average of outputs
            - round if class is an integer
            - classification = $\frac{sum of outputs}{number of models}$
        - median of outputs
            - round if class is an integer
            - if even number of models, take average of middle two values
    - in python
        - scikit-learn has a `VotingClassifier` method
            - supports voting, weighted average, and mean
### Generating Base-Learners
- pairwise coupling method 1
    - multiclassifiers are usually worse than binary classifiers
    - break multiclass problem into multiple binary problems
    - only works with models that output normalized probabilities
    - <img src="images/pairwise.png">
    - basically, each model is trained to detect a single class
        - in the e.g. above, C1 just says setosa or not setosa
- pairwise coupling method 2
    - each base learner compares a pair of classes and all possible pairs are combined
    - <img src="images/pairwise2.png">
    - basically, each model is trained to detect two classes
        - other classes are misclassified by that model as one of the two classes
        - the assumption is that the output probability would be close to 50% for those misclassifications and 1 or 0 for the classes it is trained to detect
        - in theory, the fusion step weeds out the misclassifications  

### Bagging
- bootstrap aggregating
    - sometimes random subspace method of random forests is also called bagging
- bootstrap aggregating is a method of generating multiple models from a single dataset
- each model is trained on a slightly different section of the data
- bootstrapping is a method of sampling with replacement
    - i.e. each sample can be chosen multiple times
    - each sample has an equal probability of being chosen
- typically implemented with decision trees and neural networks
    - can improve unstable models 
        - i.e. models that may give different answers with the same input
    - can also degrade stable models
        - i.e. models that will always give the same answer with the same input
- in python
    - scikit-learn has a `BaggingClassifier` method to perform bagging

### In Class 19Oct23
- 1. a 
        - 2(.12) + 3(.18) + 7(.21) + 2(.18) + 9(.17) + 4(.14) = 4.7
        - 4.7/6 = 0.783
- 1. b 
        - median(2, 2, 3, 4, 7, 9) = 3.5
- 2. 
        - AB, AC, AD, BC, BD, CD
- 3. a
        - $p_1(A) = 0.6, p_1(B) = 0.4$
        - $p_2(A) = 0.7, p_2(C) = 0.3$ 
        - $p_3(B) = 0.5, p_3(D) = 0.5$
        - $P(A) = 0.6 * 0.7 = 0.42$
        - $P(B) = 0.4 * 0.5 = 0.2$
        - $P(C) = 0.3 * 0.5 = 0.15$
- 3. b
        - class A
- 4. 
        - net 1: 1, 4, 3, 4, 8 - replacement evident in the `4`
        - net 2: 7, 2, 6, 3, 4
        - net 3: 10, 6, 7, 2, 7
        - net 4: 8, 4, 5 ,3, 5

### Boosting
- in bagging, generation of base learning is left to chance and instability, i.e. random 
- in boosting, generation of base learners is not random and each new base learner is generated to correct the errors of the previous base learners
- training data is randomly divided into subsets
    - train the next models on the subsets that the previous models misclassified
        - i.e. the next model is better at the things that the previous models were bad at
- e.g. if you have 3 sets, X1, X2, and X3
    - train model d1 on X1
    - then test d1 on X2
    - take the misclassified samples from X2 and train d2 on those and an equal number of correctly classified samples from X2
    - then test d1 and d2 on X3
    - the samples on which d1 and d2 disagree are the ones that d3 will be trained on
    - this process continues until all samples are classified correctly or the maximum number of models is reached
    - in this example, d1 is trained on X1, d2 is trained on d1's mistakes on X2, and d3 is trained on d1 and d2's mistakes on X3
- output
    - if d1 and d2's outputs agree, that is the response
    - else, d3's output is the response
- Can also be done recursively
- disadvantages
    - sample is divided into 3 sets and results in 3 models
    - each later model is trained on successively smaller data sets
### AdaBoost - Advanced Boosting
- designed to improve on boosting
- many methods of AdaBoost, here we discuss AdaBoost.M1
    - <img src="images/adaboostM1.png">
    - all training samples are available for each model
    - probability of each sample being chosen is calculated
        - later models are more likely to have samples that were misclassified by previous models
    - the later models should be good at the hard situations
    - fusion achieved by weighted average
        - weights are proportional to the model's accuracies on the training data
            - i.e. if a model is 90% accurate, its weights will be 90% of the total weights
- all models share the whole data pool, not limited to 3 models 

### IC 24Oct23
- X1 = {a, b, c, d, e}
- X2 = {f, g, h, i, j}
- X3 = {k, l, m, n, o}
- In Iteration 1, that d1 misclassified {i, j} of X2
1. d2 training sample  = {i, j, g, h}