# evaluating classification accuracy
## accuracy, precision, and recall
__Accuracy__: how many data points were detected correctly. HOWEVER: if the dataset is not well balanced, or mistakes have varying impact, accuracy is not a good measure for performance. 

if a dataset consists of pictures of cats or dogs, and the objective is to find the dogs (cats are irrelevant), then two kinds of mistakes can be made: 
1. the ml model misses a dog (false negative)
2. the model wrongly identifies a cat as a dog (false positive)
where correctly identified dogs are true positives and correctly identified cats are true negatives. 

then instead of accuracy, these two concepts can be used to evaluate model relevancy:

### precision
found by dividing true positives by overall positives. also considered as the probability that any randomly selected item is a true positive.
<br> precision quantifies the number of positive class predictions that actually belong to the positive class.
$$ precision = \frac{true positive (TP)}{true positive (TP) + false positive (FP)}$$

<br> for all identified positives, how many are correctly positive

### recall
recall is the measure of relevant elements that were detected. it divided true positives by the number of relevant elements. 
i.e. # of predicted dogs / total known # of dogs in the dataset
<br> for a singular item, recall gives the probability that a randomly selected relevant item from the dataset will be detected.
<br> recall quantifies the number of positive class predicitions made out of all positive examples in the dataset.
$$ recall = \frac{true positive (TP)}{true positive (TP) + false negative (FN)}$$

<br> for ALL positives, which ones were correctly identified

### F1 measure
F1 = 2 * ((precision*recall) / (precision + recall))
result is between 0.0 and 1.0, where 1.0 is the best.
<br> F1 gives a score that balances precision and recall in one.

# cross validation 
* validation: determining whether numerical results (which quantify hypothetical relationships) are acceptable descriptions of data
* cross validation is used to get a better idea of the effectiveness of a model
* main point of cross validation is preventing overfitting
* simplest validation form is train-test split, where around 70% of the data is set as training data, and the other 30% is saved as validation data after the model is done training.
* Mean squared error (MSE) is calculated on the predicted test set vs actual test set. 
* however, with only one training set and test set, MSE can be vastly different depending on what section of the dataset is chosen for train vs test
* this is why we use cross validation instead

### k fold validation
* cross validation divides training data into multiple folds/subsets, where one fold at a time is used as validation data (rest is training data)
* usually multiple folds are created and multiple tests are run, each using a different fold as validation data so that each test is exposed to a different set of "new" data
* results of each test are averaged to produce a better performance estimate

### leave one out CV: finding MSE of dataset
* dataset is split into test/train, but only ONE datapoint out of n points is held as the testing set (n-1 points used for training)
* model is trained on the training set and MSE is calculated. 
* this process is repeated n times, where each time a different singular datapoint is held as the test set 
* total MSE is found as average of the n test runs
* cons: very time consuming and can be computationally expensive

### bootstrapping
*bootstrapping* is a sampling technique that creates subsets from the original dataset, with replacement (items drawn from original dataset will not be removed and can be drawn again). Each sample then represents a randomly chosen subset of the entire population. 
1. draw a sample of size N from the original dataset, with replacement
2. repeat S times, so there are S bootstrap samples
3. estimate on S samples, so there are S estimates
4. combine these estimates to get a better estimate (or model)

*Out Of Bag (OOB) score*: after each bootstrap sample is selected, the points that were not chosen from the original set are given back to the model trained on that set as "unseen data" (test data). For every point, the trees whose sample did not include that point are used to predict the class of that data point. the final prediction will be used by max voting of these trees. the final OOB score will be found by aggregating all OOB predictions and comparing them to the true labels. 
<br> NOTE: due to the nature of sampling with replacement, when all samples have been bagged, only 63.2% of the orignal samples have been drawn. that leaves about 36.8% to be used as OOB data.

### AUC 

# QUESTIONS
* what is the difference between jackknife and LOO? everytime i look jackknife up, LOO is the only thing that comes up 
* what does the difference between precsision and recall actually mean 
