# Modeling

### Implemented Classifiers
- Logistic Regression `41-log-regression.ipynp`
- Gaussian Naive Bayes `42-naive-bayes.ipynp`
- Support Vector Machine `43-support-vector-machine.ipynp`
- Tree-Based Methods `44-trees.ipynb`
    - Decision Tree
    - Random Forrest
    - XGBoost

### Other Possible Classifiers
- Bernoulli Naive Bayes
- Complement Naive Bayes
- Linear Discriminant Analysis
- Quadratic Discriminant Analysis
- K-Nearest Neighbors
- Multi-layer Perceptron
- Extra Trees
- AdaBoost 

## Approach
The goal of classification is to assign a given data point to one of a set of possible classes. In this case, we have measurements of a particle that we're assigning to be either stone microdebitage or soil.

## Evaluating Classifiers
- Although we want to design a model that is accurate. High accuracy is not an adequate metric, especially in the case of imbalanced classes.
- As calculated below, a classifier achieve 93% accuracy by classifying all observations as negative (i.e. soil). However, if we chose this model, we'd never identify any of the stone microdebitage at ancient toolmaking sites

In [2]:
from utils import custom

# load data
X_train, y_train, X_test, y_test = custom.load_data(verbose = False)

# calculate negative %
(y_train.value_counts()[0] / len(y_train)).round(2)


0.93

### Confusion Matrix
- A matrix of correctly classified and misclassified observations is helpful in considering the strengths and weaknesses of a model holistically. True positives and true negatives are correct. False positives and false negatives are incorrect.


| Actual / Predicted | Positive | Negative |
| ------------------ | -------- | -------- |
| **Positive**       |    TP    |    FN    |
| **Negative**       |    FP    |    TN    | 

### Precision, Recall, and F1
- Precision is the accuracy of the positive predictions (i.e. of all the observations predicted positive, what proportion is correct). This measure penalizes false positives.
- Recall (sensitivity) is the true positive rate, or percentage of actual positive cases captured (i.e. of all the observations actually positive, what proportion is predicted positive). This measure penalizes false negatives.
- Where precision and recall are both important, the F1 score can be used, which is their harmonic mean.

### ROC and AUC
- A visualization method, the receiver operating characteristic (ROC) curve plots the true positive rate versus the false positive rate for various threholds. The area under the curve (AUC) measures how well the classifier separates the classes.

![test](https://docs.eyesopen.com/python_modules/cookbook/python/_images/roc-theory-small.png)