## Classification

- Categorizing some unknown items into a discrete set of categories or "classes"
- Target is a categorical variable
- Determines the class label for an unlabeled test case
- A sample classification problem: Detecting whether an email is spam by looking at the text

### K-Nearest Neighbors (KNN)
- **Classifies cases based on their similarity to other cases.**
- Assume we are going to classify customers and we have the nth customer, with their "class" being unknown.

#### How does it work?
- Pick a value for K.
- Calculate the distance of unknown case from all cases
- Select the K observations in the training data that are "nearest" to the unknown data point.
- Predict the response of the unknown data point, using the most popular response from the K-nearest neighbors.

#### How to find the similarity?
- Having n features, we use a specific form of Minkowski Distance:
- Dis(x1,x2) = Root(for x1, x2 in range n: add (x1_i - x2_i) ** 2) - this shows the distance

#### Choosing the right K value
- A low K value causes overfitting while a high one causes overfitting
- Choose a K value and iteratively increase it, then compare the accuracy scores.
- The value that gives you the highest accuracy is the most suitable K value

### Evaluation Metrics in Classification

#### Jaccard Index (Jaccard Similarity Coefficient)
- y: Actual labels
- Y: Predicted labels
- J(y, Y) = (y and Y) / (y or Y)

#### F1 Score (Usage in Binary Classification)
- Let's assume there are two classes: 0 and 1.
- We can create a confusion matrix: 
    - [(Predicted_1, Actual_1) (Predicted_1, Actual_0) (Predicted_0, Actual_1) (Predicted_0, Actual_0)]
- P1-A1: True Positives, P1-A0: False Positives, P0-A1: False Negatives, P0-A0: True Negatives
- **Precision = TP/(TP + FP)**
- **Recall = TP / (TP + FN)**
- **F1-Score = 2(prc * rec)/(prc + rec)** (Harmonic avg of precision and recall)

#### Log Loss
- Sometimes the outcome of classification can be a probability - like in Logistic Regression.
- In that case, Y is between 0 and 1. Here is the formula:
- **LogLoss = Average of: - (y * log(Y) + (1 - y) * (log(1 - Y))**