# Classification

### Use cases
Email filtering, 
speech recognition,
handwriting recognition
biometric identification
document classification

### Types
binary classification
multiple class classification

### Algorithms
Decision trees (ID3, C4.5, C5.0)
Naive bayes
Linear discriminant analysis
k-Nearest neighbor
Logistic regression
Neural network
Support vector machines (SVM)

### KNN
KNN is, 
- a method for classifying cases based on their similarity to other cases.
- cases that are near each other are said to be "neighbors"
- based on similar case with same class labels are near each other

Algorithm is (steps):
1. pick a value for K
2. calculate the distance of unknown case from all cases
3. select the k-observations in the training data that are "nearest" to the unknown data point
4. predict the response of the unknown data point using the most popular response value from the k-nearest neighbors

Evaluation Metrics
- Jaccard index (simpliest), J(y,y_hat) = |y ^ y_hat| / |y v y_hat|, 1 is the best, 0 is the worst, confusion matrix,
- F1-score
  - precision, tp/(tp+fp)
  - recall, tp/(tp+fn)
  - score equation, 2*(prc * rec)/(prc+rec)
  - 1 is the best, 0 is the best
  - can be used in multiclass classifiers as well
- Log loss (logarithmic loss)
  - equation, -(1/n)*sum((y*log(y_hat)+(1-y)*log(1-y_hat)))
  - 0 is the best, 1 is the worst

In [None]:
# import the library
from sklearn.neighbors import KNeighborsClassifier

# training
k = 4
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
print(neigh)

# predicting
yhat = neigh.predict(X_test)
print(yhat[0:5])

# accuracy evaluation
from sklearn import metrics
# In multilabel classification, accuracy classification score is a function that computes subset accuracy. 
# This function is equal to the jaccard_score function.
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

In [None]:
''' 
how can we choose right value for K? 
The general solution is to reserve a part of your data for testing the accuracy of the model. 
Then choose k =1, use the training part for modeling, and calculate the accuracy of prediction using all samples in your test set. 
Repeat this process, increasing the k, and see which k is the best for your model.
'''
