# Classification

### Use cases
Email filtering, 
speech recognition,
handwriting recognition
biometric identification
document classification

### Types
binary classification
multiple class classification

### Algorithms
Decision trees (ID3, C4.5, C5.0)
Naive bayes
Linear discriminant analysis
k-Nearest neighbor
Logistic regression
Neural network
Support vector machines (SVM)

### KNN
KNN is, 
- a method for classifying cases based on their similarity to other cases.
- cases that are near each other are said to be "neighbors"
- based on similar case with same class labels are near each other

Algorithm is (steps):
1. pick a value for K
2. calculate the distance of unknown case from all cases
3. select the k-observations in the training data that are "nearest" to the unknown data point
4. predict the response of the unknown data point using the most popular response value from the k-nearest neighbors

Evaluation Metrics
- Jaccard index (simpliest), J(y,y_hat) = |y ^ y_hat| / |y v y_hat|, 1 is the best, 0 is the worst, confusion matrix,
- F1-score
  - precision, tp/(tp+fp)
  - recall, tp/(tp+fn)
  - score equation, 2*(prc * rec)/(prc+rec)
  - 1 is the best, 0 is the best
  - can be used in multiclass classifiers as well
- Log loss (logarithmic loss)
  - equation, -(1/n)*sum((y*log(y_hat)+(1-y)*log(1-y_hat)))
  - 0 is the best, 1 is the worst

In [None]:
# import the library
from sklearn.neighbors import KNeighborsClassifier

# training
k = 4
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
print(neigh)

# predicting
yhat = neigh.predict(X_test)
print(yhat[0:5])

# accuracy evaluation
from sklearn import metrics
# In multilabel classification, accuracy classification score is a function that computes subset accuracy. 
# This function is equal to the jaccard_score function.
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

In [None]:
''' 
how can we choose right value for K? 
The general solution is to reserve a part of your data for testing the accuracy of the model. 
Then choose k =1, use the training part for modeling, and calculate the accuracy of prediction using all samples in your test set. 
Repeat this process, increasing the k, and see which k is the best for your model.
'''


### Standardization (z-score normalization)
##### Formula: z = (x - mu) / sigma
##### Effect: 
- Transforms the feature to have a mean of 0 and a standard deviation of 1.
- Helps handle features with different units or scales.
##### When to use
- When the data is normally distributed or when features have significantly different variances

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Min-Max Scaling (Normalization)

##### Formula
x' = [x - min(x)] / [max(x) - min(x)]

##### Effect
rescales all features to fall within the same range, making comparisons straightforward.

##### When to use
When you want all features to have a comparable scale without altering their original distribution

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

### Robus Scaling

##### Formula:
x' = [x = median(x)] / IQR

##### Effect
Scales data based on median and interquartile range, making it robust to outliers

##### When to use
When the dataset has many outliers that could skew standardization or normalization



In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

### Decision Tree
decision trees are built by splitting the training set into distinct nodes,
where one node contains all of or most of one category of the data.

decision trees are about testing an attribute and branching the cases based on the result of the test.
each internal nodes corresponds to a test
each branch corresponds to a result of the test
each leaf node assigns a patient to a class

contructed by considering the attributes one by one, the algorithm is:
1. choose an attributes from out dataset
2. calculate the significance of the attribute in the splitting of the data
3. split the data based on the value of the best attribute
4. go to each branch and repeat it for the rest of the attributes

decision trees are built using recursive partitioning to classify the data

more predictiveness
less impurity
lower entropy

entropy is the amount of information disorder or the amount of randomness in the data
used to calculate the homogeneity of the samples in that node.
the lower the entropy, the less uniform the distribution, the purer the node 
entropy formula: -p(A)log2(p(A)) - p(B)log2(p(B))

The tree with the higher information gain after splitting.
is the information that can increase the level of certainty after splitting.
information gain = (Entropy before split) - (weighted entropy after split)

higher information gain and lower entropy

