# Machine Learning Techniques Explained


## Table of Content


&emsp;&emsp;[1 Basics](#chp1) <br>
&emsp;&emsp;&emsp;&emsp;1.1 Core concepts <br>
&emsp;&emsp;&emsp;&emsp;1.2 Strategies <br>
&emsp;&emsp;2 Classifiers <br>
&emsp;&emsp;&emsp;&emsp;2.1 logistic regression  <br>
&emsp;&emsp;&emsp;&emsp;2.2 K nearest neighbor classifier (kNN) <br>
&emsp;&emsp;&emsp;&emsp;2.3 K-mean <br>
&emsp;&emsp;&emsp;&emsp;2.4 Expectation Maximization (EM) <br>
&emsp;&emsp;&emsp;&emsp;2.5 Density-based spatial clustering of application with noises (DBSCAN) <br>
&emsp;&emsp;&emsp;&emsp;2.6 Support vector machine classifier (SVM-classifier) <br>
&emsp;&emsp;&emsp;&emsp;2.7 Decision tree classifier (DTree-classifier) <br>
&emsp;&emsp;&emsp;&emsp;2.8 Random forest classifier (RForest-classifier) <br>
&emsp;&emsp;&emsp;&emsp;2.9 Gradient boost tree classifier (GBoost classifier) <br>
&emsp;&emsp;&emsp;&emsp;2.10 Extreme boost tree classifier (XGBoost classifier) <br>
&emsp;&emsp;&emsp;&emsp;2.11 Adam boost tree classifier (ADBoost classifier) <br>
&emsp;&emsp;3 Regressors <br>
&emsp;&emsp;&emsp;&emsp;3.1 Linear regression <br>
&emsp;&emsp;&emsp;&emsp;3.2 Linear mixed model <br>
&emsp;&emsp;&emsp;&emsp;3.3 Penalized regression <br>
&emsp;&emsp;&emsp;&emsp;3.4 Support vector machine regression (SVM-regression) <br>
&emsp;&emsp;&emsp;&emsp;3.5 Decision tree regression (DTree-regression) <br>
&emsp;&emsp;&emsp;&emsp;3.6 Random forest regression (RForest-regression) <br>
&emsp;&emsp;&emsp;&emsp;3.7 Gradient boost tree regression (GBoost regression) <br>
&emsp;&emsp;&emsp;&emsp;3.8 Extreme boost tree regression (XGBoost regression) <br>
&emsp;&emsp;&emsp;&emsp;3.9 Adam boost tree regression (ADBoost regression) <br>
&emsp;&emsp;4 Data Preprocessing and Feature Engineering <br>
&emsp;&emsp;&emsp;&emsp;4.1 tabular data <br>
&emsp;&emsp;&emsp;&emsp;4.2 images <br>
&emsp;&emsp;&emsp;&emsp;4.3 time-series <br>
&emsp;&emsp;&emsp;&emsp;4.4 text <br>
       

<a id='chp1'></a>
# Basics

> - [General concepts](#sec1-1) 
- [Classification metrics](#sec1-2) 

<a id='sec1-1'></a>
* ***What is the difference between supervised learning and unsupervised learning?***

> Supervised learning uses the training datasets with both input features and targets to train the model that maps the input space to the output space. On the other hands, the unsupervised learning doesn't require targets. It trains only on input features and tries to extract the patterns from input space, and such pattern can be either used to group data or engineer features.

* ***What is the bias-variance trade off?***

> The bias is the error (unexplained residual) of your model due to the oversimplification of model hypotheses. A biased model tends to under-fit the data, resulting a bad prediction accuracy. In contrast, the variance is introduced by the increasing complexity of the model. A very complex model hypothesis tends to over-fit the training data and learns the noises; therefore, its parameter can vary drastically while fitting different random samples from the population. This is how we say the model falls to the variance side and could not be generalized. In practices, data scientist often needs to tweak the complexity in order to balance the trade off between bias and variance, or in order words, to balance the under-fitting and over-fitting.

<a id='sec1-2'></a>
* ***What is confusion matrix?***

> Confusion matrix is a 4 by 4 table that contains 4 elements from a binary classification result. The columns are the predicted classes and rows are actual classes. The element (0, 0) corresponds to the number of samples that is predicted as positive and is also actual positive, called **true positive (TP)**. (0, 1) corresponds to the number of samples that is actual positive but predicted as negative, called **false negative (FN)**. (1, 0) is the number of negative samples that has been predicted as positive, called **false positive (FP)**. (1, 1) is the number of negative that is correctly assigned to negative, which is **true negative (TN)**. <br>
<img src="pics/1-2.png" align="center"/>
[Figure 1-2. Confusion matrix and associated binary classification metrics](https://en.wikipedia.org/wiki/F1_score)

> There are many metrics associated with binary classification and all related to confusion matrix, they are: <br>
1. **Error rate**: (FP + FN) / Total_Samples
2. **Overall accuracy**: (TP + TN) / Total_Samples
3. **Sensitivity (recall or true positive rate)**: TP / Total_Positives
4. **Fall-out (false positive rate)**: FP / Total_Negatives
5. **Specificity (true negative rate)**: TN / Total_Negatives
6. **Precision (positive predicted value)**: TP / (TP + FP)
7. **F-score (harmonic mean of precision and recall)**: (1 + $\beta^2$) $\times$ (Precision $\times$ Recall) / ($\beta^2 \times$  Precision + Recall). While $\beta$ is 1, we call it F1-score, which balances the precision and recall. While $\beta$ is between 0 and 1, it assigns more weight on precision and biases towards false positive responses. When $\beta$ is above 1, F-score would weight on recall being favorable on more true positive responses.

> In the field of image processing or remote sensing, confusion matrix is extended to deal with multi-label classification. Different to the original confusion matrix, people normally define columns as the references, and the rows as the predictions. <br>
<img src="pics/1-3.png" height=300 width=300 align="center"/>
[Figure 1-3. An example of confusion matrix used for image classification](http://gis.humboldt.edu/OLM/Courses/GSP_216_Online/lesson6-2/metrics.html)

> The related metrics are:
1. **Overall accuracy and error (overall)**: Total_Correct_Pred / Total_Samples
2. **Omission error (per class)**: Incorrect_Classified_Ref / Total_Ref
3. **Commission error (per class)**: Incorrect_Classified_Pred / Total_Pred
4. **Producer's accuracy (per class)**: Correct_Classified_Ref / Total_Ref = 1 - Omission_Error
5. **User's accuracy (per class)**: Correct_Classified_Pred / Total_Pred = 1 - Commission_Error
6. **Kappa (both overall and conditional to each class)** is generated from statistical test that compares the classification outcome to random. Kappa index is ranging from -1 to 1 (or 0 to 1 sometime), where 0 (or 0.5) indicates random classification, below 0 (or 0.5) indicates worser than random, and above 0 (or 0.5) indicates better than random. **Conditional Kappa** for class $i$ is given as:  
\begin{equation*}
Kappa(i) = \frac{P_{i,i}(y=i) - P_i(y=i) P(y=i)} {P_i(y=i) -  P_i(y=i) P(y=i) }
\end{equation*}
 <br> where $P_{i,i}(y=i)$ is the probability of agreement, i.e. the number of correctly classified samples of class $i$ divided by the total number of references;
 <br> $P_i(y=i)$ is the probability of true $i$ having classified the point as $i$, i.e. the total number of correctly classified class $i$ divided by the total number of points being predicted as $i$;
 <br> and $P(y=i)$ is the probability of samples being classified as $i$, i.e. the total number of predicts of class $i$ divided by the total number of references. 
<br> **Overall Kappa** is given as:
\begin{equation*}
Kappa = \frac{\sum{P_{i,i}(y=i)} - \sum{P_i(y=i)P(y=i)} } {1 - \sum{P_i(y=i) P(y=i)} }
\end{equation*}



* ***What is ROC curve?***

> ROC (Receiver Operating Characteristic) curve is a graphical representation of the contrast between true positive rate and false positive rate at various thresholds. It is often used as a proxy of trade off between true positive rate and false positive rate of the model prediction. <br>
<img src="pics/1-4.png" height=300 width=300 align="center"/>
[Figure 1-4. An example of ROC curve for different models](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)

* ***What is selection bias?***

