<br>
<br>
<br>
<br>

# DAV 6150 Module 8: "Distance-based" Machine Learning Algorithms
<br>
<br>
<br>

# Midterm Exam Comments

Data science practitioners are expected to have a clear understanding of each of these fundamental concepts, so it is recommended that you review the M1 - M7 materials if needed.

- Logistic Regression requires that explanatory variables be __independent of one another__.


- When two numeric variables exhibit a linear relationship it is possible to __Calculate and interpret their correlation coefficient__.


- Overfitting can be the result of each of the following: a) __Use of a model that is too sophisticated relative to the data set__; b) __High Variance__; c) __outliers__.


- A numeric explanatory variable that has a great deal of variance should be __Included in a model as long as it is not collinear with other explanatory variables__


- The difference between an interaction feature and an indicator variable is: __Interaction features are derived from attributes already present within a data set while indicator variables are used to introduce external domain knowledge to a data set__.


- Letter grades are an example of an __ordinal categorical variable__, i.e., A > B > C > D, etc. Therefore, neither linear models nor binary or multinomial logistic regression models should be used for purposes of creating a model that attempts to predict letter grades.


- The output ‘V’ of a binary logistic regression model is guaranteed to fall within the range of (0 <= V <= 1) because: __Logistic regression models incorporate a logarithmic sigmoid function that limits the range of their output__.


- Standard errors and RMSE differ in that: __RMSE measures the distance between actual and predicted values while standard errors are a measure of how certain we are regarding the value of the coefficients of explanatory variables__.


- Covariance vs. Correlation: a) __Covariance values are unbounded while correlation values are bounded__; b) __Covariance tells us only the direction of the linear relationship between two variables while correlation measures both the strength and direction of the relationship between two variables__; c) __Covariance can be affected by a change in the scale of variables while correlation is not affected by changes in the scale of variables__; d) __Covariance values can be infinitely negative while correlation values cannot be less than -1__.


- If your response variable is an overdispersed whole number, a __negative binomial regression model__ should be used.


- If a model underfits training data, it suffers from __high bias__.


- __Sensitivity__ and __Recall__ are different names for the exact same metric, so a model having high sensitivity can also be said to have high recall.


- Aggregating the output of multiple models in an attempt to improve the quality of your predictions is a method used by __ensemble models__, of which a __random forest__ is one example.


- In general, we should prefer regression models that have the following characteristics: a) __Higher log likelihood, higher F1, and lower AIC scores__; b) __Higher adjusted R^2, lower BIC, and higher F1 scores__; c) __Lower BIC and higher log likelihood scores__.


- __Dimensionality Reduction__ and __Feature Selection__ are not the same thing!! __Dimensionality Reduction__ techniques include __PCA__ and __Singular Value Decomposition__. By contrast __Feature Selection__ techniques include __variance thresholds, wrapper methods, correlation thresholds, stepwise search, recursive feature elimination.

## Module 7 Assignment Review

The __TARGET__ response variable is an __imbalanced class__: more than 72.67% of its values (1-(3,008/11,008)) (taken from non-duplicative observations within the dataset) are 'no' while less than 28% are 'yes'. The __null error rate__ for the response variable is 0.7267, which means that we could achieve 72.67% accuracy by simply predicting that a customer did not buy an additional insurance product for every observation within the data set. Obviously, such a model wouldn't be very useful for purposes of deciding whether or not an insurance company customer is likely to purchase an additional product from the company.

We learned in __Module 5__ that reliance on an __accuracy__ metric for purposes of comparing similar models __is not appropriate when the response variable is an imbalanced class__. 

Which metrics should we rely on?

- __precision__: TP / (TP + FP) If we are trying to __minimize the number of false positives (FP)__, we should use __precision__ as one of our primary model performance metrics.


- __recall__: TP / (TP + FN) If we are trying to __maximize the number of true positives (TP)__ or __minimize the number of false negatives__, we should use __recall__ as one of our primary model performance metrics. Why? Maximization of TP necessarily minimizes instances of FP.


- __F1 Score__: 2 * (precision * recall) / (precision + recall) is the __weighted average of precision and recall__. Therefore, F1 score __is a measure of how well a model handles both false positives and false negatives__. Remember: models with relatively larger F1 scores are preferable to models having relatively smaller F1 scores.


#### How else might we handle imbalanced classes?

__Synthetic Minority Oversampling Technique ("SMOTE")__: The concept of SMOTE was explained in the Module 3 Assigned readings (see __Machine Learning Pocket References, Chapter 9__). SMOTE works by __synthesizing__ new examples from the minority class. The process works as follows:


- A random example from the minority class is chosen. 


- k of the nearest neighbors for that example are found (typically k=5). 


- One of those k-nearest neighbors is randomly selected. Then, a synthetic example is created at a randomly selected point between the two examples in feature space.


This process is repeated as many times as needed to balance out the classifications for the imbalanced variable

The approach is effective because new synthetic examples from the minority class are created that are __plausible__, meaning their features are relatively to those of existing examples from the minority class.

When finished, we then have a balanced class. So if our variable is a binary categorical feature, the __null error rate__ for the data that includes the synthesized observations will be __.50__. With a .50 null error rate, an accuracy metric can be a very effective tool for comparing models.

For more details + examples see this link: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

#### So should we always use SMOTE if we have an imbalanced response variable?

Not necessarily. It can be effective in some instances but is not guaranteed to improve model performance. As always, we need to test it empirically to judge its efficacy relative to a specific data set.



# K-Nearest Neighbors

__K-Nearest Neighbors (KNN)__ is a __supervised machine learning algorithm__ most frequently used for solving __classification__ problems.


KNN has two underlying assumptions:

- 1) We can use a distance metric to calculate the “distance” between any two given data observations within a data set


- 2) Data observations that are “near” to one another are likely to be similar to each other. 


__“K”__ is a constant representing the number of nearby / neighboring training set data points (or data observations) to be used to predict a valid classification for a given data point


We can select from a wide variety of distance metrics for use within a KNN implementation. __Minkowski Distance__ is __a generalized distance formula__ that can be used as a framework for calculating a variety of distance measures.

### The Minkowski Distance generalized formula is as follows:

## $\left (\sum _{i=1}^{n} |x_{i} - y_{i}|^p \right) ^{1/p}$

<br>

We can manipulate the value of $p$ in the above formula to derive different distance metrics, each of which is explained graphically __here__: http://www.ieee.ma/uaesb/pdf/distances-in-classification.pdf



#### p = 1: Manhattan Distance - Calculate Distance via a Grid-Like Path

## $d = {\sum _{i=1}^n |x_{i} - y_{i}| } $ 

<br>

####  p = 2: Euclidean Distance - Calculate "As the Crow Flies" Distance Between 2 Points on a 2-D Plane

This is the classic Euclidean formula: 

## $d(x,y) = \sqrt{\sum _{i=1}^{n} \left(x_{i}-y_{i}\right)^2 }$

<br>


#### p = $\infty$:  Chebyshev Distance - Calculate the Distance Between Two Vectors

Chebyshev Distance is also sometimes referred to as __chessboard distance__


## $d_{chebyshev} (x,y) = \max\limits_i (|x_{i} - y_{i}|) $

<br>


#### Other Distance Metrics

__Mahalanobis Distance__: For a given data point and distribution $D$, measure how many standard deviations away the point is from the mean of $D$. 

#### $D_{M}(\overrightarrow{x}) = \sqrt{(\overrightarrow{x} - \overrightarrow{\mu})^T S^{-1} (\overrightarrow{x} - \overrightarrow{\mu})  } $

<br>


__Cosine Distance__: Most frequently used to measure similarity of documents; Applied to term frequency vectors constructed from the content of documents.

## $\cos\theta = \frac {\overrightarrow{a} \cdot \overrightarrow{b}} {||\overrightarrow{a}|| ||\overrightarrow{b}||} $

Compare the result of the equation to the following cosine angle values to determine how similar your documents are:

- $\cos\theta = 1$ : Vectors are pointing in the same direction => documents are very similar


- $\cos\theta = 0$ : Vectors are orthogonal => Documents have some similarities but are unlikely to be related to one another


- $\cos\theta = (- 1)$ : Vectors are pointing in opposite directions => Documents are completely dissimilar


Unfortunately, __there is no single "rule of thumb" or specific set of guidelines for determining which distance function to apply__.  Therefore, __apply your empirical skills__ and test various distance functions to derive an KNN model that works best relative to your data. 

### Implementing KNN in Python

__sklearn__ includes a pre-built KNN classifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html


An example from the assigned readings for Module 8: https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d

# Support Vector Machines


__Support Vector Machines (SVM)__ are __supervised machine learning algorithms__ most frequently used for solving __classification__ problems.


- SVM uses the concept of __margin classification__, wherein we attempt to identify classifications within a data set by deriving a decision boundary that maximizes the distance between groups of data points. 


- SVM identifies __parallel hyperplanes__ that separate the classes of data __via the maximum distance possible__ relative to the constraints of the data set. 


- The region bounded by the hyperplanes is called the __"margin"__, and __the maximum-margin hyperplane is the hyperplane that lies halfway between them__. The __“Support Vectors”__ are __the data points that lie along the edge of the maximum-margin hyperplane__. From the assigned readings: https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989


- SVM can be used for both linear + non-linear classification tasks.


- SVM is well suited for use with small to medium size, relatively complex data sets


### Using SVM with Non-Linear Data

SVM can be successfully applied to non-linear data by adding additional polynomial features to a model. However, instead of adding a significant number of features to a model, which would negatively impact our ability to implement an effective model, we make use of a __kernel trick__, which allows us to achieve the results of including a high degree of new features without actually adding them to our data. 


__How the "kernel trick" approach works__:  We map our non-linearly separable data into a higher dimensional space via a mathematical function. We then try to find a hyperplane within that higher dimensional space that can effectively separate the samples.


__What types of kernel tricks are commonly used?__: 

- Polynomial


- Radial Basis Function (RBF)


- Gaussian (a special case of RBF)


- Sigmoid


See this link for a detailed discussion of kernel tricks: https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel-trick-mercers-theorem-e1e6848c6c4d


Unfortunately, __there is no single "rule of thumb" or specific set of guidelines for determining which of the non-linear kernel tricks to apply__.


If your data is non-linear, defaulting to the use of the __RBF__ kernel trick is often suggested. However, the __polynomial__ kernel can be effective in many instances. Therefore, __apply your empirical skills__ and test various combinations of kernel tricks + SVM tuning parameter settings to derive an SVM model that works best relative to your data. 


__How do we implement SVM + kernel tricks in Python?__:  We can use a pre-built SVM classifier provided within the __scikit-learn__ library:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html


The __sklearn.svm.SVC()__ function includes a parameter that allows us to select the kernel function to be applied within the SVM classifier. We can choose from ‘linear’ (use when you have data that is known to be linearly separable), ‘poly’ (polynomial), ‘rbf’ (radial basis function), ‘sigmoid’. We can also construct + use our own kernel function if we prefer (simply set the "kernel =" parameter to the name of your Python function).

## Module 8 Assignment Guidelines / Requirements