# Machine Learning Overview
This notebook gives a general overview of machine learning as it pertains to Data Science.

---

## What is Machine Learning?
Broadly speaking, Machine Learining (or ML) is a subset of AI which provides machines the abliity to learn automatically & improve from experience without being explicitly programmed to do so. 

In the field of data science however, it is more helpful to think of machine learnining as a means of building models of data.

Fundamentally, ML learning involves building mathematical models to help understand data. 'Learning' is when these models are given tunable paramenters that can be adapted to the observed data; in this way a program can be considered to be 'learning' from the data. Once these models have been 'fit' to previously seen data, they can then be used to predict and understand apsects of new unseen data. 

This article does a great job of explaining machine learning in laymans terms, all images in this notebook came from this site: 

https://vas3k.com/blog/machine_learning/

---

## The Machine Learning Process
No matter what algorithm or model is used to solve a problem with ML, the overall process is mostly the same and generally follows these 7 steps:

1. Define the Objective
2. Data Gathering
3. Prepring Data
4. Data Exploration
5. Building a Model
6. Model Evaluation
7. Predictions

---

## Categories of Machine Learning

<img src='data/images/ml1.png'>


At the most fundamental level, ML can be categorized into Five broad categories:
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcemnt Learning
4. Ensemble Methods
5. Neural Networks and Deep Learning

---

## Classical Machine Learning
The first two main methods of machine learning (Supervised and Unsupervised) are often referred to as classical machine learning as they are well established and have been around since the 50's. Classic Machine Learning models tend work best with **simple data that has clear features**

<img src='data/images/ml3.png'>

---

## 1 - Supervised Learning 
Supervised Learing involves modeling the relationship between measured 'features' of data and some 'label' (or desired outcome) associated with the data. There are two main types of Supervised Learning Classification and Regression.

---

### Important ML and Supervised Learning Definitions:

**Algorithm** - A set of rules and statistical techniques used to learn paterns from data

**Model** - A model is trained by using a machine learning algorithm

**Predictor (aka Feature) Variables** - These are the independent (X) variables used in supervised analysis in order to predict an outcome variable

**Label (aka Target or Outcome) Variables** - The dependent y variable that is the overall desired outcome of the ML analysis and is calculated using 1 or more feature variables

**Training Data** - are the observations in the training data set form the experience that the algorithm uses to learn, therefore the training data effectively builds the ML model. 

**Testing Data** - are a set of observations used to evaluate the performance of the model using some performance metric. 

**Over-Fitting** - is when an overly complex model (too many features, etc.) predicts well with the training data, but does not work with new examples (testing set or real-world data), **Regularization** can be applied to many models to reduce over-fitting (see Lasso, Ridge, and Elastic-Net notebook for examples of this)

**Under-Fitting** - occurs when a ML model is not complex enough to accurately capture relationships between a dataset's features and target variable. 

**Validation/Hold-Out Data** - in addition to the training and testing sets, sometimes a third set called a validation set can be required. These set is used to tune variables called **hyper parameters** which control how the model is learned. 

**Cross-Validation** - partitions the training data then trains the algorithm using all but one of the partitions which is used for the testing. The partitions are then rotated several times so that the algorithm is trained and evaluated on all of the data. (Example: Inital data with 4 subsets (A, B, C, D), Initial Partition (A, B, C, trained on D), Next rotation parition (B, C, D, trained on A), and so on. 


### ML Model Bias, Variance, and Error
When a ML model (such as classification or regression) makes predictions from a set of data, the performance of the model can be described in terms of the prediction error on all examples not used to train the model, this is referred to as the **model error**

**Model Error** can be decomposed into three sources of error, the **variance** of the model, the **bias** of the model, and the variance of the **irreducible error** in the data, therefore Model Error is equal to:
* Error(Model) = Varaince(Model) + Bias(Model) + Variance(Irreducible Error)

**Bias** is a measure of how close a model can predict a mapping function between inputs and outputs of a data set. A **bias error** occurs due to incorrect assumptions about the data such as assuming that data is linear when in reality it follows a complex function. 
* **Unbiased Models** - make weak or no assumptions about the form of the unknown underlying function that maps inputs to outputs in a dataset. Unbiased models treat all variables equally and become more complex as new variables are added, these models tend to have lower bias (fits the line well) and higher variance (unreliable predictions). Unbiased models are prone to overfitting (adding unessesary 'noise' to the analysis)

* **Biased Models** - make strong assumptions about the form of the unknown underlying function that maps inputs to outputs in a dataset. These models will tend to have higher bias (doesn't fit the line as well) and lower variance (has consistent predictions), OLS linear regression is an example of one such model as these models are used to predict the best fit line formula (i.e. the unknown underlying function). Biased models are prone to underfitting 


**Variance** - is a measure of sensitivity of a model when it is fitted to new training data. 
* **Low Variance** models have small changes to the model with new traning data
* **High Variance** models have large changes to the model with new training data

**Irreducible Error** are errors that cannot be removed with any model (elements outside of control such as statistical noise)

**bias-variance trade-off** is a useful conceptualization for selecting and configuring models, although generally cannot be computed directly as it requires full knowledge of the problem domain, which we do not have. Nevertheless, in some cases, we can estimate the error of a model and divide the error down into bias and variance components, which may provide insight into a given model’s behavior.

The ideal ML algorithm has low bias (can accurately model the true relationship) and low variance (produces consistent predictions accross different datasets). In reality this is very challenging and is really the goal of applied machine learning for a given predictive modeling problem. 

The 'Trade-Off' is that reducing bias be achieved by increasing variance and conversely reducing variance can be achieved by incresing the bias. This relationship is referred toa s the bias-variance trade-off which is a conceptual framework for thinking about how to choose models and model configurations. 

For example, a model can be chosen based on its bias or variance. 
* **Simple models** such as linear regression and logistic regression, generally **have a high bias and a low variance**
* **Complex models** such as random forest, generally **have a low bias but a high variance**

Model configurations can be chosen based on their effect on the bias and variance of the model. For example, the k hyperparameter in k-nearest neighbors controls the bias-variance trade-off in that model. Small values, such as k=1, result in a low bias and a high variance, whereas large k values, such as k=21, result in a high bias and a low variance.

High bias is not always bad, nor is high variance, but they can lead to poor results.

We often must test a suite of different models and model configurations in order to discover what works best for a given dataset. A model with a large bias may be too rigid and underfit the problem. Conversely, a large variance may overfit the problem.

We may decide to increase the bias or the variance as long as it decreases the overall estimate of model error.

Technically, there is no one peformance metric that can calculate the bias-variance trade-off, because the true ideal 'mapping' function for a predictive modeling problem is unknown. In some cases the bias-variance trade off can be measured. 

See this link for more deatils on this: https://machinelearningmastery.com/calculate-the-bias-variance-trade-off/


The next section breaks down bias and variance visually:

<img src='data/images/bias1.png' width=400px>

The image below shows a differrent ML model that folows the true arc of the data relationship with a curvy line:

<img src='data/images/bias2.png' width=400px>

In order to compare how well both the straight and curvy lines fit the training data set by calculating their sums of squares. In other words, measure the distances from the fit lines to the data then add them up. 

<img src='data/images/bias3.png'>

Note that the curvy line fits the data so well that the distances between the line and the data are all 0. So for the training data set, the curvy line outperforms the straight line. But remember, there is also a testing set:

<img src='data/images/bias4.png'>

If the sum of squares is calucated for the testing set, the straight line would outperform the curvy line:

<img src='data/images/bias5.png'>

So the lesson here is that even though the curvy line did a great job of fitting the training set, it did a horrible job of fitting the testing set.


In the above example, the **complex (curvy line) model has a low bias** due to its flexibility and ability to adapt to the curve, yet it has **high variability** because it results in vastly different sums of squares for different datasets resulting in potentially widely unpredictable results between datasets. Because the curvy line fits the training set really well, but not the testing set, it is overfit.

In contrast, the **simple (straight line) model has a high bias**, since it can not capture the curve in data relationship, yet it has **low variance** because the sum of squares are very similar for different datasets. Therefore, the straight line may only give good predictions (not great), but they will be consistent. 



The bottom line here is, even though a regression model might fit the regression line well (low bias), the model may be inacurate from a predictive point of view. 


Here is another link explaining the bias/variance concept: https://www.geeksforgeeks.org/bias-vs-variance-in-machine-learning/

---

### Classification 
Supervised learning models that attempt to identify which category an object belongs to and have discrete or categorical labels (yes/no, 1/0, red/blue, rain/no rain, ect.)

**What is Classification used for?**
* Filtering (spam, ect.)
* Language Detection
* Document Search
* Sentiment Analysis
* Handwriting Recognition
* Fraud Detection

**Popular Classification Algorithms**
* Naive Bayes
* Decistion Trees
* Logistic Regression
* K-Nearest Neighbors (KNN)
* Support Vector Machines (SVM)

---
### Regression
Supervised learning models that are used for estimating the relationships between a dependent ((y) label, outcome, or target variable) and one or more independent variables ((X) predictor, covariate, or feature variable). Regression models use continuous data and are used to forecast some numerical value.

**What is Regression Used For?**
* Stock Price Forecasts
* Demand and Sales Volume Analysis
* Medical Diagnosis
* Any time-series correlations

**Popular Regression Algorithms**
* Linear Regression
* Ridge/Lasso/Elastic-Net Regression
* Polynomial Regression


**Note: many ML models can be used with both regression and classification tasks**
 
---

## 2 - Unsupervised Learning
Unsupervised learning involves modeling the features of a dataset without reference to any label (no pre-determined desired outcome values) There are two main types of unsupervised learning, Clustering and Dimensionality Reduction.

---

### Clustering
Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Basically clustering models attempt to divide objects into gropus based on unknown features. 

**What is Clustering Used For?**
* Market segmentation
* Merge close points on a map
* Image compression
* Analyze and label new data
* Detect abnormal behavior

**Popular Clustering Algorithms**
* K-Means
* Mean-Shift
* Hierachical Clustering
* DBSCAN

---

### Dimensionality Reduction
Dimensionality Reduction is the transformation of data from a high-dimensional space into a low-dimensional space (e.g. reducing the number of random variables to consider) so that the low-dimensional representation retains some meaningful properties of the original data
    
**What is Dimensionality Reduction Used For?**
* Recommendation systems
* Beatiful visualizations
* Topic m (and similar document search)
* Fake image analysis
* Risk management

**Popular Dimensionality Reduction Algorithms**
* Principle Component Analysis (PCA)
* Singular Value Decomposition (SVD)
* Latent Semantic Analysis (LSA, pLSA, GLSA)
* Latent Dirichlet Allocation (LDA)
* t-SNE (for visualization
 
---

## 3 - Reinforcement Learning
Reinforcement learning is a type of ML where an agent learns to behave in an unknown environment by performing actions and learning from the results. There is no expected output in this type of learning like there is in supervised learning and no training data either as the ai agent learns as it goes from trail and error 

**What is Reinforcement Learning Used For?**
* Self-driving cars
* Robot vacuums
* games
* Automated trading
* Enterprise resource management

**Popular Reinforcement Learning Algorithms**
* Q-Learning
* SARSA
* DQN
* A3C
* Genetic Algorithm

---

## 4 - Ensemble Methods
Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

**What are Ensemble Methods Used For?**
* Everyting that fits classical ML algorithm approaches (ensemble mothos usually work better however)
* Search systems
* Computer vision
* Object detection

**Popular Reinforcement Learning Algorithms**
* Random Forrest
* Gradient Boosting

---

## 5 - Neural Networks and Deep Learning
Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. **Neural Network Models can be supervised, semi-supervised or unsupervised**.

**Neural Networks** are basically a collection of neurons (nodes) and the connections between them. A single neuron is a function with a number of inputs and one output and its task is to take all the numbers from its input/s, perform a function on them, and send the result to the output. 

**Simple Example of a single neuron in action:** sum up all numbers from the inputs and if that sum is bigger than N return 1, else return 0. 

**Single Layer Neural Network (Perceptron)** is the most simple form of neural network, in which there is only one layer of input nodes that send weighted inputs to a subsequent layer of receiving nodes, or in some cases, one receiving node. This single-layer design was part of the foundation for systems which have now become much more complex. 

**Multi-Layer Neural Network (Multilayer Perceptron)** contains more than one layer of artificial neurons or nodes. They differ widely in design. It is important to note that while single-layer neural networks were useful early in the evolution of AI, the vast majority of networks used today have a multi-layer model.

In Multi-Layered Neural Networks, neurons are linked between layers but not interlinked within each layer. see the image below where each vertical line of circles represents a layer (input layer, hidden layers, output layer) and note that the nodes in each layer are separate from one another within, but connected to nodes in outside layers:

<img src='data/images/ml4.png'>

**What are Neural Networks and Deep Learning Used For?**
* Replacement of all algorithms mentioned above
* Photo and video object identification
* Speech recognition and synthesis
* Image processing, style transfer
* Machine translation

**Popular Reinforcement Learning Algorithms**
* Perceptron
* Convolutional Network (CNN)
* Recurrent Networks (RNN)
* Autoencoders

---

## 6 - Model Evaluation
In sklearn, there are 3 different APIs for evaluating the quality of a model’s predictions. 

* **Scoring Parameter** - are model evaluation tools (ex: cross-validation) that rely on an internal sklearn scoring strategy based on the convention that higher scores are better than lower scores, see link for more info:
    * https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
* **Metric Functions**  - the metrics module implmements functions assessing prediction errors for specific purposes, usually on a model to model basis (i.e. regression metrics, classification metrics, Clustering meterics, and Multilabel ranking metrics)
* **Estimator Score Method** - estimators have a score method providing a default evaluation criterion for the problem they are designed to solve. 

The link below explains the overall model evaluation process in detail:

https://scikit-learn.org/stable/modules/model_evaluation.html

---

## 7 - Improving model accuracy
Here is the link that contains most of the info in this section:
https://www.analyticsvidhya.com/blog/2015/12/improve-machine-learning-results/

There are many different methodologies used for improving model accuracy including:
1. **Adding more data**
2. **Treat missing values** - there are many ways to do this:
    * **Deletion** - simply delete the entire sample with a missing value
    * **Replacement** - replace the value with a sensible value (mean/median/mode for continuous variables)
    * **Prediction** - create a predicitve model to handle missing data that will estimate the best value to put in the missing values place
    * **KNN Imputation** - replaces (or imputes) the missing values of an attribute using the given number of attributes most similar to the attribute with missing values (good for classification)
3. **Removing Unecessary Outliers*8
4. **Feature Engineering** - is the process of extracting new features from existing data and/or features. There are two steps to feature engineering:
    * **Feature Transformation** - there are 3 main scenarios where feature transformation can be required:
        * **Normalization** is used to change the scale of the data to between 0 and 1 (example is having 3 variables of differing scale like meters, centimeters, and kilo-meters)
        * **Remove Skewness** (for models that work well with normally distributed data) using log, square root, or inverse of the values
        * **Create Bins** - good for dealing with outlier data that cannot be deleted, numeric data can also be made discrete this way
    * **Feature Creation** - is when new variables are derived from existing variables, in many cases this will help to increase a models accuracy by bringing out hidden relationships between variables that may not have been obvious when in isolation
5. **Feature Selection** is the process of finding the best subset of attributes which better or best explains the relationship between the indepenent variables and the target variable. 
6. **Using Multiple ML Algorithms** - in many cases, using multiple algorithms for a single problem can be useful for seeing which is best suited for a given problem
7. **Alogirithm Parameter Tuning** - dialing in the optimum parameter values can greatly increase model performance
8. **Ensemble Learning** - is the art of combining diverse set of learners (individual models) together to improvise on the stability and predictive power of the overall model (this technique is performed in many machine learning comptetions). There three commonly used ensemble learning techniques:
    * **Bagging** - attempts to implement similar learners on small sample populations and then takes a mean of all the predictions, this method helps decrease variance errors
    * **Boosting** - an iterative technique which adjusts the weight of an observation based on the last classification type. This method in general decreases the bias error (can sometimes lead to overffiting)
    * **Stacking (Stacked Generalization)** - explores a space of different models for the same problem. The idea is that you can attack a learning problem with different types of models which are capable of learning some part of the problem, but not the whole space of the problem. So, you can build multiple different learners and use them to build an intermediate prediction (one prediction for each learned model) then add a new model which learns from the intermediate predictions the same target. This final model is said to be stacked on the top of the others, hence the name. Thus, you might improve your overall performance, and often you end up with a model which is better than any individual intermediate model.

### Note that all of the above methods can improve the accuracy of a model, but note that, HIGHER ACCURACY MODELS DO NOT ALWAYS PERFORM BETTER FOR NEW UNSEEN DATA POINTS. Overffiting can also cause misleading improvement readings in a models accruracy. In order to avoid this pitfall cross validation must be used

9. **Cross-Validation** - one of the most important concepts in data modeling. There are two broad types of cross-validation:

**Non-Exhaustive CV Methods**

Non-exhustive cv methods do not compute all the possible ways of splitting up original data. Below are a few non-exhaustive cv methods:


* **Holdout Method** - the most basic example, thie approach divides the entire dataset into two parts (training and testing). Usually the training set is more than twice the size of the testing data, usually a ratio of 70:30 or 80:20. 
    * In this approach the data is shuffled randomly and the model is trained on a different combination of data points so the model can give different results every time it is trained (which can cause instability). In addition, one can never be assured that the random training set is truly representative of the data set as a whole. This method can be good to use with a very large dataset, if your in a hurry, or just as an initial starting point to feel a model out. 
* **K-Fold Cross-Validation** - this method guarantees that the score of the model will not depend on the way the training and test sets are picked, the process for this method is as follows:
     1. Randomly split entire dataset into k number of subsets (or folds)
     2. For each subset, build the model on k-1 subsets of the dataset, then test the data on the kth fold (the one held out)
     3. This process is repeated until each of the subsets has been used as the test set
     4. the average of the each iterations scores is called the cross-validation accuracy score and serves as the performance metric for the model
      5. Because k-fold cv ensures that every observation from the original dataset has the chance of appearing in a training and test set, it usually results in a less biased model and is one of the best approaches if there is **limited input data**. The main disadvantage is that it can be really slow to run
* **Stratified K-Fold Cross-Validation** - because the above k-fold cv uses random data selection for subsets, in many cases there can be highly imbalanced folds which can create traning bias. Stratification rearranges the data to ensure that each fold is a good representation of the whole. 

**Exhaustive Methods**

Exhaustive cv methods test on all of the possible ways to divide the original data sample into a training/testing validation set. Below are a few exhuastive cv methods:

* **Leave-P-Out Cross-Validation**
* **Leave-One-Out Cross-Validation**

**Note that usually either cross-validation is used or test/train/split, but not both.**

Below is a link to a good article on cross-validation:
https://machinelearningmastery.com/k-fold-cross-validation/

---
# Choosing an Algorithm
It is imperative to choose an algorithm that best fits the data and desired predictions. Below is a diagram that helps with this process:

<img src='data/images/choose.png'>
<img src='data/images/choose2.png'>