# 1. Problem Definition

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:
- Number of times pregnant.
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- Diastolic blood pressure (mm Hg).
- Triceps skinfold thickness (mm).
- 2-Hour serum insulin (mu U/ml).
- Body mass index (weight in kg/(height in m)^2).
- Diabetes pedigree function.
- Age (years).
- Class variable (0 or 1).

<u>Goal</u>: Predict the onset of diabetes within 5 years in Pima Indians given medical details.

We are going to cover the following steps:
2. Load our data
3. Understand our data with descriptive statistics
4. Understand our data with visualization
5. Prepare our data
6. Feature Selection
7. Evaluate the Performance of Algorithms with Resampling
8. Algorithm Performance Metrics
9. Spot-Check Algorithms
10. Compare Algorithms
11. Automate Workflows with Pipelines
12. Improve Performance with Ensembles
13. Improve Performance with Algorithm Tuning
14. References and Credits

# 2. Load data

Let's start off by loading the libraries required for this project.

## 2.1 Import libraries

First, let's import all of the modules, functions and objects we are going to use in this project.

In [None]:
# Load libraries
import seaborn as sns
import numpy
from numpy import arange
from numpy import set_printoptions
from matplotlib import pyplot
from pandas import read_csv
from pandas import set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import Binarizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

import warnings
warnings.filterwarnings('ignore')

## 2.2 Load data

In [None]:
# Load dataset
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

filename = '/kaggle/input/pima-indians-diabetes-database/diabetes.csv'
data = read_csv(filename)

# df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)
data.rename(columns={'Pregnancies': 'preg', 'Glucose': 'plas', 'BloodPressure':'pres', 'SkinThickness':'skin', 'Insulin':'test', 'BMI':'mass', 'DiabetesPedigreeFunction':'pedi', 'Age':'age', 'Outcome':'class'}, 
            inplace=True)

# 3. Understand our data with descriptive statistics

We are going to cover the following steps:
1. Take a peek at our raw data.
2. Review the dimensions of our dataset.
3. Review the data types of attributes in our data.
4. Summarize the distribution of instances across classes in our dataset.
5. Summarize our data using descriptive statistics.
6. Understand the relationships in our data using correlations.
7. Review the skew of the distributions of each attribute.

## 3.1 Peek at our data

Let's review the first five rows of the data.

In [None]:
peek = data.head(5)
print(peek)

## 3.2 Dimensions of our data

In [None]:
shape = data.shape
print(shape)

- We can see that the dataset has 768 rows and 9 columns.

## 3.3 Data Type For Each Attribute

In [None]:
types = data.dtypes
print(types)

- We can see that most of the attributes are integers and that mass and pedi are floating point types.

## 3.4 Descriptive Statistics

In [None]:
# Statistical Summary
set_option('display.width', 100)
set_option('precision', 3)
description = data.describe()
print(description)

- There are no missing/NA values, hence we do not need to handle missing values (i.e data imputation is not required)

## 3.5 Class Distribution (Classification Only)
On classification problems we need to know how balanced the class values are. Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of our project.

In [None]:
# Class Distribution
class_counts = data.groupby('class').size()
print(class_counts)

- We can see that there are nearly double the number of observations with class 0 (no onset of diabetes) than there are with class 1 (onset of diabetes).

## 3.6 Correlations Between Attributes

Correlation refers to the relationship between two variables and how they may or may not change together. The most common method for calculating correlation is Pearson's Correlation Coefficient, that assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all. Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in our dataset. As such, it is a good idea to review all of the pairwise correlations of the attributes in our dataset.

In [None]:
# Pairwise Pearson correlations
correlations = data.corr(method='pearson')
print(correlations)

- preg and age are positively correlated (i.e. > 0.5)

## 3.7 Skew of Univariate Distributions
Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or squashed in one direction or another. Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow us to perform data preparation to correct the skew and later improve the accuracy of our models.

In [None]:
# Skew for each attribute
skew = data.skew()
print(skew)

- The skew result show a positive (right) or negative (left) skew. Values closer to zero show less skew.
- preg, test, pedi, age, class are positively skewed (i.e. > 0.5)
- pres is negatively skewed (i.e. < -0.5)
- plas, skin, mass can be classified as not skewed (i.e. they are between +0.5 and -0.5)

- Questions
    - What should we do with the columns which are skewed? 
    - How should we transform them so that their skewed nature does not have an adverse effect on our prediction?
    - Which transformation should we apply for left-skewed columns and which transformation should we apply for right-skewed data?
- Possible Answers: The following section has been taken from https://rcompanion.org/handbook/I_12.html
    - For right-skewed data—tail is on the right, positive skew—, common transformations include square root, cube root, and log.
    - For left-skewed data—tail is on the left, negative skew—, common transformations include square root (constant – x), cube root (constant – x), and log (constant – x).
    - Because log (0) is undefined—as is the log of any negative number—, when using a log transformation, a constant should be added to all values to make them all positive before transformation.  It is also sometimes helpful to add a constant when using other transformations.
    - Another approach is to use a general power transformation, such as Tukey’s Ladder of Powers or a Box–Cox transformation.  These determine a lambda value, which is used as the power coefficient to transform values.  X.new = X ^ lambda for Tukey, and X.new = (X ^ lambda – 1) / lambda for Box–Cox.
    - The function transformTukey in the rcompanion package finds the lambda which makes a single vector of values—that is, one variable—as normally distributed as possible with a simple power transformation. 
    - The Box–Cox procedure is included in the MASS package with the function boxcox.  It uses a log-likelihood procedure to find the lambda to use to transform the dependent variable for a linear model (such as an ANOVA or linear regression).  It can also be used on a single vector.

# 4. Understand our data with visualization

We are going to cover the following visualizations:
1. Univariate Plots (Histograms, Density Plots, Box and Whisker Plots)
2. Multivariate Plots (Correlation Matrix Plot, Scatter Plot Matrix)

## 4.1.1 Univariate Plots (Histograms)

In [None]:
# Univariate Histograms
data.hist()
pyplot.show()

- age, pedi, preg, skin, test: Exponential-like distribution
- class: Bimodal distribution
- mass, plas, pres: Gaussian-like distribution


## 4.1.2 Univariate Plots (Density Plots)

In [None]:
# Univariate Density Plots
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
pyplot.show()

- We can see the distribution for each attribute is clearer than the histograms.

## 4.1.3 Univariate Plots (Box and Whisker Plots)

Another useful way to review the distribution of each attribute is to use Box and Whisker Plots or boxplots for short. Boxplots summarize the distribution of each attribute, drawing a line for the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The whiskers give an idea of the spread of the data and dots outside of the whiskers show candidate outlier values (values that are 1.5 times greater than the size of spread of the middle 50% of the data).

In [None]:
# Box and Whisker Plots
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
pyplot.show()

- We can see that the spread of attributes is quite different. Some like age, test and skin appear quite skewed towards smaller values.

## 4.2.1 Multivariate Plots (Correlation Matrix Plot)
Correlation gives an indication of how related the changes are between two variables. If two variables change in the same direction they are positively correlated. If they change in opposite directions together (one goes up, one goes down), then they are negatively correlated. We can calculate the correlation between each pair of attributes. This is called a correlation matrix. We can then plot the correlation matrix and get an idea of which variables have a high correlation with each other. This is useful to know, because some machine learning algorithms like linear and logistic regression can have poor performance if there are highly correlated input variables in our data.

In [None]:
# Correction Matrix Plot
correlations = data.corr()
# plot correlation matrix
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticklabels(data.columns)
ax.set_yticklabels(data.columns)
pyplot.show()

- We can see that the matrix is symmetrical, i.e. the bottom left of the matrix is the same as the top right. 
- This is useful as we can see two different views on the same data in one plot. 
- We can also see that <u>each variable is perfectly positively correlated with each other</u> in the diagonal line from top left to bottom right.
- There are <u>no negative correlations</u>

## 4.2.2 Multivariate Plots (Scatter Plot Matrix)

A scatter plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute. We can create a scatter plot for each pair of attributes in our data. Drawing all these scatter plots together is called a scatter plot matrix. Scatter plots are useful for spotting structured relationships between variables, like whether we could summarize the relationship between two variables with a line. Attributes with structured relationships may also be correlated and good candidates for removal from our dataset.

In [None]:
# Scatterplot Matrix
g = sns.PairGrid(data, diag_sharey=False)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot, colors="C0")
g.map_diag(sns.kdeplot, lw=2)

- Like the Correlation Matrix Plot, the scatter plot matrix is symmetrical. This is useful to look at the pairwise relationships from different perspectives.

# 5. Prepare data

Many machine learning algorithms make assumptions about our data. It is often a very good idea to prepare our data in such way to best expose the structure of the problem to the machine learning algorithms that we intend to use.

We are going to cover the following steps:
1. Rescale data.
2. Standardize data.
3. Normalize data.

## Need For Data Pre-processing

Different algorithms make different assumptions about our data and may require different transforms. Further, when we follow all of the rules and prepare our data, sometimes algorithms can deliver better results without pre-processing.

## 5.1 Rescale Data
When our data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale. Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like k-Nearest Neighbors.

In [None]:
# Rescale data (between 0 and 1)
array = data.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

- After rescaling we can see that all of the values are in the range between 0 and 1.

## 5.2 Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.

In [None]:
# Standardize data (0 mean, 1 stdev)
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

- The values for each attribute now have a mean value of 0 and a standard deviation of 1.

## 5.3 Normalize Data
Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm or a vector with the length of 1 in linear algebra). This pre-processing method can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbors.

In [None]:
# Normalize data (length of 1)
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])

- The rows are normalized to length 1.

# 6. Feature Selection
Feature selection is a process where we automatically select those features in our data that contribute most to the prediction variable or output in which we are interested. Having irrelevant features in our data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression. Three benefits of performing feature selection before modeling our data are:
- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: Less data means that algorithms train faster.

## 6.1 Univariate Feature Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable. Below we use the chi-squared (chi2) statistical test for non-negative features to select 4 of the best features

In [None]:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])

- we can see the scores for each attribute and the 4 attributes chosen (those with the highest scores): plas, test, mass and age.

## 6.2 Recursive Feature Elimination
The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute. Below, we use RFE with the logistic regression algorithm to select the top 3 features.

In [None]:
# Feature Extraction with RFE
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print(("Num Features: %d") % fit.n_features_)
print(("Selected Features: %s") % fit.support_)
print(("Feature Ranking: %s") % fit.ranking_)

- We can see that RFE chose the top 3 features as preg, mass and pedi. These are marked True in the support array and marked with a choice 1 in the ranking array.

## 6.3 Principal Component Analysis
Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form. Generally this is called a data reduction technique. A property of PCA is that we can choose the number of dimensions or principal components in the transformed result. Below, we use PCA and select 3 principal components.

In [None]:
# Feature Extraction with PCA
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print(("Explained Variance: %s") % fit.explained_variance_ratio_)
print(fit.components_)

## 6.4 Feature Importance
Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features. Below, we construct a ExtraTreesClassifier classifier.

In [None]:
# Feature Importance with Extra Trees Classifier
# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

- We are given an importance score for each attribute where the larger the score, the more important the attribute. 
- The scores suggest at the importance of plas, age and mass.

# 7. Evaluate the Performance of Algorithms with Resampling

We are  going to try the following four methods:
- Train and Test Sets.
- k-fold Cross Validation.
- Leave One Out Cross Validation.
- Repeated Random Test-Train Splits.

## 7.1 Split into Train and Test Sets

The simplest method that we can use to evaluate the performance of a machine learning algorithm is to use different training and testing datasets. We can take our original dataset and split it into two parts. Train the algorithm on the first part, make predictions on the second part and evaluate the predictions against the expected results. The size of the split can depend on the size and specifics of our dataset, although it is common to use 67% of the data for training and the remaining 33% for testing.

This algorithm evaluation technique is very fast. It is ideal for large datasets (millions of records) where there is strong evidence that both splits of the data are representative of the underlying problem. Because of the speed, it is useful to use this approach when the algorithm we are investigating is slow to train. A downside of this technique is that it can have a high variance. This means that differences in the training and test dataset can result in meaningful differences in the estimate of accuracy. Below, we split our data into 67%/33% splits for training and test and evaluate the accuracy of a Logistic Regression model.

In [None]:
# Evaluate using a train and a test set
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print(("Accuracy: %.3f%%") % (result*100.0))

We can see that the estimated accuracy for the model was approximately 78.7%. In addition to specifying the size of the split, we also specify the random seed. Because the split of the data is random, we want to ensure that the results are reproducible. By specifying the random seed we ensure that we get the same random numbers each time we run the code and in turn the same split of data. If we want to compare this result to the estimated accuracy of another machine learning algorithm or the same algorithm with a different configuration. To ensure the comparison was apples-for-apples, we must ensure that they are trained and tested on exactly the same data.

## 7.2 K-fold Cross Validation
Cross validation is an approach that we can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split. It works by splitting the dataset into k-parts (e.g. k = 5 or k = 10). Each split of the data is called a fold. The algorithm is trained on k - 1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set. After running cross validation we end up with k different performance scores that we can summarize using a mean and a standard deviation.

The result is a more reliable estimate of the performance of the algorithm on new data. It is more accurate because the algorithm is trained and evaluated multiple times on different data. The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are common. Below, we use 10-fold cross validation.

In [None]:
# Evaluate using Cross Validation
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print(("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0))

- we report both the mean and the standard deviation of the performance measure. 
- When summarizing performance measures, it is a good practice to summarize the distribution of the measures, in this case assuming a Gaussian distribution of performance (a very reasonable assumption) and recording the mean and standard deviation.

## 7.3 Leave One Out Cross Validation
We can configure cross validation so that the size of the fold is 1 (k is set to the number of observations in our dataset). This variation of cross validation is called leave-one-out cross validation. The result is a large number of performance measures that can be summarized in an effort to give a more reasonable estimate of the accuracy of our model on unseen data.
A downside is that it can be a computationally more expensive procedure than k-fold cross validation. Below, we use leave-one-out cross validation.

In [None]:
# Evaluate using Leave One Out Cross Validation
num_folds = 10
loocv = LeaveOneOut()
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=loocv)
print(("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0))

- We can see in the standard deviation that the score has more variance than the k-fold cross validation results described above.

## 7.4 Repeated Random Test-Train Splits

Another variation on k-fold cross validation is to create a random split of the data like the train/test split described above, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross validation. This has the speed of using a train/test split and the reduction in variance in the estimated performance of k-fold cross validation. We can also repeat the process many more times as needed to improve the accuracy. A down side is that repetitions may include much of the same data in the train or the test split from run to run, introducing redundancy into the evaluation. Below, we split the data into a 67%/33% train/test split and repeat the process 10 times.

In [None]:
# Evaluate using Shuffle Split Cross Validation
n_splits = 10
test_size = 0.33
seed = 7
kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print(("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0))

## What Techniques to Use When
- Generally k-fold cross validation is the gold standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
- Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.
- Techniques like leave-one-out cross validation and repeated random splits can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size.

# 8. Algorithm Performance Metrics

## Classification Metrics
Classification problems are perhaps the most common type of machine learning problems and as such there are a myriad of metrics that can be used to evaluate predictions for these problems.
Below, we will demonstrate how to use the following metrics:
- Classification Accuracy.
- Logarithmic Loss.
- Area Under ROC Curve.
- Confusion Matrix.
- Classification Report.

## 8.1 Classification Accuracy
Classification accuracy is the number of correct predictions made as a ratio of all predictions made. This is the most common evaluation metric for classification problems, it is also the most misused. It is really only suitable when there are an equal number of observations in each class (which is rarely the case) and that all predictions and prediction errors are equally important, which is often not the case.

In [None]:
# Cross Validation Classification Accuracy
kfold = KFold(n_splits=10, random_state=7)
model = LogisticRegression()
scoring = 'accuracy'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(("Accuracy: %.3f (%.3f)") % (results.mean(), results.std()))

- We can see that the ratio is reported. This can be converted into a percentage by multiplying the value by 100, giving an accuracy score of approximately 77% accurate.

## 8.2 Logarithmic Loss
Logarithmic loss (or logloss) is a performance metric for evaluating the predictions of probabilities of membership to a given class. The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm. Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction.

In [None]:
# Cross Validation Classification LogLoss
kfold = KFold(n_splits=10, random_state=7)
model = LogisticRegression()
scoring = 'neg_log_loss'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(("Logloss: %.3f (%.3f)") % (results.mean(), results.std()))

- Smaller logloss is better with 0 representing a perfect logloss.

## 8.3 Area Under ROC Curve
Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems. The AUC represents a model's ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model that is as good as random. ROC can be broken down into sensitivity and specificity. A binary classification problem is really a trade-off between sensitivity and specificity.
- Sensitivity is the true positive rate also called the recall. It is the number of instances from the positive (first) class that actually predicted correctly.
- Specificity is also called the true negative rate. It is the number of instances from the negative (second) class that were actually predicted correctly.

In [None]:
# Cross Validation Classification ROC AUC
kfold = KFold(n_splits=10, random_state=7)
model = LogisticRegression()
scoring = 'roc_auc'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(("AUC: %.3f (%.3f)") % (results.mean(), results.std()))

- We can see the AUC is relatively close to 1 and greater than 0.5, suggesting some skill in the predictions

## 8.4 Confusion Matrix
The confusion matrix is a handy presentation of the accuracy of a model with two or more classes. The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm. For example, a machine learning algorithm can predict 0 or 1 and each prediction may actually have been a 0 or 1. Predictions for 0 that were actually 0 appear in the cell for prediction = 0 and actual = 0, whereas predictions for 0 that were actually 1 appear in the cell for prediction = 0 and actual = 1. And so on.

In [None]:
# Cross Validation Classification Confusion Matrix
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)
print(matrix)

- Although the array is printed without headings, we can see that the majority of the predictions fall on the diagonal line of the matrix (which are correct predictions).

## 8.5 Classication Report
The scikit-learn library provides a convenience report when working on classification problems to give us a quick idea of the accuracy of a model using a number of measures. The classification report() function displays the precision, recall, F1-score and support for each class.

In [None]:
# Cross Validation Classification Report
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
report = classification_report(Y_test, predicted)
print(report)

- We can see good prediction and recall for the algorithm.

# 9. Spot-Check Algorithms

Let's take a look at six classification algorithms that we can spot-check on our data. 
- Linear Machine Learning Algorithms (Logistic Regression, Linear Discriminant Analysis)
- Non-linear Machine Learning Algorithms (k-Nearest Neighbors, Naive Bayes, Classification and Regression Trees, Support Vector Machines)

## 9.1 Logistic Regression

Logistic regression assumes a Gaussian distribution for the numeric input variables and can model binary classification problems.

In [None]:
# Logistic Regression Classification
num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

## 9.2 Linear Discriminant Analysis

Linear Discriminant Analysis or LDA is a statistical technique for binary and multiclass classification. It too assumes a Gaussian distribution for the numerical input variables.

In [None]:
# LDA Classification
num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = LinearDiscriminantAnalysis()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

## 9.3 k-Nearest Neighbors
The k-Nearest Neighbors algorithm (or KNN) uses a distance metric to find the k most similar instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction.

In [None]:
# KNN Classification
num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = KNeighborsClassifier()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

## 9.4 Naive Bayes
Naive Bayes calculates the probability of each class and the conditional probability of each class given each input value. These probabilities are estimated for new data and multiplied together, assuming that they are all independent (a simple or naive assumption). When working with real-valued data, a Gaussian distribution is assumed to easily estimate the probabilities for input variables using the Gaussian Probability Density Function.

In [None]:
# Gaussian Naive Bayes Classification
kfold = KFold(n_splits=10, random_state=7)
model = GaussianNB()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

## 9.5 Classification and Regression Trees
Classification and Regression Trees (CART or just decision trees) construct a binary tree from the training data. Split points are chosen greedily by evaluating each attribute and each value of each attribute in the training data in order to minimize a cost function (like the Gini index).

In [None]:
# CART Classification
kfold = KFold(n_splits=10, random_state=7)
model = DecisionTreeClassifier()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

## 9.6 Support Vector Machines
Support Vector Machines (or SVM) seek a line that best separates two classes. Those data instances that are closest to the line that best separates the classes are called support vectors and influence where the line is placed. SVM has been extended to support multiple classes. Of particular importance is the use of different kernel functions via the kernel parameter. A powerful Radial Basis Function is used by default.

In [None]:
# SVM Classification
kfold = KFold(n_splits=10, random_state=7)
model = SVC()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# 10. Compare Algorithms

## Compare Machine Learning Algorithms Consistently
The key to a fair comparison of machine learning algorithms is ensuring that each algorithm is evaluated in the same way on the same data. We can achieve this by forcing each algorithm to be evaluated on a consistent test harness.

The 10-fold cross validation procedure is used to evaluate each algorithm, importantly configured with the same random seed to ensure that the same splits to the training data are performed and that each algorithm is evaluated in precisely the same way. Each algorithm is given a short name, useful for summarizing results afterward.

In [None]:
# Compare Algorithms
# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

- The output above provides a box and whisker plot showing the spread of the accuracy scores across each cross validation fold for each algorithm.
- From these results, it would suggest that both logistic regression and linear discriminant analysis are perhaps worthy of further study on this problem.

Let's rescale the data and check whether or not the accuracy improves.

In [None]:
# Rescale data (between 0 and 1)
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

# Compare Algorithms
# prepare models
models = []
models.append(('ScaledLR', LogisticRegression()))
models.append(('ScaledLDA', LinearDiscriminantAnalysis()))
models.append(('ScaledKNN', KNeighborsClassifier()))
models.append(('ScaledCART', DecisionTreeClassifier()))
models.append(('ScaledNB', GaussianNB()))
models.append(('ScaledSVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, rescaledX, Y, cv=kfold, scoring=scoring) # note that we have replaced X with rescaledX
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

- Mean Estimated Accuracy on raw training data (X_train, Y_train)
    - LR: 0.773
    - LDA: 0.773
    - KNN: 0.727
    - CART: 0.693
    - NB: 0.755
    - SVM: 0.760
- Mean Estimated Accuracy on re-scaled training data (X_train, Y_train)
    - ScaledLR: 0.768
    - ScaledLDA: 0.773
    - ScaledKNN: 0.745
    - ScaledCART: 0.700
    - ScaledNB: 0.755
    - ScaledSVM: 0.771
- Observations:
    - the accuracy of our best performing model (i.e. LR in this case) has decreased, why?
    - accuracy of LDA (linear) and NB (non-linear) remained the same
    - the accuracy of non-linear models (KNN, CART and SVM) has become better, why?
    
Let's standardize the data and check whether or not the accuracy improves.

In [None]:
# Standardize data (0 mean, 1 stdev)
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# Compare Algorithms
# prepare models
models = []
models.append(('StandardizedLR', LogisticRegression()))
models.append(('StandardizedLDA', LinearDiscriminantAnalysis()))
models.append(('StandardizedKNN', KNeighborsClassifier()))
models.append(('StandardizedCART', DecisionTreeClassifier()))
models.append(('StandardizedNB', GaussianNB()))
models.append(('StandardizedSVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, rescaledX, Y, cv=kfold, scoring=scoring) # note that we have replaced X with rescaledX
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

- Mean Estimated Accuracy on raw training data (X_train, Y_train)
    - LR: 0.773
    - LDA: 0.773
    - KNN: 0.727
    - CART: 0.693
    - NB: 0.755
    - SVM: 0.760
- Mean Estimated Accuracy on standardized training data (X_train, Y_train)
    - StandardizedLR: 0.780
    - StandardizedLDA: 0.773
    - StandardizedKNN: 0.742
    - StandardizedCART: 0.694
    - StandardizedNB: 0.755
    - StandardizedSVM: 0.766
- Observations:
    - the accuracy of LR, KNN and SVM has become better
    - the accuracy of LDA, CART and NB has remained same

Let's normalize the data and check whether or not the accuracy improves.

In [None]:
# Normalize data (length of 1)
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# Compare Algorithms
# prepare models
models = []
models.append(('NormalizedLR', LogisticRegression()))
models.append(('NormalizedLDA', LinearDiscriminantAnalysis()))
models.append(('NormalizedKNN', KNeighborsClassifier()))
models.append(('NormalizedCART', DecisionTreeClassifier()))
models.append(('NormalizedNB', GaussianNB()))
models.append(('NormalizedSVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, normalizedX, Y, cv=kfold, scoring=scoring) # note that we have replaced X with normalizedX
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

- Mean Estimated Accuracy on raw training data (X_train, Y_train)
    - LR: 0.773
    - LDA: 0.773
    - KNN: 0.727
    - CART: 0.693
    - NB: 0.755
    - SVM: 0.760
- Mean Estimated Accuracy on normalized training data (X_train, Y_train)
    - NormalizedLR: 0.650
    - NormalizedLDA: 0.672
    - NormalizedKNN: 0.689
    - NormalizedCART: 0.628
    - NormalizedNB: 0.646
    - NormalizedSVM: 0.654
- Observations:
    - the accuracy of all of our models has decreased, why?

# 11. Automate Workflows with Pipelines

## 11.1 Automating Machine Learning Workflows 
There are standard workflows in applied machine learning. Standard because they overcome common problems like data leakage in our test harness. Python scikit-learn provides a Pipeline utility to help automate machine learning workflows. Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated. The goal is to ensure that all of the steps in the pipeline are constrained to the data available for the evaluation, such as the training dataset or each fold of the cross validation procedure.

## 11.2 Data Preparation and Modeling Pipeline
An easy trap to fall into in applied machine learning is leaking data from our training dataset to our test dataset. To avoid this trap we need a robust test harness with strong separation of training and testing. This includes data preparation. Data preparation is one easy way to leak knowledge of the whole training dataset to the algorithm. For example, preparing our data using normalization or standardization on the entire training dataset before learning would not be a valid test because the training dataset would have been influenced by the scale of the data in the test set.

Pipelines help us prevent data leakage in our test harness by ensuring that data preparation like standardization is constrained to each fold of our cross validation procedure. Below, we show this important data preparation and model evaluation workflow on our dataset. The pipeline is defined with two steps:
1. Standardize the data.
2. Learn a Linear Discriminant Analysis model.

The pipeline is then evaluated using 10-fold cross validation.

In [None]:
# Create a pipeline that standardizes the data then creates a model
# create pipeline
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lda', LinearDiscriminantAnalysis()))
model = Pipeline(estimators)
# evaluate pipeline
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

## 11.3 Feature Extraction and Modeling Pipeline
Feature extraction is another procedure that is susceptible to data leakage. Like data preparation, feature extraction procedures must be restricted to the data in our training dataset. The pipeline provides a handy tool called the FeatureUnion which allows the results of multiple feature selection and extraction procedures to be combined into a larger dataset on which a model can be trained. Importantly, all the feature extraction and the feature union occurs within each fold of the cross validation procedure. Below, we demonstrate the pipeline defined with four steps:
1. Feature Extraction with Principal Component Analysis (3 features).
2. Feature Extraction with Statistical Selection (6 features).
3. Feature Union.
4. Learn a Logistic Regression Model.

The pipeline is then evaluated using 10-fold cross validation.

In [None]:
# Create a pipeline that extracts features from the data then creates a model
# create feature union
from sklearn.pipeline import FeatureUnion
features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)
# create pipeline
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression()))
model = Pipeline(estimators)
# evaluate pipeline
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

- We can notice how the FeatureUnion is it's own Pipeline that in turn is a single step in the final Pipeline used to feed Logistic Regression. 
- In this manner, we can embed pipelines within pipelines.

# 12. Improve Performance with Ensembles

Ensembles can give us a boost in the accuracy on our dataset. We will step through Boosting, Bagging and Majority Voting and demonstrate how we can continue to ratchet up the accuracy of the models on our own datasets.

## 12.1 Combine Models Into Ensemble Predictions
The three most popular methods for combining the predictions from different models are:
- Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.
- Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the sequence of models.
- Voting. Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions.

## 12.2 Bagging Algorithms
Bootstrap Aggregation (or Bagging) involves taking multiple samples from our training dataset (with replacement) and training a model for each sample. The final output prediction is averaged across the predictions of all of the sub-models. We are going to cover the following three bagging models:
- Bagged Decision Trees.
- Random Forest.
- Extra Trees.

### 12.2.1 Bagged Decision Trees
Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning. Below, we use the BaggingClassifier with the Classification and Regression Trees algorithm (DecisionTreeClassifier). A total of 100 trees are created.

In [None]:
# Bagged Decision Trees for Classification
from sklearn.ensemble import BaggingClassifier
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

### 12.2.2 Random Forest
Random Forests is an extension of bagged decision trees. Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individual classifiers. Specifically, rather than greedily choosing the best split point in the construction of each tree, only a random subset of features are considered for each split. We can construct a Random Forest model for classification using the RandomForestClassifier class. Below, we demonstrate using Random Forest for classification with 100 trees and split points chosen from a random selection of 3 features.

In [None]:
# Random Forest Classification
from sklearn.ensemble import RandomForestClassifier
num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, random_state=7)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

### 12.2.3 Extra Trees
Extra Trees are another modification of bagging where random trees are constructed from samples of the training dataset. We can construct an Extra Trees model for classification using the ExtraTreesClassifier class. Below, we demonstrate of extra trees with the number of trees set to 100 and splits chosen from 7 random features.

In [None]:
# Extra Trees Classification
from sklearn.ensemble import ExtraTreesClassifier
num_trees = 100
max_features = 7
kfold = KFold(n_splits=10, random_state=7)
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

## 12.3 Boosting Algorithms
Boosting ensemble algorithms creates a sequence of models that attempt to correct the mistakes of the models before them in the sequence. Once created, the models make predictions which may be weighted by their demonstrated accuracy and the results are combined to create a final output prediction. The two most common boosting ensemble machine learning algorithms are:
- AdaBoost.
- Stochastic Gradient Boosting.

### 12.3.1 AdaBoost
AdaBoost was perhaps the first successful boosting ensemble algorithm. It generally works by weighting instances in the dataset by how easy or dificult they are to classify, allowing the algorithm to pay or less attention to them in the construction of subsequent models. We can construct an AdaBoost model for classification using the AdaBoostClassifier class4. Below, we demonstrate the construction of 30 decision trees in sequence using the AdaBoost algorithm.

In [None]:
# AdaBoost Classification
from sklearn.ensemble import AdaBoostClassifier
num_trees = 30
seed=7
kfold = KFold(n_splits=10, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

### 12.3.2 Stochastic Gradient Boosting
Stochastic Gradient Boosting (also called Gradient Boosting Machines) are one of the most sophisticated ensemble techniques. It is also a technique that is proving to be perhaps one of the best techniques available for improving performance via ensembles. We can construct a Gradient Boosting model for classification using the GradientBoostingClassifier class5. Below, we demonstrate Stochastic Gradient Boosting for classification with 100 trees.

In [None]:
# Stochastic Gradient Boosting Classification
from sklearn.ensemble import GradientBoostingClassifier
seed = 7
num_trees = 100
kfold = KFold(n_splits=10, random_state=seed)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

## 12.4 Voting Ensemble
Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms. It works by first creating two or more standalone models from our training dataset. A Voting Classifier can then be used to wrap our models and average the predictions of the sub-models when asked to make predictions for new data. The predictions of the sub-models can be weighted, but specifying the weights for classifiers manually or even heuristically is difficult. More advanced methods can learn how to best weight the predictions from sub-models, but this is called stacking (stacked aggregation) and is currently not provided in scikit-learn. 

We can create a voting ensemble model for classification using the VotingClassifier class. In the code below, we combine the predictions of logistic regression, classification and regression trees and support vector machines together for a classification problem.

In [None]:
# Voting Ensemble for Classification
from sklearn.ensemble import VotingClassifier
kfold = KFold(n_splits=10, random_state=7)
# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())

# 13. Improve Performance with Algorithm
Tuning Machine learning models are parameterized so that their behavior can be tuned for a given problem. Models can have many parameters and finding the best combination of parameters can be treated as a search problem.

## 13.1 Machine Learning Algorithm Parameters
Algorithm tuning is a final step in the process of applied machine learning before finalizing our model. It is sometimes called hyperparameter optimization where the algorithm parameters are referred to as hyperparameters, whereas the coefficients found by the machine learning algorithm itself are referred to as parameters. Optimization suggests the search-nature of the problem. Phrased as a search problem, we can use different search strategies to find a good and robust parameter or set of parameters for an algorithm on a given problem. Python scikit-learn provides two simple methods for algorithm parameter tuning:
- Grid Search Parameter Tuning.
- Random Search Parameter Tuning.

## 13.2 Grid Search Parameter Tuning
Grid search is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid. We can perform a grid search using the GridSearchCV class. Below, we evaluate different alpha values for the Ridge Regression algorithm. This is a one-dimensional grid search.

In [None]:
# Grid Search for Algorithm Tuning
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
alphas = numpy.array([1,0.1,0.01,0.001,0.0001,0])
param_grid = dict(alpha=alphas)
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid.fit(X, Y)
print(grid.best_score_)
print(grid.best_estimator_.alpha)

- The above code lists out the optimal score achieved and the set of parameters in the grid that achieved that score. 
- In this case the alpha value of 1.0.

## 13.3 Random Search Parameter Tuning
Random search is an approach to parameter tuning that will sample algorithm parameters from a random distribution (i.e. uniform) for a fixed number of iterations. A model is constructed and evaluated for each combination of parameters chosen. We can perform a random search for algorithm parameters using the RandomizedSearchCV class. Below, we evaluate different random alpha values between 0 and 1 for the Ridge Regression algorithm. A total of 100 iterations are performed with uniformly random alpha values selected in the range between 0 and 1 (the range that alpha values can take).

In [None]:
# Randomized for Algorithm Tuning
from scipy.stats import uniform
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV
param_grid = {'alpha': uniform()}
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100, random_state=7)
rsearch.fit(X, Y)
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

- The above code produces results much like those in grid search. An optimal alpha value near 1.0 is discovered.

# 14. References and Credits
- Thank you to Jason Brownlee https://machinelearningmastery.com/
- Used the following link to get column names in the correlation matrix https://www.geeksforgeeks.org/how-to-get-column-names-in-pandas-dataframe/
- For visualization https://seaborn.pydata.org/examples/scatterplot_matrix.html and https://seaborn.pydata.org/examples/pair_grid_with_kde.html
- For transformations on skewed data https://rcompanion.org/handbook/I_12.html
