# Machine Learning Classifier Models: A Brief Tour II

In the previous notebook we experimented with the following 10 Machine Learning (ML) classifier models on a synthetic two-dimensional dataset. 

1. K-Nearest Neighbors (KNN)
2. Logistic Regression
3. Polynomial Logistic Regression
4. Naive Bayes
5. Linear Support Vector Machine (Linear SVM)
6. Non-Linear Support Vector Machine (SVM Gaussian Radial Basis Function)
7. Decision Tree
8. Ensemble Method: Random Forest
9. Ensemble Method: Voting Clasifier
10. Multi-Layer Perceptron (MLP)



Although the dataset was non-linear, we were able to achieve over 90% test accuracy. However, in many pratical problems, the datasets are often significantly more non-linear high-dimensional. As a consequence achieving over 90% test accuracy is challenging. 

In this notebook we will understand this challenge by training the above models with a **high-dimensional large dataset**. We will see that even with the best performing model from the previous notebook we could not achieve 85% test accuracy.

The takeaway lesson from the previous notebook was to appreciate the need for acquiring a scientific understanding of the models and an optimal setting for the hyperparameters. The current notebook reinforces this lesson by underscoring the fact that **in practical scenarions problems are significantly more challenging**, thus requires a scientic understanding of the models.

In [1]:
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline, make_pipeline

# Multi-Dimensional Dataset

URL: https://archive.ics.uci.edu/ml/datasets/wine+quality

The dataset is related to the white variants of the Portuguese "Vinho Verde" wine. It provides the physicochemical (inputs) and sensory (the output) variables are available.

The dataset can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g., there are much more normal wines than excellent or poor ones).

The dataset is **11 dimensional**, which consists of characteristics of white wine (e.g., alcohol content, density, amount of citric acid, pH, etc) with target variable "quality" representing rating of wine.

The target variable ("quality" rating of wine) ranges from 3 to 8. We will convert it into a two-category variable consisting of "good" (quality > 5) & "bad" (quality <= 5). The target vector should have 0s (representing “bad” quality wine) and 1s (representing “good” quality wine).

Given the characteristics of a new, unlabeled wine, the classification task is to predict its "quality" (0 or 1).

Input variables (based on physicochemical tests):
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol

Output variable (based on sensory data): 
- quality (score between 0 and 10)

In [2]:
# Read the CSV file containing the dataset as a Pandas DataFrame object
df = pd.read_csv('/Users/hasan/datasets/winequality-white.csv')

# Get the target column from the DataFrame
y_quality = df['quality'] # 1D targer vector

# Create the 1D Target Vector
y = (y_quality > 5).astype(np.int)  # 1 if Good Wine, else 0 (Bad Wine)

# Create the data matrix containing all features excluding the target
X = df.drop(columns='quality')  

## Create Training and Test Dataset

In [3]:
# Spilt the dataset into training and test subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3918, 11)
(980, 11)
(3918,)
(980,)


## K-Nearest Neighbors

In [4]:
# Create the model
knn = KNeighborsClassifier(n_neighbors=10)
    
# Fit the model
knn.fit(X_train, y_train)

# Compute accuracy on the training set
train_accuracy_knn = knn.score(X_train, y_train)

# Compute accuracy on the test set
test_accuracy_knn = knn.score(X_test, y_test) 

print("Train Accuracy: ", train_accuracy_knn)
print("Test Accuracy: ", test_accuracy_knn)

Train Accuracy:  0.7570188871873405
Test Accuracy:  0.686734693877551


## Logistic Regression

In [5]:
# Create the model
lg_reg = LogisticRegression()
    
# Fit the model
lg_reg.fit(X_train, y_train)

# Compute accuracy on the training set
train_accuracy_lg_reg = lg_reg.score(X_train, y_train)

# Compute accuracy on the test set
test_accuracy_lg_reg = lg_reg.score(X_test, y_test) 

print("Train Accuracy: ", train_accuracy_lg_reg)
print("Test Accuracy: ", test_accuracy_lg_reg)

Train Accuracy:  0.756253190403267
Test Accuracy:  0.7255102040816327


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Polynomial Logistic Regression

In [6]:
optimal_poly_degree = 3


# Create the model
lg_reg_poly = Pipeline([
    ("poly_features", PolynomialFeatures(degree=optimal_poly_degree)), # Add polynomial terms with the feature vector
    ("scaler", StandardScaler()), # Scale the features
    ("clf", LogisticRegression())
    ])

    
# Fit the model
lg_reg_poly.fit(X_train, y_train)

# Compute accuracy on the training set
train_accuracy_lg_reg_poly = lg_reg_poly.score(X_train, y_train)

# Compute accuracy on the test set
test_accuracy_lg_reg_poly = lg_reg_poly.score(X_test, y_test) 

print("Train Accuracy: ", train_accuracy_lg_reg_poly)
print("Test Accuracy: ", test_accuracy_lg_reg_poly)

Train Accuracy:  0.7856049004594181
Test Accuracy:  0.7622448979591837


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Naive Bayes Classifier

In [7]:
# Create the model
nb = GaussianNB()

# Fit the model
nb.fit(X_train, y_train)

# Compute accuracy on the training set
train_accuracy_nb = nb.score(X_train, y_train)

# Compute accuracy on the test set
test_accuracy_nb = nb.score(X_test, y_test) 

print("Train Accuracy: ", train_accuracy_nb)
print("Test Accuracy: ", test_accuracy_nb)

Train Accuracy:  0.7120980091883614
Test Accuracy:  0.6826530612244898


## Support Vector Machine (Linear)

In [8]:
# Create the model: Linear SVM
svm_linear = Pipeline([
        ("scaler", StandardScaler()),
        ("linear_svc", LinearSVC()),
    ])

# Fit the model
svm_linear.fit(X_train, y_train)

# Compute accuracy on the training set
train_accuracy_svm_linear = svm_linear.score(X_train, y_train)

# Compute accuracy on the test set
test_accuracy_svm_linear = svm_linear.score(X_test, y_test) 

print("Train Accuracy: ", train_accuracy_svm_linear)
print("Test Accuracy: ", test_accuracy_svm_linear)

Train Accuracy:  0.7552322613578356
Test Accuracy:  0.7306122448979592




## Non-Linear Support Vector Machine (Gaussian Radial Basis Function)

In [9]:
# Create the model: Gaussian Radial Basis Function (RBF) based SVM
svm = SVC(kernel="rbf", gamma=0.3, C=100)

# Fit the model
svm.fit(X_train, y_train)

# Compute accuracy on the training set
train_accuracy_svm = svm.score(X_train, y_train)

# Compute accuracy on the test set
test_accuracy_svm = svm.score(X_test, y_test) 

print("Train Accuracy: ", train_accuracy_svm)
print("Test Accuracy: ", test_accuracy_svm)

Train Accuracy:  1.0
Test Accuracy:  0.7326530612244898


## Decision Tree

In [10]:
# Create the model
dtree = DecisionTreeClassifier(max_depth=20)
    
# Fit the model
dtree.fit(X_train, y_train)

# Compute accuracy on the training set
train_accuracy_dtree = dtree.score(X_train, y_train)

# Compute accuracy on the test set
test_accuracy_dtree = dtree.score(X_test, y_test) 

print("Train Accuracy: ", train_accuracy_dtree)
print("Test Accuracy: ", test_accuracy_dtree)

Train Accuracy:  0.9997447677386422
Test Accuracy:  0.7744897959183673


## Ensemble Mathod: Random Forest

In [11]:
# Create the model
rndforest = RandomForestClassifier(n_estimators=100, max_depth=20)
    
# Fit the model
rndforest.fit(X_train, y_train)

# Compute accuracy on the training set
train_accuracy_rndforest = rndforest.score(X_train, y_train)

# Compute accuracy on the test set
test_accuracy_rndforest = rndforest.score(X_test, y_test) 

print("Train Accuracy: ", train_accuracy_rndforest)
print("Test Accuracy: ", test_accuracy_rndforest)

Train Accuracy:  1.0
Test Accuracy:  0.8193877551020409


## Ensemble Method: Voting Classifier

In [12]:
# Create the model
voting_clf = VotingClassifier(
    estimators=[('K-NN', knn), ('Naive Bayes', nb), 
                ('Support Vector Machine', svm), ('Random Forest', rndforest)],
    voting='hard')

# Fit the model
voting_clf.fit(X_train, y_train)

# Compute accuracy on the training set
train_accuracy_voting_clf = voting_clf.score(X_train, y_train)

# Compute accuracy on the test set
test_accuracy_voting_clf = voting_clf.score(X_test, y_test) 

print("Train Accuracy: ", train_accuracy_voting_clf)
print("Test Accuracy: ", test_accuracy_voting_clf)

Train Accuracy:  0.9561000510464522
Test Accuracy:  0.7795918367346939


## Artificial Neural Network (Multi-Layer Perceptron)


In [13]:
# Create the model
mlp = Pipeline([
        ("scaler", StandardScaler()),
        ("MLP", MLPClassifier(hidden_layer_sizes=(300,100), alpha=0.2, 
                              activation='relu', solver='lbfgs',
                              early_stopping=True, n_iter_no_change=10, max_iter=1000)),
    ])
    
# Fit the model
mlp.fit(X_train, y_train)

# Compute accuracy on the training set
train_accuracy_mlp = mlp.score(X_train, y_train)

# Compute accuracy on the test set
test_accuracy_mlp = mlp.score(X_test, y_test) 

print("Train Accuracy: ", train_accuracy_mlp)
print("Test Accuracy: ", test_accuracy_mlp)

Train Accuracy:  1.0
Test Accuracy:  0.789795918367347


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


## Performance Comparison of 9 Models

In [14]:
data = [
    ["KNN", train_accuracy_knn, test_accuracy_knn],
    ["Logistic Regression", train_accuracy_lg_reg, test_accuracy_lg_reg],
    ["Polynomial Logistic Regression", train_accuracy_lg_reg_poly, test_accuracy_lg_reg_poly],
    ["Naive Bayes", train_accuracy_nb, test_accuracy_nb],
    ["Support Vector Machine (Linear)", train_accuracy_svm_linear, test_accuracy_svm_linear],
    ["Support Vector Machine (Gaussian RBF)", train_accuracy_svm, test_accuracy_svm],
    ["Decision Tree", train_accuracy_dtree, test_accuracy_dtree],
    ["Ensemble: Random Forest", train_accuracy_rndforest, test_accuracy_rndforest],
    ["Ensemble: Voting Classifier", train_accuracy_voting_clf, test_accuracy_voting_clf],
    ["Multi-Layer Perceptron", train_accuracy_mlp, test_accuracy_mlp],
    
    ]
pd.DataFrame(data, columns=["Model", "Train Accuracy", "Test Accuracy"])

Unnamed: 0,Model,Train Accuracy,Test Accuracy
0,KNN,0.757019,0.686735
1,Logistic Regression,0.756253,0.72551
2,Polynomial Logistic Regression,0.785605,0.762245
3,Naive Bayes,0.712098,0.682653
4,Support Vector Machine (Linear),0.755232,0.730612
5,Support Vector Machine (Gaussian RBF),1.0,0.732653
6,Decision Tree,0.999745,0.77449
7,Ensemble: Random Forest,1.0,0.819388
8,Ensemble: Voting Classifier,0.9561,0.779592
9,Multi-Layer Perceptron,1.0,0.789796
