The dataset I'm using is called water_potability, which essentially tries to determine whether or not a specific body of water is a viable source for drinking water based on a number of factors, one of which include pH (though the dataset is broken and has some nulls for that column, and I don't want to ruin the dataset by inputting a random value or putting zeros everywhere).

It is being classified based on whether a specific body of water is a viable drinking source of water, with 0 meaning no and 1 meaning yes. The features I'm using are the hardness of the water (the capacity of water to precipitate soap in mg/L) and the solids (total dissolved solids in ppm). 

In terms of the classifiers I used, I used two of the classifiers used in the example notebook, the KNeighborsClassifier and the DecisionTreeClassifier. The only thing that's changed is instead of using the Perceptron, I've used a Stochastic Gradient Descent instead. For the first ensembles set, I've set SGD to max_iter = 100 and tol = 1e-3, I've set DecisionTree to max_depth = 6, and I've set KNeighbors to n_neighbors = 20 and p = 3. For the second ensembles set, I've set SGD to max_iter = 1000 and tol = 1e-3, I've set DecisionTree to max_depth = 12, and I've set KNeighbors to n_neighbors = 40 and p = 6.

For both cross-validation attempts, the decision tree was the most accurate across the board, even more so compared to using a Majority Voting model. The standard deviation was highest for the SGDClassifier, though that may be due to its nature using a certain number of iterations as a parameter. Due to its higher standard deviation, it may explain why SGD across both cross-validation attempts had a lower accuracy, even at 10 times as many iterations. 

For both attempts at testing the testing data, the voting classifier actually did worse than the SGD" in the first ensembles set in terms of accuracy, but did slightly better than it on the second set. With the second ensemble set parameters, the individual classifiers were much more... volatile? than their first ensemble counterparts, leading to interesting variance especially in the standard deviation (i.e. DecisionTree was .097 vs a .139 during one test). Overall, the standard deviation of the voting classifier was higher than that of the individual classifiers, and its accuracy was higher than the individual classifiers overall.

In [8]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

from sklearn.model_selection import train_test_split

In [2]:
import os
import pandas as pd

try:
    s = 'water_potability.csv'

    print('From URL:', s)
    df = pd.read_csv(s,
                     header=None,
                     encoding='utf-8')
#   print (df[0:50])
    
except HTTPError:
    s = 'iris.data'
    df = pd.read_csv(s,
                     header=None,
                     encoding='utf-8')
    
df.tail()

From URL: water_potability.csv


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
3272,4.668101687405915,193.68173547507868,47580.99160333534,7.166638935482532,359.94857436696,526.4241709223593,13.894418518194527,66.68769478539706,4.4358209095098,1
3273,7.808856017557415,193.55321164822675,17329.802160103376,8.061361987849569,,392.4495795653845,19.90322518345954,,2.7982428424180505,1
3274,9.41951031641321,175.76264629629543,33155.578218312294,7.350233233214412,,432.0447830453679,11.039069688154314,69.84540029205144,3.298875498646556,1
3275,5.126762923351532,230.60375750846123,11983.869376336364,6.303356534249105,,402.883113121781,11.1689462210565,77.48821310275477,4.708658467526655,1
3276,7.874671357791283,195.10229858610904,17404.17706105066,7.509305856927908,,327.4597604610721,16.140367626166324,78.69844632549504,2.309149056634923,1


In [3]:
y =  df.iloc[1:501, 9].values
X = df.iloc[1:501, [1, 2]]
X = X.astype(float)
X_train, X_test, y_train, y_test =\
        train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

In [4]:
pipe1 = make_pipeline(StandardScaler(), SGDClassifier(max_iter = 100, tol = 1e-3))

pipe2 = make_pipeline(DecisionTreeClassifier(max_depth=6,
                                             criterion='entropy',
                                             random_state=0))

pipe3 = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=20,
                                                             p=3,
                                                             metric='minkowski'))

clf_labels = ['SGD', 'Decision tree', 'KNN']

print('10-fold cross validation:\n')
for clf, label in zip([pipe1, pipe2, pipe3], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='accuracy')
    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " [" + label + "]")

10-fold cross validation:

Accuracy: 0.63 Stdev: 0.12 [SGD]
Accuracy: 0.68 Stdev: 0.048 [Decision tree]
Accuracy: 0.67 Stdev: 0.034 [KNN]


In [5]:
from sklearn.ensemble import VotingClassifier

mv_clf = VotingClassifier(estimators=[('sgd', pipe1), ('dt', pipe2), ('kn', pipe3)])

clf_labels += ['Majority voting']
all_clf = [pipe1, pipe2, pipe3, mv_clf]

for clf, label in zip(all_clf, clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='accuracy')
    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " [" + label + "]")

Accuracy: 0.66 Stdev: 0.072 [SGD]
Accuracy: 0.68 Stdev: 0.048 [Decision tree]
Accuracy: 0.67 Stdev: 0.034 [KNN]
Accuracy: 0.68 Stdev: 0.042 [Majority voting]


In [6]:
pipe4 = make_pipeline(StandardScaler(), SGDClassifier(max_iter = 1000, tol = 1e-3))

pipe5 = make_pipeline(DecisionTreeClassifier(max_depth=12,
                                             criterion='entropy',
                                             random_state=0))

pipe6 = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=40,
                                                             p=6,
                                                             metric='minkowski'))

clf2_labels = ['SGD', 'Decision tree', 'KNN']

print('10-fold cross validation:\n')
for clf, label in zip([pipe1, pipe2, pipe3], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='accuracy')
    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " [" + label + "]")

10-fold cross validation:

Accuracy: 0.59 Stdev: 0.085 [SGD]
Accuracy: 0.68 Stdev: 0.048 [Decision tree]
Accuracy: 0.67 Stdev: 0.034 [KNN]


In [7]:
mv_clf2 = VotingClassifier(estimators=[('sgd', pipe4), ('dt', pipe5), ('kn', pipe6)])

clf2_labels += ['Majority voting']
all_clf2 = [pipe4, pipe5, pipe6, mv_clf2]

for clf, label in zip(all_clf2, clf2_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='accuracy')
    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " [" + label + "]")

Accuracy: 0.68 Stdev: 0.054 [SGD]
Accuracy: 0.62 Stdev: 0.056 [Decision tree]
Accuracy: 0.68 Stdev: 0.013 [KNN]
Accuracy: 0.66 Stdev: 0.042 [Majority voting]
