# Test of scikit-learn Classifiers on iris Dataset
We loop all the useful scikit-learn classifiers here for the iris dataset to see how they work. We first load the iris dataset from `sklearn.datasets` calling `load_irs()`.

In [2]:
import numpy as np
import sklearn.datasets as skds
data = skds.load_iris()

x = np.array( data.data )
y = np.array( data.target )
target_names = data.target_names
nx, ndim = x.shape
print( 'X data number: %d' % nx )
print( 'X dimension number: %d' % ndim )
print( 'Y cat '+ str( [i for i in set(y)] ) )

X data number: 150
X dimension number: 4
Y cat [0, 1, 2]


Iris data are continuous for all features of $X$ but has 3-categorical target $Y$. There are no missing values in the dataset so we have no data cleaning work. The first step is to split it into training and test dataset.

In [3]:
import sklearn.cross_validation as skcv

xtrain, xtest, ytrain, ytest = skcv.train_test_split( x, y, test_size=0.33, random_state=0)

def output(clf, xtrain, ytrain, xtest, ytest):
    print( 'Train Score:%6.4f Test Score:%6.4f' % (clf.score(xtrain, ytrain), clf.score(xtest, ytest)) )

## First Round with default settings
Next we can build model for the training dataset. Since the target is categorical, we need to choose classifier type of model for this task.

In [121]:
import sklearn.naive_bayes as sknb
clf = sknb.MultinomialNB( )
# fit_prior=True
clf = clf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)

Train Score:0.8200 Test Score:0.7000


In [122]:
import sklearn.naive_bayes as sknb
clf = clf = sknb.GaussianNB()
# fit_prior=True
clf = clf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)

Train Score:0.9600 Test Score:0.9600


In [123]:
import sklearn.linear_model as sklm
clf = sklm.LogisticRegression()
# penalty='l1'/'l2', C=1.0
# solver='newton-cg'/'lbfgs'/'liblinear'/'sag', tol=
clf = clf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)

Train Score:0.9500 Test Score:0.9000


In [124]:
import sklearn.tree as sktree
clf = sktree.DecisionTreeClassifier()
# max_depth=None, min_samples_split=2, random_state=0
clf = clf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)

Train Score:1.0000 Test Score:0.9600


In [125]:
import sklearn.ensemble as skens
clf = skens.BaggingClassifier()
# base_estimator=default is DecisionTreeClassifier
# n_estimators=10, max_samples=1.0/0.5/10, max_features=1.0/0.5/10
# n_jobs=1/n/-1
clf = clf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)

Train Score:1.0000 Test Score:0.9600


In [126]:
import sklearn.ensemble as skens
clf = skens.AdaBoostClassifier()
# base_estimator = default is DecisionTreeClassifier
# n_estimators=50
clf = clf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)

Train Score:0.9600 Test Score:0.9000


In [127]:
import sklearn.ensemble as skens
clf = skens.RandomForestClassifier(n_estimators=200)
# n_estimators=10, max_features="auto"/"sqrt"/"log2"/n/None
# max_depth=None, min_samples_split=2, bootstrap=True, random_state=None
# n_jobs=1/n/-1
clf = clf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)

Train Score:1.0000 Test Score:0.9600


In [128]:
import sklearn.ensemble as skens
clf = skens.ExtraTreesClassifier(n_estimators=100)
# n_estimators=10, max_depth=None, min_samples_split=1, random_state=0
# n_jobs=1/n/-1
clf = clf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)

Train Score:1.0000 Test Score:0.9600


In [129]:
import sklearn.ensemble as skens
clf = skens.GradientBoostingClassifier( )
# n_estimators=10, max_depth=3, min_samples_split=2, random_state=0
# max_features="auto"/"sqrt"/"log2"/n/None
clf = clf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)

Train Score:1.0000 Test Score:0.9600


In [130]:
import sklearn.ensemble as skens
import sklearn.tree as sktree
import sklearn.naive_bayes as sknb
clf = skens.VotingClassifier( estimators=\
    [('lr', sklm.LogisticRegression() ),\
    ( 'rf', skens.RandomForestClassifier() ),\
    ( 'gnb', sknb.GaussianNB())], voting='hard' )
# estimators=[ ('clf1', clf1 ), ('clf2', clf2),... ]
# voting='hard'/'soft'
# weights=[w1,w2,w3]
clf = clf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)

Train Score:0.9800 Test Score:0.9600


In [131]:
import sklearn.svm as sksvm
clf = sksvm.SVC( )
# C=1, kernel='rbf'/'linear'/'poly'/sigmoid'
clf = clf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)

Train Score:0.9800 Test Score:0.9800


## Classifier Metrics
Besides model accuracy, confusion matrix is also needed to see how well the model performs on each data category. `sklearn.metrics` provides a lot of metrics for model evaluation.

In [8]:
import sklearn.metrics as skmt

import sklearn.svm as sksvm
clf = sksvm.SVC( )
# C=1, kernel='rbf'/'linear'/'poly'/sigmoid'
clf = clf.fit(xtrain, ytrain)

predtrain = clf.predict( xtrain )
predtest  = clf.predict( xtest )

print( skmt.confusion_matrix(ytrain, predtrain) )
print( skmt.classification_report(ytrain, predtrain) )

[[34  0  0]
 [ 0 29  2]
 [ 0  0 35]]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        34
          1       1.00      0.94      0.97        31
          2       0.95      1.00      0.97        35

avg / total       0.98      0.98      0.98       100



## Cross Validation
Among the most common scikit-learn classifiers with their default paramter setting, SVM gets the best test score. However, other classifiers with cross-validation hyperparameter tuning may beat SVM. Let's first take a look at how scikit-learn deals with cross validation.

In [135]:
import sklearn.cross_validation as skcv
clf = sklm.LogisticRegression(penalty="l1")
scores = skcv.cross_val_score( clf, xtrain, ytrain)
# cv=3 (3-fold)/ n (n-fold)
# n_jobs=1/n/-1
print( scores )
print( 'CV score %6.4f' % np.mean(scores) )
clf = clf.fit( xtrain, ytrain )
output(clf, xtrain, ytrain, xtest, ytest)

[ 0.9143  0.9697  0.9062]
CV score 0.9301
Train Score:0.9700 Test Score:0.9000


Next we are going to use `GridSearchCV` from `sklearn.grid_search` to find an optimal parameter for `sklm.LogisticRegression`. We have fed C=0.5, 2., 10, 50, and 100 to `GridSearchCV` and let it chooses the best C. It returns a classifier with the best C parameter. We can see that now `sklm.LogisticRegression` gets score of 0.98, which is the same as SVM.

In [137]:
import sklearn.grid_search as skgs
clf = sklm.LogisticRegression(penalty="l1")
params = {'C':[0.5, 2., 10, 50, 100] }
gsclf = skgs.GridSearchCV( clf, param_grid=params)
# cv=3 (3-fold)/ n (n-fold)
# n_jobs=1/n/-1
clf = clf.fit(xtrain, ytrain)
gsclf = gsclf.fit(xtrain, ytrain)
output(clf, xtrain, ytrain, xtest, ytest)
output(gs, xtrain, ytrain, xtest, ytest)

Train Score:0.9700 Test Score:0.9000
Train Score:0.9900 Test Score:0.9800
