# Choose the best classification algorithm

### Use a k-fold cross-validation to choose the best classification algorithm

From the scikit-learn documentation concerning [k-fold cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html):

>To avoid it ["overfitting"], it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test.

>In the basic approach, called *k-fold CV*, the training set is split into k smaller sets... The following procedure is followed for each of the k “folds”:

> * A model is trained using k-1 of the folds as training data;
* the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The following code uses this technique to evaluate the relative performance of various ML classification algorithms on the training data.

RandomForest is one of the best choices.

In [22]:
# Initialize

import pandas as pd
import numpy as np
import subprocess
import IPython.display

# Show versions of all installed software to help debug incompatibilities.
# pip runs outside python so use subprocess to get the cmd's O/P.
# iPython display is used to print the full O/P.

try:
    display(subprocess.check_output(['pip', 'freeze']))
except subprocess.CalledProcessError as err:
    display(err)
    

b'apturl==0.5.2\nasn1crypto==0.24.0\natomicwrites==1.1.5\nattrs==18.1.0\nbackcall==0.1.0\nbeautifulsoup4==4.6.0\nbleach==2.1.3\nblinker==1.4\nBrlapi==0.6.4\ncertifi==2018.4.16\ncffi==1.11.5\nchardet==3.0.4\ncheckbox-support==0.38.0\ncommand-not-found==0.3\ncookies==2.2.1\ncryptography==2.2.2\ncycler==0.10.0\ndecorator==4.3.0\ndefer==1.0.6\nentrypoints==0.2.3\nfeedparser==5.2.1\nfido2==0.3.0\nfuzzywuzzy==0.16.0\nguacamole==0.9.2\nhtml5lib==1.0.1\nhttplib2==0.11.3\nidna==2.7\nipykernel==4.8.2\nipython==6.4.0\nipython-genutils==0.2.0\nipywidgets==7.2.1\njedi==0.12.1\nJinja2==2.10\njsonschema==2.6.0\njupyter==1.0.0\njupyter-client==5.2.3\njupyter-console==5.2.0\njupyter-core==4.4.0\nkiwisolver==1.0.1\nlanguage-selector==0.1\nlouis==2.6.4\nlxml==4.2.2\nMako==1.0.7\nMarkupSafe==1.0\nmatplotlib==2.2.2\nmistune==0.8.3\nmore-itertools==4.2.0\nmpmath==1.0.0\nnbconvert==5.3.1\nnbformat==4.4.0\nnotebook==5.5.0\nnumpy==1.14.5\noauthlib==2.1.0\nonboard==1.2.0\npadme==1.1.1\npandas==0.23.1\npandocfil

## Read in the vendor training data

Read in the manually labelled vendor training data.

Format it and convert to two numpy arrays for input to the scikit-learn ML algorithm.

* For Docker, file is _"/home/jovyan/work/vulnmine/vulnmine_data/label_vendors.csv"_
* For pycharm /iPython, current working directory is *"~/PycharmProjects/vulnmine"*. File is *"~/src/git/vulnmine/vulnmine_data/label_vendors.csv"*

In [24]:
try:
    df_label_vendors = pd.io.parsers.read_csv(
                            "~/src/git/vulnmine/vulnmine/vulnmine_data/label_vendors.csv",
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            quotechar='"',
                            encoding='utf-8')
except IOError as e:
    display('\n\n***I/O error({0}): {1}\n\n'.format(
                e.errno, e.strerror))

# except ValueError:
#    self.logger.critical('Could not convert data to an integer.')
except:
    display(
        '\n\n***Unexpected error: {0}\n\n'.format(
            sys.exc_info()[0]))
    raise

# Number of records / columns

df_label_vendors.shape

(10110, 13)

In [28]:
# Format training data as "X" == "features, "y" == target.
# The target value is the 1st column.
df_match_train1 = df_label_vendors[['match','fz_ptl_ratio', 'fz_ptl_tok_sort_ratio', 'fz_ratio', 'fz_tok_set_ratio', 'fz_uwratio','ven_len', 'pu0_len']]

# Convert into 2 numpy arrays for the scikit-learn ML classification algorithms.
np_match_train1 = np.asarray(df_match_train1)
X, y = np_match_train1[:, 1:], np_match_train1[:, 0]

display(X.shape, y.shape)

(10110, 7)

(10110,)

In [37]:
# set up for k-fold cross-validation to choose best model

#rom sklearn import cross_validation
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB



for clf, clf_name in (
		(RidgeClassifier(alpha=1.0), "Ridge Classifier"),
        (RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier #2"),
        (Perceptron(max_iter=100),"Perceptron"),
        (PassiveAggressiveClassifier(max_iter=100),"Passive-Aggressive"),
        (KNeighborsClassifier(n_neighbors=10), "kNN"),
        (NearestCentroid(), "Nearest Centroid"),
        (RandomForestClassifier(n_estimators=100, class_weight="balanced"), "Random forest"),
		(SGDClassifier(alpha=.0001, penalty="l2"), "SGD / SVM"),
		(MultinomialNB(alpha=.01), "Naive Bayes")):

	scores = cross_val_score(clf, X, y, cv=5)
	display("%s, Accuracy: %0.2f (+/- %0.2f)" % (clf_name, scores.mean(), scores.std() * 2))

'Ridge Classifier, Accuracy: 0.97 (+/- 0.02)'

'Ridge Classifier #2, Accuracy: 0.97 (+/- 0.02)'

'Perceptron, Accuracy: 0.93 (+/- 0.03)'

'Passive-Aggressive, Accuracy: 0.87 (+/- 0.16)'

'kNN, Accuracy: 0.98 (+/- 0.01)'

'Nearest Centroid, Accuracy: 0.90 (+/- 0.04)'

'Random forest, Accuracy: 0.98 (+/- 0.01)'



'SGD / SVM, Accuracy: 0.93 (+/- 0.02)'

'Naive Bayes, Accuracy: 0.78 (+/- 0.15)'