# Choose the best classification algorithm

### Use a k-fold cross-validation to choose the best classification algorithm

From the scikit-learn documentation concerning [k-fold cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html):

>To avoid it ["overfitting"], it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test.

>In the basic approach, called *k-fold CV*, the training set is split into k smaller sets... The following procedure is followed for each of the k “folds”:

> * A model is trained using k-1 of the folds as training data;
* the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The following code uses this technique to evaluate the relative performance of various ML classification algorithms on the training data.

RandomForest is one of the best choices.

In [7]:
# Initialize

import pandas as pd
import numpy as np
import pip #needed to use the pip functions

# Show versions of all installed software to help debug incompatibilities.

for i in pip.get_installed_distributions(local_only=True):
    print(i)

xlrd 1.1.0
widgetsnbextension 3.2.1
wheel 0.31.0
webencodings 0.5
wcwidth 0.1.7
vincent 0.4.4
urllib3 1.23
traitlets 4.3.2
tornado 5.0.2
toolz 0.9.0
testpath 0.3.1
terminado 0.8.1
sympy 1.1.1
statsmodels 0.9.0
SQLAlchemy 1.2.8
six 1.11.0
simplegeneric 0.8.1
setuptools 39.2.0
Send2Trash 1.5.0
seaborn 0.8.1
scipy 1.1.0
scikit-learn 0.19.1
scikit-image 0.14.0
ruamel-yaml 0.15.40
requests 2.19.1
pyzmq 17.0.0
PyYAML 3.12
PyWavelets 0.5.2
pytz 2018.4
python-oauth2 1.0.1
python-editor 1.0.3
python-dateutil 2.7.3
PySocks 1.6.8
pyparsing 2.2.0
pyOpenSSL 18.0.0
Pygments 2.2.0
pycparser 2.18
pycosat 0.6.3
ptyprocess 0.5.2
protobuf 3.5.2
prompt-toolkit 1.0.15
pip 9.0.3
Pillow 5.1.0
pickleshare 0.7.4
pexpect 4.6.0
patsy 0.5.0
parso 0.2.1
pandocfilters 1.4.2
pandas 0.23.1
pamela 0.3.0
packaging 17.1
olefile 0.45.1
numpy 1.13.3
numexpr 2.6.5
numba 0.38.1
notebook 5.5.0
networkx 2.1
nbformat 4.4.0
nbconvert 5.3.1
mistune 0.8.3
matplotlib 2.2.2
MarkupSafe 1.0
llvmlite 0.23.0
kiwisolver 1.0.1
jupyterlab

## Read in the vendor training data

Read in the manually labelled vendor training data.

Format it and convert to two numpy arrays for input to the scikit-learn ML algorithm.

In [8]:
try:
    df_label_vendors = pd.io.parsers.read_csv(
                            "/home/jovyan/work/vulnmine/vulnmine_data/label_vendors.csv",
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            quotechar='"',
                            encoding='utf-8')
except IOError as e:
    print('\n\n***I/O error({0}): {1}\n\n'.format(
                e.errno, e.strerror))

# except ValueError:
#    self.logger.critical('Could not convert data to an integer.')
except:
    print(
        '\n\n***Unexpected error: {0}\n\n'.format(
            sys.exc_info()[0]))
    raise

# Number of records / columns

df_label_vendors.shape

(10110, 13)

In [9]:
# Format training data as "X" == "features, "y" == target.
# The target value is the 1st column.
df_match_train1 = df_label_vendors[['match','fz_ptl_ratio', 'fz_ptl_tok_sort_ratio', 'fz_ratio', 'fz_tok_set_ratio', 'fz_uwratio','ven_len', 'pu0_len']]

# Convert into 2 numpy arrays for the scikit-learn ML classification algorithms.
np_match_train1 = np.asarray(df_match_train1)
X, y = np_match_train1[:, 1:], np_match_train1[:, 0]

print(X.shape, y.shape)

(10110, 7) (10110,)


In [10]:
# set up for k-fold cross-validation to choose best model

#rom sklearn import cross_validation
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB



for clf, clf_name in (
		(RidgeClassifier(alpha=1.0), "Ridge Classifier"),
        (RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier #2"),
        (Perceptron(max_iter=100), "Perceptron"),
        (PassiveAggressiveClassifier(max_iter=100), "Passive-Aggressive"),
        (KNeighborsClassifier(n_neighbors=10), "kNN"),
        (NearestCentroid(), "Nearest Centroid"),
        (RandomForestClassifier(n_estimators=100, class_weight="balanced"), "Random forest"),
		(SGDClassifier(alpha=.0001, max_iter=50, penalty="l2"), "SGD / SVM"),
		(MultinomialNB(alpha=.01), "Naive Bayes")):

	scores = cross_val_score(clf, X, y, cv=5)
	print("%s, Accuracy: %0.2f (+/- %0.2f)" % (clf_name, scores.mean(), scores.std() * 2))

Ridge Classifier, Accuracy: 0.97 (+/- 0.02)
Ridge Classifier #2, Accuracy: 0.97 (+/- 0.02)
Perceptron, Accuracy: 0.93 (+/- 0.03)
Passive-Aggressive, Accuracy: 0.92 (+/- 0.07)
kNN, Accuracy: 0.98 (+/- 0.01)
Nearest Centroid, Accuracy: 0.90 (+/- 0.04)
Random forest, Accuracy: 0.98 (+/- 0.01)
SGD / SVM, Accuracy: 0.91 (+/- 0.10)
Naive Bayes, Accuracy: 0.78 (+/- 0.15)
