# Train vendor matching algorithm

The ML classification algorithm has to be retrained from time to time e.g. when scikit-learn undergoes a major release upgrade.

The specific algorithm used is a [RandomForest classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier). In initial testing using a k-fold cross-validation approach, this algorithm outperformed several other simple classification algorithms

The training proceeds as follows:

* First the algorithm is tuned for typical data by using a grid search.
* Next the ML classifier is run on the training data using the optimum parameters.
* Finally the trained model is stored for future use.

In [3]:
# Initialize

import pandas as pd
import numpy as np
import pip #needed to use the pip functions

# Show versions of all installed software to help debug incompatibilities.

for i in pip.get_installed_distributions(local_only=True):
    print(i)

zict 0.1.3
xlrd 1.1.0
widgetsnbextension 3.0.8
wheel 0.30.0
webencodings 0.5
wcwidth 0.1.7
vincent 0.4.4
urllib3 1.22
traitlets 4.3.2
tornado 4.5.2
toolz 0.8.2
testpath 0.3.1
terminado 0.8.1
tblib 1.3.2
sympy 1.0
statsmodels 0.8.0
SQLAlchemy 1.1.13
sortedcontainers 1.5.7
six 1.11.0
simplegeneric 0.8.1
setuptools 36.6.0
seaborn 0.7.1
scipy 0.19.1
scikit-learn 0.18.2
scikit-image 0.12.3
ruamel-yaml 0.11.14
requests 2.18.4
pyzmq 16.0.2
PyYAML 3.12
pytz 2017.3
python-oauth2 1.0.1
python-editor 1.0.3
python-dateutil 2.6.1
PySocks 1.6.7
pyparsing 2.2.0
pyOpenSSL 17.2.0
Pygments 2.2.0
pycparser 2.18
pycosat 0.6.2
ptyprocess 0.5.2
psutil 5.4.0
protobuf 3.5.0
prompt-toolkit 1.0.15
pip 9.0.1
Pillow 4.2.1
pickleshare 0.7.4
pexpect 4.3.0
patsy 0.4.1
partd 0.3.8
pandocfilters 1.4.1
pandas 0.19.2
pamela 0.3.0
olefile 0.44
numpy 1.12.1
numexpr 2.6.4
numba 0.31.0
notebook 5.2.2
networkx 2.0
nbformat 4.4.0
nbconvert 5.3.1
msgpack-python 0.4.8
mistune 0.8.3
matplotlib 2.0.2
MarkupSafe 1.0
locket 0.2.0
l

## Read in the vendor training data

Read in the manually labelled vendor training data.

Format it and convert to two numpy arrays for input to the scikit-learn ML algorithm.

In [4]:
try:
    df_label_vendors = pd.io.parsers.read_csv(
                            "/home/jovyan/work/vulnmine/vulnmine_data/label_vendors.csv",
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            quotechar='"',
                            encoding='utf-8')
except IOError as e:
    print('\n\n***I/O error({0}): {1}\n\n'.format(
                e.errno, e.strerror))

# except ValueError:
#    self.logger.critical('Could not convert data to an integer.')
except:
    print(
        '\n\n***Unexpected error: {0}\n\n'.format(
            sys.exc_info()[0]))
    raise

# Number of records / columns

df_label_vendors.shape

(10110, 13)

In [5]:
# Print out some sample values

df_label_vendors.sample(5)

Unnamed: 0.1,Unnamed: 0,fz_ptl_ratio,fz_ptl_tok_sort_ratio,fz_ratio,fz_tok_set_ratio,fz_uwratio,pub0_cln,publisher0,ven_cln,vendor_X,match,ven_len,pu0_len
6228,6228,75,100,38,24,90,nprof community,nprof community,ni,ni,0,2,15
8701,8701,50,100,38,24,90,bot productions,bot productions,ti,ti,0,2,15
2909,2909,50,100,42,27,90,intellibreeze,intellibreeze software,ez,ez,0,2,22
9555,9555,100,100,45,33,90,ubisoft toronto,ubisoft toronto,ubi,ubi,1,3,15
9396,9396,75,100,73,57,90,atmel,atmel,tm,tm software,0,11,5


In [6]:
# Check that all rows are labelled

# (Should return "False")

df_label_vendors['match'].isnull().any()

False

In [7]:
# Format training data as "X" == "features, "y" == target.
# The target value is the 1st column.
df_match_train1 = df_label_vendors[['match','fz_ptl_ratio', 'fz_ptl_tok_sort_ratio', 'fz_ratio', 'fz_tok_set_ratio', 'fz_uwratio','ven_len', 'pu0_len']]

# Convert into 2 numpy arrays for the scikit-learn ML classification algorithms.
np_match_train1 = np.asarray(df_match_train1)
X, y = np_match_train1[:, 1:], np_match_train1[:, 0]

print(X.shape, y.shape)

(10110, 7) (10110,)


## Use a grid search to tune the ML algorithm

Once the best algorithm has been determined, it should be tuned for optimal performance with the data.

This is done using a grid search. From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/grid_search.html):

> Parameters that are not directly learnt within estimators can be set by searching a parameter space for the best Cross-validation: evaluating estimator performance score... Any parameter provided when constructing an estimator may be optimized in this manner.

Rather than do a compute-intensive search of the entire parameter space, a randomized search is done to find reasonably efficient parameters.

This code was modified from the [scikit-learn sample code](http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html#sphx-glr-auto-examples-model-selection-randomized-search-py).

In [8]:
#	Now find optimum parameters for model using Grid Search

from time import time
from scipy.stats import randint as sp_randint

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# build a classifier
clf = RandomForestClassifier()

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")
            

# specify parameters and distributions to sample from
param_dist = {"n_estimators": sp_randint(20, 100),
              "max_depth": [3, None],
              "max_features": sp_randint(1,7),
              "min_samples_split": sp_randint(2,7),
              "min_samples_leaf": sp_randint(1, 7),
              "bootstrap": [True, False],
              "class_weight": ['auto', None],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 40
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

RandomizedSearchCV took 44.59 seconds for 40 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.980 (std: 0.005)
Parameters: {'bootstrap': False, 'class_weight': None, 'criterion': 'entropy', 'max_depth': 3, 'max_features': 3, 'min_samples_leaf': 2, 'min_samples_split': 3, 'n_estimators': 58}

Model with rank: 1
Mean validation score: 0.980 (std: 0.005)
Parameters: {'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': 3, 'max_features': 3, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 78}

Model with rank: 3
Mean validation score: 0.979 (std: 0.005)
Parameters: {'bootstrap': False, 'class_weight': None, 'criterion': 'entropy', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 4, 'min_samples_split': 6, 'n_estimators': 73}




## Run the ML classifier with optimum parameters on the test data

Based on the above, and ignoring default values, the optimum set of parameters would be something like the following:

    'bootstrap':True, 'min_samples_leaf': 4, 'n_estimators': 55, 'min_samples_split': 4, 'criterion':'entropy', 'max_features': 3, 'max_depth: 3, 'class_weight': None

The RandomForest classifier is now trained on the test data to produce the model.

In [11]:
clf = RandomForestClassifier(
    bootstrap=True,
    min_samples_leaf=4,
    n_estimators=55,
    min_samples_split=4,
    criterion='entropy',
    max_features=3,
    max_depth=3,
    class_weight=None
)

# Train model on original training data
clf.fit(X, y)

# save model for future use

from sklearn.externals import joblib
joblib.dump(clf, '/home/jovyan/work/vulnmine/vulnmine_data/vendor_classif_trained_Rdm_Forest.pkl.z') 

['/home/jovyan/work/vulnmine/vulnmine_data/vendor_classif_trained_Rdm_Forest.pkl.z']

In [12]:
# Test loading

clf = joblib.load('/home/jovyan/work/vulnmine/vulnmine_data/vendor_classif_trained_Rdm_Forest.pkl.z' )