# About 
This notebook demonstrates **MatrixNet** server wrapper which is provided by __Reproducible experiment platform (REP)__ package. This service is available for CERN users.

To get the access to MatrixNet, you’ll need:
* Go to https://yandex-apps.cern.ch/
* Login with your CERN-account
* Click Add token at the left panel
* Choose service MatrixNet and click Create token
*  Create `~/.rep-matrixnet.config.json` file with the following content (the path to config file can be changed in the constructor of the wrappers):

```
{
   "url": "https://ml.cern.yandex.net/v1",
   "token": "<your_token>"
}
```


### In this notebook we show:
* classifier training on the server
* build predictions 
* measure quality

# Loading data

### download particle identification Data Set from UCI

In [1]:
!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt

File `MiniBooNE_PID.txt' already there; not retrieving.


In [2]:
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score

data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\s*', skiprows=[0], header=None, engine='python')
labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)
labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]
data.columns = ['feature_{}'.format(key) for key in data.columns]

### First rows of our data

In [3]:
data.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_40,feature_41,feature_42,feature_43,feature_44,feature_45,feature_46,feature_47,feature_48,feature_49
0,2.59413,0.468803,20.6916,0.322648,0.009682,0.374393,0.803479,0.896592,3.59665,0.249282,...,101.174,-31.373,0.442259,5.86453,0.0,0.090519,0.176909,0.457585,0.071769,0.245996
1,3.86388,0.645781,18.1375,0.233529,0.030733,0.361239,1.06974,0.878714,3.59243,0.200793,...,186.516,45.9597,-0.478507,6.11126,0.001182,0.0918,-0.465572,0.935523,0.333613,0.230621
2,3.38584,1.19714,36.0807,0.200866,0.017341,0.260841,1.10895,0.884405,3.43159,0.177167,...,129.931,-11.5608,-0.297008,8.27204,0.003854,0.141721,-0.210559,1.01345,0.255512,0.180901
3,4.28524,0.510155,674.201,0.281923,0.009174,0.0,0.998822,0.82339,3.16382,0.171678,...,163.978,-18.4586,0.453886,2.48112,0.0,0.180938,0.407968,4.34127,0.473081,0.25899
4,5.93662,0.832993,59.8796,0.232853,0.025066,0.233556,1.37004,0.787424,3.66546,0.174862,...,229.555,42.96,-0.975752,2.66109,0.0,0.170836,-0.814403,4.67949,1.92499,0.253893


### Splitting into train and test

In [4]:
# Get train and test data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)

# Variables used in training

In [5]:
variables = list(data.columns)[:10]

# MatrixNet wrapper

In [6]:
from rep.estimators import MatrixNetClassifier

In [7]:
print MatrixNetClassifier.__doc__

MatrixNet classification model. 

    This is a wrapper around **MatrixNet (specific BDT)** technology developed at **Yandex**,
    which is available for CERN people using authorization.
    Trained estimator is downloaded and stored at your computer, so you can use it at any time.

    :param train_features: features used in training
    :type train_features: list[str] or None
    :param api_config_file: path to the file with remote api configuration in the json format::

                {"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}

    :type api_config_file: str

    :param int iterations: number of constructed trees (default=100)
    :param float regularization: regularization number (default=0.01)
    :param intervals: number of bins for features discretization or dict with borders
     list for each feature for its discretisation (default=8)
    :type intervals: int or dict(str, list)
    :param int max_features_per_iteration: depth (default=6, supports 1 <= 

In [8]:
# configuring classifier (take configuration from $HOME/.rep-matrixnet.config.json)
mn = MatrixNetClassifier(features=variables, iterations=300, sync=False)
# training classifier
mn.fit(train_data, train_labels)
# pay attention: we set sync=False, so training is asynchronous 
# we passed the dataset to server and you can do other operations in python when classifier is trained on the server
print('asynchronous training started')

asynchronous training started


In [9]:
import time
# Check status of training
print 'Is training complete?', mn.training_status()
time.sleep(15)
# get number of iterations
print 'Number of iterations which are done', mn.get_iterations
# Synchronize (wait until the training is complete)
mn.synchronize()
print 'Is training complete?', mn.training_status()

Is training complete? False
Number of iterations which are done None
Is training complete? True


**Note**: if training is failed, call 
`mn.resubmit()`

### Predict probabilities and estimate quality

In [10]:
# predict probabilities for each class
prob = mn.predict_proba(test_data)
print prob

[[ 0.98460066  0.01539934]
 [ 0.30781675  0.69218325]
 [ 0.98838825  0.01161175]
 ..., 
 [ 0.53682432  0.46317568]
 [ 0.97292185  0.02707815]
 [ 0.51366281  0.48633719]]


In [11]:
print 'AUC', roc_auc_score(test_labels, prob[:, 1])

AUC 0.955400767638


### Predictions of classes

In [12]:
mn.predict(test_data)

array([0, 1, 0, ..., 0, 0, 0])

### Features importances: returns three different measures

In [13]:
mn.get_feature_importances()

Unnamed: 0,effect,efficiency,information
feature_0,0.83389,0.876547,0.951335
feature_1,0.66448,0.698462,0.951348
feature_2,1.0,1.0,1.0
feature_3,0.412722,0.433819,0.951368
feature_4,0.071242,0.087024,0.818652
feature_5,0.173965,0.196312,0.886167
feature_6,0.092704,0.097442,0.951382
feature_7,0.091081,0.095742,0.951326
feature_8,0.173274,0.182133,0.951356
feature_9,0.068185,0.07167,0.951377
