# !Link to challenge!

# https://www.kaggle.com/t/b3ced76a60b94572a90740756f778fc8

### Metric

For binary classification with a true label y $\in \{0,1\}$ and a probability estimate p = $\operatorname{Pr}(y = 1)$, the log loss per sample is the negative log-likelihood of the classifier given the true label:
$$
L_{\log}(y, p) = -\log \operatorname{Pr}(y|p) = -(y \log (p) + (1 - y) \log (1 - p))$
$$

This extends to the multiclass case as follows. Let the true labels for a set of samples be encoded as a 1-of-K binary indicator matrix Y, i.e., $y_{i,k} = 1$ if sample i has label k taken from a set of K labels. Let P be a matrix of probability estimates, with $p_{i,k} = \operatorname{Pr}(t_{i,k} = 1)$. Then the log loss of the whole set is

$$
L_{\log}(Y, P) = -\log \operatorname{Pr}(Y|P) = - \frac{1}{N} \sum_{i=0}^{N-1} \sum_{k=0}^{K-1} y_{i,k} \log p_{i,k}
$$

# Grading

#### Firstly, to get any mark, you must beat medium baseline score

Your grade after challenge ends will be calculated as this:
$$
Grade = \frac{score - mid\_baseline\_score}{\#1\_score - mid\_baseline\_score} * 10
$$

where score will be taken from private part results.

## About

In this notebook we prepare a simple solution.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

### Read training and test files

In [2]:
data = pd.read_csv('training.csv')
test = pd.read_csv('test.csv')

In [3]:
data.head()

Unnamed: 0,TrackP,TrackNDoFSubdetector2,BremDLLbeElectron,MuonLooseFlag,FlagSpd,SpdE,EcalDLLbeElectron,DLLmuon,RICHpFlagElectron,EcalDLLbeMuon,...,TrackNDoF,RICHpFlagMuon,RICH_DLLbeKaon,RICH_DLLbeElectron,HcalE,MuonFlag,FlagMuon,PrsE,RICH_DLLbeMuon,RICH_DLLbeProton
0,74791.156263,15.0,0.232275,1.0,1.0,3.2,-2.505719,6.604153,1.0,1.92996,...,28.0,1.0,-7.2133,-0.2802,5586.589846,1.0,1.0,10.422315,-2.081143e-07,-24.8244
1,2738.489989,15.0,-0.357748,0.0,1.0,3.2,1.864351,0.263651,1.0,-2.061959,...,32.0,1.0,-0.324317,1.707283,-7e-06,0.0,1.0,43.334935,2.771583,-0.648017
2,2161.409908,17.0,-999.0,0.0,0.0,-999.0,-999.0,-999.0,0.0,-999.0,...,27.0,0.0,-999.0,-999.0,-999.0,0.0,0.0,-999.0,-999.0,-999.0
3,15277.73049,20.0,-0.638984,0.0,1.0,3.2,-2.533918,-8.724949,1.0,-3.253981,...,36.0,1.0,-35.202221,-14.742319,4482.803707,0.0,1.0,2.194175,-3.070819,-29.291519
4,7563.700195,19.0,-0.638962,0.0,1.0,3.2,-2.087146,-7.060422,1.0,-0.995816,...,33.0,1.0,25.084287,-10.272412,5107.55468,0.0,1.0,1.5e-05,-5.373712,23.653087


In [4]:
data[data.DLLmuon < -998].Label.head()

2         Ghost
111       Ghost
125        Kaon
139      Proton
268    Electron
Name: Label, dtype: object

In [5]:
test.head()

Unnamed: 0,TrackP,TrackNDoFSubdetector2,BremDLLbeElectron,MuonLooseFlag,FlagSpd,SpdE,EcalDLLbeElectron,DLLmuon,RICHpFlagElectron,EcalDLLbeMuon,...,RICHpFlagMuon,RICH_DLLbeKaon,RICH_DLLbeElectron,HcalE,MuonFlag,FlagMuon,PrsE,RICH_DLLbeMuon,RICH_DLLbeProton,ID
0,55086.199233,18.0,-0.438763,0.0,1.0,3.2,-1.843821,-4.579244,1.0,-1.732886,...,1.0,18.674086,-1.355015,24510.990244,0.0,1.0,9.325265,-0.250015,35.408585,0
1,3393.820071,17.0,-0.554341,0.0,1.0,0.0,-0.883237,-6.203035,1.0,-0.097206,...,1.0,16.536804,-17.601196,778.675303,0.0,1.0,-6e-06,-6.646096,14.011904,1
2,18341.359361,12.0,-0.554339,0.0,1.0,0.0,-2.653786,-3.922639,1.0,0.936484,...,1.0,-1.306109,-4.536409,7915.21242,0.0,1.0,1.371346,-2.132609,-5.617409,2
3,27486.710933,7.0,-0.492411,1.0,1.0,3.2,-999.0,2.034453,1.0,-999.0,...,1.0,-4.222793,3.149207,-999.0,1.0,1.0,61.985428,0.946207,-8.657193,3
4,6842.249996,16.0,0.098706,0.0,1.0,3.2,2.644499,-1.471364,1.0,-2.90947,...,1.0,-3.425113,23.147387,-1.3e-05,0.0,1.0,2.468453,2.614987,-5.713513,4


In [23]:
data.shape

(1200000, 50)

In [27]:
data.loc[:10]

Unnamed: 0,TrackP,TrackNDoFSubdetector2,BremDLLbeElectron,MuonLooseFlag,FlagSpd,SpdE,EcalDLLbeElectron,DLLmuon,RICHpFlagElectron,EcalDLLbeMuon,...,TrackNDoF,RICHpFlagMuon,RICH_DLLbeKaon,RICH_DLLbeElectron,HcalE,MuonFlag,FlagMuon,PrsE,RICH_DLLbeMuon,RICH_DLLbeProton
0,74791.156263,15.0,0.232275,1.0,1.0,3.2,-2.505719,6.604153,1.0,1.92996,...,28.0,1.0,-7.2133,-0.2802,5586.589846,1.0,1.0,10.422315,-2.081143e-07,-24.8244
1,2738.489989,15.0,-0.357748,0.0,1.0,3.2,1.864351,0.263651,1.0,-2.061959,...,32.0,1.0,-0.324317,1.707283,-7e-06,0.0,1.0,43.334935,2.771583,-0.648017
2,2161.409908,17.0,-999.0,0.0,0.0,-999.0,-999.0,-999.0,0.0,-999.0,...,27.0,0.0,-999.0,-999.0,-999.0,0.0,0.0,-999.0,-999.0,-999.0
3,15277.73049,20.0,-0.638984,0.0,1.0,3.2,-2.533918,-8.724949,1.0,-3.253981,...,36.0,1.0,-35.202221,-14.742319,4482.803707,0.0,1.0,2.194175,-3.070819,-29.291519
4,7563.700195,19.0,-0.638962,0.0,1.0,3.2,-2.087146,-7.060422,1.0,-0.995816,...,33.0,1.0,25.084287,-10.272412,5107.55468,0.0,1.0,1.5e-05,-5.373712,23.653087
5,62641.62109,17.0,0.976355,0.0,1.0,3.2,-2.649216,-3.767491,1.0,1.282086,...,40.0,1.0,29.475203,-3.059098,20529.441404,0.0,1.0,2.468433,-1.194598,1.010202
6,18872.81057,14.0,2.345886,0.0,1.0,3.2,-3.027858,-5.173245,1.0,0.750181,...,26.0,1.0,26.711504,-3.326296,19248.388672,0.0,1.0,2.742722,-1.859796,13.021704
7,1993.550048,3.0,0.170659,0.0,1.0,0.0,1.864349,0.101,1.0,0.382705,...,13.0,0.0,6e-06,-37.474493,694.30664,0.0,1.0,-1.5e-05,-0.244894,6e-06
8,90635.296871,8.0,-999.0,0.0,1.0,3.2,-999.0,2e-06,1.0,-999.0,...,22.0,1.0,-1.552902,0.561498,-999.0,0.0,1.0,119.85675,0.08879832,-3.197502
9,11633.669941,16.0,0.976349,0.0,1.0,0.0,-2.479154,-0.631769,1.0,0.449661,...,28.0,1.0,9.489098,-0.643303,913.806574,0.0,1.0,22.764572,-0.2466028,9.954897


In [33]:
data.loc[(data.BremDLLbeElectron < -998 np.logical_or data.SpdE < -998 np.logical_or data.EcalDLLbeElectron < -998 np.logical_or data.DLLmuon< -998)]

SyntaxError: invalid syntax (<ipython-input-33-b412d1fa5ca6>, line 1)

### Look at the labels set

In [7]:
set(data.Label)

{'Electron', 'Ghost', 'Kaon', 'Muon', 'Pion', 'Proton'}

### Define training features

Exclude `Label` from the features set

In [8]:
features = list(set(data.columns) - {'Label'})
features

['TrackNDoFSubdetector1',
 'RICH_DLLbeProton',
 'TrackNDoFSubdetector2',
 'DLLmuon',
 'FlagRICH1',
 'FlagSpd',
 'FlagRICH2',
 'DLLelectron',
 'GhostProbability',
 'TrackP',
 'RICHpFlagProton',
 'TrackDistanceToZ',
 'HcalDLLbeElectron',
 'TrackQualityPerNDoF',
 'FlagPrs',
 'TrackNDoF',
 'MuonLLbeMuon',
 'FlagBrem',
 'EcalDLLbeElectron',
 'EcalDLLbeMuon',
 'FlagHcal',
 'EcalShowerLongitudinalParameter',
 'RICHpFlagKaon',
 'MuonLLbeBCK',
 'Calo3dFitQuality',
 'FlagMuon',
 'BremDLLbeElectron',
 'MuonFlag',
 'MuonLooseFlag',
 'DLLproton',
 'RICH_DLLbeMuon',
 'RICH_DLLbeKaon',
 'PrsDLLbeElectron',
 'DLLkaon',
 'EcalE',
 'FlagEcal',
 'RICHpFlagElectron',
 'Calo2dFitQuality',
 'HcalE',
 'TrackQualitySubdetector1',
 'PrsE',
 'HcalDLLbeMuon',
 'RICH_DLLbeElectron',
 'RICH_DLLbeBCK',
 'TrackPt',
 'RICHpFlagMuon',
 'TrackQualitySubdetector2',
 'RICHpFlagPion',
 'SpdE']

### Divide training data into 2 parts

In [10]:
training_data, validation_data = train_test_split(data, random_state=11, train_size=0.10, test_size = 0.9)

In [11]:
len(training_data), len(validation_data)

(120000, 1080000)

### Simple logistic regression forest from `sklearn` training

train multiclassification model

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [13]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(training_data[features])

In [14]:
%%time

clf = LogisticRegression(penalty='l2', n_jobs=-1, solver='saga', multi_class='multinomial', random_state=42)
param_grid = {'C': [0.1, 1]}

gscv = GridSearchCV(clf, param_grid, scoring='neg_log_loss', cv=3, n_jobs=-1, verbose=1)
gscv.fit(X_train, training_data.Label)

Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   46.7s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   46.7s finished


CPU times: user 27.6 s, sys: 285 ms, total: 27.8 s
Wall time: 1min 13s




In [15]:
gscv.cv_results_

{'mean_fit_time': array([ 27.65927505,  20.48001496]),
 'mean_score_time': array([ 0.24192667,  0.16194979]),
 'mean_test_score': array([-0.87774634, -0.87361161]),
 'mean_train_score': array([-0.87553757, -0.87138096]),
 'param_C': masked_array(data = [0.1 1],
              mask = [False False],
        fill_value = ?),
 'params': [{'C': 0.1}, {'C': 1}],
 'rank_test_score': array([2, 1], dtype=int32),
 'split0_test_score': array([-0.90848259, -0.90595279]),
 'split0_train_score': array([-0.90506437, -0.90247521]),
 'split1_test_score': array([-0.81828495, -0.81098105]),
 'split1_train_score': array([-0.81632204, -0.80901035]),
 'split2_test_score': array([-0.90647142, -0.90390092]),
 'split2_train_score': array([-0.9052263 , -0.90265731]),
 'std_fit_time': array([ 0.0621607 ,  4.89105325]),
 'std_score_time': array([ 0.00608489,  0.05237692]),
 'std_test_score': array([ 0.04205356,  0.04429441]),
 'std_train_score': array([ 0.04187175,  0.04410274])}

Train best model:

In [16]:
c = 1
clf = LogisticRegression(penalty='l2', C=c, n_jobs=-1, solver='saga', multi_class='multinomial')
clf.fit(X_train, training_data.Label)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=-1, penalty='l2', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)

### Evaluate predictions on the validation sample

In [17]:
# predict each track
X_val = scaler.fit_transform(validation_data[features])
proba = clf.predict_proba(X_val)

### Log loss on the cross validation sample

In [18]:
log_loss(validation_data.Label, proba)

0.87079445046779702

## Prepare submission to kaggle

In [53]:
# predict test sample
X_test = scaler.fit_transform(test[features])
kaggle_proba = clf.predict_proba(X_test)
kaggle_ids = test.ID

In [56]:
from IPython.display import FileLink

def create_solution(ids, proba, names, filename='baseline.csv'):
    """saves predictions to file and provides a link for downloading """
    solution = pd.DataFrame({'ID': ids})
    
    for name in ['Ghost', 'Electron', 'Muon', 'Pion', 'Kaon', 'Proton']:
        solution[name] = proba[:, np.where(names == name)[0]]
    
    solution.to_csv('{}'.format(filename), index=False)
    return FileLink('{}'.format(filename))
    
create_solution(kaggle_ids, kaggle_proba, clf.classes_)