# Introduction


## Challenge Large Scale Machine Learning

### Authors: 
#### Pavlo Mozharovskyi (pavlo.mozharovskyi@telecom-paris.fr), Umut Şimşekli (umut.simsekli@telecom-paris.fr)


### Fusion of algorithms for face recognition

The increasingly ubiquitous presence of biometric solutions and face recognition in particular in everyday life requires their adaptation for practical scenario. In the presence of several possible solutions, and if global decisions are to be made, each such single solution can be far less efficient than tailoring them to the complexity of an image.

In this challenge, the goal is to build a fusion of algorithms in order to construct the best suited solution for comparison of a pair of images. This fusion will be driven by qualities computed on each image.

Comparing of two images is done in two steps. 1st, a vector of features is computed for each image. 2nd, a simple function produces a vector of scores for a pair of images. The goal is to create a function that will compare a pair of images based on the information mentioned above, and decide whether two images belong to the same person.

You are provided a label set of training data and a test set without labels. You should submit a .csv file with labels for each entry of this test set.

# The properties of the dataset:


### Training data: 


The training set consist of two files, **xtrain_challenge.csv** and **xtest_challenge.csv**.

File **xtrain_challenge.csv** contains one observation per row which contains following entries based on a pair of images:
 * columns 1-13 - 13 qualities on first image;
 * columns 14-26 - 13 qualities on second image;
 * columns 27-37 - 11 matching scores between the two images.

File **ytrain_challenge.csv** contains one line with each entry corresponding to one observation in **xtrain_challenge.csv**, maintaining the order, and has '1' if a pair of images belong to the same person and '0' otherwise.

For each of these 38 variables, there are in total 9,800,713 training observations.

### Test data:

File **xtest_challenge.csv** has the same structure as file **xtrain_challenge.csv**.

There are in total 3,768,311 test observations.

## The performance criterion¶

Consider the problem of the supervised classification with two classes labeled '0' and '1'. Many methods for supervised classification assign a new observation $\boldsymbol{x}$ to a class using the following rule:

\begin{align}
g(\boldsymbol{x}) = \begin{cases}1 & \text{ if } f(\boldsymbol{x}) \ge t, \\ 0 & \text{ otherwise}.\end{cases}
\end{align}

Threshold $t$ is then chosen due to specific needs managing the trade-off between the true positive rate (TPR) and the false positive rate (FPR), depending on the cost of the corresponding mistakes.

Here, the performance criterion is **TPR for the value of FPR = $10^{-4}$**, or, speaking in other words, one needs to maximize the value of the receiver operating characteristic (ROC) in the point FPR = $10^{-4}$. The submitted solution file should thus contain the score for each observation.

# Training Data

Training data, input (file **xtrain_challenge.csv**): 

https://www.dropbox.com/s/618rb0wev4q84kj/xtrain_challenge.csv

Training data, output (file **ytrain_challenge.csv**): 

https://www.dropbox.com/s/oph3w9sn3nmu376/ytrain_challenge.csv


# Test Data 

Test data, input (file **xtest_challenge.csv**): 

https://www.dropbox.com/s/fezxb6lrzass556/xtest_challenge.csv

# Example submission

In [1]:
%matplotlib inline
import numpy as np
import sys
import os
import matplotlib.pyplot as plt
import math
from sklearn.linear_model import LogisticRegression

## Load and investigate the data

In [4]:
# Load training data
nrows_train = 49
nrows_valid = 51
xtrain = np.loadtxt('../../../data_challenge/xtrain_challenge.csv', delimiter=',', skiprows = 1, max_rows = nrows_train + nrows_valid)
ytrain = np.loadtxt('../../../data_challenge/ytrain_challenge.csv', delimiter=',', skiprows = 1, max_rows = nrows_train + nrows_valid)
ytrain = np.array(ytrain).reshape(nrows_train + nrows_valid)
# Check the number of observations and properties
print(xtrain[:3,])
print(ytrain[:10])
print(xtrain.shape)
print(ytrain.shape)

[[ 1.00000e+00  0.00000e+00  0.00000e+00 -6.24000e+00 -5.27000e+00
  -1.86000e+00  6.30000e-01  3.27000e+00  8.90000e-01  3.50980e+02
   6.48600e+01  0.00000e+00  1.00000e+00  1.00000e+00  0.00000e+00
   0.00000e+00 -9.44000e+00 -1.25300e+01  8.40000e-01  2.59000e+00
   1.53000e+00  1.03000e+00  2.76020e+02  5.80200e+01  0.00000e+00
   1.00000e+00  2.40594e+03  1.98109e+03  2.67784e+03  2.47044e+03
   1.57939e+03  2.18579e+03  2.11877e+03  2.58099e+03  2.49804e+03
   3.18058e+03  2.71829e+03]
 [ 1.00000e+00  0.00000e+00  0.00000e+00 -4.20000e-01 -4.50000e+00
  -4.31000e+00  1.61000e+00  1.72000e+00  2.76000e+00  3.47060e+02
   2.88500e+01  0.00000e+00  9.90000e-01  1.00000e+00  0.00000e+00
   0.00000e+00 -3.50000e-01 -1.89700e+01 -1.31000e+00  1.23000e+00
   8.40000e-01  3.40000e-01  2.63840e+02  3.05000e+01  0.00000e+00
   1.00000e+00  3.24137e+03  2.01524e+03  3.98719e+03  3.34353e+03
   2.89469e+03  2.94734e+03  2.68942e+03  3.76351e+03  2.54422e+03
   3.51558e+03  3.24749e+03]
 [ 1

## Train a simple classifier

In [5]:
# Train the classifier on a part of the data set
clf = LogisticRegression(solver='lbfgs')
clf.fit(xtrain[:nrows_train], ytrain[:nrows_train], )

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [6]:
xtrain[:nrows_train].shape

(49, 37)

In [7]:
ytrain[:nrows_train].shape

(49,)

In [9]:
xtrain[nrows_train:(nrows_train + nrows_valid)].shape

(51, 37)

In [10]:
# Check its on another part of the data set
yvalid = clf.predict_proba(xtrain[nrows_train:(nrows_train + nrows_valid)])[:,clf.classes_ == 1][:,0]
print(yvalid)

[7.22245093e-08 1.37566227e-09 9.99258613e-01 2.23355535e-06
 1.03031229e-10 3.04856672e-10 1.12137416e-11 1.26164702e-07
 1.53594797e-11 1.15437390e-09 1.71801441e-11 1.00000000e+00
 7.22535600e-09 1.81026398e-09 9.99403757e-09 2.10589727e-08
 3.53614605e-09 4.60942703e-08 3.61415261e-13 3.31623715e-09
 1.05706163e-08 4.30388022e-07 1.06614632e-12 3.29254055e-10
 3.22364656e-09 2.38831261e-11 9.69647451e-08 4.19342298e-12
 2.09900564e-11 1.37808491e-06 1.94527252e-08 3.98905846e-10
 1.00000000e+00 4.04981714e-11 1.68247587e-11 4.23396558e-11
 6.61319556e-11 1.34666504e-12 1.04895576e-08 9.02744839e-08
 4.80341458e-08 3.81114530e-10 2.62666249e-11 6.09709652e-09
 4.56042052e-08 8.09745055e-08 1.00000000e+00 1.41549496e-07
 8.61558868e-11 3.35364327e-12 1.24649049e-11]


In [11]:
clf.classes_

array([0., 1.])

In [25]:
# Proba to be 1
clf.predict_proba(xtrain[nrows_train:(nrows_train + nrows_valid)])[:,clf.classes_ == 1]

array([[7.22245093e-08],
       [1.37566227e-09],
       [9.99258613e-01],
       [2.23355535e-06],
       [1.03031229e-10],
       [3.04856672e-10],
       [1.12137416e-11],
       [1.26164702e-07],
       [1.53594797e-11],
       [1.15437390e-09],
       [1.71801441e-11],
       [1.00000000e+00],
       [7.22535600e-09],
       [1.81026398e-09],
       [9.99403757e-09],
       [2.10589727e-08],
       [3.53614605e-09],
       [4.60942703e-08],
       [3.61415261e-13],
       [3.31623715e-09],
       [1.05706163e-08],
       [4.30388022e-07],
       [1.06614632e-12],
       [3.29254055e-10],
       [3.22364656e-09],
       [2.38831261e-11],
       [9.69647451e-08],
       [4.19342298e-12],
       [2.09900564e-11],
       [1.37808491e-06],
       [1.94527252e-08],
       [3.98905846e-10],
       [1.00000000e+00],
       [4.04981714e-11],
       [1.68247587e-11],
       [4.23396558e-11],
       [6.61319556e-11],
       [1.34666504e-12],
       [1.04895576e-08],
       [9.02744839e-08],


In [23]:
np.argsort(yvalid)

array([18, 22, 37, 49, 27,  6, 50,  8, 34, 10, 28, 25, 42, 33, 35, 36, 48,
        4,  5, 23, 41, 31,  9,  1, 13, 24, 19, 16, 43, 12, 14, 38, 20, 30,
       15, 44, 17, 40,  0, 45, 39, 26,  7, 47, 21, 29,  3,  2, 11, 46, 32])

In [68]:
# Calculate the performance metric
# From ordered predictions (yvalid), we find the actual results from the train sample
yvalid_scoreordered = ytrain[nrows_train:(nrows_train + nrows_valid)][np.argsort(yvalid)]

print(yvalid_scoreordered)
N = np.sum(ytrain[nrows_train:(nrows_train + nrows_valid)] == 0)
P = np.sum(ytrain[nrows_train:(nrows_train + nrows_valid)] == 1)
FP = 0
TP = 0
for i in range(nrows_valid - 1, -1, -1):
    if (yvalid_scoreordered[i] == 1):
        TP = TP + 1
        print("TP")
    else:
        FP = FP + 1 # false positive is still zero at that time
        print("FP")
    if (FP / N > 10**-4):
        FP = FP - 1
        break
print("For the smallest FPR <= 10^-4 (i.e., ", FP / N, ") TPR = ", TP / P, ".", sep = "")

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 1. 1. 1.]
TP
TP
TP
TP
FP
For the smallest FPR <= 10^-4 (i.e., 0.0) TPR = 0.8.


In [58]:
N # 51 - 5

46

In [42]:
np.sort(yvalid) # sorted predictions

array([3.61415261e-13, 1.06614632e-12, 1.34666504e-12, 3.35364327e-12,
       4.19342298e-12, 1.12137416e-11, 1.24649049e-11, 1.53594797e-11,
       1.68247587e-11, 1.71801441e-11, 2.09900564e-11, 2.38831261e-11,
       2.62666249e-11, 4.04981714e-11, 4.23396558e-11, 6.61319556e-11,
       8.61558868e-11, 1.03031229e-10, 3.04856672e-10, 3.29254055e-10,
       3.81114530e-10, 3.98905846e-10, 1.15437390e-09, 1.37566227e-09,
       1.81026398e-09, 3.22364656e-09, 3.31623715e-09, 3.53614605e-09,
       6.09709652e-09, 7.22535600e-09, 9.99403757e-09, 1.04895576e-08,
       1.05706163e-08, 1.94527252e-08, 2.10589727e-08, 4.56042052e-08,
       4.60942703e-08, 4.80341458e-08, 7.22245093e-08, 8.09745055e-08,
       9.02744839e-08, 9.69647451e-08, 1.26164702e-07, 1.41549496e-07,
       4.30388022e-07, 1.37808491e-06, 2.23355535e-06, 9.99258613e-01,
       1.00000000e+00, 1.00000000e+00, 1.00000000e+00])

In [56]:
yvalid_scoreordered # corresponding ytrain for the sorted predictions

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1.])

## Prepare a file for submission

In [6]:
# Load test data
xtest = np.loadtxt('xtest_challenge.csv', delimiter=',', skiprows = 1)
# Classify the provided test data
ytest = clf.predict_proba(xtest)[:,clf.classes_ == 1][:,0]
print(ytest.shape)
np.savetxt('ytest_challenge_student.csv', ytest, fmt = '%1.15f', delimiter=',')

(3768311,)


#### Now it's your turn. Good luck !  :) 