In [1]:
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2

# Exploratory Data Analysis

In [5]:
# Import Data and methods
# --- IMPORTANT this notebook has to be run from inside the /notebook subdirectory for it to import properly
import numpy as np
import matplotlib.pyplot as plt
# Path hack.
import sys, os
sys.path.insert(0, os.path.abspath('..'))
import scripts.helpers as hp

In [7]:
y_train, x_train, ids_train = hp.load_csv_data('../data/train.csv', sub_sample=False)

## Five number Summary of training values for each feature
1. the sample minimum
2. the lower quartile
3. the median
4. the upper quartile or third quartile
5. the sample maximum 

In [8]:
#Printe the five number summary of the data set
def fiveNumberSummary(x_t):
    np.set_printoptions(suppress=True)
    print ('---the sample minimum for each feature---')
    print(np.percentile(x_t,0,axis=0))
    print('')
    print('---the lower quartile for each feature---')
    print(np.percentile(x_t,25,axis=0))
    print('')
    print('---the median for each feature---')
    print(np.percentile(x_t,50,axis=0))
    print('')
    print('---the upper quartile for each feature---')
    print(np.percentile(x_t,75,axis=0))
    print('')
    print('---the sample maximum for each feature---')
    print(np.percentile(x_t,100,axis=0))
    


In [9]:
fiveNumberSummary(x_train)

---the sample minimum for each feature---
[-999.       0.       6.329    0.    -999.    -999.    -999.       0.208
    0.      46.104    0.047   -1.414 -999.      20.      -2.499   -3.142
   26.      -2.505   -3.142    0.109   -3.142   13.678    0.    -999.    -999.
 -999.    -999.    -999.    -999.       0.   ]

---the lower quartile for each feature---
[  78.10075   19.241     59.38875   14.06875 -999.      -999.      -999.
    1.81       2.841     77.55       0.883     -1.371   -999.        24.59175
   -0.925     -1.575     32.375     -1.014     -1.522     21.398     -1.575
  123.0175     0.      -999.      -999.      -999.      -999.      -999.
 -999.         0.     ]

---the median for each feature---
[ 105.012    46.524    73.752    38.4675 -999.     -999.     -999.
    2.4915   12.3155  120.6645    1.28     -0.356  -999.       31.804
   -0.023    -0.033    40.516    -0.045     0.086    34.802    -0.024
  179.739     1.       38.96     -1.872    -2.093  -999.     -999.     -999.


We can observe that some features have more than half of its values equal to -999 which probably distort the results. We could also see that there is something weird with the last feature as more than 25% of the value are equal to 0.

## Meaningless values (-999)

By looking at the data, we can see that every values are continuous except PRI_jet_num which represents the number of jets that is categorical. The -999 values seem to be linked to the number of jets. The number of jets belongs to the following set: {0,1,2,3}. Let's check which features are -999 depending on the number of jets:

In [10]:
features = ['DER_mass_MMC','DER_mass_transverse_met_lep','DER_mass_vis','DER_pt_h','DER_deltaeta_jet_jet','DER_mass_jet_jet','DER_prodeta_jet_jet','DER_deltar_tau_lep','DER_pt_tot','DER_sum_pt','DER_pt_ratio_lep_tau','DER_met_phi_centrality','DER_lep_eta_centrality','PRI_tau_pt','PRI_tau_eta','PRI_tau_phi','PRI_lep_pt','PRI_lep_eta','PRI_lep_phi','PRI_met','PRI_met_phi','PRI_met_sumet','PRI_jet_num','PRI_jet_leading_pt','PRI_jet_leading_eta','PRI_jet_leading_phi','PRI_jet_subleading_pt','PRI_jet_subleading_eta','PRI_jet_subleading_phi','PRI_jet_all_pt']

#Return the number of times a feature is meaningless depending on the number of jets.
def nullFeatureDependingOnJetNmbr(nmr_jet):
    nullFeatures = np.zeros(30)
    count = 0
    for r in x_train:
        if(r[22] == nmr_jet):
            for i in range(len(r)):
                if(r[i] == -999):
                    nullFeatures[i] += 1
    return nullFeatures

def printNullFeatureDependingOnJet():
    for i in range(4):
        nullFeaturesnm = nullFeatureDependingOnJetNmbr(i)
        numberRowsWithJet = x_train[x_train[:,22]==i].shape[0]
        print('Features -999 when the number of jets is:',i,)
        print('The number of rows with the number of jets being',i,'is',numberRowsWithJet)
        for j in range(30):
            if nullFeaturesnm[j] != 0:
                print(features[j],'has',int(nullFeaturesnm[j]),'/', numberRowsWithJet,'meaningless values')
        print('')
        print('')
printNullFeatureDependingOnJet()

Features -999 when the number of jets is: 0
The number of rows with the number of jets being 0 is 99913
DER_mass_MMC has 26123 / 99913 meaningless values
DER_deltaeta_jet_jet has 99913 / 99913 meaningless values
DER_mass_jet_jet has 99913 / 99913 meaningless values
DER_prodeta_jet_jet has 99913 / 99913 meaningless values
DER_lep_eta_centrality has 99913 / 99913 meaningless values
PRI_jet_leading_pt has 99913 / 99913 meaningless values
PRI_jet_leading_eta has 99913 / 99913 meaningless values
PRI_jet_leading_phi has 99913 / 99913 meaningless values
PRI_jet_subleading_pt has 99913 / 99913 meaningless values
PRI_jet_subleading_eta has 99913 / 99913 meaningless values
PRI_jet_subleading_phi has 99913 / 99913 meaningless values


Features -999 when the number of jets is: 1
The number of rows with the number of jets being 1 is 77544
DER_mass_MMC has 7562 / 77544 meaningless values
DER_deltaeta_jet_jet has 77544 / 77544 meaningless values
DER_mass_jet_jet has 77544 / 77544 meaningless values
D

We can observe 11 out of 30 features containing -999. 10 of them linked to jets. We can see that DER_mass_MMC is -999 more often when the number of jets is lower but is not always -999. 

There are 10 features that are always -999 when the number of jet is 0: 

* DER_deltaeta_jet_jet
* DER_mass_jet_jet
* DER_prodeta_jet_jet 
* DER_lep_eta_centrality 
* PRI_jet_leading_pt
* PRI_jet_leading_eta
* PRI_jet_leading_phi 
* PRI_jet_subleading_pt
* PRI_jet_subleading_eta 
* PRI_jet_subleading_phi

There are 7 features that are always -999 when the number of jet is 1: 

* DER_deltaeta_jet_jet
* DER_mass_jet_jet
* DER_prodeta_jet_jet
* DER_lep_eta_centrality
* PRI_jet_subleading_pt
* PRI_jet_subleading_eta
* PRI_jet_subleading_phi

There are no features that are always -99 when the number of jets is 2 or 3

## Zero values

By looking at the data, we can guess that the feature PRI_jet_all_pt seems to be always zero when the number of jet is equal to 0. Let's check that hypothesis:

In [11]:
nullFeaturesnm0 = nullFeatureDependingOnJetNmbr(0)
numberRowsWithJet = x_train[x_train[:,22]==0].shape[0]
count = 0
for r in x_train[x_train[:,22]==0]:
    if(r[29] == 0):
        count = count+1
print(features[29],'has',count,'/', numberRowsWithJet,'zero values')

PRI_jet_all_pt has 99913 / 99913 zero values


## Correlation

Pearson product-moment correlation coefficients to check the correlation between the variables and y

In [12]:
for i in range(30):
    print('Correlation between y and',features[i], np.corrcoef(y_train,x_train[:,i])[0][1])

Correlation between y and DER_mass_MMC 0.239149057892
Correlation between y and DER_mass_transverse_met_lep -0.351427955862
Correlation between y and DER_mass_vis -0.0140552737849
Correlation between y and DER_pt_h 0.192526328569
Correlation between y and DER_deltaeta_jet_jet 0.141645992566
Correlation between y and DER_mass_jet_jet 0.191766088075
Correlation between y and DER_prodeta_jet_jet 0.140554400465
Correlation between y and DER_deltar_tau_lep 0.0122454812855
Correlation between y and DER_pt_tot -0.0152874266878
Correlation between y and DER_sum_pt 0.153235932476
Correlation between y and DER_pt_ratio_lep_tau -0.195397896183
Correlation between y and DER_met_phi_centrality 0.271751877052
Correlation between y and DER_lep_eta_centrality 0.141345988596
Correlation between y and PRI_tau_pt 0.235237975878
Correlation between y and PRI_tau_eta -0.000943251058212
Correlation between y and PRI_tau_phi -0.00440253868639
Correlation between y and PRI_lep_pt -0.0319475868053
Correlation 

Pearson product-moment correlation coefficients to check the correlation between the variables

In [13]:
corrBetweenVariables = np.corrcoef([x_train[:,0],x_train[:,1],x_train[:,2],x_train[:,3],x_train[:,4],x_train[:,5],x_train[:,6],x_train[:,7],x_train[:,8],x_train[:,9],x_train[:,10],x_train[:,11],x_train[:,12],x_train[:,13],x_train[:,14],x_train[:,15],x_train[:,16],x_train[:,17],x_train[:,18],x_train[:,19],x_train[:,20],x_train[:,21],x_train[:,22],x_train[:,23],x_train[:,24],x_train[:,25],x_train[:,26],x_train[:,27],x_train[:,28],x_train[:,29]])
threshold = 0.95
for i in range(x_train.shape[1]):
    for j in range(i+1,x_train.shape[1]):
            if(abs(corrBetweenVariables[i][j]) > threshold):
                print('Correlation between',features[i],' and ',features[j])
        

Correlation between DER_deltaeta_jet_jet  and  DER_prodeta_jet_jet
Correlation between DER_deltaeta_jet_jet  and  DER_lep_eta_centrality
Correlation between DER_deltaeta_jet_jet  and  PRI_jet_subleading_pt
Correlation between DER_deltaeta_jet_jet  and  PRI_jet_subleading_eta
Correlation between DER_deltaeta_jet_jet  and  PRI_jet_subleading_phi
Correlation between DER_prodeta_jet_jet  and  DER_lep_eta_centrality
Correlation between DER_prodeta_jet_jet  and  PRI_jet_subleading_pt
Correlation between DER_prodeta_jet_jet  and  PRI_jet_subleading_eta
Correlation between DER_prodeta_jet_jet  and  PRI_jet_subleading_phi
Correlation between DER_sum_pt  and  PRI_jet_all_pt
Correlation between DER_lep_eta_centrality  and  PRI_jet_subleading_pt
Correlation between DER_lep_eta_centrality  and  PRI_jet_subleading_eta
Correlation between DER_lep_eta_centrality  and  PRI_jet_subleading_phi
Correlation between PRI_jet_leading_pt  and  PRI_jet_leading_eta
Correlation between PRI_jet_leading_pt  and  PR

The conclusion from the correlation computed above are not significant enough as we would probably need to work on the dataset before computing them.