### General Information
#### Data Set Information
> Machine learning is used in high-energy physics experiments to search for the signatures of exotic particles. These signatures are learned from Monte Carlo simulations of the collisions that produce these particles and the resulting decay products. In each of the three data sets here, the goal is to separate particle-producing collisions from a background source. [...] The data is separated into a training set of 7 million examples and a test set of 3.5 million for each.

([UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/HEPMASS#))

#### Attribute Information
> The first column is the class label (1 for signal, 0 for background), followed by the 27 normalized features (22 low-level features then 5 high-level features), and a 28th mass feature.

([UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/HEPMASS#))

| label |      f0      |       f1     | ... |      f25     |      f26     |            mass              |
|:-----:|:------------:|:------------:|:---:|:------------:|:------------:|:----------------------------:|
| {0,1} | $\mathbb{R}$ | $\mathbb{R}$ | ... | $\mathbb{R}$ | $\mathbb{R}$ | {500, 750, 1000, 1250, 1500} |

### Statistics

In [59]:
import dask.dataframe as dd
data = dd.read_csv('data/all_test.csv')
data.rename(index=str, columns={'# label': 'label'})
data.drop(['label', 'mass'], axis=1).compute().describe() # NOTE this might freeze your computer for a while :)

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26
count,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0,...,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0,3500000.0
mean,0.01677517,-0.0005882622,-0.0002703736,0.01170023,-5.277635e-05,0.002556547,0.01815479,0.0002004137,-0.0001961444,-0.006334786,...,0.00467755,0.01147895,-0.0008455537,4.723695e-05,-3.806618e-05,0.01228957,0.009204575,0.006408718,-0.0008033496,0.01511048
std,1.005118,0.9972289,0.9996684,0.9945796,0.9997896,1.000173,0.9869383,0.9962253,0.999917,1.001789,...,1.00095,1.00265,1.000688,1.000289,0.9999775,1.009107,1.00382,1.011991,0.9850043,0.9818347
min,-1.960549,-2.365346,-1.732165,-8.895442,-1.732137,-1.054221,-2.93288,-2.757863,-1.732359,-1.325801,...,-0.8154401,-1.728284,-2.281867,-1.731758,-0.5736825,-3.612524,-4.271224,-16.78043,-2.800613,-2.654076
25%,-0.7284369,-0.734565,-0.8662397,-0.6083473,-0.8656481,-1.054221,-0.7570114,-0.7015647,-0.8661835,-1.325801,...,-0.8154401,-0.7430429,-0.7223305,-0.8665953,-0.5736825,-0.5409764,-0.5113137,-0.3543733,-0.6924303,-0.7937275
50%,-0.03874749,-0.000288312,-0.001403588,0.02056307,0.0001922509,-0.005983562,-0.1508666,0.000504004,0.0002682766,0.7542607,...,-0.8154401,-0.09024961,-0.0008028708,0.0003705295,-0.5736825,-0.1602321,-0.314292,-0.3265146,-0.3561128,-0.08932617
75%,0.6912319,0.7329645,0.8650876,0.6797863,0.8653622,0.8504885,0.7687392,0.7013005,0.8657596,0.7542607,...,1.226331,0.6419688,0.7207651,0.8661937,-0.5736825,0.4809416,0.1620571,-0.2329328,0.4770004,0.7598977
max,4.037127,2.365296,1.73237,3.622688,1.731978,4.482618,3.933915,2.758563,1.73145,0.7542607,...,1.226331,5.537915,2.282209,1.73274,1.743123,7.326623,9.357253,15.55813,5.008558,4.613183


In [21]:
def scatterplot_matrix(data, **kwargs):
    """Plots a scatterplot matrix of subplots.  Each row of "data" is plotted
    against other rows, resulting in a nrows by nrows grid of subplots with the
    diagonal subplots labeled with "names".  Additional keyword arguments are
    passed on to matplotlib's "plot" command. Returns the matplotlib figure
    object containg the subplot grid."""
    
    numvars = data.columns.__len__()
    columns = data.columns
    print(numvars)
    print(columns)
    fig, axes = plt.subplots(nrows=numvars, ncols=numvars, figsize=(27,27))
    fig.subplots_adjust(hspace=0.05, wspace=0.05)
    
    for ax in axes.flat:
        # Hide all ticks and labels
        ax.xaxis.set_visible(False)
        ax.yaxis.set_visible(False)

        # Set up ticks only on one side for the "edge" subplots...
        if ax.is_first_col():
            ax.yaxis.set_ticks_position('left')
        if ax.is_last_col():
            ax.yaxis.set_ticks_position('right')
        if ax.is_first_row():
            ax.xaxis.set_ticks_position('top')
        if ax.is_last_row():
            ax.xaxis.set_ticks_position('bottom')

    # Plot the data.
    for i, j in zip(*np.triu_indices_from(axes, k=1)):
        for x, y in [(i,j), (j,i)]:
            axes[y,x].plot(data[columns[x]], data[columns[y]], **kwargs)

    # Label the diagonal subplots...
    for i, label in enumerate(columns):
        axes[i,i].annotate(label, (0.5, 0.5), xycoords='axes fraction',
                ha='center', va='center')

    # Turn on the proper x or y axes ticks.
    for i, j in zip(range(numvars), itertools.cycle((-1, 0))):
        axes[j,i].xaxis.set_visible(True)
        axes[i,j].yaxis.set_visible(True)

    return fig

#---------------------------------------------------------------------------------
import dask.dataframe as dd
import matplotlib.pyplot as plt
%matplotlib notebook
import numpy as np
import itertools
dask_df = dd.read_csv('data/all_test.csv')
subsample = dask_df.sample(0.0001).compute()
print("Sampled")
fig = scatterplot_matrix(subsample.drop(['# label', 'mass'], axis=1),
            linestyle='none', marker='o', color='black', mfc='none')
fig.suptitle('Scatterplot Matrix')
plt.show()

Sampled
27
Index(['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10',
       'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20',
       'f21', 'f22', 'f23', 'f24', 'f25', 'f26'],
      dtype='object')


<IPython.core.display.Javascript object>