### General Information
#### Data Set Information
> Machine learning is used in high-energy physics experiments to search for the signatures of exotic particles. These signatures are learned from Monte Carlo simulations of the collisions that produce these particles and the resulting decay products. In each of the three data sets here, the goal is to separate particle-producing collisions from a background source. [...] The data is separated into a training set of 7 million examples and a test set of 3.5 million for each.

([UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/HEPMASS#))

#### Attribute Information
> The first column is the class label (1 for signal, 0 for background), followed by the 27 normalized features (22 low-level features then 5 high-level features), and a 28th mass feature.

([UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/HEPMASS#))

| label |      f0      |       f1     | ... |      f25     |      f26     |            mass              |
|:-----:|:------------:|:------------:|:---:|:------------:|:------------:|:----------------------------:|
| {0,1} | $\mathbb{R}$ | $\mathbb{R}$ | ... | $\mathbb{R}$ | $\mathbb{R}$ | {500, 750, 1000, 1250, 1500} |

### Statistics

In [22]:
import dask.dataframe as dd
import matplotlib.pyplot as plt
%matplotlib notebook
import numpy as np
import itertools

data = dd.read_csv('data/all_test.csv')
data = data.rename(columns={'# label': 'label'})
features = data.drop(['label', 'mass'], axis=1)
#features.compute().describe() # NOTE this might freeze your computer for a while :)

In [26]:
def scatterplot_matrix(data, **kwargs):
    """Plots a scatterplot matrix of subplots.  Each row of "data" is plotted
    against other rows, resulting in a nrows by nrows grid of subplots with the
    diagonal subplots labeled with "names".  Additional keyword arguments are
    passed on to matplotlib's "plot" command. Returns the matplotlib figure
    object containg the subplot grid."""
    
    numvars = data.columns.__len__()
    columns = data.columns
    fig, axes = plt.subplots(nrows=numvars, ncols=numvars, figsize=(27,27))
    fig.subplots_adjust(hspace=0.05, wspace=0.05)
    
    for ax in axes.flat:
        # Hide all ticks and labels
        ax.xaxis.set_visible(False)
        ax.yaxis.set_visible(False)

        # Set up ticks only on one side for the "edge" subplots...
        if ax.is_first_col():
            ax.yaxis.set_ticks_position('left')
        if ax.is_last_col():
            ax.yaxis.set_ticks_position('right')
        if ax.is_first_row():
            ax.xaxis.set_ticks_position('top')
        if ax.is_last_row():
            ax.xaxis.set_ticks_position('bottom')

    # Plot the data.
    for i, j in zip(*np.triu_indices_from(axes, k=1)):
        for x, y in [(i,j), (j,i)]:
            axes[y,x].plot(data[columns[x]], data[columns[y]], **kwargs)

    # Label the diagonal subplots...
    for i, label in enumerate(columns):
        axes[i,i].annotate(label, (0.5, 0.5), xycoords='axes fraction',
                ha='center', va='center')

    # Turn on the proper x or y axes ticks.
    for i, j in zip(range(numvars), itertools.cycle((-1, 0))):
        axes[j,i].xaxis.set_visible(True)
        axes[i,j].yaxis.set_visible(True)

    return fig

#---------------------------------------------------------------------------------
features_subsample = features.sample(0.0001).compute()
fig = scatterplot_matrix(features_subsample, linestyle='none', marker='o', color='black', mfc='none')
fig.suptitle('Scatterplot Matrix')
plt.show()

<IPython.core.display.Javascript object>

In [25]:
features_subsample.plot.box()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f287de9a7b8>

In [27]:
subsample = data.sample(0.0001).compute()

In [28]:
by_class = subsample.drop(['mass'], axis=1).groupby('label')
for name, group in by_class:
    group.drop(['label'], axis=1).hist(grid=False)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>