## A Python data structure to improve handling of datasets in machine learning workflows

This class is greatly suited for neuroimaging applications (or any other domain), where each sample needs to be uniquely identified with a subject ID (or something similar). 

Key-level correspondence across data, labels (1 or 2), classnames ('healthy', 'disease') and the related helps maintain data integrity and improve the provenance, in addition to enabling traceback to original sources from where the features have been originally derived.

An example application is shown below:


In [1]:
import sys, os
import numpy as np
import cPickle as pickle

Improting the class definition:

In [2]:
from mldataset import MLDataset

We can now instantiate it and give it a description:

In [3]:
dataset = MLDataset()
dataset.description = 'ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.'

In [4]:
dataset

ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
Empty dataset.

You can see the dataset some description attached to it, however we know it is empty. This can be verified in a boolean context as shown below:

In [5]:
bool(dataset)

False

Let's add samples to this dataset which is when this dataset implementation becomes really handy. Before we do that, we will define some convenience routines defined to just illustrate a simple yet common use of this dataset.

In [6]:
def read_thickness(path):
    """Dummy function to minic a data reader."""

    # in your actural routine, this might be:
    #   pysurfer.read_thickness(path).values()
    return np.random.random(8)


def get_features(work_dir, subj_id):
    """Returns the whole brain cortical thickness for a given subject ID."""

    # extension to identify the data file; this could be .curv, anything else you choose
    ext_thickness = '.thickness'

    thickness = dict()
    for hemi in ['lh', 'rh']:
        path_thickness = os.path.join(work_dir, subj_id, hemi + ext_thickness)
        thickness[hemi] = read_thickness(path_thickness)

    # concatenating them to build a whole brain feature set
    thickness_wb = np.concatenate([thickness['lh'], thickness['rh']])

    return thickness_wb

So now we have IO routines to read the data for us. Let's define where the data will come from:

In [7]:
work_dir = '/project/ADNI/FreesurferThickness_v4p3'
class_set = ['Ctrl', 'Alzr', 'Another']

This would obviously change for your applications, but this has sufficient properties to illustrate the point.

Let's look at what methods this dataset offers us:

In [38]:
dir(dataset)

['add_classes',
 'add_sample',
 'class_set',
 'class_sizes',
 'classes',
 'data',
 'data_matrix',
 'del_sample',
 'description',
 'extend',
 'get_class',
 'get_feature_subset',
 'get_subset',
 'glance',
 'keys',
 'num_classes',
 'num_features',
 'num_samples',
 'sample_ids',
 'save',
 'target']

## Constructor

You can see there few methods such as add_sample, get_subset etc: important method being add_sample, which is key to constructing this dataset. Let's go ahead and some samples:

In [9]:
for class_index, class_id in enumerate(class_set):
    print('Working on class {:>5}'.format(class_id))

    target_list_path = os.path.join(work_dir,'scripts','test_sample.{}'.format(class_id))
    with open(target_list_path,'r') as tf:
        target_list = tf.readlines()
        target_list = [sub.strip() for sub in target_list]

    for subj_id in target_list:
        print('\t reading subject {:>15}'.format(subj_id))
        thickness_wb = get_features(work_dir, subj_id)

        # adding the sample to the dataset
        dataset.add_sample(subj_id, thickness_wb, class_index, class_id)

Working on class  Ctrl
	 reading subject      011_S_0005
	 reading subject      011_S_0008
	 reading subject      022_S_0014
	 reading subject      100_S_0015
	 reading subject      011_S_0016
	 reading subject      067_S_0019
	 reading subject      011_S_0021
	 reading subject      011_S_0022
	 reading subject      011_S_0023
	 reading subject      023_S_0031
Working on class  Alzr
	 reading subject      031_S_1209
	 reading subject      007_S_1248
	 reading subject      007_S_1304
	 reading subject      009_S_1334
	 reading subject      007_S_1339
	 reading subject      005_S_1341
	 reading subject      057_S_1371
	 reading subject      057_S_1379
	 reading subject      041_S_1391
	 reading subject      094_S_1402
Working on class Another
	 reading subject      130_S_1200
	 reading subject      130_S_1201
	 reading subject      130_S_1290
	 reading subject      130_S_1337
	 reading subject      131_S_0123
	 reading subject      131_S_0319
	 reading subject      131_S_0384
	 reading s

**Nice. Isn't it?**

So what's nice about this, you say? *The simple fact that you are constructing a dataset as you read the data* in its most elemental form (in the units of the dataset such as the subject ID in our neuroimaging application). You're done as soon as you're done reading the features from disk.

What's more - you can inspect the dataset in an intuitive manner, as shown below:

In [10]:
dataset

ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
30 samples and 16 features.
Class    Alzr : 10 samples.
Class Another : 10 samples.
Class    Ctrl : 10 samples.

Even better, right? No more too much typing of several commands to get the complete and concise sense of the dataset.

## Convenient attributes

If you would like, you can always get more specific information, such as:

In [11]:
dataset.num_samples

30

In [12]:
dataset.num_features

16

In [13]:
dataset.class_set

{'Alzr', 'Another', 'Ctrl'}

In [14]:
dataset.class_sizes

Counter({'Alzr': 10, 'Another': 10, 'Ctrl': 10})

In [15]:
dataset.class_sizes['Ctrl']

10

If you'd like to take a look data inside for few subjects - shall we call it a glance?

In [16]:
dataset.glance()

{'011_S_0005': array([ 0.71906724,  0.69474234,  0.81541508,  0.68290433,  0.48515202,
         0.05169716,  0.35661796,  0.29255153,  0.77603053,  0.76953204,
         0.75151331,  0.6310391 ,  0.16664214,  0.45568029,  0.77235658,
         0.129734  ]),
 '011_S_0008': array([ 0.23475367,  0.87192348,  0.74838111,  0.14198781,  0.54580507,
         0.60608685,  0.96056912,  0.54054964,  0.12188444,  0.66777379,
         0.03865748,  0.39415703,  0.14964127,  0.08273157,  0.35624855,
         0.14643187]),
 '011_S_0016': array([ 0.0470204 ,  0.79071877,  0.05989734,  0.25284974,  0.49609748,
         0.24351157,  0.74269742,  0.69888173,  0.49533741,  0.6436561 ,
         0.98329797,  0.82819635,  0.99460435,  0.1030755 ,  0.28432574,
         0.58462271]),
 '022_S_0014': array([ 0.55777521,  0.25868491,  0.57849707,  0.80919437,  0.29650012,
         0.13068381,  0.20903501,  0.00799854,  0.54577376,  0.28578138,
         0.45626358,  0.23603352,  0.60335395,  0.02585778,  0.45984704,

We can control the number of items to glance:

In [17]:
dataset.glance(2)

{'011_S_0005': array([ 0.71906724,  0.69474234,  0.81541508,  0.68290433,  0.48515202,
         0.05169716,  0.35661796,  0.29255153,  0.77603053,  0.76953204,
         0.75151331,  0.6310391 ,  0.16664214,  0.45568029,  0.77235658,
         0.129734  ]),
 '011_S_0008': array([ 0.23475367,  0.87192348,  0.74838111,  0.14198781,  0.54580507,
         0.60608685,  0.96056912,  0.54054964,  0.12188444,  0.66777379,
         0.03865748,  0.39415703,  0.14964127,  0.08273157,  0.35624855,
         0.14643187])}

Or you may be wondering what are the subject IDs in the dataset.. here they are:

In [18]:
dataset.sample_ids

['011_S_0005',
 '011_S_0008',
 '022_S_0014',
 '100_S_0015',
 '011_S_0016',
 '067_S_0019',
 '011_S_0021',
 '011_S_0022',
 '011_S_0023',
 '023_S_0031',
 '031_S_1209',
 '007_S_1248',
 '007_S_1304',
 '009_S_1334',
 '007_S_1339',
 '005_S_1341',
 '057_S_1371',
 '057_S_1379',
 '041_S_1391',
 '094_S_1402',
 '130_S_1200',
 '130_S_1201',
 '130_S_1290',
 '130_S_1337',
 '131_S_0123',
 '131_S_0319',
 '131_S_0384',
 '131_S_0409',
 '131_S_0436',
 '131_S_0441']

## Subset selection

In addition to the structured way of obtaining the various properties of this dataset, this implementation really will come in handy when you have to slice and dice the dataset (with large number of classes and features) into smaller subsets (e.g. for binary classification). Let's see how we can retrieve the data for a single class:

In [19]:
ctrl = dataset.get_class('Ctrl')

That's it, obtaining the data for a given class is a simple call away.

Now let's see what it looks like:

In [20]:
ctrl


 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
10 samples and 16 features.
Class Ctrl : 10 samples.

Even with updated description automatically, to indicate its history. Let's see some data from controls:

In [21]:
ctrl.glance(2)

{'011_S_0005': array([ 0.71906724,  0.69474234,  0.81541508,  0.68290433,  0.48515202,
         0.05169716,  0.35661796,  0.29255153,  0.77603053,  0.76953204,
         0.75151331,  0.6310391 ,  0.16664214,  0.45568029,  0.77235658,
         0.129734  ]),
 '011_S_0008': array([ 0.23475367,  0.87192348,  0.74838111,  0.14198781,  0.54580507,
         0.60608685,  0.96056912,  0.54054964,  0.12188444,  0.66777379,
         0.03865748,  0.39415703,  0.14964127,  0.08273157,  0.35624855,
         0.14643187])}

### Let's see how we can retrieve specific samples by their IDs (for which there are many use cases):

In [22]:
data = dataset.get_subset(['022_S_0014','023_S_0031','023_S_0031','131_S_0409'])

So as simple as that.

In [23]:
data


 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
3 samples and 16 features.
Class Another : 1 samples.
Class    Ctrl : 2 samples.

### More useful case would be to select a subset of classes from an original large dataset:

In [24]:
binary_dataset = dataset.get_class(['Ctrl','Alzr'])

In [25]:
binary_dataset


 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
20 samples and 16 features.
Class Alzr : 10 samples.
Class Ctrl : 10 samples.

How about selecting a subset of features from all samples?

In [26]:
binary_dataset.get_feature_subset(xrange(10))

Subset features derived from: 
 
 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
20 samples and 10 features.
Class Alzr : 10 samples.
Class Ctrl : 10 samples.

**Great.** Isn't it? You can also see the two-time-point history (initial subset in classes, followed by a subset in features).

## Serialization

Once you have this dataset, you can save and load these trivially using your favourite serialization module. Let's do some pickling:

In [27]:
out_file = os.path.join(work_dir,'binary_dataset_Ctrl_Alzr_Freesurfer_thickness_v4p3.pkl')
binary_dataset.save(out_file)

That's it - it is saved.

Let's reload it from disk and make sure we can indeed retrieve it:

In [28]:
reloaded = MLDataset(filepath=out_file) # another form of the constructor!

Loading the dataset from: /project/ADNI/FreesurferThickness_v4p3/binary_dataset_Ctrl_Alzr_Freesurfer_thickness_v4p3.pkl


In [29]:
reloaded


 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
20 samples and 16 features.
Class Alzr : 10 samples.
Class Ctrl : 10 samples.

## Dataset Arithmetic

You might wonder how can you combine two different types of features ( thickness and shape ) from the dataset. Piece of cake, see below ...

To concatenat two datasets, first we make a second dataset:

In [30]:
dataset_two = MLDataset(in_dataset=dataset) # yet another constructor: in its copy form!

How can you check if they are "functionally identical"? As in same keys, same data and classes for each key... Easy:

In [31]:
dataset_two == dataset

True

Now let's try the arithmentic:

In [32]:
combined = dataset + dataset_two

Identical keys found. Trying to horizontally concatenate features for each sample.


Great. The add method recognized the identical set of keys and performed a horiz cat, as can be noticed by the twice the number of features in the combined dataset:

In [33]:
combined


30 samples and 32 features.
Class    Alzr : 10 samples.
Class Another : 10 samples.
Class    Ctrl : 10 samples.

We can also do some removal in similar fashion:

In [34]:
smaller = combined - dataset

011_S_0005 removed.
011_S_0008 removed.
022_S_0014 removed.
100_S_0015 removed.
011_S_0016 removed.
067_S_0019 removed.
011_S_0021 removed.
011_S_0022 removed.
011_S_0023 removed.
023_S_0031 removed.
031_S_1209 removed.
007_S_1248 removed.
007_S_1304 removed.
009_S_1334 removed.
007_S_1339 removed.
005_S_1341 removed.
057_S_1371 removed.
057_S_1379 removed.
041_S_1391 removed.
094_S_1402 removed.
130_S_1200 removed.
130_S_1201 removed.
130_S_1290 removed.
130_S_1337 removed.
131_S_0123 removed.
131_S_0319 removed.
131_S_0384 removed.
131_S_0409 removed.
131_S_0436 removed.
131_S_0441 removed.




Data structure is even producing a warning to let you know the resulting output would be empty! We can verify that:

In [35]:
bool(smaller)

False

This is all well and good. How does it interact with other packages out there, you might ask? It is as simple as you can imagine:

In [36]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)

In [37]:
clf.fit(binary_dataset.data_matrix, binary_dataset.target)

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

There you have it, a simple example to show you the utility and convenience of this dataset.

## Thanks for checking it out. I would appreciate if you could give me feedback on improving or sharpening it further.