This is a Python class to encapsulate a machine learning dataset based on dictionares. 

This class is greatly suited for neuroimaging applications (or any other domain), where each sample needs to be uniquely identified with a subject ID (or something similar). 

Key-level correspondence across data, labels (1 or 2), classnames ('healthy', 'disease') and the related helps maintain data integrity, in addition to enabling traceback to original sources from where the features have been originally derived.

An example application is shown below:


In [1]:
import sys, os
import numpy as np
import cPickle as pickle

Improting the class definition:

In [2]:
from mldataset import MLDataset

We are now going to import it and give it a description.

In [3]:
dataset = MLDataset()
dataset.description = 'ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.'

In [4]:
dataset

ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.

You can see the dataset some description attached to it, however we know it is empty. This can be verified in a boolean context as shown below:

In [5]:
bool(dataset)

False

Let's add samples to this dataset which is when this dataset implementation becomes really handy. Before we do that, we will improt some convenience routines defined to just illustrate a simple yet common use of this dataset.

In [6]:
def read_thickness(path):
    """Dummy function to minic a data reader."""

    # in your actural routine, this might be:
    #   pysurfer.read_thickness(path).values()
    return np.random.random(8)


def get_features(work_dir, subj_id):
    """Returns the whole brain cortical thickness for a given subject ID."""

    # extension to identify the data file; this could be .curv, anything else you choose
    ext_thickness = '.thickness'

    thickness = dict()
    for hemi in ['lh', 'rh']:
        path_thickness = os.path.join(work_dir, subj_id, hemi + ext_thickness)
        thickness[hemi] = read_thickness(path_thickness)

    # concatenating them to build a whole brain feature set
    thickness_wb = np.concatenate([thickness['lh'], thickness['rh']])

    return thickness_wb

So now we have a IO functions to read the data for us. Let's define where the data will come from:

In [7]:
work_dir = '/project/ADNI/FreesurferThickness_v4p3'
class_set = ['Ctrl', 'Alzr', 'Another']

This would obviously change for you applications, but this has sufficient properties to illustrate the point.

Let's look at what methods this dataset offers us:

In [8]:
dir(dataset)

['add_classes',
 'add_sample',
 'class_set',
 'class_sizes',
 'classes',
 'data',
 'data_matrix',
 'description',
 'get_class',
 'get_subset',
 'keys',
 'labels',
 'num_classes',
 'num_features',
 'num_samples',
 'subject_ids']

You can see there few methods such as add_sample, get_subset etc: important method being add_sample, which is key to constructing this dataset. Let's go ahead and some samples:

In [9]:
for class_index, class_id in enumerate(class_set):
    print('Working on class {:>5}'.format(class_id))

    target_list_path = os.path.join(work_dir,'scripts','test_sample.{}'.format(class_id))
    with open(target_list_path,'r') as tf:
        target_list = tf.readlines()
        target_list = [sub.strip() for sub in target_list]

    for subj_id in target_list:
        print('\t reading subject {:>15}'.format(subj_id))
        thickness_wb = get_features(work_dir, subj_id)

        # adding the sample to the dataset
        dataset.add_sample(subj_id, thickness_wb, class_index, class_id)

Working on class  Ctrl
	 reading subject      011_S_0005
	 reading subject      011_S_0008
	 reading subject      022_S_0014
	 reading subject      100_S_0015
	 reading subject      011_S_0016
	 reading subject      067_S_0019
	 reading subject      011_S_0021
	 reading subject      011_S_0022
	 reading subject      011_S_0023
	 reading subject      023_S_0031
Working on class  Alzr
	 reading subject      031_S_1209
	 reading subject      007_S_1248
	 reading subject      007_S_1304
	 reading subject      009_S_1334
	 reading subject      007_S_1339
	 reading subject      005_S_1341
	 reading subject      057_S_1371
	 reading subject      057_S_1379
	 reading subject      041_S_1391
	 reading subject      094_S_1402
Working on class Another
	 reading subject      130_S_1200
	 reading subject      130_S_1201
	 reading subject      130_S_1290
	 reading subject      130_S_1337
	 reading subject      131_S_0123
	 reading subject      131_S_0319
	 reading subject      131_S_0384
	 reading s

**Nice. Isn't it?**

So what's nice about this, you say? *The simple fact that you are constructing a dataset as you read the data* in its most elemental form (in the units of the dataset such as the subject ID in our neuroimaging application). You're done as soon as you're done reading the features from disk.

What's more - you can inspect the dataset in an intuitive manner, as shown below:

In [10]:
dataset

ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
30 samples and 16 features.
Class    Alzr : 10 samples.
Class Another : 10 samples.
Class    Ctrl : 10 samples.

Even better, right? No more too much typing of several commands to get the complete and concise sense of the dataset.

If you would like, you can always get more specific information, such as:

In [11]:
dataset.num_samples

30

In [12]:
dataset.num_features

16

In [13]:
dataset.class_set

{'Alzr', 'Another', 'Ctrl'}

In [14]:
dataset.class_sizes

Counter({'Alzr': 10, 'Another': 10, 'Ctrl': 10})

In [15]:
dataset.class_sizes['Ctrl']

10

In [23]:
dataset.subject_ids

['011_S_0005',
 '011_S_0008',
 '022_S_0014',
 '100_S_0015',
 '011_S_0016',
 '067_S_0019',
 '011_S_0021',
 '011_S_0022',
 '011_S_0023',
 '023_S_0031',
 '031_S_1209',
 '007_S_1248',
 '007_S_1304',
 '009_S_1334',
 '007_S_1339',
 '005_S_1341',
 '057_S_1371',
 '057_S_1379',
 '041_S_1391',
 '094_S_1402',
 '130_S_1200',
 '130_S_1201',
 '130_S_1290',
 '130_S_1337',
 '131_S_0123',
 '131_S_0319',
 '131_S_0384',
 '131_S_0409',
 '131_S_0436',
 '131_S_0441']

In addition to the structured way of obtaining the various properties of this dataset, this implementation really will come in handy when you have to slice and dice the dataset (with large number of classes and features) into smaller subsets (e.g. for binary classification). Let's see how we can retrieve the data for a single class:

In [16]:
ctrl = dataset.get_class('Ctrl')

That's it, a simple call way.

Now let's see what it looks like:

In [17]:
ctrl


 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
10 samples and 16 features.
Class Ctrl : 10 samples.

Even with updated description automatically, to indicate its history.

## Let's see how we can retrieve specific subject IDs (for which there are many use cases):

In [25]:
data = dataset.get_subset(['022_S_0014','023_S_0031','023_S_0031','131_S_0409'])

So as simple as that.

In [26]:
data


 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
3 samples and 16 features.
Class Another : 1 samples.
Class    Ctrl : 2 samples.

### More useful case would be to select a subset of classes from an original large dataset:

In [19]:
binary_dataset = dataset.get_class(['Ctrl','Alzr'])

In [20]:
binary_dataset


 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
20 samples and 16 features.
Class Alzr : 10 samples.
Class Ctrl : 10 samples.

**Great.**

Once you have this dataset, you can save and load these trivially using your favourite serialization module. Let's do some pickling:

In [21]:
import cPickle as pickle

In [22]:
out_file = os.path.join(work_dir,'binary_dataset_Ctrl_Alzr_Freesurfer_thickness_v4p3.pkl')

# saving the dataset to disk
try:
    path = os.path.abspath(out_file)
    with open(path, 'wb') as df:
        pickle.dump(dataset, df)
    print('saved.')
except:
    raise

saved.


There you have it, a simple example to show you the utility and convenience of this dataset.

## Thanks for checking it out. I would appreciate if you could give me feedback on improving or sharpening it further.