# Preprocess the raw Zooniverse datasets

In this notebook we will load the raw Zooniverse dataset and convert them for easier handling later on.

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
from pyclouds.imports import *
from pyclouds.zooniverse import *

## Import and understand the entire dataset

The dataset from 16 December contains the MPI and the LMD classification data, for the real dataset as well as for the practice dataset.

To handle the data we will use some functions from pyclouds.zooniverse

In [3]:
!ls ../zooniverse_raw/

README.md
sugar-flower-fish-or-gravel-classifications_18_11_02.csv
sugar-flower-fish-or-gravel-classifications_18_11_30.csv
sugar-flower-fish-or-gravel-classifications_18_12_16.csv
sugar-flower-fish-or-gravel-subjects_18_11_05.csv
sugar-flower-fish-or-gravel-subjects_19_01_14.csv


In [4]:
clas_fn = '../zooniverse_raw/sugar-flower-fish-or-gravel-classifications_18_12_16.csv'
subj_fn = '../zooniverse_raw/sugar-flower-fish-or-gravel-subjects_19_01_14.csv'

In [34]:
%%time
clas_prac = split_classification_df(clas_fn, 'Full dataset', 24.13, drop_nli=False, subj_df_or_fn=subj_fn)

CPU times: user 29.1 s, sys: 1.84 s, total: 30.9 s
Wall time: 30.9 s


In [33]:
clas_prac.head()

NameError: name 'clas_prac' is not defined

### The classification dataset

The classification dataset contains one row for one user's labels for one image.

In [24]:
clas = parse_classifications(clas_fn, json_columns=['metadata', 'annotations', 'subject_data'])

In [6]:
len(clas)

34616

In [7]:
clas.tail()

Unnamed: 0,classification_id,user_name,user_id,user_ip,workflow_id,workflow_name,workflow_version,created_at,gold_standard,expert,metadata,annotations,subject_data,subject_ids
34611,136647387,onnoq,1834851.0,af96bca21f888f9ee82c,8073,Full dataset,13.11,2018-12-14 16:09:42 UTC,,,"{'source': 'api', 'session': '67f9a58fcd31b031...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27142706': {'retired': None, 'fn': '/project...",27142706
34612,136647412,onnoq,1834851.0,af96bca21f888f9ee82c,8073,Full dataset,13.11,2018-12-14 16:09:54 UTC,,,"{'source': 'api', 'session': '67f9a58fcd31b031...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27144789': {'retired': None, 'fn': '/project...",27144789
34613,136647449,onnoq,1834851.0,af96bca21f888f9ee82c,8073,Full dataset,13.11,2018-12-14 16:10:08 UTC,,,"{'source': 'api', 'session': '67f9a58fcd31b031...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27163862': {'retired': None, 'fn': '/project...",27163862
34614,136647469,onnoq,1834851.0,af96bca21f888f9ee82c,8073,Full dataset,13.11,2018-12-14 16:10:14 UTC,,,"{'source': 'api', 'session': '67f9a58fcd31b031...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27143251': {'retired': {'id': 24671973, 'wor...",27143251
34615,136647512,onnoq,1834851.0,af96bca21f888f9ee82c,8073,Full dataset,13.11,2018-12-14 16:10:28 UTC,,,"{'source': 'api', 'session': '67f9a58fcd31b031...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27147693': {'retired': None, 'fn': '/project...",27147693


In [8]:
clas.user_id.nunique(), clas.user_name.nunique(), clas.subject_ids.nunique()   
# Number of unique users and images = subjects

(72, 91, 9801)

User_id is only for logged in users, so we have around 20 users that were not logged in. Let's see how many labels correspond to those users.

In [9]:
len(clas[clas.user_name.apply(lambda u: 'not-logged-in' in u)])

444

So only, a small fraction of labels is from not-logged-in users. For ease of analysis, we should probably remove those.

In [10]:
clas.workflow_id.unique()

array([8072, 8073, 8104, 8109, 8414])

In [11]:
clas.workflow_name.unique()

array(['Practice', 'Full dataset', 'Tutorial Gold standard',
       'Tutorial Random', 'Validation'], dtype=object)

We really only care about the `Practice` and `Full dataset` workflows.

In [12]:
clas[clas.workflow_name == 'Practice'].workflow_version.unique()

array([19.13, 24.13])

In [13]:
clas[clas.workflow_name == 'Full dataset'].workflow_version.unique()

array([13.11])

Let's now also convert the `created_at` column to an actual datetime object that we can work with.

Here we can see the two spikes associated with the MPI and LMD labeling days. The LMD day seems to have a larger "tail".

### The subject dataset

In [25]:
subj = load_classifications(subj_fn)

In [26]:
len(subj)

10179

In [27]:
subj.tail()

Unnamed: 0,subject_id,project_id,workflow_id,subject_set_id,metadata,locations,classifications_count,retired_at,retirement_reason,created_at,updated_at
10174,27164038,7699,8073.0,60835,{'fn': '/project/meteo/work/S.Rasp/cloud-class...,"{""0"":""https://panoptes-uploads.zooniverse.org/...",4,2018-11-29 14:26:48 UTC,classification_count,2018-10-29 12:15:26 UTC,2018-10-29 12:15:26 UTC
10175,27164039,7699,8073.0,60835,{'fn': '/project/meteo/work/S.Rasp/cloud-class...,"{""0"":""https://panoptes-uploads.zooniverse.org/...",4,2018-11-30 13:26:41 UTC,classification_count,2018-10-29 12:15:27 UTC,2018-10-29 12:15:27 UTC
10176,27164040,7699,8073.0,60835,{'fn': '/project/meteo/work/S.Rasp/cloud-class...,"{""0"":""https://panoptes-uploads.zooniverse.org/...",4,2018-12-30 13:00:38 UTC,classification_count,2018-10-29 12:15:28 UTC,2018-10-29 12:15:28 UTC
10177,27164041,7699,8073.0,60835,{'fn': '/project/meteo/work/S.Rasp/cloud-class...,"{""0"":""https://panoptes-uploads.zooniverse.org/...",4,2018-12-01 22:35:27 UTC,classification_count,2018-10-29 12:15:31 UTC,2018-10-29 12:15:31 UTC
10178,27164042,7699,8073.0,60835,{'fn': '/project/meteo/work/S.Rasp/cloud-class...,"{""0"":""https://panoptes-uploads.zooniverse.org/...",3,,,2018-10-29 12:15:33 UTC,2018-10-29 12:15:33 UTC


In [28]:
subj.set_index('subject_id', inplace=True)

In [29]:
subj.head()

Unnamed: 0_level_0,project_id,workflow_id,subject_set_id,metadata,locations,classifications_count,retired_at,retirement_reason,created_at,updated_at
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
26976345,7699,,60713,{'fn': '/project/meteo/work/S.Rasp/cloud-class...,"{""0"":""https://panoptes-uploads.zooniverse.org/...",0,,,2018-10-24 16:42:28 UTC,2018-10-24 16:42:28 UTC
26976346,7699,,60713,{'fn': '/project/meteo/work/S.Rasp/cloud-class...,"{""0"":""https://panoptes-uploads.zooniverse.org/...",0,,,2018-10-24 16:42:29 UTC,2018-10-24 16:42:29 UTC
26976347,7699,,60713,{'fn': '/project/meteo/work/S.Rasp/cloud-class...,"{""0"":""https://panoptes-uploads.zooniverse.org/...",0,,,2018-10-24 16:42:31 UTC,2018-10-24 16:42:31 UTC
26976348,7699,,60713,{'fn': '/project/meteo/work/S.Rasp/cloud-class...,"{""0"":""https://panoptes-uploads.zooniverse.org/...",0,,,2018-10-24 16:42:32 UTC,2018-10-24 16:42:32 UTC
26976349,7699,,60713,{'fn': '/project/meteo/work/S.Rasp/cloud-class...,"{""0"":""https://panoptes-uploads.zooniverse.org/...",0,,,2018-10-24 16:42:33 UTC,2018-10-24 16:42:33 UTC


In [22]:
clas[clas.subject_ids == 26976410]

Unnamed: 0,classification_id,user_name,user_id,user_ip,workflow_id,workflow_name,workflow_version,created_at,gold_standard,expert,metadata,annotations,subject_data,subject_ids,datetime
13,128932300,raspstephan,1814911.0,ece34b7062ff27190425,8073,Full dataset,13.11,2018-10-28 10:00:31 UTC,,,"{'source': 'api', 'session': '72e7236d3736b33e...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'26976410': {'retired': None, 'fn': '/project...",26976410,2018-10-28 10:00:31


In [23]:
clas['subject_set_id'] = clas.subject_ids.apply(
    lambda i: subj.loc[i].subject_set_id if i in list(subj.index) else np.nan) 

The first row is still from a now deleted dataset, so let's remove that!

In [60]:
clas.dropna(subset=['subject_set_id'], inplace=True)

In [61]:
#EXPORT
def add_subject_set_id_to_clas_df(clas_df, subj_df):
    s = subj_df.set_index('subject_id')
    # HARDCODED: Only using the actual datasets, not practice, etc.
    s = s[s.subject_set_id.apply(lambda s: s in subj_id2name.keys())]
    clas_df['subject_set_id'] = clas_df.subject_ids.apply(
        lambda i: s.loc[i].subject_set_id if i in list(s.index) else np.nan)
    clas_df.dropna(subset=['subject_set_id'], inplace=True)
    # Also add filename
    clas_df['fn'] = clas_df.subject_ids.apply(lambda i: s.loc[i].metadata['fn'][48:])
    return clas_df

In [62]:
subj = load_classifications(subj_fn)

In [37]:
%time clas = add_subject_set_id_to_clas_df(clas, subj)

CPU times: user 21.9 s, sys: 12 ms, total: 21.9 s
Wall time: 21.9 s


In [38]:
clas

Unnamed: 0,classification_id,user_name,user_id,user_ip,workflow_id,workflow_name,workflow_version,created_at,gold_standard,expert,metadata,annotations,subject_data,subject_ids,datetime,subject_set_id,fn
23,129485502,not-logged-in-80bdc4acf6d39d1ea32e,,80bdc4acf6d39d1ea32e,8073,Full dataset,13.11,2018-11-01 08:15:14 UTC,,,"{'source': 'api', 'session': '3cf8bb39dc9ac314...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27144341': {'retired': {'id': 24673754, 'wor...",27144341,2018-11-01 08:15:14,60815,Region2_DJF_Aqua/Aqua_CorrectedReflectance2010...
24,129485553,not-logged-in-80bdc4acf6d39d1ea32e,,80bdc4acf6d39d1ea32e,8073,Full dataset,13.11,2018-11-01 08:16:38 UTC,,,"{'source': 'api', 'session': '3cf8bb39dc9ac314...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27146358': {'retired': {'id': 24675851, 'wor...",27146358,2018-11-01 08:16:38,60817,Region3_DJF_Aqua/Aqua_CorrectedReflectance2010...
25,129485578,not-logged-in-80bdc4acf6d39d1ea32e,,80bdc4acf6d39d1ea32e,8073,Full dataset,13.11,2018-11-01 08:16:53 UTC,,,"{'source': 'api', 'session': '3cf8bb39dc9ac314...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27161730': {'retired': {'id': 24706423, 'wor...",27161730,2018-11-01 08:16:53,60819,Region3_SON_Aqua/Aqua_CorrectedReflectance2017...
26,129485596,not-logged-in-80bdc4acf6d39d1ea32e,,80bdc4acf6d39d1ea32e,8073,Full dataset,13.11,2018-11-01 08:17:11 UTC,,,"{'source': 'api', 'session': '3cf8bb39dc9ac314...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27139667': {'retired': {'id': 24666988, 'wor...",27139667,2018-11-01 08:17:11,60811,Region1_DJF_Aqua/Aqua_CorrectedReflectance2007...
27,129485619,not-logged-in-80bdc4acf6d39d1ea32e,,80bdc4acf6d39d1ea32e,8073,Full dataset,13.11,2018-11-01 08:17:32 UTC,,,"{'source': 'api', 'session': '3cf8bb39dc9ac314...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27145958': {'retired': None, 'fn': '/project...",27145958,2018-11-01 08:17:32,60816,Region2_DJF_Terra/Terra_CorrectedReflectance20...
28,129486316,raspstephan,1814911.0,80bdc4acf6d39d1ea32e,8073,Full dataset,13.11,2018-11-01 08:28:49 UTC,,,"{'source': 'api', 'session': 'f83f28c58ce8e6e9...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27142057': {'retired': None, 'fn': '/project...",27142057,2018-11-01 08:28:49,60813,Region1_MAM_Aqua/Aqua_CorrectedReflectance2007...
29,129486328,raspstephan,1814911.0,80bdc4acf6d39d1ea32e,8073,Full dataset,13.11,2018-11-01 08:29:02 UTC,,,"{'source': 'api', 'session': 'f83f28c58ce8e6e9...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27146798': {'retired': {'id': 24676302, 'wor...",27146798,2018-11-01 08:29:02,60817,Region3_DJF_Aqua/Aqua_CorrectedReflectance2015...
30,129486340,raspstephan,1814911.0,80bdc4acf6d39d1ea32e,8073,Full dataset,13.11,2018-11-01 08:29:18 UTC,,,"{'source': 'api', 'session': 'f83f28c58ce8e6e9...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27163527': {'retired': None, 'fn': '/project...",27163527,2018-11-01 08:29:18,60835,Region3_SON_Terra/Terra_CorrectedReflectance20...
31,129486353,raspstephan,1814911.0,80bdc4acf6d39d1ea32e,8073,Full dataset,13.11,2018-11-01 08:29:32 UTC,,,"{'source': 'api', 'session': 'f83f28c58ce8e6e9...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27140064': {'retired': {'id': 24667644, 'wor...",27140064,2018-11-01 08:29:32,60811,Region1_DJF_Aqua/Aqua_CorrectedReflectance2011...
32,129487308,raspstephan,1814911.0,80bdc4acf6d39d1ea32e,8073,Full dataset,13.11,2018-11-01 08:40:30 UTC,,,"{'source': 'api', 'session': '7c9ad7c01df71e1f...","{'task': 'T0', 'task_label': 'Draw bounding bo...","{'27140103': {'retired': {'id': 24667709, 'wor...",27140103,2018-11-01 08:40:30,60811,Region1_DJF_Aqua/Aqua_CorrectedReflectance2011...


## Splitting the dataset

To work with the dataset, we would like to have easy ways of splitting the data and removing outliers.

First up, we can throw out all not-logged-in users.

Then, we want to split the practice and full dataset workflows, and throw out the 19.13 workflow for the practice dataset which was not live but just for testing. 

Finally, we can split the MPI and LMD datasets by time.

At the end we can also write a function that does all this in one step for use in other notebooks.

In [63]:
clas_drop = clas[clas.user_name.apply(lambda u: 'not-logged-in' not in u)]

In [64]:
len(clas_drop)

34160

In [65]:
clas_full = clas_drop[clas_drop.workflow_name == 'Full dataset']
clas_prac = clas_drop[(clas_drop.workflow_name == 'Practice') & (clas_drop.workflow_version == 24.13)]

In [65]:
len(clas_full), len(clas_prac)

(30311, 3015)

In [71]:
split_date = np.datetime64('2018-11-28')

In [73]:
clas_full_MPI = clas_full[clas_full.datetime.dt.date < split_date]
clas_full_LMD = clas_full[clas_full.datetime.dt.date > split_date]

In [74]:
len(clas_full_MPI), len(clas_full_LMD)

(19898, 10410)

In [210]:
#EXPORT
def split_classification_df(raw_df, workflow_name=None, workflow_version=None, date_range=None, drop_nli=False):
    """
    Takes as input the raw classification dataframe that comes out of parse_classifications().
    Adds a datetime column. If not None, returns only rows with workflow_name, workflow_version.
    Optionally, returns only labels in a certain date range. Dates must be in string format 'yyyy-mm-dd'.
    Optionally, drops all labels of users that were not-logged-in (nli).
    """
    df = raw_df.copy()
    df['datetime'] = pd.to_datetime(df['created_at'])
    if workflow_name is not None:
        df = df[df.workflow_name == workflow_name]
    if workflow_version is not None:
        df = df[df.workflow_version == workflow_version]
    if date_range is not None:
        df = df[(df.datetime.dt.date > np.datetime64(date_range[0])) & 
                (df.datetime.dt.date < np.datetime64(date_range[1]))]
    if drop_nli:
        df = df[df.user_name.apply(lambda u: 'not-logged-in' not in u)]
    return df

In [94]:
%%time
clas_full_LMD_test = split_classification_df(
    clas, 
    workflow_name='Full dataset', 
    date_range=('2018-11-28', '2019-01-01'),
    drop_nli=True
)

CPU times: user 4.47 s, sys: 0 ns, total: 4.47 s
Wall time: 4.47 s


In [95]:
len(clas_full_LMD_test)

10410

This function is now also in pyclouds.zooniverse

## Extracting the label information

The label information is hidden in the `annotations` column, which is a dictionary. There can be more than one annotation/label for each user and image.

In [118]:
clas.metadata.iloc[-1]

{'source': 'api',
 'session': '67f9a58fcd31b0313ab8dde7d0ded6a74e2732ab43444073c767a66ba164960a',
 'viewport': {'width': 1462, 'height': 1266},
 'started_at': '2018-12-14T16:10:12.234Z',
 'user_agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0',
 'utc_offset': '-3600',
 'finished_at': '2018-12-14T16:10:26.570Z',
 'live_project': True,
 'interventions': {'opt_in': True, 'message': False},
 'user_language': 'en',
 'user_group_ids': [],
 'subject_dimensions': [{'clientWidth': 956,
   'clientHeight': 637,
   'naturalWidth': 2100,
   'naturalHeight': 1400}],
 'workflow_translation_id': '7573'}

In [102]:
clas.annotations.iloc[-1]

{'task': 'T0',
 'task_label': 'Draw bounding boxes around cloud regions',
 'value': [{'x': 1059.060791015625,
   'y': 254.72518920898438,
   'tool': 1,
   'frame': 0,
   'width': 924.7423095703125,
   'height': 542.5449523925781,
   'details': [],
   'tool_label': 'Flower'},
  {'x': 327.6136474609375,
   'y': 202.00827026367188,
   'tool': 0,
   'frame': 0,
   'width': 610.6375122070312,
   'height': 1140.0032043457031,
   'details': [],
   'tool_label': 'Sugar'}]}

To be able to handle the data well, we will now create a new DataFrame that contains one row per bounding box. This will then allow us more easily to handle the data later. In this process, we will also extract the coordinate data plus some more meta data we might need.

In [89]:
# We need to figure out first how many items we have in order to allocate the new DataFrame
count = 0
for i, row in clas_prac.iterrows():
    if len(row.annotations['value']) == 0: count += 1
    for anno in row.annotations['value']:
        count += 1
count

5070

In [90]:
annos = pd.DataFrame(
    columns=list(clas_prac.columns) + ['x', 'y', 'width', 'height', 'tool_label', 'started_at', 'finished_at'],
    index=np.arange(count)
)

In [91]:
j = 0
for i, row in clas_prac.iterrows():    
    coords = row.annotations['value']
    if len(coords) == 0:
        coords = [{'x': None, 'y': None, 'width': None, 'height': None, 'tool_label': None}]
    for anno in coords:
        for c in clas_prac.columns:
            annos.iloc[j][c] = row[c]
        for coord in ['x', 'y', 'width', 'height', 'tool_label']:
            annos.iloc[j][coord] = anno[coord]
        for meta in ['started_at', 'finished_at']:
            annos.iloc[j][meta] = row.metadata[meta]
        j += 1

In [92]:
j, i

(5070, 33332)

In [93]:
annos

Unnamed: 0,classification_id,user_name,user_id,user_ip,workflow_id,workflow_name,workflow_version,created_at,gold_standard,expert,...,subject_ids,datetime,subject_set_id,x,y,width,height,tool_label,started_at,finished_at
0,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,129.2,809.198,136.238,133.176,Flower,2018-11-01T13:59:06.938Z,2018-11-01T14:04:56.893Z
1,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,518.013,747.968,168.384,151.545,Flower,2018-11-01T13:59:06.938Z,2018-11-01T14:04:56.893Z
2,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,26.6388,816.852,99.4994,143.891,Flower,2018-11-01T13:59:06.938Z,2018-11-01T14:04:56.893Z
3,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,12.862,773.991,699.558,163.791,Flower,2018-11-01T13:59:06.938Z,2018-11-01T14:04:56.893Z
4,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,1408.92,294.863,609.243,361.259,Sugar,2018-11-01T13:59:06.938Z,2018-11-01T14:04:56.893Z
5,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,1153.28,53.0026,462.29,223.491,Gravel,2018-11-01T13:59:06.938Z,2018-11-01T14:04:56.893Z
6,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,482.805,83.6178,491.374,433.205,Gravel,2018-11-01T13:59:06.938Z,2018-11-01T14:04:56.893Z
7,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,20.5158,-43.4353,414.836,249.514,Sugar,2018-11-01T13:59:06.938Z,2018-11-01T14:04:56.893Z
8,129516820,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:05:59 UTC,,,...,27140534,2018-11-01 14:05:59,subject_id 27140534 60811 27140534 60902...,1555.87,354.562,538.828,930.702,Sugar,2018-11-01T14:05:02.931Z,2018-11-01T14:05:58.902Z
9,129516820,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:05:59 UTC,,,...,27140534,2018-11-01 14:05:59,subject_id 27140534 60811 27140534 60902...,681.804,565.807,382.69,446.982,Flower,2018-11-01T14:05:02.931Z,2018-11-01T14:05:58.902Z


In [165]:
annos[(annos.subject_ids == 27140443) & (annos.user_name == 'raspstephan')]

Unnamed: 0,classification_id,user_name,user_id,user_ip,workflow_id,workflow_name,workflow_version,created_at,gold_standard,expert,...,subject_data,subject_ids,datetime,x,y,width,height,tool_label,started_at,finished_at
0,134717413,raspstephan,1814910.0,08ce726fe14db3753a3a,8072,Practice,24.13,2018-12-05 16:28:26 UTC,,,...,"{'27140443': {'retired': None, 'fn': '/project...",27140443,2018-12-05 16:28:26,593.934,183.632,1464.12,771.5,Fish,2018-12-05T16:27:54.218Z,2018-12-05T16:28:26.348Z


In [94]:
#EXPORT
def convert_clas_to_annos_df(clas_df):
    """
    Converts a classification pd.DataFrame parsed from the raw Zooniverse file to a pd.DataFrame
    that has one row per bounding box. 
    Additionally, extracts coordinate and metadata information
    """
    # We need to figure out first how many items we have in order to allocate the new DataFrame
    count = 0
    for i, row in clas_df.iterrows():
        if len(row.annotations['value']) == 0: count += 1
        for anno in row.annotations['value']:
            count += 1
    # Allocate new dataframe
    annos_df = pd.DataFrame(
        columns=list(clas_df.columns) + ['x', 'y', 'width', 'height', 'tool_label', 'started_at', 'finished_at'],
        index=np.arange(count)
    )
    # go through each annotation
    j = 0
    for i, row in clas_df.iterrows():
        coords = row.annotations['value']
        if len(coords) == 0:
            coords = [{'x': None, 'y': None, 'width': None, 'height': None, 'tool_label': None}]
        for anno in coords:
            for c in clas_df.columns:
                annos_df.iloc[j][c] = row[c]
            for coord in ['x', 'y', 'width', 'height', 'tool_label']:
                annos_df.iloc[j][coord] = anno[coord]
            for meta in ['started_at', 'finished_at']:
                annos_df.iloc[j][meta] = row.metadata[meta]
            j += 1
    # Convert start and finish times to datetime
    for meta in ['started_at', 'finished_at']:
        annos_df[meta] = pd.to_datetime(annos_df[meta])
    return annos_df

In [95]:
%%time
annos_prac = convert_clas_to_annos_df(clas_prac)

CPU times: user 11.6 s, sys: 0 ns, total: 11.6 s
Wall time: 11.7 s


In [96]:
annos_prac

Unnamed: 0,classification_id,user_name,user_id,user_ip,workflow_id,workflow_name,workflow_version,created_at,gold_standard,expert,...,subject_ids,datetime,subject_set_id,x,y,width,height,tool_label,started_at,finished_at
0,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,129.2,809.198,136.238,133.176,Flower,2018-11-01 13:59:06.938,2018-11-01 14:04:56.893
1,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,518.013,747.968,168.384,151.545,Flower,2018-11-01 13:59:06.938,2018-11-01 14:04:56.893
2,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,26.6388,816.852,99.4994,143.891,Flower,2018-11-01 13:59:06.938,2018-11-01 14:04:56.893
3,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,12.862,773.991,699.558,163.791,Flower,2018-11-01 13:59:06.938,2018-11-01 14:04:56.893
4,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,1408.92,294.863,609.243,361.259,Sugar,2018-11-01 13:59:06.938,2018-11-01 14:04:56.893
5,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,1153.28,53.0026,462.29,223.491,Gravel,2018-11-01 13:59:06.938,2018-11-01 14:04:56.893
6,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,482.805,83.6178,491.374,433.205,Gravel,2018-11-01 13:59:06.938,2018-11-01 14:04:56.893
7,129516659,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:04:58 UTC,,,...,27139777,2018-11-01 14:04:58,60811,20.5158,-43.4353,414.836,249.514,Sugar,2018-11-01 13:59:06.938,2018-11-01 14:04:56.893
8,129516820,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:05:59 UTC,,,...,27140534,2018-11-01 14:05:59,subject_id 27140534 60811 27140534 60902...,1555.87,354.562,538.828,930.702,Sugar,2018-11-01 14:05:02.931,2018-11-01 14:05:58.902
9,129516820,lpaccini,1.83006e+06,dd3ab09f3c57140838ac,8072,Practice,24.13,2018-11-01 14:05:59 UTC,,,...,27140534,2018-11-01 14:05:59,subject_id 27140534 60811 27140534 60902...,681.804,565.807,382.69,446.982,Flower,2018-11-01 14:05:02.931,2018-11-01 14:05:58.902
