# I downloaded my classifications; now what?

This notebook will help you take a first look at your data, to get basic information about it like: how many classifications, how many classifiers (signed in and not signed in), etc. It uses Python 2.7.

There are scripts to do this, but first let's just get a sense of what the data looks like.

Before we begin, though, we will need the following packages:

In [12]:
import sys, os
import numpy as np
import pandas as pd

print("Python version: %d.%d.%d, numpy version: %s, pandas version: %s. \nOriginally developed using Py 2.7.11, np v1.11.0, pd v0.19.2" %(sys.version_info[0], sys.version_info[1], sys.version_info[2], np.__version__, pd.__version__))
print("If these versions don't match and stuff breaks, that's probably why.")

Python version: 3.6.3, numpy version: 1.15.3, pandas version: 0.20.3. 
Originally developed using Py 2.7.11, np v1.11.0, pd v0.19.2
If these versions don't match and stuff breaks, that's probably why.


First, let's say your project is called "My Project". We'll make that a variable below, because any of the files we need to access (classifications file, workflow contents file, etc) will start with that name.

In [10]:
project_name = "western-montana-wildlife"

classification_file = project_name + "-classifications.csv"

print(classification_file)

western-montana-wildlife-classifications.csv


In [9]:
!ls -la

total 1204928
drwxr-xr-x  7 fresh_171228  staff        224 Nov 14 18:03 [34m.[m[m
drwxr-xr-x  6 fresh_171228  staff        192 Nov 14 17:57 [34m..[m[m
drwxr-xr-x  3 fresh_171228  staff         96 Nov 14 18:00 [34m.ipynb_checkpoints[m[m
-rw-r--r--  1 fresh_171228  staff      14730 Nov 14 18:02 00 - First Look at Classifications (Notebook - Python 2.7).ipynb
-rw-r--r--@ 1 fresh_171228  staff  606815708 Nov 14 17:54 western-montana-wildlife-classifications.csv
-rw-r--r--@ 1 fresh_171228  staff     386709 Nov 14 17:54 western-montana-wildlife-workflow_contents.csv
-rw-r--r--@ 1 fresh_171228  staff    8921639 Nov 14 17:54 western-montana-wildlife-workflows.csv


Now let's read in that file, using a package called `pandas` which is designed to handle large tables.

In [13]:
classifications_all = pd.read_csv(classification_file)
n_class = len(classifications_all)

print("File %s read with %d rows." % (classification_file, n_class))

File western-montana-wildlife-classifications.csv read with 492111 rows.


  interactivity=interactivity, compiler=compiler, result=result)


The number of rows, which we've saved as `n_class`, is the same as the total number of classifications recorded in this file. 

  **Note:** The more classifications in your file, the more memory it will take for your computer to work with them using `pandas`. From my experience, a few million rows isn't too big a deal as long as you have at least 8 GB of RAM. If you have a lot more, you may need something with more memory than a laptop, or you might want to use a script that doesn't try to hold them all in memory at once, or a package meant to be parallelized, like `dask`.

What does each classification actually contain? Here are the column headers:

In [14]:
classifications_all.columns

Index(['classification_id', 'user_name', 'user_id', 'user_ip', 'workflow_id',
       'workflow_name', 'workflow_version', 'created_at', 'gold_standard',
       'expert', 'metadata', 'annotations', 'subject_data', 'subject_ids'],
      dtype='object')

Each row in the file (i.e., each classification) includes:

 - **classification_id** - the unique ID assigned to each classification
 - **user_name** - the username the classifier chose when they registered on the site (this is public-facing as it's what they're identified with when they post on Talk)
 - **user_id** - the user's ID number in the Zooniverse database (this is not public; in the example file they've been hashed)
 - **user_ip** - a hashed version of the user's IP address
 - **workflow_id** - the ID number of the workflow this classification was recorded in
 - **workflow_name** - the text name of the workflow this classification was recorded in
 - **workflow_version** - the version number (format `major.minor`) of the workflow
 - **created_at** - the timestamp from when the classification was recorded
 - **metadata** - metadata from the classification such as browser information, operating system used
 - **annotations** - the actual information from the classification (answers / clicks / species identifications / etc, specific to this workflow id+version)
 - **subject_data** - the data on the subject that was uploaded as part of the subject upload
 - **subject_ids** - the unique identifier of all subjects classified in this classification (typically 1 subject)
 
We can also quickly look at the first few rows in raw form:

In [16]:
classifications_all.tail()

Unnamed: 0,classification_id,user_name,user_id,user_ip,workflow_id,workflow_name,workflow_version,created_at,gold_standard,expert,metadata,annotations,subject_data,subject_ids
492106,128749350,SIUWildlifer87,1833099.0,2aeb98d9ed7f8226c468,3101,Winter Eagle Project,225.146,2018-10-26 15:21:33 UTC,,,"{""source"":""api"",""session"":""84f9bac0aacdd40cfc6...","[{""task"":""T1"",""value"":[{""choice"":""IMMATUREGOLD...","{""24087619"":{""retired"":null,""url"":""https://zoo...",24087619
492107,128749520,SIUWildlifer87,1833099.0,2aeb98d9ed7f8226c468,3101,Winter Eagle Project,225.146,2018-10-26 15:22:13 UTC,,,"{""source"":""api"",""session"":""84f9bac0aacdd40cfc6...","[{""task"":""T1"",""value"":[{""choice"":""MOUNTAINLION...","{""24068453"":{""retired"":null,""url"":""https://zoo...",24068453
492108,128750378,SIUWildlifer87,1833099.0,2aeb98d9ed7f8226c468,3101,Winter Eagle Project,225.146,2018-10-26 15:25:52 UTC,,,"{""source"":""api"",""session"":""84f9bac0aacdd40cfc6...","[{""task"":""T1"",""value"":[{""choice"":""BLACKBILLEDM...","{""24084790"":{""retired"":null,""url"":""https://zoo...",24084790
492109,128750965,SallySue,1822809.0,8f36c65862263c0fc770,3101,Winter Eagle Project,225.146,2018-10-26 15:28:49 UTC,,,"{""source"":""api"",""session"":""141704adf080be49b33...","[{""task"":""T1"",""value"":[{""choice"":""MOUNTAINLION...","{""24066647"":{""retired"":null,""url"":""https://zoo...",24066647
492110,128757714,SallySue,1822809.0,8f36c65862263c0fc770,3101,Winter Eagle Project,225.146,2018-10-26 16:22:38 UTC,,,"{""source"":""api"",""session"":""141704adf080be49b33...","[{""task"":""T1"",""value"":[{""choice"":""BLACKBILLEDM...","{""24066690"":{""retired"":null,""url"":""https://zoo...",24066690


Even if you ignore the classification annotations themselves, there's still a lot of information in this classification file. Let's find out some other basic information about the classifications. 

In [17]:
users_all = classifications_all['user_name'].unique()
n_users = len(users_all)

# if the classification is from a classifier who isn't signed in, the user_name field has "not-logged-in-[user_ip]"
is_unreg = np.array([q.startswith("not-logged-in") for q in users_all])
is_reg   = np.invert(is_unreg)

n_unreg = sum(is_unreg)
n_reg   = sum(is_reg)

print("%d classifications from %d classifiers, of which %d (%.0f percent) were signed-in and %d (%.0f percent) were not signed in.\n" % (n_class, n_users, n_reg, (float(n_reg)/float(n_users)*100.), n_unreg, (float(n_unreg)/float(n_users)*100.)))

print("Average classifications per user: %.1f" % (float(n_class)/float(n_users)))

492111 classifications from 7228 classifiers, of which 4503 (62 percent) were signed-in and 2725 (38 percent) were not signed in.

Average classifications per user: 68.1


In [18]:
# use created_at to print date range for classifications
print("Classifications registered between %s and %s." % (classifications_all['created_at'][classifications_all.index[0]], classifications_all['created_at'][classifications_all.index[-1]]))

Classifications registered between 2016-12-09 21:40:26 UTC and 2018-10-26 16:22:38 UTC.


In [19]:
# print out the classification ID of the last classification (useful in some cases)
print("Latest classification ID in this file: %d" % classifications_all['classification_id'][classifications_all.index[-1]])

Latest classification ID in this file: 128757714


There's more we could do here: compute medians as well as averages, figure out the typical time it takes for a user to complete a classification, work out how many hours of human effort were spent classifying, etc. We could also clean the classification export of duplicate and non-live classifications, and isolate classifications from just the workflow ID + version that we want to actually analyze.

However, that's for the next notebook!