# I downloaded my classifications; now what?

This notebook will help you take a first look at your data, to get basic information about it like: how many classifications, how many classifiers (signed in and not signed in), etc. It uses Python 2.7.

There are scripts to do this, but first let's just get a sense of what the data looks like.

Before we begin, though, we will need the following packages:

In [1]:
import sys, os
import numpy as np
import pandas as pd

print("Python version: %d.%d.%d, numpy version: %s, pandas version: %s. \nOriginally developed using Py 2.7.11, np v1.11.0, pd v0.19.2" %(sys.version_info[0], sys.version_info[1], sys.version_info[2], np.__version__, pd.__version__))
print("If these versions don't match and stuff breaks, that's probably why.")

Python version: 2.7.11, numpy version: 1.11.0, pandas version: 0.19.2. 
Originally developed using Py 2.7.11, np v1.11.0, pd v0.19.2
If these versions don't match and stuff breaks, that's probably why.


First, let's say your project is called "My Project". We'll make that a variable below, because any of the files we need to access (classifications file, workflow contents file, etc) will start with that name.

In [2]:
project_name = "my-project"

classification_file = project_name + "-classifications.csv"

print(classification_file)

my-project-classifications.csv


Now let's read in that file, using a package called `pandas` which is designed to handle large tables.

In [9]:
classifications_all = pd.read_csv(classification_file)
n_class = len(classifications_all)

print("File %s read with %d rows." % (classification_file, n_class))

File my-project-classifications.csv read with 50000 rows.


The number of rows, which we've saved as `n_class`, is the same as the total number of classifications recorded in this file. 

  **Note:** The more classifications in your file, the more memory it will take for your computer to work with them using `pandas`. From my experience, a few million rows isn't too big a deal as long as you have at least 8 GB of RAM. If you have a lot more, you may need something with more memory than a laptop, or you might want to use a script that doesn't try to hold them all in memory at once, or a package meant to be parallelized, like `dask`.

What does each classification actually contain? Here are the column headers (*note: this example file may not have all the columns your actual data file will contain*):

In [10]:
classifications_all.columns

Index([u'classification_id', u'user_name', u'user_id', u'user_ip',
       u'workflow_id', u'workflow_version', u'created_at', u'metadata',
       u'subject_ids'],
      dtype='object')

Each row in the file (i.e., each classification) includes:

 - **classification_id** - the unique ID assigned to each classification
 - **user_name** - the username the classifier chose when they registered on the site (this is public-facing as it's what they're identified with when they post on Talk)
 - **user_id** - the user's ID number in the Zooniverse database (this is not public; in the example file they've been hashed)
 - **user_ip** - a hashed version of the user's IP address
 - **workflow_id** - the ID number of the workflow this classification was recorded in
 - **workflow_name** - the text name of the workflow this classification was recorded in
 - **workflow_version** - the version number (format `major.minor`) of the workflow
 - **created_at** - the timestamp from when the classification was recorded
 - **metadata** - metadata from the classification such as browser information, operating system used
 - **annotations** - the actual information from the classification (answers / clicks / species identifications / etc, specific to this workflow id+version)
 - **subject_data** - the data on the subject that was uploaded as part of the subject upload
 - **subject_ids** - the unique identifier of all subjects classified in this classification (typically 1 subject)
 
We can also quickly look at the first few rows in raw form:

In [11]:
classifications_all.head()

Unnamed: 0,classification_id,user_name,user_id,user_ip,workflow_id,workflow_version,created_at,metadata,subject_ids
0,71432114,klmasters,9762938000.0,152e3f84bcf873903f81,4958,17.6,2017-09-21 20:38:58 UTC,"{""session"":""a72e45416496889a98c50417d2ad7e5aa8...",13093797
1,71432154,klmasters,9762938000.0,152e3f84bcf873903f81,4958,17.6,2017-09-21 20:39:19 UTC,"{""session"":""a72e45416496889a98c50417d2ad7e5aa8...",13094432
2,71432192,klmasters,9762938000.0,152e3f84bcf873903f81,4958,17.6,2017-09-21 20:39:43 UTC,"{""session"":""a72e45416496889a98c50417d2ad7e5aa8...",13094624
3,71432209,klmasters,9762938000.0,152e3f84bcf873903f81,4958,17.6,2017-09-21 20:39:53 UTC,"{""session"":""a72e45416496889a98c50417d2ad7e5aa8...",13094977
4,71432225,klmasters,9762938000.0,152e3f84bcf873903f81,4958,17.6,2017-09-21 20:40:01 UTC,"{""session"":""a72e45416496889a98c50417d2ad7e5aa8...",13098523


*Note: user IDs have been changed for this example file and are not valid Zooniverse user IDs.*

Even if you ignore the classification annotations themselves (which this example file is), there's still a lot of information in this classification file. Let's find out some other basic information about the classifications. 

In [12]:
users_all = classifications_all['user_name'].unique()
n_users = len(users_all)

# if the classification is from a classifier who isn't signed in, the user_name field has "not-logged-in-[user_ip]"
is_unreg = np.array([q.startswith("not-logged-in") for q in users_all])
is_reg   = np.invert(is_unreg)

n_unreg = sum(is_unreg)
n_reg   = sum(is_reg)

print("%d classifications from %d classifiers, of which %d (%.0f percent) were signed-in and %d (%.0f percent) were not signed in.\n" % (n_class, n_users, n_reg, (float(n_reg)/float(n_users)*100.), n_unreg, (float(n_unreg)/float(n_users)*100.)))

print("Average classifications per user: %.1f" % (float(n_class)/float(n_users)))

50000 classifications from 919 classifiers, of which 742 (81 percent) were signed-in and 177 (19 percent) were not signed in.

Average classifications per user: 54.4


In [13]:
# use created_at to print date range for classifications
print("Classifications registered between %s and %s." % (classifications_all['created_at'][classifications_all.index[0]], classifications_all['created_at'][classifications_all.index[-1]]))

Classifications registered between 2017-09-21 20:38:58 UTC and 2017-09-22 16:41:16 UTC.


In [14]:
# print out the classification ID of the last classification (useful in some cases)
print("Latest classification ID in this file: %d" % classifications_all['classification_id'][classifications_all.index[-1]])

Latest classification ID in this file: 71523948


There's more we could do here: compute medians as well as averages, figure out the typical time it takes for a user to complete a classification, work out how many hours of human effort were spent classifying, etc. We could also clean the classification export of duplicate and non-live classifications, and isolate classifications from just the workflow ID + version that we want to actually analyze.

However, that's for the next notebook!