# Citizen scientists accuracy

The following scripts are set up to calculate the accuracy of citizen scientists to identify deep water corals. First, we retrieve the annotations of citizen scientists and then we compare them with annotations provided by an expert diver.

# Requirements

### Install required packages

We use the "panoptes_client" package to communicate with Zooniverse. If you don't have it installed, run the command below.

In [1]:
!pip install panoptes_client
!apt-get install libmagic-dev
!pip install python-magic

Collecting panoptes_client
  Downloading https://files.pythonhosted.org/packages/55/6d/09aee478aedcbdc87825eb39bb8593392dc1743b3066d25ba9ec35aa75b0/panoptes_client-1.3.0.tar.gz
Collecting python-magic<0.5,>=0.4
  Downloading https://files.pythonhosted.org/packages/59/77/c76dc35249df428ce2c38a3196e2b2e8f9d2f847a8ca1d4d7a3973c28601/python_magic-0.4.18-py2.py3-none-any.whl
Collecting redo>=1.7
  Downloading https://files.pythonhosted.org/packages/f0/df/6eaeece84b3b6a51663075ae25089ec9b49e90b687ddca6f1fe0f93ab091/redo-2.0.4.tar.gz
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: redo
  Building wheel for redo (PEP 517) ... [?25l[?25hdone
  Created wheel for redo: filename=redo-2.0.4-cp36-none-any.whl size=11931 sha256=2b2b4c5647ce348e1708804639a1d0bfdc9550ed9aed6cca16dbf96973d2e02e
  Stored in directory: /root/.cache/pip/wheels/7e/ca/39/

### Load required libraries

In [2]:
import io
import getpass
import zipfile
import json
import gzip
import pandas as pd
import numpy as np

from google.colab import drive
from datetime import date
from sklearn import metrics
from panoptes_client import (
    SubjectSet,
    Subject,
    Project,
    Panoptes,
) 

### Connect to Zooniverse

In [3]:
# Your user name and password for Zooniverse. 
zoo_user = getpass.getpass('Enter your Zooniverse user')
zoo_pass = getpass.getpass('Enter your Zooniverse password')


# Connect to Zooniverse with your username and password
auth = Panoptes.connect(username=zoo_user, password=zoo_pass)

if not auth.logged_in:
    raise AuthenticationError("Your credentials are invalid. Please try again.")

# Connect to the Zooniverse project (our project # is 9747)
project = Project(9747)

Enter your Zooniverse user··········
Enter your Zooniverse password··········


# Download Zooniverse classifications information

In [4]:
# Get classifications from Zooniverse
export = project.get_export("classifications")

# Save the response as pandas data frame
class_df = pd.read_csv(
    io.StringIO(export.content.decode("utf-8")),
    usecols=[
        "subject_ids",
        "classification_id",
        "workflow_id",
        "workflow_version",
        "annotations",
        "created_at",
        "user_name",
    ],
)

## Specify the video and frame workflows

In [5]:
workflow_clip = 11767
workflow_clip_version = 227
workflow_frame = 12852
workflow_frame_version = 21.85 #Should this be 21.43?

### Format video annotations

In [6]:
# Filter clip classifications
class_clip = class_df[
    (class_df.workflow_id >= workflow_clip)
    & (class_df.workflow_version >= workflow_clip_version)
].reset_index()

# Create an empty list
rows_list = []

# Loop through each classification submitted by the users
for index, row in class_clip.iterrows():
    # Load annotations as json format
    annotations = json.loads(row["annotations"])

    # Select the information from the species identification task
    for ann_i in annotations:
        if ann_i["task"] == "T4":

            # Select each species annotated and flatten the relevant answers
            for value_i in ann_i["value"]:
                choice_i = {}
                # If choice = 'nothing here', set follow-up answers to blank
                if value_i["choice"] == "NOTHINGHERE":
                    f_time = ""
                    inds = ""
                # If choice = species, flatten follow-up answers
                else:
                    answers = value_i["answers"]
                    for k in answers.keys():
                        if "FIRSTTIME" in k:
                            f_time = answers[k].replace("S", "")
                        if "INDIVIDUAL" in k:
                            inds = answers[k]

                # Save the species of choice, class and subject id
                choice_i.update(
                    {
                        "classification_id": row["classification_id"],
                        "label": value_i["choice"],
                        "first_seen": f_time,
                        "how_many": inds,
                    }
                )

                rows_list.append(choice_i)

# Create a data frame with annotations as rows
class_clips_df = pd.DataFrame(
    rows_list, columns=["classification_id", "label", "first_seen", "how_many"]
)

# Specify the type of columns of the df
class_clips_df["how_many"] = pd.to_numeric(class_clips_df["how_many"])
class_clips_df["first_seen"] = pd.to_numeric(class_clips_df["first_seen"])

# Add subject id to each annotation
class_clips_df = pd.merge(
    class_clips_df,
    class_clip.drop(columns=["annotations"]),
    how="left",
    on="classification_id",
)

## Select subjects classified by an expert and citizen scientists

In [28]:
# Find out subjects classified by expert (CallieSanDiego)
subjects_comparison = class_clips_df[(class_clips_df.user_name == "CallieSanDiego")].subject_ids.unique()

# Select subjects classified by expert
class_comparison = class_clips_df[(class_clips_df.subject_ids.isin(subjects_comparison))]

# Create dataset of citizen scientists
class_comparison_cc = class_comparison[class_comparison.user_name != "CallieSanDiego"].reset_index()

# Create dataset of expert
class_comparison_exp = class_comparison[class_comparison.user_name == "CallieSanDiego"].reset_index()

# Find out subjects classified as deep water corals by the expert and citizen scientists

In [48]:
# Create a list for all classifications except deep water corals
all_class = class_comparison.label.unique()
other_classifications = np.delete(all_class, np.where(all_class == 'DEEPWATERCORAL'))

# Calculate the number of citizen scientists that classified each subject
class_comparison_cc["n_users"] = class_comparison_cc.groupby("subject_ids")["classification_id"].transform("nunique")

# Select a subset of subjects that have been classified by 8 volunteers
class_comparison_cc_8 = class_comparison_cc[class_comparison_cc.n_users == 8]

# Calculate the number of users that agreed on their annotations
class_comparison_cc_8["class_n"] = class_comparison_cc_8.groupby(["subject_ids", "label"])["classification_id"].transform("count")

# Specify user type
class_comparison_cc_8['user'] = 'Volunteers'

# Select classifications with the highest agreement
class_comparison_cc_8_agg = class_comparison_cc_8[class_comparison_cc_8.class_n >= 1]

# Select relevant columns
class_comparison_cc_8_agg = class_comparison_cc_8_agg[['subject_ids', 'user', 'label']]

# Select only those subjects classified by 8 citizen scientists
class_comparison_exp = class_comparison_exp[class_comparison_exp.subject_ids.isin(class_comparison_cc_8.subject_ids.unique())]

# Specify user type
class_comparison_exp['user'] = 'Expert'

# Select relevant columns
class_comparison_exp = class_comparison_exp[['subject_ids', 'user', 'label']]

# Merge all the classifications together
coral_subjects = pd.concat([class_comparison_cc_8_agg, class_comparison_exp])

# Replace classifications to deepwater coral or something else 
coral_subjects['label'] = coral_subjects['label'].replace(other_classifications, "Other")

# Sort dataframe by label in each subject/user 
coral_subjects = coral_subjects.groupby(["subject_ids","user"]).apply(lambda x: x.sort_values(["label"], ascending = True)).reset_index(drop=True)

# Select only one user classification per subject
coral_subjects = coral_subjects.groupby(["subject_ids","user"]).head(1)

# Reshape to have one subject per line
coral_subjects = coral_subjects.pivot(index='subject_ids', columns='user',values='label').fillna("Other")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


Compare the accuracy of expert and cit. scientists

In [49]:
metrics.confusion_matrix(coral_subjects['Expert'], coral_subjects['Volunteers'])

array([[ 558,   20],
       [ 509, 1507]])

In [10]:
# END