<a href="https://colab.research.google.com/github/victor-wildlife/taxinomitis-docs/blob/master/frog_zooniverse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook contains the scripts to upload photos of Archey's frogs to a Zooniverse project and download labels of the landmarks of the frogs to train ML algorithms.

#Requirements

We use the "panoptes_client" package to communicate with Zooniverse. If you don't have it installed, run the command below.

In [None]:
!pip install panoptes_client

Load generic libraries

In [None]:
import zipfile
import pandas as pd

from google.colab import drive
from panoptes_client import (
    SubjectSet,
    Subject,
    Project,
    Panoptes,
) 

# Download frog photos

###Add shortcuts to the compressed photos

To download the photos of the frogs into this Google Colab you first need to add shortcuts in your Google drive to the [five zipped folders](https://drive.google.com/file/d/1XXSrATFX1l-J0CUE4m6UfoOBp9zv3XOr/view?usp=sharing) with the photos. 

To add the shortcuts:
* go to the "Shared with me" section in your Google drive,
* find the five zipped folders,
* click on "Add shorcut to Drive" and
* save the shortcuts (we created a folder called "frog_photos" and saved them there).

*Specify* the folder in your Google drive where you saved the shortcuts to the photos (in our case "frog_photos").

In [None]:
dir_shortcuts = "/content/drive/My Drive/frog_photos/"

*If you can't access the five zipped folders please [email Victor](victor@wildlife.ai). 

###Download the compressed photos

To download the five zip folders with the photos you will need to grant access to the Google file stream. 



In [None]:
drive.mount('/content/drive/')

whareorino_a = zipfile.ZipFile(dir_shortcuts + "whareorino_a.zip", 'r')
whareorino_a_pd = pd.DataFrame(
    [x for x in whareorino_a.namelist() if 'Individual Frogs' in x and not x.endswith(('.db','/','Store'))]
)

whareorino_b = zipfile.ZipFile(dir_shortcuts + "whareorino_b.zip", 'r')
whareorino_b_pd = pd.DataFrame(
    [x for x in whareorino_b.namelist() if 'Individual Frogs' in x and not x.endswith(('.db','/','Store'))]
)
whareorino_c = zipfile.ZipFile(dir_shortcuts + "whareorino_c.zip", 'r')
whareorino_c_pd = pd.DataFrame(
    [x for x in whareorino_c.namelist() if 'Individual Frogs' in x and not x.endswith(('.db','/','Store'))]
)

whareorino_d = zipfile.ZipFile(dir_shortcuts + "whareorino_d.zip", 'r')
whareorino_d_pd = pd.DataFrame(
    [x for x in whareorino_d.namelist() if 'Individual Frogs' in x and not x.endswith(('.db','/','Store'))]
)
pukeokahu = zipfile.ZipFile(dir_shortcuts + "pukeokahu.zip", 'r')
pukeokahu_pd = pd.DataFrame(
    [x for x in pukeokahu.namelist() if 'Individual Frogs' in x and not x.endswith(('.db','/','Store'))]
)

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


#Create a data frame with frog information

Create a data frame to keep track of the photos uploaded to Zooniverse

###Prepare information related to the photos

In [78]:
#Combine the different grids into a single data frame
pdList = [whareorino_a_pd,
          whareorino_b_pd,
          whareorino_c_pd,
          whareorino_d_pd,
          pukeokahu_pd]

frog_df = pd.concat(pdList)

#Rename the column
frog_df = frog_df.rename(columns={0: "file_path"})

#Add new columns based on the directory and filename of the photos
directories = frog_df['file_path'].str.split("/", n = 4, expand = True)

# making separate first name column from new data frame 
frog_df["grid"] = directories[0]
frog_df["frog_id"] = directories[2] 
frog_df["filename"] = directories[3] 

frog_df["capture"] = frog_df["filename"].str.split(".",1, expand = True)[0].str.replace('_', '-').str.rsplit("-",1, expand = True)[0] 

#Sort the columns to match the database
frog_df = frog_df[
        ["filename", "file_path", "capture" , "frog_id", "grid"]
    ]
                                      
frog_df["subject_id"] = np.nan
#list(frog_df.columns)
#frog_df.iloc[1442]['filename']


Updated photos


###Prepare information related to Zooniverse subjects

You need to specify your Zooniverse username and password. Uploading and downloading information from Zooniverse is only accessible to those user with access to the project.

In [None]:
zoo_user = "usern"
zoo_pass = "pass"

In [80]:
# Connect to Zooniverse with your username and password
auth = Panoptes.connect(username=zoo_user, password=zoo_pass)

if not auth.logged_in:
    raise AuthenticationError("Your credentials are invalid. Please try again.")

# Connect to the Zooniverse project (our frog project # is 13355)
project = Project(13355)

# Get info of subjects uploaded to the project
export = project.get_export("subjects")

# Save the subjects info as pandas data frame
subjects_df = pd.read_csv(
    io.StringIO(export.content.decode("utf-8")),
    usecols=[
        "subject_id",
        "metadata",
    ],
)

# Reset index of df
subj_df = subjects_df.reset_index(drop=True).reset_index()

# Flatten the metadata from the uploaded subjects
meta_df = pd.json_normalize(subj_df.metadata.apply(json.loads))

# Drop metadata and index columns from original df
subj_df = subj_df.drop(columns=["metadata", "index",]).rename(
    columns={"id": "subject_id"}
)

# Add the subject_id of photos already uploaded to Zooniverse
frog_df = pd.merge(frog_df, subj_df, how="left", on="movie_filename")


all


PanoptesAPIException: ignored

#Upload new photos to Zooniverse

In [None]:
#Select n number of photos to upload to Zooniverse
photos_upload = frog_df[frog_df['subject_id'].isnull()].sample(n = 3)

# Rename the columns that will appear as metadata associated to the subjects
photos_upload = photos_upload[
            [
             "file_path",
             "filename", 
             "capture" , 
             "frog_id", 
             "grid"
            ]
        ]
        
# Save the df as the subject metadata
subject_metadata = sp_frames_df.set_index('frame_path').to_dict('index')

  # Create a subjet set in Zooniverse to host the frames
subject_set = SubjectSet()

subject_set.links.project = koster_project
subject_set.display_name = args.species + date.today().strftime("_%d_%m_%Y")

subject_set.save()

print("Zooniverse subject set created")


In [None]:
import argparse, os, cv2, re
import utils.db_utils as db_utils
import pandas as pd
import numpy as np
import pims

from PIL import Image
from datetime import date
from utils.zooniverse_utils import auth_session
from panoptes_client import (
    SubjectSet,
    Subject,
    Project,
    Panoptes,
)

# Function to identify up to n number of frames per classified clip
# that contains species of interest after the first time seen
def get_species_frames(species_id, conn, n_frames):

    # Find classified clips that contain the species of interest
    frames_df = pd.read_sql_query(
        f"SELECT subject_id, first_seen FROM agg_annotations_clip WHERE agg_annotations_clip.species_id={species_id}",
        conn,
    )

    # Add species id to the df
    frames_df["frame_exp_sp_id"] = species_id
    
    # Get start time of the clips and ids of the original movies
    (frames_df["clip_start_time"], frames_df["movie_id"],) = list(
        zip(
            *pd.read_sql_query(
                f"SELECT clip_start_time, movie_id FROM subjects WHERE id IN {tuple(frames_df['subject_id'].values)} AND subject_type='clip'",
                conn,
            ).values
        )
    )

    # Identify the second of the original movie when the species first appears
    frames_df["first_seen_movie"] = frames_df["clip_start_time"] + frames_df["first_seen"]

    # Get the filepath and fps of the original movies
    f_paths = pd.read_sql_query(f"SELECT id, fpath, fps FROM movies", conn)

    # Ensure swedish characters don't cause issues
    f_paths["fpath"] = f_paths["fpath"].apply(
        lambda x: str(x) if os.path.isfile(str(x)) else db_utils.unswedify(str(x))
    )
    
    # Include movies' filepath and fps to the df
    frames_df = frames_df.merge(f_paths, left_on="movie_id", right_on="id")
    
    # Specify if original movies can be found
    frames_df['exists'] = frames_df['fpath'].map(os.path.isfile)
    
    if len(frames_df[~frames_df.exists]) > 0:
        print(
            f"There are {len(frames_df) - frames_df.exists.sum()} out of {len(frames_df)} frames with a missing movie"
        )
        
    # Select only frames from movies that can be found
    frames_df = frames_df[frames_df.exists]
    
    # Identify the ordinal number of the frames expected to be extracted
    frames_df["frame_number"] = frames_df[
        ["first_seen_movie", "fps"]
    ].apply(
        lambda x: [
            int((x["first_seen_movie"] + j) * x["fps"])
            for j in range(n_frames)
        ],
        1,
    )
    
    # Reshape df to have each frame as rows
    lst_col = 'frame_number'

    frames_df = pd.DataFrame({
        col:np.repeat(frames_df[col].values, frames_df[lst_col].str.len())
        for col in frames_df.columns.difference([lst_col])
    }).assign(**{lst_col:np.concatenate(frames_df[lst_col].values)})[frames_df.columns.tolist()]
    
    # Drop unnecessary columns
    frames_df.drop(["subject_id"], inplace=True, axis=1)
    
    return frames_df
    
# Function to extract frames 
def extract_frames(df, frames_folder):    

    # Get movies filenames from their path
    df["movie_filename"] = df["fpath"].str.split('/').str[-1].str.replace(".mov", "")
    
    
    
    # Set the filename of the frames
    df["frame_path"] = (frames_folder
                      + df["movie_filename"].astype(str)
                      + "_frame_"
                      + df["frame_number"].astype(str)
                      + "_"
                      + df["frame_exp_sp_id"].astype(str)
                      + ".jpg"
                     )
    
    
    # Read all original movies
    video_dict = {k: pims.Video(k) for k in df["fpath"].unique()}

    # Save the frame as matrix    
    df["frames"] = df[["fpath", "frame_number", "fps"]].apply(
        lambda x: video_dict[x["fpath"]][
            np.arange(
                int(x["frame_number"]),
                int(x["frame_number"])
                    + int(x["fps"]),
                int(x["fps"]),
            )
        ],
        1,
    )
    
    # Extract and save frames
    for frame, filename in zip(df["frames"].explode(), df["frame_path"].explode()):
        Image.fromarray(frame).save(f"{filename}")
        
    print("Frames extracted successfully")
    return df["frame_path"]


def main():

    "Handles argument parsing and launches the correct function."
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--user", "-u", help="Zooniverse username", type=str, required=True
    )
    parser.add_argument(
        "--password", "-p", help="Zooniverse password", type=str, required=True
    )
    parser.add_argument(
        "--species", "-l", help="Species to upload", type=str, required=True
    )
    parser.add_argument(
        "-db",
        "--db_path",
        type=str,
        help="the absolute path to the database file",
        default=r"koster_lab.db",
        required=True,
    )
    parser.add_argument(
        "-fp",
        "--frames_folder",
        type=str,
        help="the absolute path to the folder to store frames",
        default=r"./frames",
        required=True,
    )
    parser.add_argument(
        "-t",
        "--testing",
        help="add flag if testing",
        required=False,
        action="store_true",
    )
    parser.add_argument(
        "-nf",
        "--n_frames",
        type=int,
        help="number of frames to create per clip",
        default=2,
        required=False,
    )

    args = parser.parse_args()

    # Connect to koster_db
    conn = db_utils.create_connection(args.db_path)

    # Connect to Zooniverse
    koster_project = auth_session(args.user, args.password)

    # Get id of species of interest
    species_id = pd.read_sql_query(
        f"SELECT id FROM species WHERE label='{args.species}'", conn
    ).values[0][0]

    # Identify n number of frames per classified clip that contains species of interest 
    sp_frames_df = get_species_frames(species_id, conn, args.n_frames)

    # Get info of frames already uploaded
    uploaded_frames_df = pd.read_sql_query(
        f"SELECT movie_id, frame_number, frame_exp_sp_id FROM subjects WHERE frame_exp_sp_id='{species_id}' and subject_type='frame'",
        conn,
    )

    # Filter out frames that have already been uploaded
    if len(uploaded_frames_df) > 0 and not args.testing:

        # Exclude frames that have already been uploaded
        sp_frames_df = sp_frames_df[
            ~(sp_frames_df["movie_id"].isin(uploaded_frames_df["movie_id"]))
            & ~(sp_frames_df["frame_number"].isin(uploaded_frames_df["frame_number"]))
            & ~(
                sp_frames_df["frame_exp_sp_id"].isin(
                    uploaded_frames_df["frame_exp_sp_id"]
                )
            )
        ]

    # Upload frames to Zooniverse that have not been uploaded
    if len(sp_frames_df) == 0:
        print(
            "There are no subjects to upload, this may be because all of the subjects have already been uploaded"
        )
        raise
     
    else:
        # Create the folder to store the frames if not exist
        if not os.path.exists(args.frames_folder):
            os.mkdir(args.frames_folder)
        
        # Extract the frames and save them
        sp_frames_df["frame_path"] = extract_frames(sp_frames_df, args.frames_folder)
 
        # Select koster db metadata associated with each frame
        sp_frames_df["label"] = args.species
        sp_frames_df["subject_type"] = "frame"

        sp_frames_df = sp_frames_df[
            [
                "frame_path",
                "fpath",
                "frame_number",
                "fps",
                "movie_id",
                "label",
                "frame_exp_sp_id",
                "subject_type",
            ]
        ]
        
        # Save the df as the subject metadata
        subject_metadata = sp_frames_df.set_index('frame_path').to_dict('index')
        
         # Create a subjet set in Zooniverse to host the frames
        subject_set = SubjectSet()

        subject_set.links.project = koster_project
        subject_set.display_name = args.species + date.today().strftime("_%d_%m_%Y")

        subject_set.save()

        print("Zooniverse subject set created")
        
        
        # Upload frames to Zooniverse (with metadata)
        new_subjects = []

        for filename, metadata in subject_metadata.items():
            subject = Subject()

            subject.links.project = koster_project
            subject.add_location(filename)

            subject.metadata.update(metadata)

            subject.save()
            new_subjects.append(subject)

        # Upload frames
        subject_set.add(new_subjects)
        
        print("Subjects uploaded to Zooniverse")


#Download Zooniverse annotations

In [None]:
import os 
import pandas as pd

# Create a df of the photos found in the tmp folder
data = []
# Loop through each folder in the tmp directory
for grid in os.listdir('../tmp/'):
  if 'Grid' in grid:
    grid_path = '../tmp/' + grid
    # Loop through each subfolder in the 'Grid' directories
    for subfolder in os.listdir(grid_path):
      if 'Individual' in subfolder:
        subfolder_path = grid_path + "/" + subfolder
        # Loop through each individual frog in the "individual frog" directoy
        for ind in os.listdir(subfolder_path):
          if not ind.endswith('db'):
            ind_path = subfolder_path + "/" + ind
            # Loop through each photo of the "individual" frog
            for doc in os.listdir(ind_path):
              #Save information about the photo and the frog
              if not doc.endswith('db'):
                fpath = ind_path + "/" + doc
                capt = doc.split(".",1)[0].replace('_', '-').rsplit("-",1)[1]
                data.append((doc, fpath, capt, ind, grid))

df = pd.DataFrame(data,columns = ['filename', 'file_path', 'capture', 'frog_id', 'grid'])

In [None]:
movies_df = db_utils.download_csv_from_google_drive(movies_file_id)

    # Include server's path of the movie files
    movies_df["Fpath"] = movies_path + "/" + movies_df["FilenameCurrent"] + ".mov"

    # Standarise the filename
    movies_df["FilenameCurrent"] = movies_df["FilenameCurrent"].str.normalize("NFD")
    
    # Set up sites information
    sites_db = movies_df[
        ["SiteDecription", "CentroidLat", "CentroidLong"]
    ].drop_duplicates("SiteDecription")

    