# Introduction
This notebook documents the filtering and construction of the _Frogs_ dataset, as well as a basic exploratory analysis.

## Source of the data
We pulled the data using the user interface provided by [iNaturalist](), filtering by "Frogs" and "Research-grade" images from the past 2 years. The data is then provided as a `.csv` file with some metadata about the images, as well as weblinks from which we can download the actual images. 



In [1]:
import os
import pandas as pd

from numba import prange

# custom modules
sys.path.append("/home/e/e0425222/CS4243/")
from utils.dataset_utils.filter_utils import visualize_images_from_df, retrieve_word_embeddings, compute_similarity_score, print_topk_words, groupby_plot_hist, filter_by_threshold_counts, download_image

ModuleNotFoundError: No module named 'numba'

In [2]:
frogs_df = pd.read_csv(os.getcwd() + "/frogs_metadata.csv", sep=",")
frogs_df.head()

Unnamed: 0,id,observed_on_string,observed_on,time_observed_at,time_zone,user_id,user_login,created_at,updated_at,quality_grade,...,geoprivacy,taxon_geoprivacy,coordinates_obscured,positioning_method,positioning_device,species_guess,scientific_name,common_name,iconic_taxon_name,taxon_id
0,73992831,Thu Apr 15 2021 07:10:05 GMT+0900 (GMT+9),2021-04-15,2021-04-14 22:10:05 UTC,Tokyo,460572,norio_nomura,2021-04-14 23:15:23 UTC,2021-04-28 12:19:04 UTC,research,...,,open,False,,,ニホンアマガエル,Hyla japonica,Japanese Tree Frog,Amphibia,23951
1,73992869,Thu Apr 15 2021 07:16:24 GMT+0900 (GMT+9),2021-04-15,2021-04-14 22:16:24 UTC,Tokyo,460572,norio_nomura,2021-04-14 23:15:42 UTC,2021-04-28 05:38:57 UTC,research,...,,open,False,,,Japanese Tree Frog,Hyla japonica,Japanese Tree Frog,Amphibia,23951
2,73999658,2021/04/15 10:04 AM AEST,2021-04-15,2021-04-15 00:04:00 UTC,Brisbane,1771883,graham_winterflood,2021-04-15 00:34:04 UTC,2022-02-04 13:11:14 UTC,research,...,,open,False,,,White-lipped Tree Frog,Nyctimystes infrafrenatus,White-lipped Tree Frog,Amphibia,517066
3,74005755,Thu Apr 15 2021 07:08:31 GMT+1000 (GMT+10),2021-04-15,2021-04-14 21:08:31 UTC,Brisbane,2579853,megahertzia,2021-04-15 01:53:49 UTC,2022-01-04 06:01:13 UTC,research,...,,open,False,,,Desert Tree Frog,Litoria rubella,Desert Tree Frog,Amphibia,23611
4,74006270,2021-04-15 10:38:32 AM GMT+10:00,2021-04-15,2021-04-15 00:38:32 UTC,Brisbane,2235434,kimradnell,2021-04-15 02:01:22 UTC,2021-04-15 13:03:12 UTC,research,...,,open,False,gps,gps,Eastern Dwarf Tree Frog,Litoria fallax,Eastern Dwarf Tree Frog,Amphibia,23656


In [3]:
print("Columns are: \n", frogs_df.columns)
print("Rows:", len(frogs_df))

Columns are: 
 Index(['id', 'observed_on_string', 'observed_on', 'time_observed_at',
       'time_zone', 'user_id', 'user_login', 'created_at', 'updated_at',
       'quality_grade', 'license', 'url', 'image_url', 'sound_url', 'tag_list',
       'description', 'num_identification_agreements',
       'num_identification_disagreements', 'captive_cultivated',
       'oauth_application_id', 'place_guess', 'latitude', 'longitude',
       'positional_accuracy', 'private_place_guess', 'private_latitude',
       'private_longitude', 'public_positional_accuracy', 'geoprivacy',
       'taxon_geoprivacy', 'coordinates_obscured', 'positioning_method',
       'positioning_device', 'species_guess', 'scientific_name', 'common_name',
       'iconic_taxon_name', 'taxon_id'],
      dtype='object')
Rows: 193005


## Basic cleaning

In [4]:
# removal of duplicates
frogs_df = frogs_df.dropna(subset=['image_url']).drop_duplicates(subset=['image_url'])

# remove non-image types (e.g. gifs)
frogs_df = frogs_df[~(frogs_df['image_url'].apply(lambda x: x[-5:]).isin(['m.gif']))]

print(f"Number of observations left: {len(frogs_df)}")

Number of observations left: 186902


## Filtering by description
The dataset, from observation, appears to have a lot noisy images - dead animals, tadpoles, etc. all classified under "frogs". To remove these instances, we filter the images by removing those whose descriptions contain one of the top-_k_ words that are close to some filter keywords (e.g. dead).

We start by getting word embeddings of all words in all descriptions.

In [5]:
frogs_df = frogs_df.dropna(subset=['description'])
print("Number after removing NaN image or description:", len(frogs_df))


Number after removing NaN image or description: 20964


In [6]:
# get unique words
words = frogs_df['description'].str.lower().str.findall("\w+")
all_words = set()
for l in words:
    all_words.update(l)
all_words = list(all_words)
print("Number of unique words:", len(all_words))

# get embeddings
all_embeddings = retrieve_word_embeddings(all_words)

Number of unique words: 19401


Batches:   0%|          | 0/607 [00:00<?, ?it/s]

We then use word embeddings to find samples with words related to our filter keywords via cosine similarity.

In [7]:
seed_words = [
    'dead',
    'spawn',
    'egg',
    'tadpole',
    'nest',
    'brood'
]

# embed, find top k similar words for each seed word
seed_embeddings = retrieve_word_embeddings(seed_words)
scores = compute_similarity_score(seed_embeddings, all_embeddings)
sorted_word_list = print_topk_words(
    query_words = seed_words, 
    scores = scores, 
    key_words = all_words, 
    k = 20, 
    threshold = 0.75)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Word 1: dead
[1.00000] - dead
[0.80411] - deceased
[0.77243] - died
[0.76028] - killed
[0.73655] - alive
[0.73514] - dies
[0.71713] - death
[0.70168] - lifeless
[0.68495] - die
[0.66714] - dying
[0.61315] - killing
[0.60742] - lives
[0.60381] - buried
[0.59859] - demise
[0.59787] - revive
[0.59340] - kill
[0.58957] - funeral
[0.58482] - lived
[0.56640] - living
[0.55941] - live

Word 2: spawn
[1.00000] - spawn
[0.85648] - spawning
[0.85348] - spawned
[0.53508] - create
[0.53012] - emerge
[0.50721] - nests
[0.50456] - brood
[0.49716] - frogspawn
[0.49280] - reproduce
[0.48963] - nest
[0.48769] - populated
[0.48695] - creating
[0.47868] - emergent
[0.47711] - generated
[0.47677] - reproduction
[0.47651] - emergents
[0.47553] - swarming
[0.47350] - swarm
[0.47272] - feeder
[0.47156] - generates

Word 3: egg
[1.00000] - egg
[0.91213] - eggs
[0.59257] - chicken
[0.56302] - chickens
[0.55838] - breeding
[0.55598] - hatchling
[0.55007] - duck
[0.54836] - feathers
[0.54708] - hatching
[0.54578

Then we find the words within these retrieved similar words that have a similarity score above a threshold.

In [8]:
# get set of top k words greater than threshold in similarity
unique_filter_words = set()
unique_filter_words.update(list(map(lambda wordscore : wordscore[0], sorted_word_list)))
print("Unique filter words are:\n", unique_filter_words)
print("No. words:", len(unique_filter_words))

Unique filter words are:
 {'deceased', 'nest', 'killed', 'nests', 'eggs', 'egg', 'spawning', 'tadpole', 'nesting', 'spawn', 'tadpoles', 'neste', 'spawned', 'tadpolee', 'nesters', 'died', 'brood', 'dead'}
No. words: 18


Then we visualize the samples from these words.

In [9]:
for word in unique_filter_words:
    df_word = frogs_df[(frogs_df['description'].str.lower().str.contains(word))]
    print(f"For {word}, {len(df_word)} samples.")
    visualize_images_from_df(df_word, 16)

For deceased, 25 samples.


For nest, 37 samples.


For killed, 16 samples.


For nests, 3 samples.


For eggs, 165 samples.


For egg, 278 samples.


For spawning, 5 samples.


For tadpole, 325 samples.


For nesting, 2 samples.


For spawn, 65 samples.


For tadpoles, 222 samples.


For neste, 2 samples.


For spawned, 1 samples.


For tadpolee, 1 samples.


For nesters, 1 samples.


For died, 22 samples.


For brood, 1 samples.


For dead, 301 samples.


We then remove the samples with these keywords in their descriptions.

In [10]:
frogs_df = frogs_df[~(frogs_df['description'].str.lower().isin(unique_filter_words))]
print("Number after removing:", len(frogs_df))


Number after removing: 20898


## Filtering by classes
The dataset of frogs is collected by people from different timezones (as a proxy for location) and contains different species. There are also several other markers of quality. In this section, we use these categorical variables to examine the dataset and accordingly apply filters.

Here, we look at the distribution of categorical values for the given columns.

In [11]:
count_threshold = 150
column_names = ["common_name", "time_zone"]

for column in column_names:
    groupby_plot_hist(frogs_df, column, count_threshold = count_threshold)

American Toad                 1328
American Bullfrog             1189
Green Frog                    1097
Gulf Coast Toad                897
Green Treefrog                 649
European Common Frog           559
European Toad                  542
Wood Frog                      529
Western Leopard Toad           496
Northern Leopard Frog          463
Spring Peeper                  390
Cuban Tree Frog                379
Gray Treefrog                  371
Northern Pacific Tree Frog     358
Western Toad                   355
Gray Treefrog Complex          290
Southern Toad                  277
Cane Toad                      265
Asian Common Toad              260
Southern Leopard Frog          248
Pickerel Frog                  242
Blanchard's Cricket Frog       239
Fowler's Toad                  195
Squirrel Tree Frog             194
Sierran Tree Frog              194
Giant Toad                     192
Clicking Stream Frog           190
Woodhouse's Toad               189
Australian Green Tre

We note that there are several classes with extremely low representation (e.g. some even as few as 10). So we set a threshold and remove these underrepresented samples.

In [12]:
frogs_df = filter_by_threshold_counts(frogs_df, column_names, count_threshold)
for column in column_names:
    groupby_plot_hist(frogs_df, column, count_threshold = count_threshold)

After filtering, left with 11517 samples.
American Toad                 1309
American Bullfrog             1138
Green Frog                    1085
Gulf Coast Toad                883
Green Treefrog                 642
Wood Frog                      506
Western Leopard Toad           478
Northern Leopard Frog          452
Spring Peeper                  388
Gray Treefrog                  369
Cuban Tree Frog                369
Northern Pacific Tree Frog     353
Western Toad                   346
European Common Frog           321
European Toad                  299
Gray Treefrog Complex          287
Southern Toad                  274
Southern Leopard Frog          244
Pickerel Frog                  240
Blanchard's Cricket Frog       236
Fowler's Toad                  195
Sierran Tree Frog              193
Squirrel Tree Frog             191
Woodhouse's Toad               183
Cope's Gray Treefrog           182
Clicking Stream Frog           181
Southern Cricket Frog          173
Name: common_

## Saving
We are now done cleaning our dataset. After a final check with a visualization, we save the files.

In [13]:
visualize_images_from_df(frogs_df, 64)

### Save indices

In [14]:
metadata_filename = 'frogs_filtered.csv'
frogs_df.to_csv(metadata_filename, header = list(frogs_df.columns), sep=',', index = None)

# sanity check
loaded = pd.read_csv(metadata_filename, sep = ",")
loaded.head()


Unnamed: 0,id,observed_on_string,observed_on,time_observed_at,time_zone,user_id,user_login,created_at,updated_at,quality_grade,...,geoprivacy,taxon_geoprivacy,coordinates_obscured,positioning_method,positioning_device,species_guess,scientific_name,common_name,iconic_taxon_name,taxon_id
0,74049910,Thu Apr 15 2021 11:00:16 GMT-0400 (EDT),2021-04-15,2021-04-15 15:00:16 UTC,Eastern Time (US & Canada),3435284,jprice12,2021-04-15 15:43:21 UTC,2021-04-17 19:39:18 UTC,research,...,,,False,,,Green Frog,Lithobates clamitans,Green Frog,Amphibia,65982
1,74050194,Thu Apr 15 2021 10:59:16 GMT-0400 (EDT),2021-04-15,2021-04-15 14:59:16 UTC,Eastern Time (US & Canada),3435284,jprice12,2021-04-15 15:46:21 UTC,2021-04-17 19:39:11 UTC,research,...,,,False,,,Wood Frog,Lithobates sylvaticus,Wood Frog,Amphibia,66012
2,74064526,Thu Apr 15 2021 13:18:00 GMT-0500 (CDT),2021-04-15,2021-04-15 18:18:00 UTC,Central Time (US & Canada),2494733,chelsea_ellen,2021-04-15 18:31:17 UTC,2021-04-26 04:27:16 UTC,research,...,,open,False,,,Gulf Coast Toad,Incilius nebulifer,Gulf Coast Toad,Amphibia,65849
3,74072407,2021/04/15 11:24 AM CDT,2021-04-15,2021-04-15 16:24:00 UTC,Central Time (US & Canada),3113294,chdphoto,2021-04-15 19:55:39 UTC,2021-04-15 20:50:48 UTC,research,...,,,False,,,Green Frog,Lithobates clamitans,Green Frog,Amphibia,65982
4,74072408,2021/04/15 11:25 AM CDT,2021-04-15,2021-04-15 16:25:00 UTC,Central Time (US & Canada),3113294,chdphoto,2021-04-15 19:55:40 UTC,2021-04-15 23:56:14 UTC,research,...,,,False,,,Green Frog,Lithobates clamitans,Green Frog,Amphibia,65982


In [15]:
id_url_filename = "frogs_id_url.txt"
frogs_id_url_df = frogs_df[['id', 'image_url']]
frogs_id_url_df.to_csv(id_url_filename, header = ['id', 'image_url'], index = None, sep = ",")
frogs_id_url_df.head()

Unnamed: 0,id,image_url
75,74049910,https://inaturalist-open-data.s3.amazonaws.com...
76,74050194,https://inaturalist-open-data.s3.amazonaws.com...
93,74064526,https://static.inaturalist.org/photos/12178992...
102,74072407,https://inaturalist-open-data.s3.amazonaws.com...
103,74072408,https://inaturalist-open-data.s3.amazonaws.com...


### Download

In [16]:
frogs_id_url_df = pd.read_csv(id_url_filename, sep = ",")
frogs_id_url_df.head()

Unnamed: 0,id,image_url
0,74049910,https://inaturalist-open-data.s3.amazonaws.com...
1,74050194,https://inaturalist-open-data.s3.amazonaws.com...
2,74064526,https://static.inaturalist.org/photos/12178992...
3,74072407,https://inaturalist-open-data.s3.amazonaws.com...
4,74072408,https://inaturalist-open-data.s3.amazonaws.com...


In [17]:
save_path = os.getcwd() + "/frog_images"

# save
id_url_list = frogs_id_url_df.apply(lambda row : (row["id"], row["image_url"]), axis = 1).values
for i in prange(len(id_url_list)):
    id_url = id_url_list[i]
    download_image(id_url, save_path)

frogs_90113118.jpeg was downloaded...