# MS Classifier for Plains Zebras Collection Classification
- MS classifier **does** distinguish between grevy's zebra (equus grevyi) and plains zebra (equus quagga):
    - https://speciesclassification.westus2.cloudapp.azure.com/
- collections to run classifier on: 
   - plains zebras general
   - plains zebra general africa bbox
   

In [1]:
#to reflect changes made in modules
%load_ext autoreload
%autoreload 2

## Keys

In [2]:
DB_KEY =  "mongodb+srv://user:BCLobB4rLJucVXG2@wildbook-cmmya.mongodb.net/test?retryWrites=true&w=majority" # connect to database here (see owners for access)
# MS_key = '3c313eb853de41788b3e35e9bcf1ba2e'

In [8]:
import os, sys
# sys.path.append(os.path.join(sys.path[0], '../'))
sys.path.append(os.path.join(os.path.abspath(os.getcwd()), '../'))

#distance visualization
import plotly.graph_objects as go
import plotly.io as pio
import numpy as np
from itertools import chain
import pandas as pd
import matplotlib.pyplot as plt

#import flickr and db modules
from wildbook_social import Flickr, Database

#set up
db = Database(DB_KEY, 'flickr_june_2019')
# db = Database(DB_KEY, 'imgs_for_species_classifier')
fr = Flickr(db)

In [9]:
from wildbook_social import SpeciesClassifier 
from wildbook_social import Image

## instance of the MS Species Classification API and Save Class to reformat Flickr data for API
sc = SpeciesClassifier()
img = Image()

# Select MongoDB Collection + MS Classifier Setup

In [None]:
# saveTo = 'plains zebra general - 1000 demo'
saveTo = 'plains zebra general - 1000 testing' #Vi-an

print('You are working with the collection: ', saveTo)

In [None]:
## rename 'url_l' field in docs to just 'url'
db.renameField(saveTo, 'url_l', 'url')

In [None]:
## get the current mongoDB database collection object
db_obj = db.getDB()

## Demo - Classifying Images with the MS Classifier

In [None]:
# numToClassify = 20 #set number of images you want to classify
# species_keyword = 'Plains Zebra'
# confidence = 0.0 

# flickr_img_dicts = img.get_flickr_img_dicts(db_obj, saveTo, numToClassify)
# sc.predict_image_relevancy(db_obj, saveTo, flickr_img_dicts, species_keyword, confidence)

# Batch Relevance Filtration with MS Classifier
- Automatically filter through unlabeled images and have classifier mark as relevant if species is in frame
- Only choose to run either (1) classify entire collection or (2) classify smaller subsets of collection

In [None]:
res = db_obj[saveTo].find({'relevant':None})
res_list = list(res)
len(res_list)

### (A) Classify Entire Collection ...

In [None]:
numToClassify = len(res_list) #can also manually set to 100 or something if your don't want to classify entire collection in one go
confidence = 0.0
species_keyword = 'Plains Zebra'

#encode the metadata in a form that fits the MS classifier
flickr_img_dicts = img.get_flickr_img_dicts(db_obj, saveTo, numToClassify)
print(len(flickr_img_dicts))

#begin running the classifier on our images in the collection
sc.predict_image_relevancy(db_obj, saveTo, flickr_img_dicts, species_keyword, confidence)

### (B) ... Or Classify Smaller Subsets of Collection to avoid Timeout

In [None]:
# for i in range(0,10):
#     print(i)
#     flickr_img_dicts = img.get_flickr_img_dicts(db_obj, saveTo, numToClassify)
#     print(len(flickr_img_dicts))
#     sc.predict_image_relevancy(db_obj, saveTo, flickr_img_dicts, species_keyword, confidence)
# print('Done with set of 10')

FIXME: currently,humpback whale specific - 30 full collection, has relevant and wild bool values as strings, so our get_flickr_img_dicts 
function is not returning anything because we'd need to do relevant: "null"
we need to go back and fix the values in these fields back to bool vals. This bool -> string conversion happened when we exported our data from
the flickr db to a csv and into this dummy collection.


# Visualizing MS Species Classifier Results

In [None]:
import ipyplot

In [None]:
## get images labeled as relevant and irrelevant
images = db_obj[saveTo].find({"relevant": True}, {"url": 1})
images_irrel = db_obj[saveTo].find({"relevant": False}, {"url": 1})

In [None]:
list_of_imgs = list(images)
list_of_imgs_irrel = list(images_irrel)

In [None]:
imgs_url = [dic['url'] for dic in list_of_imgs]
imgs_url_irrel = [dic['url'] for dic in list_of_imgs_irrel]

In [None]:
labels = [dic['_id'] for dic in list_of_imgs]
labels_irrel = [dic['_id'] for dic in list_of_imgs_irrel]

## Compare counts
- plot count of relevant vs non-relevant for each collection

In [None]:
count_rel = len(imgs_url)
count_irrel = len(imgs_url_irrel)

data = {'relevant': count_rel, 'irrelevant': count_irrel}
df_counts = pd.DataFrame(data, index=[0])
print(df_counts)

## Plot Images in A Grid

In [None]:
ipyplot.plot_images(imgs_url, labels, max_images = 600, img_width=100)

## Double Checking Relevant Images and Labeling Truly Relevant Images as Wild/Not Wild

Update the table here https://mramir71.quip.com/ag3gALrvbh6K/Wildlife-Social-Media-Bias-Meeting-Notes by entering 'yes' under the column **filtered for wild/unknown/captive** when you are done with the entire collection

## Select the Collection You Want to Filter

In [19]:
# saveTo = 'plains zebra general'
# saveTo = 'plains zebra general africa bbox'
saveTo = 'plains zebra specific'
# saveTo = 'plains zebra specific africa bbox'

print('You are working with the collection: ', saveTo)

You are working with the collection:  plains zebra specific africa bbox


In [20]:
## rename field and get the current mongoDB database collection object
db.renameField(saveTo, 'url_l', 'url')
db_obj = db.getDB()

In [None]:
## run this cell to see how many relevant images you have left to double check
amt_remaining_to_check = db_obj[saveTo].count_documents({"$and": [{"relevant": True}, {"double_checked": False}]})
print(amt_remaining_to_check)

### Run this cell to start double checking filtration. Below are the steps for the update filtration process:
1. Mark if the image is truly relevant (contains a real Plains zebra)
2. If the image is relevant, mark if it is a wild/unknown/captive encounter
    - **wild**: you can definitely tell that the Plains zebra is in the wild/national park. You can use the location coordinates (if available) to double check
    - **unknown**: you cannot tell if the Plains zebra is in the wild or a zoo. 
    - **captive**: you can definitely tell that the Plains zebra is in captivity/zoo. Look for "zoo" in the tags/description/title, and if the location coordinates detail an area where Plains zebras don't typically live

In [None]:
## run this cell to filter through the images
amount = 1
db.doubleCheckRelevantImages(saveTo, amount, first_round = False)

In [None]:
db.close()