# Acquiring images (Module 3)

This notebook is designed to be a standalone tutorial demonstrating how you can acquire images from the [iDigBio portal](https://www.idigbio.org/portal/search) based on your own research scope. For the remainder of the lessons in today's workshop we will be using a standard set of images provided by the workshop instructors. This module will show you how that set of images was originally acquired, and provide an opportunity for you to test out downloading images yourself.

Our goals are for you to be able to:
1. Define search parameters
1. Retrieve specimen media records that meet those parameters
1. Download media associated with those specimen records
1. Review media and conduct quality control

We will be accessing data from the iDigBio Portal via an Application Programming Interface (API). All that you need to know about APIs for this workshop is that they are essentially a way for an organization to allow external users to interact with their systems. If you would like a slightly more detailed overview, check out this video on [Reading data directly into your analysis script: Introduction to APIs](https://vimeo.com/444924504).

Before we begin, you will need to install the following packages if you have not already:

In [None]:
!pip install idigbio

In [None]:
!pip install pandas

In [None]:
!pip install requests

In [None]:
!pip install shutil

## Define search parameters & retrieve specimen media records

First, we need to find all the media records for which we are interested in downloading media files. Do this using the `search_media` function from the idigbio package, which allows you to search for media records based on data contained in linked specimen records, like species or collecting locality. You can learn more about this function from the [iDigBio API documentation](https://github.com/iDigBio/idigbio-search-api/wiki) and [iDigBio Python package documentation](https://pypi.org/project/idigbio/). In this example, we want to search for images of herbarium specimens of species in the genus Acer that were collected in the United States.

In [15]:
# Load idigbio package
import idigbio

# Load pandas package
import pandas

# Specify that we want to return results as a dataframe
api = idigbio.pandas()

# Execute the search with a limit on retrieving only 10 results
mediarecords = api.search_media(rq={'genus': 'acer',
                                    'country': 'united states',
                                    'stateprovince': 'california',
                                    'hasImage': True},
                                limit = 10)

As a result of the code above, we now have a dataframe called `mediarecords`. Here's a peak into some of the important fields contained in this dataframe:

In [16]:
mediarecords[['accessuri','rights','format','records']]

Unnamed: 0_level_0,accessuri,rights,format,records
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e29dac8e-bb8b-4b9a-a466-ab60737de977,http://n2t.net/ark:/65665/m3d2d57e8d-4524-4cc9...,CC0,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",[c7591b04-ed68-49cc-b81e-625dfb759eb1]
a15c893e-885c-4213-9e21-eed63a8396ef,http://n2t.net/ark:/65665/m352c5c3c4-420c-4341...,CC0,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",[d320b657-4b96-48c1-a0a3-25a14941edca]
c327a5f5-eb40-4cdb-af17-5d9a48813e12,http://n2t.net/ark:/65665/m3e5ce0eef-d64d-4f80...,CC0,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",[01956fa0-1041-4966-822e-d4efd040b4ac]
4f107f1c-33b7-44b0-bbcb-5ba9ccf6dd40,http://n2t.net/ark:/65665/m328733e5b-c0a4-4ada...,CC0,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",[01956fa0-1041-4966-822e-d4efd040b4ac]
ad9d24c6-0c02-456e-8f2f-ad4f277e1016,http://n2t.net/ark:/65665/m369ad5509-8d9e-4472...,CC0,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",[fe270751-f9ab-4279-88d6-30a2aaa56dae]
2d8a7d7e-2b62-4ea5-825a-99e13542ec52,http://n2t.net/ark:/65665/m356d8049d-54a9-4412...,CC0,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",[f3c793c5-0bc5-46d6-a5de-212591d3d193]
0b1f8a56-84f1-4fef-b6c0-d25bda9a5274,http://n2t.net/ark:/65665/m3b77f80b1-0475-4ad9...,CC0,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",[41a1c12e-a8fa-49f7-ab8b-ff76b26e6a9d]
ed5d42e1-c255-40dc-8c94-c8dbf8646d8b,http://n2t.net/ark:/65665/m3119d46db-005c-4d6e...,CC0,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",[085d0883-a595-4d91-b55f-7f46a2405570]
9d6b9063-c930-4c3d-a2db-bb78186811e9,http://n2t.net/ark:/65665/m394f87783-6efe-463d...,CC0,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",[ad800b93-f818-4c51-a27a-317b6e26b504]
bb30376f-71f6-481a-a118-0fd760f1b3fa,http://n2t.net/ark:/65665/m37ac9f2ad-a9c2-428c...,CC0,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",[41a1c12e-a8fa-49f7-ab8b-ff76b26e6a9d]


## Download media associated with these specimen records

Now that we know what media records are of interest to us, we need to isolate the URLs that link to the actual media files so that we can download them. In this example, we will demonstrate how to download files that are cached on the iDigBio server, as well as the original files hosted externally by the data provider. It is unlikely that you need to download two sets of images, so you can choose to execute either the steps related to "iDigBio" or to "external," depending on your preference.

_The code block below will assemble a vector of download URLs for media files on the iDigBio server._

In [12]:
# Pull UUID values out of the `mediarecords` dataframe and into a list data structure
uuids = mediarecords.index.tolist()

# Set standard URL prefix for files cached on the iDigBio server
append_str = 'https://api.idigbio.org/v2/media/'

# Define a URL suffix to specify we want to download the full size images
# You can download lower resolution images by changing this string to '?size=webview'
suffix_str = '?size=fullsize'

# Create list of iDigBio media URLs by concatenating the prefix above with the UUID values from `mediarecords`
mediaurl_idigbio = [append_str + sub + suffix_str for sub in uuids]

# Show us what we just made
for i in mediaurl_idigbio: print(i)

https://api.idigbio.org/v2/media/7af81430-dc0f-4966-afe5-0a668c3250a9?size=fullsize
https://api.idigbio.org/v2/media/75ec8a37-8daf-4e74-b2a5-5c303c4a43f7?size=fullsize
https://api.idigbio.org/v2/media/bb2fdea6-e3e4-411d-a0a4-c22ead467250?size=fullsize
https://api.idigbio.org/v2/media/1f2dbb2b-75ba-48cb-b34c-1ca003b4a38d?size=fullsize
https://api.idigbio.org/v2/media/ea53c2af-184d-458b-9a4a-1a095235b854?size=fullsize
https://api.idigbio.org/v2/media/f7ff8c3e-0e26-4ba1-8b47-37211f845120?size=fullsize
https://api.idigbio.org/v2/media/e66162cb-7b7a-412d-82cf-b923aacc80c2?size=fullsize
https://api.idigbio.org/v2/media/0bd110c1-5a9f-48fc-9fd2-138c7c6ee15a?size=fullsize
https://api.idigbio.org/v2/media/1c5f5caa-12a9-4888-94f4-6026e40582af?size=fullsize
https://api.idigbio.org/v2/media/79925f66-bcf5-46f9-bc23-5512a9c21d6d?size=fullsize


_The code block below will assemble a vector of download URLs for data providers' original media files stored on external servers._

In [17]:
# Create list of external media URLs by pulling values for `accessuri` out of `mediarecords`
mediaurl_external = mediarecords.accessuri.tolist()

# Show us what we just made
for i in mediaurl_external: print(i)

http://n2t.net/ark:/65665/m3d2d57e8d-4524-4cc9-9bf3-0218332279fb
http://n2t.net/ark:/65665/m352c5c3c4-420c-4341-9b6f-0bb2691a4fa3
http://n2t.net/ark:/65665/m3e5ce0eef-d64d-4f80-9341-bb61cae08444
http://n2t.net/ark:/65665/m328733e5b-c0a4-4ada-8ab4-a4f56f510b39
http://n2t.net/ark:/65665/m369ad5509-8d9e-4472-914e-3a681bc7c01e
http://n2t.net/ark:/65665/m356d8049d-54a9-4412-ba49-b2bfa93a0cc3
http://n2t.net/ark:/65665/m3b77f80b1-0475-4ad9-b6bb-dd33dce8cb8e
http://n2t.net/ark:/65665/m3119d46db-005c-4d6e-a79b-a5be91585fb3
http://n2t.net/ark:/65665/m394f87783-6efe-463d-84b6-9d6afb5083ae
http://n2t.net/ark:/65665/m37ac9f2ad-a9c2-428c-affc-d9930d322fa2


## Downloading images

We can use the download URLs that we assembled in the step above to go and download each media file. For clarity, we will place files in two different folders, based on whether we downloaded them from the iDigBio server or an external server.

_The code block below will download media files from the iDigBio server._

In [None]:
# Load requests package
import requests

# Load shutil package
import shutil

# Create a new directory to save media files in


# Initiate loop that will iterate through our list of URLs and download the file found at each
for i in mediaurl_idigbio:

    # Define what values we are iterating through
    image_url = i
    
    # Define a filename based on the UUID of the media record
    filename = image_url.split('/')[-1] + '.jpg'
    
    # Begin the process of downloading a file
    r = requests.get(image_url, stream = True)

    # Check that the file can be retrieved successfully
    if r.status_code == 200:
        
        # Set this otherwise the downloaded image file size will be zero
        r.raw.decode_content = True
    
        # Open a local file
        with open(filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
        
        # Report back on how things went
        print('Image sucessfully Downloaded: ',filename)
    else:
        print('Image Couldn\'t be retreived')

In [None]:
_The code block below will download media files hosted externally by data providers._

In [None]:
# Load requests package
import requests

# Load shutil package
import shutil

# Create a new directory to save media files in


# Initiate loop that will iterate through our list of URLs and download the file found at each
for i in mediaurl_external:

    # Define what values we are iterating through
    image_url = i
    
    # Retain the original filename
    filename = image_url.split('/')[-1] + '.jpg'
    
    # Begin the process of downloading a file
    r = requests.get(image_url, stream = True)

    # Check that the file can be retrieved successfully
    if r.status_code == 200:
        
        # Set this otherwise the downloaded image file size will be zero
        r.raw.decode_content = True
    
        # Open a local file
        with open(filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
        
        # Report back on how things went
        print('Image sucessfully Downloaded: ',filename)
    else:
        print('Image Couldn\'t be retreived')

## Reviewing media and conducting quality control

- ditch images of non specimens
- deal with EXIF metadata issues