<h1>Observation Data Scraper for INaturalist API</h1>

This notebook contains a scraper for retrieving data from the [iNaturalist API](https://www.inaturalist.org/pages/api+reference). The scraper collects observations based on user-specified parameters, such as species names, location, or media types (photos, sounds, etc.).

This script utilizes the "window" method of requesting data from the API. Due to INaturalist's limit of 10,000 observations per request, you must first tailor your parameters to fetch no more than 10,000 results per each request. You should not alter the request rate to fetch more than 60 times per minute, and should also not request more than 10,000 times per day (though difficult). Download capacity (monitored by the program) is capped at 5GB per hour, and no more than 24GB per day per INaturalist's usage policy. If you start and stop this program rather than running it once per day, be sure to monitor your total data usage manually. Should an error occur, your data will automatically be saved.

This program works to scrape together data through incremental search "windows" of all possible id numbers, compiling and formatting the data as necessary into a .csv format. It has a timer and data meter to prevent exceeding the download limit policy, and thus can be run in the background over the course of however long it will take. For example, processing 700k observations may take several days. Multiple IP's/proxies/VPN's can allow you to do this much faster, however that is a violation of the usage policy. 

Note: INaturalist's API is not intended for data scraping. While this script complies with their policies, I cannot condone its usage. I created this program for educational and proof of concept purposes. This script is especially intended to retrieve information that is unavailable through their formal data export tool, and can be used to collect more data than their data export tool allows (given enough time). Be sure to include appropriate attributions when using photos copyrighted under the Creative Commons License.<br>
<a href="https://www.inaturalist.org/pages/api+recommended+practices">INaturalist's API recommended practices</a> <br>
<a href="https://www.inaturalist.org/observations/export">INaturalist's formal data export tool</a>


<H2>Features:</h2>
<ul>
<p>-Fetches observations from Inaturalist based on configurable parameters</p>
<p>-Filters observation data according to a configurable field list</p>
<p>-Handles data issues of the currently pressent fields</p>
<p>-Fully complies with Inaturalist's API use and data downloading policy</p>
<p>-Saves results to CSV for further analysis</p>
</ul>

<H2>Prerequisites:</h2>
<ul>
<p>-Python 3.12 and the following libraries:</p>
    <ul>
        <p>-requests</p>
        <p>-pandas</p>
        <p>-time</p>
        <p>-json</p>
        <p>-math</p>
        <p>-os</p>
        <p>-datetime</p>
    </ul>
</ul>
<h3>Scroll down to set up your parameters to begin</h3>

In [2]:
#<--Click Run after prerequisite libraries are installed
import requests
import pandas as pd
import time
import json
import math
from datetime import datetime, timedelta
import os


<h3>Step one: Define your search parameters and personal details, excluding "id_above" and "id_below".</h3>
Execute Get_Latest_ID() (below) before executing this block.
You may also add or remove any parameters as you desire. For a complete list of observation query parameters, see <a href="https://www.inaturalist.org/pages/api%2Breference#get-observations">the full documentation.</a>

In [3]:
#Change these request headers and Parameters as needed. Add or remove fields in the getInfo() function below as desired. 

request_headers={   #change in step 1 to your own request headers (recommended). This is a generic headers sample.
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
    "Accept": "application/json, text/html;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.5",
    "Connection": "keep-alive",
}

Your_File_Path = r'C:\YourFilePath'+'\\' #Change in step 1 (the r string only). filepath to folder of this notebook
url="https://api.inaturalist.org/v1/observations"
ids_per_window=100000    #change in step 2
try: #No need to change
    total_ids = get_latest_ID()
except NameError:
    total_ids = 260916966
number_of_windows=math.ceil(total_ids/ids_per_window)
initial_window_number=0   #(As Necessary) Change this value to start a search at that window number. In case your execution is cut short, 
#you may resume at the next initial_window_number, following the last value in the last saved csv file name. The initial search always begins at 0.
per_page=200 #results per page request (maximum allowed=200)


the_parameters = {   #change in step 1
    "place_id": 11, #Use the codeblock below to decide which place_id to use, if any
    "per_page":per_page, 
    "created_before":"2025-02-04", #Keep in mind that the observation date could be long before the date it was posted to INaturalist (creation date)
    "id_above":total_ids-(initial_window_number+1)*ids_per_window, 
    "id_below":total_ids-(initial_window_number)*ids_per_window,
}




In [31]:
print(Get_Place_ID('new york')) #Enter your desired location. Examples: Hawaii, USA, San Diego, Earth, Australia, new york city




<h3>Step Two:</h3>
Execute the large code block that starts with Get_Info() after Step 3. Then,
 run the following print function and note the total results. As long as that number is below 10,000, this program will work. If it exceeds 10,000, decrease the 'ids_per_window' in the_parameters above such that the total results are somewhere around 8,000 to be safe. If it is far below 10,000, increase 'ids_per_window' by a magnitude or two to make the search more efficient. The total results pulled from the API based on your search parameters may vary depending on the date/window, but the search windows with the higher ID number ranges (the most recent observations) will generally yield the highest total results.

This program is currently set up to search through all possible INaturalist ID's. If you would like to only get data from a certain time period, say all data before 2020, then add the parameter "created before":"YYYY-MM-DD" to the function "Get_Latest_ID()" below, and execute that cell as well as the one containing the parameters. This will reduce unnecessary requests that fall outside of your date range. 

You can also open the testdata.json file that saves into the program folder to see the total_results as well as a sample of the raw data returned by the API.


In [None]:
print(Get_Total_Results())

<h3>Step Three: Choose your target fields</h3>
You can find a complete list of return fields <a href="https://www.inaturalist.org/pages/api+reference#get-observations">here</a>.
You may add or remove them from Get_Info() as necessary, following the current format, and they will be added into the final csv. In future updates, a feature to make this easier may be added.
Be sure to match the order of the data_columns with the Get_Info() function.

In [None]:
#Takes only the desired datapoints from each observation uploaded to INaturalist
def Get_Info(observation):
    extensions = ['.jpg','.jpeg','.png','.webp','.gif']
    result=[] 
    if(observation['captive']==True): #only returns wild specimen observations
        return
    try:
        result.append(observation['taxon']['name']) #Most specific taxon name given. Observation is voided altogether if there is no name.
    except(KeyError, TypeError):
        return None
    try:
        result.append(observation['id']) #Observation ID number
    except (KeyError, TypeError):
        result.append('None')
    try:
        result.append(observation['location']) #Location
    except (KeyError, TypeError):
        result.append('None')
    try:
        result.append(observation['positional_accuracy']) #Accuracy, in meters, of the location
    except (KeyError, TypeError):
        result.append('None')
    try:
        result.append(observation['observed_on']) #date of observation
    except (KeyError, TypeError):
        result.append('None')
    try:
        result.append(observation['species_guess']) 
    except (KeyError, TypeError):
        result.append('None')
    try:
        result.append(observation['taxon']['rank']) #rank of the taxon (family, genus, order, etc.)
    except (KeyError, TypeError):
        result.append('None')
    try:
        result.append(observation['taxon']['preferred_common_name']) #preferred common name of the specimen. May be in a native language or different than the common name
    except (KeyError, TypeError):
        result.append('None')
    try: #Retrieve the original photo url, accounting for both jpg and jpeg files. If not found, this observation is voided altogether
        photo_url=observation['observation_photos'][0]['photo']['url']
        matched = False
        for ext in extensions:
            if photo_url.lower().endswith(ext):
                result.append(photo_url[:-(len(ext)+6)]+'original'+ ext)
                matched = True
                break
        if not matched:
            return None
    except (KeyError, TypeError, IndexError):
        return None
    try: #Creative commons phoho copyright attribution
        result.append((observation['observation_photos'][0]['photo']['attribution'])) #Copyright attribution for fair use of the original photo
    except (KeyError, TypeError, IndexError):
        result.append('None')
    try:
        result.append(observation['taxon']['conservation_status']['status_name']) #conservation status. ex:threatened, vulnerable, etc.
    except (KeyError, TypeError):
        result.append('None')
    try: #Native, Introduced, or Endemic
        result.append(observation['taxon']['establishment_means']['establishment_means']) #Known Establishment means. ex: native, endemic, introduced, none, etc. 
    except (KeyError, TypeError):
        result.append('None')
    try:
        result.append(observation['obscured']) #exact loation may be obscured by several hundred or thousand meters, due to user privacy or taxon privacy(in the case of rare or endangered species)
    except (KeyError, TypeError):
        result.append('None')
    try:
        result.append(observation['taxon']['extinct']) #Extinct or not (Boolean)
    except (KeyError, TypeError):
        result.append('None') 
    return result   

data_columns = ['taxon_name','id','location','positional_accuracy','observed_on','species_guess','taxon_rank','preferred_common_name',
                'photo_url','copyright','conservation_status','establishment_means','location_obscured','rank_extinct'] #csv column names



def Merge_Files(file_1_name, file_2_name):
    file_1=file_1_name
    file_2=file_2_name
    alpha = pd.read_csv(f"{Your_File_Path}{file_1}")
    beta=pd.read_csv(f"{Your_File_Path}\\{file_2}")
    pd.concat([alpha,beta], ignore_index=False, axis=0, sort=False).drop_duplicates().to_csv("All_Data.csv",index=False)


def Get_Total_Results(): #returns the total results from the_parameters in the average search window
    check_response = requests.get(url, params=the_parameters, headers=request_headers)
    if check_response.status_code == 200: #response from API recieved
        test_data = check_response.json()
        with open("testdata.json", "w") as json_file:
            json.dump(test_data, json_file, indent=4)
        return (test_data["total_results"])

#retrieve the ID of the most recent observation (according to the date parameters) uploaded to INaturalist
def Get_Latest_ID():
    parameters = {
    "per_page":1,
    #"created_before":"YYYY-MM-DD"
    }
    check_response = requests.get(url, params=parameters, headers=request_headers)
    if check_response.status_code == 200: #response from API recieved
        test_data = check_response.json()
        return int(test_data["results"][0]['id'])
    
#Retrieve the proper location ID 
def Get_Place_ID(place):
    places_url = "https://api.inaturalist.org/v1/places/autocomplete"
    params = {"q": f"{place}"}
    response = requests.get(places_url, params=params)
    if response.status_code == 200:
        data = response.json()
        for place in data["results"]:
            print(f"Place Name: {place['display_name']} - Place ID: {place['id']}")
    else:
        print(f"Error: {response.status_code}")

#returns amount of hours until n hours from the start time, along with a boolean representing whether or not the current time has reached that time.
def CompareTime(start_time, n):
    hrs_ago=datetime.now()-timedelta(hours=n) 
    diff=(start_time-hrs_ago)
    if diff.seconds/3600<=1:
        return [False,abs(diff.seconds)] #returns the time until the next hour interval
    else:
        print(f'it has been {min(abs(diff.seconds/3600), abs(24-diff.seconds/3600))} hours since {n} hours from the start time, and 5GB has not been reached. Resetting hourly data consumption meter.')
        return [True]
    
#Retrieves the raw data from the API and saves it to a .csv
def Get_Data():
    init_num=initial_window_number
    download_size = 0 #ID window current download size
    total_download_size = 0 #total daily download size
    hrs=1 #Hours (1-6) counter
    days=0 #days counter
    all_data=pd.DataFrame(columns=data_columns)
    start_time=datetime.now()
    try:
        while True:
            while initial_window_number<=number_of_windows: #cycle through each window of ID's 
                page=1
                page_limit=1
                while (page<=(page_limit)): #cycle through each page of that window
                    the_parameters["page"]=page
                    response = requests.get(url, params=the_parameters, headers=request_headers)
                    if response.status_code == 200: #response from API recieved
                        data = response.json()
                        with open("data.json", "w") as json_file:
                            json.dump(data, json_file, indent=4)
                        page_limit = (math.ceil(data['total_results']/per_page)) #Max amount of pages in that window
                        for observation in data['results']:
                            if not (x:= Get_Info(observation)) == None: #voids the observations that are missing important data points, which return None in getInfo()
                                if (len(x) == len(data_columns)): #in case any columns are mismatched somehow
                                    all_data.loc[len(all_data)]=x
                    else: #response from API not received
                        print(f"Error: {response.status_code}",'\n', response.text)
                        all_data.to_csv(f'all_data(failed){init_num}-{initial_window_number-1}outof{number_of_windows}.csv', index_label="index") 
                        raise SystemExit("Stopping execution")

                    download_size+= (s:=os.path.getsize(f"{Your_File_Path}data.json")/1024/1024/1024) #count each page request size in GB
                    total_download_size +=s
                    print(f"\rDownloaded data this hour: {download_size:.2f} GB. ", end="") #download size live update
                    page+=1
                    if(download_size)>4.85: #if more than 4.9 gigs of data have been downloaded and an hour hasnt passed, pause until the next hour begins. 
                        all_data.to_csv(f'alldata_windows_hr{hrs}_{init_num}_{initial_window_number-1}out_of{number_of_windows}.csv', index_label="index") #save once per hour block.
                        if CompareTime(start_time,hrs)[0]==False:
                            download_size = 0
                            print(f'{CompareTime(start_time,hrs)[1]//60} minutes until the next 5GB cycle.')
                            time.sleep(CompareTime(start_time,hrs)[1]) #wait until the next hour interval, then continue
                            hrs+=1
                    if ((datetime.now()-start_time).seconds/3600)>=(hrs+1): #resetting the data counter and the time clock if an hour passes before 5gb is downloaded
                            hrs+=1
                            download_size = 0 
                    if((total_download_size-days*23.85)>=23.85): #daily download threshold is reached.
                        print(f"Max Daily Data reached. Window {initial_window_number} incomplete. Continuing in 19 hours.")
                        download_size=0
                        time.sleep(68400)
                    time.sleep(1) #Maximum allowed request rate is 1 per second
                print(f'\rWindow {initial_window_number} Complete: {(initial_window_number+1)/number_of_windows:.2%}')
                initial_window_number+=1 #Increment to the next "window" of ID's 
                the_parameters["id_below"]=max(1257723,260757723-(initial_window_number)*1300000)
                the_parameters["id_above"] = max(0, 260757723 - ((initial_window_number + 1) * 1300000))
            all_data.to_csv(f'all_data_windows{init_num}-{initial_window_number-1}/{number_of_windows}.csv', index_label="index") #save final csv file once the end is reached.
            raise SystemExit("Data Collection Complete.")
    except (KeyboardInterrupt, BaseException): #saves progress in case of an error of any kind
        all_data.to_csv(f"all_data_windows(failed){init_num}-{max(init_num,initial_window_number-1)}outof{number_of_windows}.csv", index_label="index")
        print('Error. Processed data has been saved.')


<h3>Step 4: Run the program</h3>
Run the following code block. Your data will be saved into csv files named after the range of search windows it represents. 

In [20]:
Get_Data()




<h3>Step 5: Compile</h3>
Your data may result in multiple csv files depending on the amount you download or the amount of interruptions. They will need to be merged together without dupicates. Use the following code with the proper file names to do so.

In [None]:
Merge_Files("File1.csv","File2.csv")