# Data Mining US National Park API
The [US National Park Service has an API](https://www.nps.gov/subjects/developer/api-documentation.htm) that can be used to get a variety of information about campgrounds, parks, events, and much more. In order to access the API, [you must register](https://www.nps.gov/subjects/developer/get-started.htm) with a name and e-mail and receive an API key. There is [a guide to using the API](https://www.nps.gov/subjects/developer/guides.htm), however I had some issues with the information provided and I suspect the API has been changed without the documentation not updated. Below I go over what worked for me.

We will be using Python's requests library to make HTTP requests to the API. I began this project using urllib but eventually switched to the requests library because it is easier to use.

In [1]:
import requests # Used to make HTTP requests to the API
import json # Used to process JSON strings

### Accessing the 'parks' endpoint
Below we use our API key and access the 'parks' endpoint. NOTE that the parameters dict uses "api_key" instead of "Authorization" as mentioned on the [Python code snippet on the NPS website](https://www.nps.gov/subjects/developer/guides.htm). 

We are using v1 of the API because v0 is deprecated. Also, you can only make 500 API requests per hour. The default number of results returned is 50. The data is returned in JSON format with 4 keys/value pairs in the root object: limit, total, start, data. 

- __limit__ is how many entries in data were returned. By default 50 entries are requested.
- __total__ is how many entries are available to be requested (496 as of this writing)
- __start__ is works like an offset into the data. We start at 1 by default, but can request to start elsewhere.
- __data__ is the data that we requested (more on this below)

In [2]:
API_Key_NPS = "YOUR_API_KEY_HERE" # Insert your API key here
endpoint = "https://developer.nps.gov/api/v1/parks?"

parameters = {"api_key":API_Key_NPS} # use "api_key" not "Authorization"
response = requests.get(endpoint, params=parameters)
print("response status code: " + str(response.status_code))
data_all = response.json() # Interpret bytes as JSON format and convert to a Python dict
print('limit: ' + data_all['limit'])
print('total: ' + data_all['total'])
print('start: ' + data_all['start'])

response status code: 200
limit: 50
total: 496
start: 1


### What data do we get back?
Let's take a closer look at the data that we got back. The data key within data_all is a list of all of the parks. We can access each park by index and print out the value of a key. We get useful information like description, name, lattitude/longitude, weather, park code, etc.

In [3]:
list_of_parks = data_all['data']
for key in list_of_parks[0].keys(): # iterate over all information about the first park in the list
    print(key + ': ' + list_of_parks[0][key])

directionsUrl: http://www.nps.gov/afam/planyourvisit/directions.htm
description: Over 200,000 African-American soldiers and sailors served in the U.S. Army and Navy during the Civil War. Their service helped to end the war and free over four million slaves. The African American Civil War Memorial honors their service and sacrifice.
designation: 
url: https://www.nps.gov/afam/index.htm
states: DC
latLong: lat:38.916554, long:-77.025977
parkCode: afam
fullName: African American Civil War Memorial
weatherInfo: Washington DC gets to see all four seasons. Humidity will make the temps feel hotter in summer and colder in winter.

Spring (March - May) Temp: Average high is 65.5 degrees with a low of 46.5 degrees

Summer (June - August) Temp: Average high is 86 degrees with a low of 68.5 degrees

Fall (September - November) Temp: Average high is 68 degrees with a low of 51.5 degrees

Winter (December - February) Temp: Average high is 45 degrees with a low of 30 degrees

(Source: www.usclimateda

### Get ALL the data!
In order to get all the data, we will need to send a request with limit >= 496. We can set this limit within our parameters dictionary as shown below. NOTE: If you get status code of 500 from attempting this, see the [Extra: Get all the data! (a little bit at a time)](#get-data-a-little-bit-at-a-time) section below 

In [4]:
parameters = {"api_key":API_Key_NPS, "limit":500} # limit is arbitrary value larger than total number of parks
response = requests.get(endpoint, params=parameters)
print("response status code: " + str(response.status_code))
data_all = response.json() # Interpret bytes as JSON format and convert to a Python dict
print('limit: ' + data_all['limit'])
print('total: ' + data_all['total'])
print('start: ' + data_all['start'])

response status code: 200
limit: 500
total: 496
start: 1


### Convert data to pandas DataFrame and pickle
If we want to process the data easily, pandas is a good option. We can also store the data on our local hard drive as a pickle file (.pkl) so that we don't have access the API anymore.

In [5]:
import pandas as pd

df_parks = pd.DataFrame(data_all['data']) # convert list of json objects to dataframe
df_parks = df_parks.drop_duplicates() # remove duplicate parks (since we requested 510 and there are only 507)
print(df_parks)
df_parks.to_pickle('all_park_data.pkl')

                                           description  \
0    For over a century people from around the worl...   
1    Acadia National Park protects the natural beau...   
2    From the sweet little farm at the foot of Penn...   
3    Over 200,000 African-American soldiers and sai...   
4    African Burial Ground is the oldest and larges...   
5    During the 1890s, scientists rediscovered what...   
6    Established in 2000 for the preservation, prot...   
7    The headwaters of Alagnak Wild River lie withi...   
8    Alaska’s parks, forests, and refuges are rich ...   
9    Alcatraz Island offers a close-up look at the ...   
10   During World War II the remote Aleutian Island...   
11   13,000 years ago this site was already well-kn...   
12   The first railroad to circumvent the Allegheny...   
13   American Memorial Park honors the American and...   
14   An oasis in the desert, Amistad National Recre...   
15   Welcome to Anacostia Park, your neighborhood n...   
16   The Camp 

### Visualize data
Its important to know what the data looks like so we can do some quick plots. We can see that there are many different types of NPS designations and some overlap with eachother. There are seperate designations for 'National Park', 'National and State Parks', 'National Park & Preserve'. This makes it difficult to sort out the National Parks that may also have other designations.

In [6]:
print(df_parks.columns) # what information is available on each park
print(df_parks.shape)

print (df_parks['designation'].value_counts()) # Determine the frequency of each 'designation'

# Attempt to sort out just the 'National Parks'
nationalParks = df_parks[df_parks['fullName'].str.contains("National Park")]
print(nationalParks.shape)
print(nationalParks.fullName)

Index([u'description', u'designation', u'directionsInfo', u'directionsUrl',
       u'fullName', u'id', u'latLong', u'name', u'parkCode', u'states', u'url',
       u'weatherInfo'],
      dtype='object')
(496, 12)
National Monument                                   80
National Historic Site                              79
National Historical Park                            57
National Park                                       50
                                                    35
National Heritage Area                              22
National Memorial                                   18
National Historic Trail                             17
National Recreation Area                            16
National Battlefield                                11
Park                                                11
National Seashore                                   10
National Preserve                                    9
National Military Park                               9
National Park & Pr

### Extra: Get all the data! (a little bit at a time) <a class="anchor" id="get-data-a-little-bit-at-a-time"></a>
When I initially tried to request all 496 entries at once from the API, I kept getting response status codes of 500 (Internal Server Error). If this happens to you, you can use the work-around below to only request a few entries at a time and append the data as you go along.

In [7]:
responseList = [] # store json responses in a list

for i in range(0,510, 10): # There are currently 507 parks return, but we need this loop to only get 10 at a time.
    parameters = {"api_key":API_Key_NPS, 'limit':'10', 'start':str(i)} # limit is arbitrary value larger than total number of parks
    response = requests.get(endpoint, params=parameters)
    print("request #" + str(i) + " response status code: " + str(response.status_code))
    data = response.json() # Interpret bytes as JSON format and convert to a Python dict
    responseList.append(data["data"])
    print("length of response list " + str(len(responseList)))

request #0 response status code: 200
length of response list 1
request #10 response status code: 200
length of response list 2
request #20 response status code: 200
length of response list 3
request #30 response status code: 200
length of response list 4
request #40 response status code: 200
length of response list 5
request #50 response status code: 200
length of response list 6
request #60 response status code: 200
length of response list 7
request #70 response status code: 200
length of response list 8
request #80 response status code: 200
length of response list 9
request #90 response status code: 200
length of response list 10
request #100 response status code: 200
length of response list 11
request #110 response status code: 200
length of response list 12
request #120 response status code: 200
length of response list 13
request #130 response status code: 200
length of response list 14
request #140 response status code: 200
length of response list 15
request #150 response status c

### Convert responseList to pandas DataFrame
Now we can convert the list of responses into a pandas DataFrame. We also store it as a pickle on the hard drive so we don't have to send requests to the sketchy NPS servers anymore.

In [8]:
parks = pd.concat([pd.DataFrame(x) for x in responseList], ignore_index=True) # convert list of json objects to dataframe
parks = parks.drop_duplicates() # remove duplicate parks (since we requested 510 and there are only 507)
print(parks)
parks.to_pickle('all_park_data_2.pkl')

                                           description  \
0    Hoodoos (irregular columns of rock) exist on e...   
1    The mounds preserved here are considered sacre...   
2    First Ladies National Historic Site consists o...   
3    Fort Davis is one of the best surviving exampl...   
4    For centuries, the Oneida Carrying Place, a si...   
5    Politics before the Civil War was a whirlwind ...   
6    Aviation is chock-full of tradition & history ...   
7    Welcome to National Capital Parks-East. We inv...   
8    After 10,000 years, the people of South Texas ...   
9    Discover what it took for the United States to...   
10   Home to the National Woman's Party for nearly ...   
11   Though a short distance from the urban areas o...   
12   Almost 70 miles (113 km) west of Key West lies...   
14   Fort Bowie witnessed almost 25 years of confli...   
15   Albert Gallatin is best remembered for his thi...   
16   Harriet Tubman was guided by a deep faith and ...   
17   The "Monu