# Assignment 1-2: Data Collection Using Web APIs

## Objective

Many Websites (such as Twitter, Yelp, Spotify) provide free APIs to allow users to access their data. *API wrappers* simplify the use of these APIs by wrapping API calls into easy-to-use Python functions. At SFU, we are developing a unified API wrapper, called [DataPrep.Connector](https://docs.dataprep.ai/user_guide/connector/introduction.html#userguide-connector), which offers a unified programming interface to collect data from a variety of Web APIs.

In this assignment, you will learn the following:

* How to ask insightful questions about data
* How to collect data from Web APIs using DataPrep.Connector

**Requirements:**

1. Please use [pandas.DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) rather than spark.DataFrame to manipulate data.

2. Please follow the python code style (https://www.python.org/dev/peps/pep-0008/). If TA finds your code hard to read, you will lose points. This requirement will stay for the whole semester.

## Preliminary

DataPrep.Connector is very easy to learn. By watching this 10-min [PyData Global 2020 talk](https://www.youtube.com/watch?v=56qu-0Ka-dA), you should be able to know how to use it. 

If you want to know more, below are some other useful resources.

* [Quick Introduction](https://github.com/sfu-db/dataprep#connector)
* [User Guide](https://sfu-db.github.io/dataprep/user_guide/connector/connector.html) 
* [Examples](https://github.com/sfu-db/dataprep/tree/develop/examples)
* [Fetch and analyze COVID-19 tweets using DataPrep](https://www.youtube.com/watch?v=vvypQB3Vp1o)

## Overview

This is a **group** assignment. 
Please check your group in this [PDF file](https://coursys.sfu.ca/2022sp-cmpt-733-g1/pages/Web-API-Assignment/view).

To do this assignment, your group needs to go through four steps:

1. Select a new Web API that is not listed on https://github.com/sfu-db/APIConnectors. 
2. Create a configuration file for the API (see tutorials at [link1](https://github.com/sfu-db/APIConnectors/blob/develop/CONTRIBUTING.md#add-configuration-files) and [link2](https://github.com/sfu-db/EZHacks-tutorial/blob/master/2.%20Tutorial.ipynb)). 
3. Come up with four questions about the API. 
4. Write code to answer these questions one by one.

For Step 3, please make sure your questions are **good**.

## What are "good questions"?

Please use the following to judge whether your questions are good or not.

1. Good questions need to be useful. That is, they are either common questions asked about the API or are exploring novel use cases.
2. Good questions need to be diverse. That is, they cover different aspects of the API. 
3. Good questions have to cover different difficulty levels. That is, it consists of both easy and difficult questions,  where the difficulty can be measured by the number of lines of code or the number of input parameters.

The following shows four good questions about the Yelp API. The corresponding code can be found at this [link](https://github.com/sfu-db/DataConnectorConfigs#yelp----collect-local-business-data).

* Q1. What's the phone number of Capilano Suspension Bridge Park?
* Q2. Which yoga store has the highest review count in Vancouver?
* Q3. How many Starbucks stores are in Seattle and where are they?
* Q4. What are the ratings for a list of restaurants?

**Why are they useful?**
* Q1 is useful because "Capilano Suspension Bridge Park" is one of the most popular tourist attractions in Vancouver.
* Q2 is useful because a yoga fan wants to find out the most popular yoga store in Vancouver. 
* Q3 is useful because Starbucks was founded in Seattle.
* Q4 is useful because people often rely on yelp ratings to decide which restaurant to go to.

**Why are they diverse?**

This is because the [code](yelp-code.png) written to answer them has different inputs or outputs.
* Q1 takes `term` and `location` as input and returns 1 record with attributes `name` and `phone` 
* Q2 takes `categories`, `location`, and `sort_by` as input and returns 1 record with attributes `name` and `review_count`
* Q3 takes `term` and `location` as input and returns n records with attributes `name`, `address`, `city`, `state`, `country`, `zipcode`
* Q4 takes a list of retarurant `names` as input and return n records with attributes `name`, `rating`, `city`

**Why are they more and more difficult?**
* Q1 (4 lines of code, 2 query parameters)
* Q2 (4 lines of code, 3 query parameters)
* Q3 (5 lines of code, 2 query parameters)
* Q4 (11 lines of code, 2 query parameters)

Please note that you have to use DataPrep.Connector to get data from the Web API. If DataPrep.Connector cannot meet your needs, please post your questions on Teams (Channel: Assignment 1). We will help you. 

## Now, it's your turn. :) 

Please write down your questions and the corresponding code for your assigned API. 

In [5]:
!cat ebird/_meta.json

{
    "tables": [
        "observation",
        "product",
        "geo",
        "hotspot",
        "taxonomy",
        "region",
        "historicObservation",
        "speciesList",
        "recentNearbyObservations",
        "nearestObservation"
    ]
}
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dataprep
  Downloading dataprep-0.4.5-py3-none-any.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
Collecting flask<3,>=2
  Downloading Flask-2.2.2-py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.5/101.5 KB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting regex<2022.0.0,>=2021.8.3
  Downloading regex-2021.11.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (764 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m764.7/764.7 KB[0m [31m50.4 MB/s[0m eta [36m0:0

In [6]:
## Provide your API key here for TAs to reproduce your results
from dataprep.connector import Connector
import pandas as pd
dataConnector = Connector('./ebird', _auth={"access_token": "p6otupunpquj"})


### Q1 Bird sightings around whistler in last 30 days

In [7]:
## Write your code

latitude = 50.116322
longitude = -122.957359

whistlerSpotting =  await dataConnector.query("recentNearbyObservations", lat = latitude, lng = longitude)

whistlerSpotting

Unnamed: 0,speciesCode,comName,sciName,locId,locName,obsDt,howMany,lat,lng,obsValid,obsReviewed,locationPrivate,subId
0,stejay,Steller's Jay,Cyanocitta stelleri,L22210587,Louis’s house,2023-01-15 07:58,1,50.087557,-122.98701,True,False,True,S126204802
1,whtpta1,White-tailed Ptarmigan,Lagopus leucura,L1119167,Whistler Mountain,2023-01-14 14:45,1,50.063889,-122.956389,True,False,False,S126157422
2,gockin,Golden-crowned Kinglet,Regulus satrapa,L4415811,Whistler--Lost Lake,2023-01-14 14:41,4,50.126411,-122.937398,True,False,False,S126164433
3,varthr,Varied Thrush,Ixoreus naevius,L4415811,Whistler--Lost Lake,2023-01-14 14:41,2,50.126411,-122.937398,True,False,False,S126164433
4,spotow,Spotted Towhee,Pipilo maculatus,L4415811,Whistler--Lost Lake,2023-01-14 14:41,1,50.126411,-122.937398,True,False,False,S126164433
5,amedip,American Dipper,Cinclus mexicanus,L441796,Whistler--Calcheak Camp,2023-01-14 12:57,2,50.069591,-123.09391,True,False,False,S126170984
6,gryjay,Canada Jay,Perisoreus canadensis,L1119167,Whistler Mountain,2023-01-04 15:11,3,50.063889,-122.956389,True,False,False,S125427756


### Q2 Find the nearest locations where Western Screech owl (now exinct in BC ) has been spotted recently.

In [8]:
## Write your code
# Lat Long set to center of BC
latitude = 54.470038
longitude = -125.332031
screechOwlCode = "wesowl1"

nearestScreechOwlSpotting = await dataConnector.query("nearestObservation", speciesCode = screechOwlCode, lat = latitude, lng = longitude)

nearestScreechOwlSpotting 
# We can observe all locations are out of BC

Unnamed: 0,speciesCode,comName,sciName,locId,locName,obsDt,howMany,lat,lng,obsValid,obsReviewed,locationPrivate,subId
0,wesowl1,Western Screech-Owl,Megascops kennicottii,L351484,Marymoor Park,2023-01-15 06:00,1,47.662576,-122.120520,True,False,False,S126249700
1,wesowl1,Western Screech-Owl,Megascops kennicottii,L493054,Henry Hagg Lake Park (Scoggins Valley Park),2023-01-14 06:11,1,45.473855,-123.213558,True,False,False,S126110763
2,wesowl1,Western Screech-Owl,Megascops kennicottii,L22232874,Hagg Lake to 26,2023-01-16 13:12,1,45.469885,-123.192184,True,False,True,S126325965
3,wesowl1,Western Screech-Owl,Megascops kennicottii,L22114596,"1612 Southeast Salmon Street, Portland, Oregon...",2023-01-08 09:52,1,45.514208,-122.649421,True,False,True,S125706798
4,wesowl1,Western Screech-Owl,Megascops kennicottii,L22205957,Friends' Home,2023-01-14 19:48,1,45.507675,-122.649368,True,False,True,S126180397
...,...,...,...,...,...,...,...,...,...,...,...,...,...
161,wesowl1,Western Screech-Owl,Megascops kennicottii,L22101514,Sendero A La Sierra -- Parte Segundo,2023-01-05 18:10,2,23.535369,-110.024223,True,False,True,S125577855
162,wesowl1,Western Screech-Owl,Megascops kennicottii,L22101521,Locale De Megascops,2023-01-06 06:00,1,23.546323,-110.001724,True,False,True,S125577928
163,wesowl1,Western Screech-Owl,Megascops kennicottii,L22174145,"Baja California Sur, MX (23.533, -110.024)",2023-01-05 18:54,1,23.532637,-110.023572,True,False,True,S126004266
164,wesowl1,Western Screech-Owl,Megascops kennicottii,L22101419,"Baja California Sur, MX (23.505, -110.05)",2023-01-06 18:22,1,23.504614,-110.050218,True,False,True,S125577299


### Q3  Are there any common birds species seen in last 30 days between Arizona and BC?

In [9]:
## Write your code


BC =  await dataConnector.query("observation", regionCode = "CA-BC")
AZ =  await dataConnector.query("observation", regionCode = "US-AZ")

BC = BC.drop(columns=['locId','locName', 'obsDt', 'howMany', 'lat', 'lng', 'obsValid', 'obsReviewed', 'locationPrivate', 'subId'])
AZ = AZ.drop(columns=['locId','locName', 'obsDt', 'howMany', 'lat', 'lng', 'obsValid', 'obsReviewed', 'locationPrivate', 'subId'])

commonSpecies = pd.merge(BC, AZ, on='speciesCode')
commonSpecies

Unnamed: 0,speciesCode,comName_x,sciName_x,comName_y,sciName_y
0,comrav,Common Raven,Corvus corax,Common Raven,Corvus corax
1,norfli,Northern Flicker,Colaptes auratus,Northern Flicker,Colaptes auratus
2,rebnut,Red-breasted Nuthatch,Sitta canadensis,Red-breasted Nuthatch,Sitta canadensis
3,eucdov,Eurasian Collared-Dove,Streptopelia decaocto,Eurasian Collared-Dove,Streptopelia decaocto
4,amegfi,American Goldfinch,Spinus tristis,American Goldfinch,Spinus tristis
...,...,...,...,...,...
141,wessan,Western Sandpiper,Calidris mauri,Western Sandpiper,Calidris mauri
142,leasan,Least Sandpiper,Calidris minutilla,Least Sandpiper,Calidris minutilla
143,musduc,Muscovy Duck,Cairina moschata,Muscovy Duck,Cairina moschata
144,horlar,Horned Lark,Eremophila alpestris,Horned Lark,Eremophila alpestris


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


### Q4 Comparing bird sightings in the 1st week of 2023 vs 2013 (Note: This requires several cells)

In [46]:
## Write your code
import datetime

years_tested = ['2013', '2023']
frames_2013 = []
frames_2023 = []

def get_date_frame(year):
  start_date = '01-01-' + year
  end_date = '07-01-' + year
  start = datetime.datetime.strptime(start_date, "%d-%m-%Y")
  end = datetime.datetime.strptime(end_date, "%d-%m-%Y")
  date_generated = pd.date_range(start, end)
  return date_generated

for year in years_tested:
  dates = get_date_frame(year)
  for i in dates:
    date = i.strftime("%Y/%m/%d")
    temp = await dataConnector.query("historicObservation", regionCode = "CA-BC", date = date)
    frames_2013.append(temp) if year=='2013' else frames_2023.append(temp)

data2013 = pd.concat(frames_2013)
data2023 = pd.concat(frames_2023)
totalSpotting2013 = data2013.groupby(['speciesCode','comName','sciName'])['howMany'].sum().to_frame()
totalSpotting2023 = data2023.groupby(['speciesCode','comName','sciName'])['howMany'].sum().to_frame()
species_spotting_merged = pd.merge(totalSpotting2013, totalSpotting2023, on=['speciesCode','comName','sciName']).rename(columns={'howMany_x': 'howMany_2013', 'howMany_y': 'howMany_2023'})
species_spotting_merged['differenceSpotting'] = species_spotting_merged['howMany_2023'] - species_spotting_merged['howMany_2013']
species_spotting_merged = species_spotting_merged.sort_values(by=['differenceSpotting'])

species_spotting_merged

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,howMany_2013,howMany_2023,differenceSpotting
speciesCode,comName,sciName,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
whwsco2,White-winged Scoter,Melanitta deglandi,1013.0,291.0,-722.0
calgul,California Gull,Larus californicus,563.0,31.0,-532.0
bohwax,Bohemian Waxwing,Bombycilla garrulus,1245.0,720.0,-525.0
mallar3,Mallard,Anas platyrhynchos,790.0,267.0,-523.0
gresca,Greater Scaup,Aythya marila,472.0,47.0,-425.0
...,...,...,...,...,...
norpin,Northern Pintail,Anas acuta,213.0,879.0,666.0
bongul,Bonaparte's Gull,Chroicocephalus philadelphia,45.0,1022.0,977.0
snogoo,Snow Goose,Anser caerulescens,461.0,1725.0,1264.0
y00475,American Coot,Fulica americana,191.0,1585.0,1394.0


## Submission

Complete this notebook, rename it to `A1-2-[WEB API Name].ipynb`, and submit it along with your config files to the CourSys activity `Assignment 1 - Part 2`. For example, if your group works on Yelp, then **every member of your group** needs to submit the same notebook named `A1-2-Yelp.ipynb` and the config files named `config.zip`.