# Lecture 15: APIs

April 17, 2024

## 1 Accessing data through APIs

### APIs

- An API (Application Programming Interface) is an interface that sits on top of a computer-based system and simplifies certain tasks, such as extracting subsets of data from a large repository or database.
- Real-time server-to-browser communication.
- An API is code that allows software programs to communicate.
- Web APIs allow you to access data available via an internet web interface.
- Often you can access data from web APIs using a URL that contains sets of parameters that specifies the type and particular subset of data that you are interested in.
- Web APIs are a way to strip away all the extraneous visual interface that you don't care about and get the data that you want.

### JSON files

- Data from APIs are often JSONs (JavaScript Object Notation)
- Structured Machine-readable files: Files that can be stored in a text format but are hierarchical and structured in some way that optimizes machine readability. JSON files are an example of structured machine-readable files.
- They are stored in human-readable text.
- They are 'lightweight' for storing and transferring data. This makes it very easy to work with quickly and productively. The specification is designed to minimize the number of requests and the amount of data that needs sending between client and server.
- Python is particularly good at reading in JSON files.
- It usually makes sense to load these in to Python as dictionaries.

### Why do we use APIs?

Among other things, APIs allow us to:
- Get information that would be time-consuming to get otherwise.
- Get information that you can't get otherwise.
- Automate analytical workflows that require continuously updated data.
- Access data using a more direct interface.
- There are many different types of web APIs. One of the most common types is a REST, or RESTful, API.

A RESTful API is a web API that uses URL arguments to specify what information you want returned through the API.

### Using APIs

- API is actually a very simple tool that allows anyone to access information from a given website. You might require the use of certain headers but some APIs require just the URL.
- Data REQUEST: You try to access a URL in your browser.
- Data processing: A web server somewhere that uses URL to query a specified dataset.
- Data RESPONSE: That web server then sends you back some content.

### Python libraries for accessing APIs

- As part of accessing the API content and getting the data into a .CSV file, we'll have to import a number of Python Libraries.
- `requests` library helps us get the content from the API by using the `get()` method. The `json()` method converts the API response to JSON format for easy handling.
- `json` library is needed so that we can work with the JSON content we get from the API.
- `pandas` library helps to create a DataFrame which we can format with proper headings and indexing, and then analyze.

## 2 Open Notify API example

- Now collect data from the Open Notify API. This example follows https://www.dataquest.io/blog/python-api-tutorial/.
- The Open Notify API gives access to data about the international space station. It's a great API for learning because it has a very simple design, and doesn't require authentication.
- The first endpoint we'll use is http://api.open-notify.org/astros.json, which returns data about astronauts currently in space.

In [None]:
import numpy as np
import pandas as pd
import requests
import json
import os

response = requests.get("http://api.open-notify.org/astros.json")
print(response.status_code)

We received a '200' code which tells us our request was successful. The documentation tells us that the API response we'll get is in JSON format. Next, let's use the response.json() method to see the data we received back from the API:

In [None]:
print(response.json())

### Convert this to a pandas DataFrame

In [None]:
iss_df = pd.DataFrame(response.json())
print(iss_df)

This is not how we would like our data frame to look. The message and number columns are unnecessary. Remove them and convert the people column to a data frame.

In [None]:
print(response.json()['people'])

In [None]:
iss_df = pd.DataFrame(response.json()['people'])
print(iss_df)

## 3 Colorado Population Projections Example

See what the dataset looks like at the bottom of this page: https://data.colorado.gov/Demographics/Population-Projections-in-Colorado/q5vp-adf3

At the top of the page, you will notice an API button. If you click on that, it gives you the URL for accessing the API: https://data.colorado.gov/resource/q5vp-adf3.json

We will use this URL in our code.

In [None]:
url = "https://data.colorado.gov/resource/q5vp-adf3.json"
requests.get(url)

### Understanding the code

- `requests.get(url).json()` outputs the data at the URL as a JSON to the console.
- We will save this as JSON_data: `JSON_data = requests.get(url).json()`
- Use `json.dumps(JSON_data)` to print the data to the console without indentation.
- Notice that the data looks like a Python dictionary!

In [None]:
JSON_data = requests.get(url).json()
my_data = json.dumps(JSON_data)
# print(my_data)

- Recall that `JSON_data = requests.get(url).json()`
- To convert the JSON_data to a pandas DataFrame, simply use `pd.DataFrame`: `pop_df = pd.DataFrame(JSON_data)`

In [None]:
pop_df = pd.DataFrame(JSON_data)
print(pop_df)

## 4 RESTful APIs

- There are ways of doing more complex extractions using the API string. I will show these in the next few slides. These are RESTful APIs.
- REST (REpresentational State Transfer) is an architectural style, and an approach to communications that is often used in the development of Web services. The use of REST is often preferred over the more heavyweight SOAP (Simple Object Access Protocol) style because REST does not leverage as much bandwidth, which makes it a better fit for use over the Internet
- However, unless there is a huge amount of data, we are more confident doing this filtering in Python.

### 4.1 Using REST API for Colorado Population Projections Example

In [None]:
url = 'https://data.colorado.gov/resource/tv8u-hswn.json?county=Boulder&$where=age between 20 and 40 and year between 2016 and 2025&$select=year,age,femalepopulation'

JSON_data = requests.get(url).json()
# print(JSON_data)

### Create the dataset

In [None]:
# Create the dataset
pop_df_filter = pd.DataFrame(JSON_data)
print(pop_df_filter)

### Breaking down the API string

Notice that the colorado.data.gov API URL in the cell above starts with data.colorado.gov but then has various parameters attached to the end of the URL that specify the particular type of information that you are looking for.

The parameters in the url are:
- The Data set itself: `/tv8u-hswn.json`
- AGE: `where=age between 20 and 40`
- YEAR: `year between 2016 and 2025`
- COUNTY: `county=Boulder`
- Columns to get: `select=year,age,femalepopulation`

## 5 Exercise

Do your own filtering of data from the Colorado population projection data using an API string.

## 6 Newspaper search example

- We will now look at the Chronicling America API. Details on how to use it are at the following link: https://chroniclingamerica.loc.gov/about/api/
- The base URL for the API is: https://chroniclingamerica.loc.gov/
- The Chronicling America API allows access to metadata and text for millions of scanned newspaper pages. In addition, unlike many other APIs, it also does not require an authentication process, allowing us to immediately explore the available data without signing up for an account.
- In our example, we will try to find data on when Castleblayney was mentioned in an American paper.
- From the about API page, we see that the URL for creating a request to the API is: http://chroniclingamerica.loc.gov/search/pages/results/
- We see that this contains all results.
- We want to search for Castleblayney in JSON format. To do this add `?andtext=castleblayney&format=json`: https://chroniclingamerica.loc.gov/search/pages/results/?andtext=castleblayney&format=json
- If we request this URL, convert it to a JSON file, and try to convert it to a DataFrame, it is not in the format we would like.
- Notice that the items column appears to contain the dictionary we would like:

In [None]:
url = "https://chroniclingamerica.loc.gov/search/pages/results/?andtext=castleblayney&format=json"
# JSONContent = requests.get(url, headers={'content-type':'application/json'}).json()

JSONContent = requests.get(url).json()

blayney_df = pd.DataFrame(JSONContent)
print(blayney_df)
print(blayney_df.iloc[0:2,0:4])

- If we apply `pd.DataFrame` to `JSONContent['items']`, we get the dataset we would like.
- There are some missing values and strangely formatted columns. These would have to be tidied up.
- This only gives the data for the first 20 results. How do you think we will access the data for the other results?

In [None]:
blayney_df = pd.DataFrame(JSONContent['items'])
# blayney_df
print(blayney_df.head())
# print(blayney_df.iloc[0])

- Use the page argument to see results other than the first 20: https://chroniclingamerica.loc.gov/search/pages/results/?andtext=castleblayney&format=json&page=2

In [None]:
url = "https://chroniclingamerica.loc.gov/search/pages/results/?andtext=castleblayney&format=json&page=2"
JSONContent2 = requests.get(url).json()

blayney_df2 = pd.DataFrame(JSONContent2['items'])
print(blayney_df2)

### Putting all of the pages together into one data frame

- We will use a for loop to loop through the page numbers, and add the data to a DataFrame each time.
- First, we must create an empty DataFrame: `blayney_df_all = pd.DataFrame()`
- We will specify a range in our for loop of range(1, 6) because we want to go from pages 1 to 5 inclusive.
- We will add the number on as a string at the end of the URL to specify the page number.

In [None]:
blayney_df_all = pd.DataFrame()

for i in range(1,6):
    JSONContent = requests.get("https://chroniclingamerica.loc.gov/search/pages/results/?andtext=castleblayney&format=json&page="+str(i), 
                              headers={'content-type':'application/json'}).json()
    blayney_df_page = pd.DataFrame(JSONContent['items'])
    blayney_df_all = pd.concat([blayney_df_all, blayney_df_page])

blayney_df_all

## 7 Accessing an API with authorization: YouTube

https://developers.google.com/youtube/v3

https://developers.google.com/youtube/v3/getting-started

Create a project at the credentials page (link on this page: https://developers.google.com/youtube/registering_an_application)

Click on Create Credentials and request an API key.

Enable the YouTube Data API v3.

Install the googleapiclient (I have commented it out to avoid installing multiple times):

In [None]:
# pip install --upgrade google-api-python-client

In [None]:
# import necessary libraries
from googleapiclient.discovery import build
import pandas as pd
import seaborn as sns

In [None]:
api_key = 'YOUR_API_KEY_HERE'
youtube = build('youtube', 'v3', developerKey=api_key)

We will need to find the channel_ids of the channels we want to find details on.

See details here on how to find channel ids: https://www.youtube.com/watch?v=qPKmPaNaCmE

1. Go to the YouTube channel homepage.
2. Right-click anywhere on the page and click 'View page source'.
3. Use Ctrl-F to find 'channel_id=' on the page. The channel_id will appear after this text.
4. Copy the channel_id into the Python code as seen below.

In [None]:
channel_ids = ['UCNAf1k0yIjyGu3k9BwAg3lg',  # Sky Sports Premier League
              'UCWw6scNyopJ0yjMu1SyOEyw',  # Talksport
              'UCjXIw1GlwaY1IzpW_jN9iCQ',  # The Overlap
             ]

Let's run some code to see what data we can get on the channels.

In [None]:
request = youtube.channels().list(
    part="snippet,contentDetails,statistics",
    id=','.join(channel_ids))
response = request.execute()
response

Define a function called channel_stats to get channel statistics:

In [None]:
def channel_stats(youtube, channel_ids):
    all_data = []
    request = youtube.channels().list(
        part="snippet,contentDetails,statistics",
        id=','.join(channel_ids))
    response = request.execute()

    for i in range(len(response['items'])):
        data = {
            'channel_name': response['items'][i]['snippet']['title'],
            'num_Subscribers': response['items'][i]['statistics']['subscriberCount'],
            'num_views': response['items'][i]['statistics']['viewCount'],
            'num_vids': response['items'][i]['statistics']['videoCount'],
            'playlist_ID': response['items'][i]['contentDetails']['relatedPlaylists']['uploads']
        }
        all_data.append(data)
    return all_data

In [None]:
channel_stats(youtube, channel_ids)

In [None]:
channel = channel_stats(youtube, channel_ids)
channelStats = pd.DataFrame(channel)  # convert to pandas df
channelStats

Now we will try to get details on videos from the playlist_id from the channel 'The Overlap'. First, we extract the playlist_id from the channelStats df:

In [None]:
channel_name = 'The Overlap'
playlist_id = channelStats.loc[channelStats['channel_name'] == channel_name, 'playlist_ID'].iloc[0]
playlist_id

Next, create a function to extract 50 video IDs from the playlist.

In [None]:
def get_vid_id(youtube, playlist_id):
    request = youtube.playlistItems().list(  # playlistItem() is gotten from the YT developer to list playlist items.
        part="contentDetails",
        playlistId=playlist_id,
        maxResults=50)  # to increase the results per page from the default 5 to the max 50
    response = request.execute()

    video_ids = []
    for i in range(len(response['items'])):
        video_ids.append(response['items'][i]['contentDetails']['videoId'])

    return video_ids

Run this function on our playlist_id:

In [None]:
video_ids = get_vid_id(youtube, playlist_id)
video_ids

Now we need to get the video details from the video IDs:

In [None]:
def get_vid_details(youtube, video_ids):
    combvidestats = []

    for video_id in video_ids:
        request = youtube.videos().list(
            part='snippet,statistics',
            id=video_id
        )
        response = request.execute()

        for video in response['items']:
            video_stats = {
                'Title': video['snippet']['title'],
                'Publish_date': video['snippet']['publishedAt'],
                'num_views': video['statistics']['viewCount'],
                'num_likes': video['statistics']['likeCount'],
                'num_comm': video['statistics']['commentCount']
            }
            combvidestats.append(video_stats)

    return combvidestats

In [None]:
video_details = get_vid_details(youtube, video_ids)
video_details_df = pd.DataFrame(video_details)
video_details_df

## 8 Exercises

1. Find an API online, and create a DataFrame in Python from the API data.
2. Use the YouTube API to get details on videos from a particular YouTube channel.