<a href="https://colab.research.google.com/github/rskrisel/api_workshop/blob/main/API_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting started with APIs

![PlanningPME_API___how_the_dedicated_API_works_.png](attachment:PlanningPME_API___how_the_dedicated_API_works_.png)

In this workshop, we will learn how to retrieve data from an application programming interface (API).

We will start by making a simple data request from The Metropolitan Museum of Art Collection API, reading the response in JSON format, and finally wrangling the data into a dataframe using the Pandas library.  

Next, we will create our unique access API keys to work with News API to retrieve news articles. We will also use the Wordcloud Python library to visualize our results in a word cloud.

### Acknowledgements

This workshop is adapted from the following tutorials:

- Python API Tutorial: Getting Started with APIs: https://www.dataquest.io/blog/python-api-tutorial/

- Accessing the News API in Python: https://www.datacareer.de/blog/accessing-the-news-api-in-python/

- How to create a Pandas Dataframe from an API Endpoint in a Jupyter Notebook: https://deallen7.medium.com/how-to-create-a-pandas-dataframe-from-an-api-endpoint-in-a-jupyter-notebook-f2561f766ca3

### How do APIs work?

In the previous workshop, we learned how to collect internet data by scraping the surface of web pages. APIs offer a different method for data collection by allowing users to programmatically extract and interact with data under the hood of websites, social networks, applications, and projects.

When we make a request to an API server for data, and it responds to your request:

![Screen+Shot+2022-11-07+at+10.34.46+AM.png](attachment:Screen+Shot+2022-11-07+at+10.34.46+AM.png)

## 1. Using an API without Keys: The Metropolitan Museum of Art Collection API

In this workshop, we will first work with The Metropolitan Museum of Art Collection API (MetMuseum API, https://metmuseum.github.io), which provides select datasets of information on more than 470,000 artworks in its Collection for unrestricted commercial and noncommercial use. The MetMuseum API is ideal for learning how to make data requests since it has a simple design and doesn't require authentication like other APIs including Twitter. More on that later.

### 1.1 Installing the "requests" library

To work with APIs, we need tools to make requests. The most common library for making requests and working with APIs in Python is "requests".

Since the "requests" library is not part of the standard Python library, we need to install it first.

Working from your computer Terminal (make sure you see the % symbol), type the following command:
```
conda install requests
```

Then, press enter. When it prompts you to respond with y or n (i.e., yes or no), type y and press enter.

When the installation in complete, your terminal should return to its base with the % symbol.

### 1.2 Importing the "requests" library

In [None]:
import requests

### 1.3 Making an API request

A "GET" request is the most common type of API request to retrieve data.
The API responds to the GET request with a "response code", letting us know if the request was successful.

In order to work with specific APIs, it's important to consult the documentation so you understand what kinds of data requests you can make and the format to use. Reading the documentation for any programming package can seem daunting at times, so I recommend reading this how-to guide first: https://bit.ly/3TnOOMw

If we take a look at the MetMuseum API documentation (https://metmuseum.github.io), we can see it has more than one API on its server. Each of these are called "endpoints."

The MetMuseum API has four endpoints:

![Latest_Updates___The_Metropolitan_Museum_of_Art_Collection_API.png](attachment:Latest_Updates___The_Metropolitan_Museum_of_Art_Collection_API.png)

Let's start with the "departments" endpoint, , which returns a listing of all departments: https://collectionapi.metmuseum.org/public/collection/v1/departments

In [None]:
response = requests.get("https://collectionapi.metmuseum.org/public/collection/v1/departments")

It looks like we got a silent success, but let's use the "response.status_code" attribute to receive the status code for our request:

In [None]:
print(response.status_code)

The 200 code tells us our request was successful. Read here for more on status codes: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

### 1.4 Reading API responses in JSON

Looking at the API documentation (https://metmuseum.github.io), we know that the API response is in JSON (JavaScript Object Notation) format. JSON is a way to encode data structures that ensures that they are easily readable by machines. JSON is the primary format in which data is passed back and forth to APIs, and most API servers will send their responses in JSON format.

We can use the response.json() method to see the data we got from the API:

In [None]:
print(response.json())

### 1.5 Reading the JSON output

The JSON output we got from the API looks like it contains Python dictionaries, lists, strings and integers. JSON is a combination of these objects represented as strings.

To work with JSON data in Python, we can use the JSON package, which is part of the standard library, so we don’t have to install anything to use it.

The JSON library has two main functions:

- json.dumps() — Takes in a Python object, and converts (dumps) it to a string.

- json.loads() — Takes a JSON string, and converts (loads) it to a Python object.



first, let's import the JSON library

In [None]:
import json

Next, let's create a formatted string of the Python JSON object. We will define a new function jprint which takes 'obj' as its input variable

In [None]:
def jprint(obj):
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

Then, let's apply the new function to our response.json()

In [None]:
jprint(response.json())

Converting our output into a string makes it easier to understand the structure of the data. There are 21 departments at the Metropolitan Museum, with their names existing as dictionaries inside a list.

### 1.6 Converting JSON data into a Pandas Dataframe

It may be easier to work with the API data if it were stored in tabular format (i.e., a spreadsheet). We can use the Pandas Python library for data analysis and manipulation (https://pandas.pydata.org) to wrangle the JSON formatted data into a dataframe.  

To start, let's bring the Pandas library into our Python environment:

In [None]:
import pandas as pd

Next, let's explore our JSON using the Keys() and Type() Methods.

The keys() method returns a view object. The view object contains the keys of the dictionary, as a list (https://www.w3schools.com/python/ref_dictionary_keys.asp)

Let's create a new variable 'json_results' equal to our API response

In [None]:
json_results=response.json()

json_results.keys()


The output of keys method called on the json_results variable provides us with a list of keys that we can use to explore the JSON, similar to how you would select a column in a Pandas Dataframe. In this case, we just have a single key. Later in this workshop we will work with a JSON with multiple keys.

In [None]:
json_results['departments']

We could also check the data type for this key:

In [None]:
type(json_results['departments'])

Because our 'departments' key is a list, we can simply add an index next to the key, and test what type of data is listed:

In [None]:
type(json_results['departments'][0])

It looks like with have a list of "dicts", which are easily transformed into a dataframe. Dictionaries are used to store data values in key:value pairs (https://www.w3schools.com/python/python_dictionaries.asp).

In [None]:
df = pd.DataFrame(json_results['departments'])
df

Converting the data from JSON to a dataframe makes it much easier to read! It also wrangles the data into a format that can be easily used for data analysis and manipulation. See Pandas workshop for a refresher: https://gc-dri.github.io/Dhrift-GC/workshops/pandas/

## 2. Using an API with Keys: NewsAPI

NewsAPI.org is an easy to use API to get news from over 30,000 sources all over the world. NewsAPI offers both free and paid plans.

To connect with the NewsAPI (https://newsapi.org/), we will need a “Client Access Token,” which is like a password assigned to you.

Many APIs require authentication keys to gain access to them, in part so they can keep track of the requests we are making.

To get you NewsAPI keys, fill out the info at: https://newsapi.org/register.

You’ll be asked to sign up for a News API account, required to gain API access. Signing up for a NewsAPI account is free and easy. You need your first name, an email address, and a password.

![Screen+Shot+2022-11-26+at+7.46.33+AM.png](attachment:Screen+Shot+2022-11-26+at+7.46.33+AM.png)

Once you’re signed in, you should be taken to https://newsapi.org/account, where you will find your personal API Key.

![Account_-_News_API.png](attachment:Account_-_News_API.png)

### 2.1 Installing the NewsAPI

Before we can communicate with the NewsAPI from our Python environment, we need to install it. You can run "!pip install newsapi-python" directly from your Jupyter Notebook:  

In [None]:
# !pip install newsapi-python

### 2.2 Importing the "requests" and "pprint" libraries

In addition to importing our "requests" library, we will also use the Pretty Print "pprint" module. Pretty Print in Python is a utility module that you can use to print data structures in a readable, pretty way. It’s a part of the standard library that’s especially useful for debugging code dealing with API requests, large JSON files, and data in general.

In [None]:
import pprint
import requests

### 2.3 Saving your NewsAPI key as a variable

When working with API keys, it's always a good idea to set them equal to a variable that you can easily call on when connecting to the API.

You can find your NewsAPI key at: https://newsapi.org/docs/authentication.

Here, I add a dummy key. Make sure to replace it with your own key:

In [None]:
secret= '571e874fe6674690a5ea658e5937d47c' #replace with your key

### 2.4 Defining your endpoint and specifying the query

NewsAPI offers three endpoints:

'/v2/top-headlines', for the most important headlines per country and category
'/v2/everything', for all the news articles from over 30,000 sources
'/v2/sources', for information on the various sources

We will use the 'everything' endpoint, to get news about 'New York City'.

Let's define the endpoint

In [None]:
url = 'https://newsapi.org/v2/everything?'

Next, we need to specify the query and number of returns

In [None]:
parameters = {
    'q': 'new york city', # query phrase
    #'sources': 'new york times', #specify sources, if desired
    'pageSize': 20,  # maximum is 100
    'apiKey': secret # your own API key
    }

### 2.5 Retrieve the news with the requests package

Let's make the request

In [None]:
response = requests.get(url, params=parameters)

Then, convert the response to JSON format and pretty print it

In [None]:
response_json = response.json()
pprint.pprint(response_json)

### 2.6 Print just the titles with a loop:

In [None]:
for i in response_json['articles']:
    print(i['title'])

Congrats on making your first NewsAPI query! Try out some other queries and the different endpoints. You can find the documentation at https://newsapi.org/docs.

### 2.7 Visualizing Results in a Word Cloud

A word cloud is a visual representation of information or data. It shows the popularity of words or phrases by making the most frequently used words appear larger or bolder compared with the other words around them.

To visualize the results of our NewsAPI query, we can create a word cloud using the "wordcloud" package for Python. Since the "wordcloud" package is not part of the standard Python library, we need to install it first.




We will also use Matplotlib, a comprehensive library for creating static, animated, and interactive visualizations in Python.

### 2.8 Importing the "wordcloud" and "matplotlib" libraries

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

### 2.9 Combine headlines into one string:

Let's start by creating an empty string

In [None]:
text_combined = ''

Loop through all the headlines and add them to 'text_combined.' We can make sure to add a space after every headline, so the first and last words are not glued together. Finally, let's print the first 300 characters to screen for inspection

In [None]:
for i in response_json['articles']:
    text_combined += i['title'] + ' '
print(text_combined[0:300])

### 2.10 Create a Word Cloud:

In [None]:
wordcloud = WordCloud(max_font_size=40).generate(text_combined)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

### 2.11 Converting JSON data into a Pandas Dataframe

As noted in section 1.6, it may be easier to work with the API data if it were stored in tabular format (i.e., a spreadsheet). We can use the Pandas Python library for data analysis and manipulation (https://pandas.pydata.org) to wrangle the JSON formatted data into a dataframe.

Let's follow the same steps as section 1.6:

In [None]:
import pandas as pd

In [None]:
response_json.keys()

In [None]:
response_json['status']

In [None]:
response_json['totalResults']

In [None]:
response_json['articles']

In [None]:
type(response_json['status'])

In [None]:
type(response_json['totalResults'])

In [None]:
type(response_json['articles'])

In [None]:
type(response_json['articles'][0])

In [None]:
df = pd.DataFrame(response_json['articles'])

In [None]:
df

### 2.12 Next Step: Data Cleaning!

Looking at the dataframe above, it's clear that the data received from the NewsAPI lacks uniformity. For example, in the author column, some authors are listed with their emails, others with their locations. In addition, the source column includes both the source 'id' and 'name'.

This is a common experience when collecting data from APIs. Even though the data is shared in a structured way based on specific query parameters, this does not mean the data is clean.

Remember that the data process starts with data collection, followed by data wrangling (e.g., converting data from JSON to a dataframe) and data cleaning, which includes making sure your data is stored in consistent formats.

For example, if we wanted to clean our source column so as to just keep the name of the publication, we could first define a function:

In [None]:
def dict_to_value(dict):
    for value in dict:
        resultList = list(dict.values())
        resultList.reverse()
        return str(resultList[0])

then, apply it to our dataframe:

In [None]:
df['source'] = df['source'].apply(dict_to_value)
df

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path = "/content/drive/MyDrive/web_scraping_api" #change path to match your directory


In [None]:
df.to_csv(f"{path}/news_articles.csv", encoding='utf-8', index=False)