
# Workshop: Collecting data from the Internet: API

## Today's plan

API
* Introduction
* Examples: OpenWeather, Reddit
* Demos: Twitter, Finnhub

## API 

API (Application programming interface) is a set of rules and specifications that software programs can follow to communicate with each other. 
* It serves as an interface between different software programs and facilitates their interaction
* Example:
    * A client application initiates an API call (send a request to an API)
    * If the request is valid, the API makes a call to the external program or web server
    * The server sends a response to the API with the requested information
    * The API transfers the data to the initial requesting application
    
Sound familiar?

## API: Advantages
* Advantage of API over web scraping:
    * API calls are faster than web scraping
    * Data from API is in a more standard format
    * Authorised way to access data

## API: Limitations

* Availability: not all websites have APIs. For example:
<center><img src="figs/glassdoor_api.png" width="800"/></center>
* Most APIs have a limited usage policy
    * Rate limit: number of API calls a user can make within a given time period
    * Can be different for different types of users (e.g. free and paid users)
        * e.g. For Twitter normal developer account, users can only get back up to the last seven day tweets

## Example 1: OpenWeather API

OpenWeather provides API for users to retrieve weather data, including real-time and historical weather.
* See the whole list of data provided [here](https://openweathermap.org/price#weather)

## How to use OpenWeather API

1. Before using it, please first sign up [here](https://home.openweathermap.org/users/sign_up) and then log in
2. At the top right, click your account -> My API keys: <center><img src="figs/weather_api_key_1.png" width="500"/></center>
3. Copy your API key from the page: <center><img src="figs/weather_api_key_2.png" width="500"/></center>

## API Keys


To prevent unauthorised use of API, API calls are restricted to those that provide proper authentication credentials, and these credentials are in the form of an _API key_.
* API key: a unique identifier that authenticates and associates with a user or an app
    * For now, you can consider API keys as your login details. Just like your other log in details, you should NOT share and expose them to others!

## API Keys (continue)

Instead of putting the API key directly to the code, we will store our API key in a json file `keys.json` with the content like the following:

```
{"open_weather": {
  "api_key": "XXXXXXXXXX"
  },
  "reddit": {
    "app_id": "XXXXXXXXXX",
    "app_secret": "XXXXXXXXXX",
    "user_name": "XXXXXXXXXX",
    "password": "XXXXXXXXXX"
  }
}
```

By doing so, we can "hide" our API key from people who have access to the code.
* When handing in your work, please do NOT submit the `keys.json` file!

## API Keys (continue)

To get back the API key to access the API, we use the module `json` to parse the data. For example:

In [1]:
import requests
import json

with open('keys.json') as f:
    keys = json.load(f)
api_key = keys['open_weather']['api_key'] 

## OpenWeather API: Get the current weather information

Read the instruction from the [official document](https://openweathermap.org/current#name) to understand how to make the API call:

<center><img src="figs/weather_api_current.png" width="300"/></center>

## Get the current weather information (continue)

Get the current weather of London using the given API call:

In [2]:
city_name = 'London'
url = f'http://api.openweathermap.org/data/2.5/weather?q={city_name}&appid={api_key}'
r = requests.get(url)
r

<Response [200]>

In [3]:
r.text

'{"coord":{"lon":-0.1257,"lat":51.5085},"weather":[{"id":803,"main":"Clouds","description":"broken clouds","icon":"04n"}],"base":"stations","main":{"temp":284.81,"feels_like":284.41,"temp_min":283.76,"temp_max":285.45,"pressure":1009,"humidity":91},"visibility":10000,"wind":{"speed":5.66,"deg":230,"gust":11.32},"clouds":{"all":75},"dt":1707862401,"sys":{"type":2,"id":2075535,"country":"GB","sunrise":1707808778,"sunset":1707844201},"timezone":0,"id":2643743,"name":"London","cod":200}'

## Get the current weather information (continue)

The received data is in the form of json, and we can parse it using the `json.loads()`:

In [4]:
weather = json.loads(r.text)

In [5]:
weather

{'coord': {'lon': -0.1257, 'lat': 51.5085},
 'weather': [{'id': 803,
   'main': 'Clouds',
   'description': 'broken clouds',
   'icon': '04n'}],
 'base': 'stations',
 'main': {'temp': 284.81,
  'feels_like': 284.41,
  'temp_min': 283.76,
  'temp_max': 285.45,
  'pressure': 1009,
  'humidity': 91},
 'visibility': 10000,
 'wind': {'speed': 5.66, 'deg': 230, 'gust': 11.32},
 'clouds': {'all': 75},
 'dt': 1707862401,
 'sys': {'type': 2,
  'id': 2075535,
  'country': 'GB',
  'sunrise': 1707808778,
  'sunset': 1707844201},
 'timezone': 0,
 'id': 2643743,
 'name': 'London',
 'cod': 200}

## Get the current weather information (continue)

And retrieve the data we want, say temperature:

In [6]:
weather['main']['temp']

284.81

Why is the number so high? Go back and read the documentation carefully!

## Get the current weather information (continue)

Provide additional argument to have the right unit of weather details:

In [7]:
url = f'http://api.openweathermap.org/data/2.5/weather'\
    f'?q={city_name}&appid={api_key}&units=metric'
r = requests.get(url)
r

<Response [200]>

Get the temperature again:

In [8]:
weather = json.loads(r.text)
weather['main']['temp']

4.09

Lesson: Always read the official documentation - do not assume it will do what you _think_ it will do.

## Recall: URL

* URL form: `<scheme>://<host>:<port>/<path><;parameters><?query><#fragment>`
    * Scheme (e.g., http, https)
    * Location of the server, e.g. a hostname or IP address
    * [optional] port: communication endpoint
    * Path to the resource on the server
    * [optional] parameters (;something), query string (?something), fragment identifier (#something)

## Activity 8.1

1. Update the code above to get the current weather information of a city you are interested in
2. Read https://openweathermap.org/api/geocoding-api and create an API call to get the latitude and longitude of Paris, France

In [8]:
# write your code for question 1 here

In [9]:
# write your code for question 2 here
# hint: use json.loads(r.text) and locate the index of longitude and latitude inside the API documentation

## Example 2: Reddit API

[Reddit API](https://www.reddit.com/dev/api/) allows users to automate what they can do manually on the Reddit website with the use of API calls.

In this workshop, we will follow the [Quick Start Example](https://github.com/reddit-archive/reddit/wiki/OAuth2-Quick-Start-Example) from Reddit.

## Reddit API: Sign up and set up the application

1. Sign up and log in to a Reddit account
2. Go to https://www.reddit.com/prefs/apps. Hit `are you a developer? create an app...` and enter the details like the following:
<center><img src="figs/reddit_app.png" width="700"/></center>
You should then get the following:
<center><img src="figs/reddit_id_secret.png" width="400"/></center>

## Reddit API: Sign up and set up the application

3. Go to https://www.reddit.com/wiki/api. Hit `Read the full API terms and sign up for usage` and fill in the form with the app id you have got from (2)
4. Update the `keys.json` with your Reddit username, password, app id and app secret from (2)

## Reddit API: Request a token

To request a token, we need to:
1. Provide the application id and secret (or call the client id and secret) for authentication
2. Provide Reddit username and password 

We first retrieve the required id, secret, etc from `keys.json`:

In [10]:
with open('keys.json') as f:
    keys = json.load(f)
    
app_id = keys['reddit']['app_id']
app_secret = keys['reddit']['app_secret']
username = keys['reddit']['username']
password = keys['reddit']['password']

## Reddit API: Request a token (continue)

We then follow the example from the [Reddit API github page](https://github.com/reddit-archive/reddit/wiki/OAuth2-Quick-Start-Example#python-example) to request a token:

In [11]:
import requests.auth

client_auth = requests.auth.HTTPBasicAuth(app_id, app_secret)
post_data = {'grant_type': 'password', 
             'username': username, 
             'password': password}
headers = {'User-Agent': f'testing lse/0.0.1 by {username}'}

r = requests.post('https://www.reddit.com/api/v1/access_token',
                  auth=client_auth, data=post_data, headers=headers)

In this example:
* We use post instead of get
* Headers are included

## Recap: HTTP requests

* An HTTP request has the following components:
    * Request line, which includes
        * URL 
        * Method (e.g. GET)
        * Protocol version
    * Headers: meta-information about a request
    * Message body (optional)

## Recap: HTTP request methods

HTTP request methods indicate the desired action to be performed on the identified resource. 
* Some example request methods:

| Method      |Description                                                                      |
|-------------|---------------------------------------------------------------------------------|
| GET         | Retrieve information from the specified URL                                    |
| HEAD        | Retrieve the meta information from the header of the specified URL             |
| POST        | Send the attached information for appending to the source URL                  |
| PUT         | Send the attached information for replacing the resource at the specified URL  |
| DELETE      | Delete the information identified by the URL                                    |


## Reddit API: Request a token (continue)

Check if we have successfully got the token:

In [12]:
r

<Response [200]>

In the response we get the "access_token" for which we will use later:

In [13]:
access_token = json.loads(r.text)['access_token']

## Reddit API: use the access token to send a request

Now we use the access token to create an API call to show the top posts from the subreddit `datascience`:

In [14]:
headers = {"Authorization": f"bearer {access_token}", 
           'User-Agent': f'lse/0.0.1 by {username}'}

r = requests.get("https://oauth.reddit.com/r/datascience/hot?limit=3",
                 headers=headers)
r

<Response [200]>

Note:
* Here we make use of `headers` to provide the access token
* Here we set `limit=3`, which means the maximum number of items to return is 3

To learn how to make the API call, See [here](https://www.reddit.com/dev/api/#GET_{sort}).

## Reddit API: have a look at the data we have got

In [15]:
for post in json.loads(r.text)['data']['children']:
    print(post['data']['title'], post['data']['url'])

Weekly Entering &amp; Transitioning - Thread 12 Feb, 2024 - 19 Feb, 2024 https://www.reddit.com/r/datascience/comments/1aos1w7/weekly_entering_transitioning_thread_12_feb_2024/
How do land a causal inference focused DS job? https://www.reddit.com/r/datascience/comments/1apwzdn/how_do_land_a_causal_inference_focused_ds_job/
Essential Math for Data Science VS Math for machine learning, which is a better book? https://www.reddit.com/r/datascience/comments/1aq3q1h/essential_math_for_data_science_vs_math_for/
Refreshing math skills https://www.reddit.com/r/datascience/comments/1apycq9/refreshing_math_skills/


## Activity 8.2

1. Read the first section of https://www.reddit.com/dev/api to understand what `after` / `before`, `limit`, `count`, and `show`
2. Read https://www.reddit.com/dev/api/#GET_search and use the API to get the 100 posts related to python from the subreddit `datascience`. Print out the title of the post and the score of the post

In [None]:
# write your code for question 2 here
# hint: use for loop to print out the title and score of the post in parsed_data['data']['children']

##  API remark

* Providers often decide the API, and clients have to read the documentation to know how to use their API

# General feedback on project proposal

## Project proposal: data source

* Unless you have prior experience, please do NOT use web scraping on Facebook / Instagram with or without the use of third party libraries
    * Your accounts are likely to be blocked!
* Supported API: Twitter and Reddit
    * You are allowed to use other APIs, but I may not be able to provide much help
    * For students who use Twitter API, you are strongly advised to apply for a developer account as soon as possible
* Please don't be too ambitious
    * Unless you are experienced, using one API / web scraping _should_ be enough for you to show your data collection skills
    * Three data sources _should_ be more than enough in this project

## Project proposal: data source (continue)

* For many of the APIs, there is a limitation on how many posts you can get
    * Twitter: posts from the last seven days if you have a normal account
    * Reddit: not more than 1000 posts from a subreddit
    * You are advised to start collecting the data earlier so that you can get more data
* Because of the limitation, you may not able to answer questions like "how the public opinion on xxx changed in the past z years" using these APIs
    * Instead, you may answer questions like "how the public opinion different between xxx and yyy"

## Project proposal: data source (continue)

* You will NOT be penalised because you can only get a limited amount of data due to the limitation of the data source, but in your report you need to:
    * Highlight the limitation and what you have tried (but failed)
        * Your effort will be taken into account
    * Reduce the scope of your question based on the limitation on the data you can collect 
        * "how the public opinion on xxx changed in the past z years" -> "how the public opinion differs for xxx and yyy in the last week"
        * "Whether Instagram or Twitter is a better platform to engage students" -> "Is Twitter an effective platform to engage students?"
* Remember data science lifecycle is an iterative process!

## Data science lifecycle

<center><img src="figs/ds_process_2.png" width="500"/></center>

## Project proposal: Some more feedback

* We use the project to assess your ability to perform the data science lifecycle and its main steps and tasks (except modelling) using real-life data to answer real-world questions
* List of steps and tasks:
    * Formulate questions to answer
    * Collect relevant data
    * Explore data
        * Data wrangling
        * EDA (with visualisation)
    * Report the result
        * with visualisation

## Project proposal: Some more feedback (continue)

* To get a good grade, your project should demonstrate your ability in performing _most_ (if not all) of the steps and tasks above
* Try to use your time wisely and not to spend too much time on one of the areas
    * For example, if you have spent a lot of time figuring out how to use an API to collect the data, you may want to stop and use another source instead so that you can spend enough time in other areas
* You can do modelling if you want, but make sure you have put enough effort into the main steps and tasks listed above
    * This is a first-year course, modeling is allowed but NOT required

# API demo

## Further examples on API

Here I provide 2 more examples of using API to collect data as a _demonstration_. The purpose of the demo is as follows:
1. To show how you can get data from Twitter using their API (can be useful for students who analyse Twitter data in their project)
2. To show how you can collect real-time financial data through API (can be useful for students who are interested in jobs in the financial industry)

The use of the Twitter and Finnhub API will NOT appear in problem set questions. Not to worry if you are not able to run the code from here.
* Getting a Twitter developer account can be difficult...

## Demo 1: Twitter API

Before you can use Twitter API, you need to apply for a developer account. 

Follow the [link](https://developer.twitter.com/en/docs/tutorials/step-by-step-guide-to-making-your-first-request-to-the-twitter-api-v2) to:
1. Sign up for a developer account
2. Create a Project and connect to an app
    * In this workshop, we will only use the bearer to connect to an app

## Twitter API: demo 

Get information about a Twitter account:

In [32]:
bearer_token = keys['twitter']['bearer_token']
headers = {
    'Authorization': f"Bearer {bearer_token}"
}
r = requests.get('https://api.twitter.com/2/users/by/username/elonmusk', headers=headers)

In [33]:
r.text

'{\n  "title": "Unauthorized",\n  "type": "about:blank",\n  "status": 401,\n  "detail": "Unauthorized"\n}'

## Twitter API: demo (continue)

Get a few tweets from the given user using the user id:

In [34]:
r = requests.get('https://api.twitter.com/2/users/44196397/tweets', headers=headers)
r

<Response [401]>

## Twitter API: demo (continue)

In [19]:
json.loads(r.text)

{'title': 'Unauthorized',
 'type': 'about:blank',
 'status': 401,
 'detail': 'Unauthorized'}

## Twitter API: demo (continue)

Get the details of a tweet given its id. 
* You can specify what field you want under `tweet.fields=...`
    * Here we get the `author_id` and `public_metrics` (measure Tweet engagement)
    * See [here](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet) for more details

In [20]:
r = requests.get('https://api.twitter.com/2/tweets?ids=1501796086255693827&' \
                 'tweet.fields=author_id,public_metrics', 
                 headers=headers)

In [21]:
json.loads(r.text)

{'title': 'Unauthorized',
 'type': 'about:blank',
 'status': 401,
 'detail': 'Unauthorized'}

## Twitter API: demo (continue)

Get a collection of relevant Tweets matching a specified query - here search for "elon musk":

In [22]:
#%20 is a space in a url
api_call = 'https://api.twitter.com/1.1/search/tweets.json?q=elon%20musk&count=100' 
r = requests.get(api_call, 
                 headers=headers)
r

<Response [400]>

In [23]:
tweets = json.loads(r.text)

In [29]:
len(tweets['statuses'])

KeyError: 'statuses'

See [here](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets) to learn more about how to get search results using Twitter API, and see [here](https://en.wikipedia.org/wiki/Percent-encoding) to learn about percent encoding.

## Twitter API: demo (continue)

Get a collection of the most recent Tweets posted by the user - here with the given user id of Elon Musk:

In [25]:
url='https://api.twitter.com/1.1/statuses/user_timeline.json?user_id=44196397&count=200'
r=requests.get(url, headers=headers)
r

<Response [400]>

In [26]:
tweets = json.loads(r.text)

In [27]:
len(tweets)

1

See [here](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/api-reference/get-statuses-user_timeline) to learn more about how to the most recent Tweets posted from a user.

## Demo 2: receiving real-time financial data

The code of the demo is under `finnhub.ipynb`.

## Select files to submit on GitHub

Unselect `keys.json` before commit and push:

<center><img src="figs/github.png" width="500"/></center>

## Summary

* API examples and exercises
* API demo to:
    * Help you to prepare for the project
    * See how we can collect real-time data