<a href="https://colab.research.google.com/github/toisTareq/DvDa/blob/main/03_Collecting_Data_API_Script_Lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Collection
One of the first steps of data analysis is typically to get some data. If you're lucky, someone provides you with a CSV (or other) file that you can start working on. Sometimes you have to write some SQL code to query databases (see Database lecture next year). Sometimes, no data repository or database is available and you will either have to query an API or scrape a webpage by yourself. This notebook focuses on how to succeed in the latter two methods.

## Collecting data from web-based sources
Gathering data from a web-based source is one of the more common forms of querying data.

### Issuing HTTP requests
The communication between a client (your Python program or your browser) and a server (where your web data is hosted) is regulated by the HTTP Protocol (HyperText Transfer Protocol). This is basically a set of rules and standards that everyone on the internet has agreed to. If you want to collect web-based data, you need to issue an HTTP request to the web server. The web server will respond based on the request method and its parameters.

Python provides the [requests library](https://requests.readthedocs.io/en/latest/user/quickstart/) to communicate and issue queries to a web server. The following code will load data from the THI webpage.

In [None]:
import requests
response = requests.get("http://www.thi.de")

print("Status Code: ", response.status_code)
print("Headers: ", response.headers)

Status Code:  200
Headers:  {'Date': 'Mon, 01 Dec 2025 14:49:00 GMT', 'Server': 'Apache', 'Content-Language': 'de-DE', 'Expires': 'Mon, 01 Dec 2025 16:07:43 GMT', 'Cache-Control': 'max-age=4723', 'Pragma': 'public', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'sameorigin', 'Strict-Transport-Security': 'max-age=31536000', 'X-XSS-Protection': '1; mode=block', 'X-UA-Compatible': 'IE=edge', 'Referrer-Policy': 'strict-origin-when-cross-origin', 'Permissions-Policy': 'accelerometer=(), camera=(), geolocation=(), gyroscope=(), magnetometer=(), microphone=(), payment=(), usb=(), interest-cohort=()', 'Content-Length': '31579', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=utf-8'}


This code imports the `requests` library. It issues an HTTP GET request to the web server to retrieve the requested URL. The `response` object contains information about the request's response and the content of the website itself. The `response.status_code` field, for example, contains the status code of the response. 200 indicates, that the query was successful. The `response.headers` field shows some metainformation such as the content language (DE) or the timestamp the request was sent. If you want to look at the content, you can use the `response.content` or the `response.text` fields.  

In [None]:
# Show the first 400 characters of the content
response.content[:400]

### Parameters
You probably have seen URLs like these before.

https://www.google.com/search?q=python+download+url+content&source=chrome

The `https://www.google.com/search`string is the URL, and everything after the `?` are parameters. Each parameter is of the form "param=value", several parameters are separated by an ampersand `&`. A lot of special characters need to be encoded in these parameters, such as spaces replaced by "%20". The `requests`library handles those encoding issues. You can simply pass all the parameters as a Python directory.

In [None]:
params = {"query": "python download url content", "source":"chrome"}
response = requests.get("http://www.google.com/search", params=params)
print(response.status_code)

Besides the HTTP GET command, there are other common HTTP commands (POST, PUT, DELETE) which can also be called by the corresponding function in the library.

## RESTful APIs

The response object above returns the HTML of the requested web site. Often, you want to extract specific information such as the title or some text content. Many web-based services will employ web services that help deliver the right content for client requests. Basically, they provide an API (application programming interface), the client can use to retrieve a response. The (currently) most used web service is REST (Representational State Transfer) API. Practically, you should remember the following points about REST APIs:
1. You call REST APIs using standard HTTP commands: GET, POST, DELETE, PUT. You will probably see GET and POST used most frequently.
2. REST servers don’t store state. This means that each time you issue a request, you need to include all relevant information like your account key, etc.
3. REST calls will usually return information in a nice format, typically JSON. The `requests` library will automatically parse it to return a Python dictionary with the relevant data.

For more details refer to the corresponding lecture slides.   

In [None]:
api_url="https://swapi.dev/api/people/1/"
response = requests.get(api_url)
response.json()

Here, the URI  "https://swapi.dev/api/people/1/" is called to retrieve information about Luke Skywalker. The response object can be formatted using the method `json()`.

## Authentication
Most APIs require an API key (or another form of authentification). An API key is a token that has been assigned to a client when he or she registered with the service. The client provides the key when making API calls.  

They key can be send in the request header:
```
GET /something HTTP/1.1
X-API-Key: asdfeweradsfa
```
A problem with this kind of authentification is, that the key can be picked up by anyone if any point in the entire network is insecure. Some APIs use more sophisticated methods, such as OAuth 2.0. See [this blog entry](https://blog.restcase.com/4-most-used-rest-api-authentication-methods/) for further information.

### Using LastFM
In order to get an API key for the LastFM library, you need to open [LastFM](https://www.last.fm/api/account/create/) and fill in the form. Chose an application name and briefly describe the purpose of your registration. You will receive you key (API_KEY). Please save this key, as you need it if you want to submit reqests to the API of LastFM.

In [None]:
API_KEY = '029d2e4b7ccf394cf0143e34ff62a6a0' # Fake API Key, replace with your key
USER_AGENT = 'navarrobullock'

The following method defines the base URL of the LastFM API, header parameters as well as parameters for the API request. The parameter dictionary contains the API_KEY, the default format (JSON).

In [None]:
def lastfm_get(payload):
    # define headers and URL
    headers = {'user-agent': USER_AGENT}

    # base URL
    url = 'https://ws.audioscrobbler.com/2.0/'

    # Add API key and format to the payload
    payload['api_key'] = API_KEY
    payload['format'] = 'json'

    response = requests.get(url, headers=headers, params=payload)
    return response

The method lastfm_get is called. As a further parameter you need to specify the method you want to request. Sometimes, you need to specify further parameters, for example filter parameters. The [API documentation](https://www.last.fm/api) contains all the information you need. Just navigate to the appropriate method. In this example, [`chart.getTopArtists`](https://www.last.fm/api/show/chart.getTopArtists). The response is written into the `r` object.

In [None]:
r = lastfm_get({
    'method': 'chart.gettopartists'
})

r.status_code

In [None]:
import json

# create a formatted string of the Python JSON object
text = json.dumps(r.json(), sort_keys=True, indent=4)
print(text)

## Collecting and processing data using an API

Das folgende Skript lädt alle Star-Wars-Raumschiffe. Ausgangspunkt ist die URL https://swapi.dev/api/starships/. In einer Schleife wird jede Seite per `requests.get(..., timeout=30)` abgerufen, die JSON-Antwort mit `r.json()` geparst und die Ergebnisse in `items` gesammelt. Die Methode `extend()` sorgt dafür, dass die einzelnen Elemente aus der Liste `results` in `items` übernommen wird. Über das Feld `next` der API wird automatisch zur nächsten Seite gewechselt, bis keine weiteren Seiten mehr vorhanden sind. Fehler werden robust behandelt: HTTP-Fehler (z. B. 404/500) und allgemeine Request-Ausnahmen werden abgefangen und ausgegeben.


In [None]:
BASE = "https://swapi.dev/api"

url = f"{BASE}/starships/"
items = []
while url:
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        j = r.json()
        items.extend(j.get("results", []))
        url = j.get("next")
    # bad status code (4xx/5xx)
    except requests.exceptions.HTTPError as e:
        print("HTTP-Error:", e)
    # ConnectionError, Timeout, ProxyError
    except requests.exceptions.RequestException as e:
        print("General Request-Error:", e)



### Read output into a Pandas dataframe
Read the output into a pandas dataframe. You need to specify the hierarchy of the JSON file. In this case, we want to extract all artists (`artist` items below the tag `artists`).

In [None]:
import pandas as pd
df = pd.DataFrame(items)
df["cost_in_credits"] = pd.to_numeric(df["cost_in_credits"], "coerce")
df["hyperdrive_rating"] = pd.to_numeric(df["hyperdrive_rating"], "coerce")
df["cargo_capacity"] = pd.to_numeric(df["cargo_capacity"], "coerce")
df.cost_in_credits.dtypes
print(df[["name", "cost_in_credits", "cargo_capacity"]])

In [None]:
df = df.dropna(subset=["cost_in_credits", "cargo_capacity"])
df["cost_millions"] = df["cost_in_credits"] / 1e6
df["cargo_millions"] = df["cargo_capacity"] / 1e6

print(df[["name", "cost_millions", "cargo_millions"]])

In [None]:
import matplotlib.pyplot as plt

plt.scatter(df["cost_in_credits"], df["cargo_capacity"])
plt.xscale("log")
plt.yscale("log")
plt.xlabel("Cost (credits, log scale)")
plt.ylabel("Cargo capacity (log scale)")
plt.title("Star Wars Ships: Cost vs Cargo Capacity")
plt.show()


In [None]:
top = df.sort_values("cost_in_credits", ascending=False).head(10)
plt.barh(top["name"], top["cost_in_credits"])
plt.xlabel("Cost in credits")
plt.title("Top 10 Most Expensive Starships")
plt.gca().invert_yaxis()
plt.show()

In [None]:
plt.figure(figsize=(8, 4))
no_outliers = top[top["cost_in_credits"] < 1e9]
plt.barh(no_outliers["name"], no_outliers["cost_in_credits"])
plt.xlabel("Cost in credits")
plt.title("Starships (excluding outliers)")
plt.gca().invert_yaxis()
plt.show()

## Exercise
Use the Star Wars API (SWAPI) to fetch all planets (there are multiple pages) using the `requests` library and store the columns `name`, `diameter`, `population`, `surface_water` and `gravity` into a `pandas.DataFrame`.

Convert your data to numeric and handle *unknown* values.

Create a scatter plot: diameter vs. population

Visualize the top most populated planets (with and without outliers)