# Working with Data from the Web

## Grading instructions

1. Launch VS Code and open your working-folder
2. Create a `Session_02` folder, in which you create another folder called `data`
3. Copy paste this notebook `02_Data_from_the_Web_lecture` from the lectures repo to the working-folder. 
4. Copy the json file `animals.json` and the zip file `arbeitsmarktstatistik_erwerbslosenquoten_geschlecht.zip` into the `working-folder/Session_02/data` directory.

#### There are two `Self-work exercises` in this notebook
1. During the course, do the self-work exercises
2. Once finished, copy-Paste this notebook `02_Data_from_the_Web_lecture` into `ESMT_2024_DataScraping_Students` folder in your computer
3. Commit and push your self-work in your branch before the deadline. **Push only the notebook, not the files!**

#### Number of points: 10 (weights 10% in the final grade)
- `Self-work exercises #1`: 4 points
- `Self-work exercises #2`: 6 points



#### Deadline: October 18th 08:59 am CET
#### Any missed deadline without justification to the Administration will result in 0 points for this homework.
#### If the Github branch is not correctly named using the indicated format **LASTNAME_firstname**, then a penalty of -2 points will be applied

## Course content

* Introduction to APIs
* Using request package to download files
* Loading files and tables from URLs (wikipedia)
* Working with zip files in Python
* Introduction to JSON (read_json, json_normalise)

# Introduction to API

An API, short for Application Programming Interface, is a concept used to describe – essentially – a piece of intermediary software (the interface) that facilitates communication between 2 other pieces of software (the applications). 

This very broad term is frequently used for web-based systems, database systems, operating systems, or even computer hardware. 

In this chapter we will focus on web-based APIs.

### What is a Web API?

A Web API typically means some kind of special website or URL that we use as a channel to get data from some company or web based program. 

We can write a Python program to retrieve data from the API. Put very bluntly, an API is a website providing data that is easy for a machine (e.g. python code) to understand (as opposed to a prettier, HTML-rendered, user interface for humans).

**Intro to API (duration: 3'24):**
https://www.youtube.com/watch?v=s7wmiS2mSXY

### Examples of Web APIs
* Google Maps: get map coordinates for an address
* Spotify: read and modify a playlist
* GitHub: read statistics on your code repo
* WeatherAPI: get weather data for specific location
* Google Translate: translate texts directly from a Python script

<br>

# Python Requests Library

The Python requests library is a popular third-party library that simplifies the process of making HTTP requests and working with HTTP responses. 

It provides a high-level interface for sending HTTP requests to web servers and receiving their responses. 

This library is widely used for tasks like fetching data from APIs, sending data to servers, and interacting with web resources.

    Installation: You can install the requests library using pip, a package installer for Python:

In [1]:
!pip install requests==2.32.3



    Importing: After installation, you need to import the library in your Python code before you can use it:

In [2]:
import requests

    HTTP Methods: The library supports various HTTP methods, such as GET, POST, PUT, DELETE, etc. These correspond to create, read, update, and delete (or CRUD) operations, respectively. You can choose the appropriate method for your request. 

To make a GET request:

In [3]:
response = requests.get("https://randomuser.me/api/")

## API status code

In [4]:
print(response.status_code)

200


    API Status code:

Status codes are returned with every request that is made to a web server. Status codes indicate information about what happened with a request. Here are some codes that are relevant to GET requests:

    200: Everything went okay, and the result has been returned (if any).
    301: The server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint name is changed.
    400: The server thinks you made a bad request. This can happen when you don’t send along the right data, among other things.
    401: The server thinks you’re not authenticated. Many APIs require login credentials, so this happens when you don’t send the right credentials to access an API.
    403: The resource you’re trying to access is forbidden: you don’t have the right permissions to see it.
    404: The resource you tried to access wasn’t found on the server.
    503: The server is not ready to handle the request.

For more information about the various HTTP status codes [click here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

## Response Object


    Response Object: When you send a request using the requests library, you receive a response object that contains information about the server’s response:

   * response.status_code: HTTP status code of the response (200 is a code meaning a successful operation)
* response.content: Raw content of the response
* response.text: Content of the response in text format
* response.json(): Parses the response content as JSON if applicable
* response.headers: Headers received from the server (additional information passed through both request and response. It could be for example List of acceptable encodings. It is usually hidden to end-user)
* response.url: The URL that was accessed


In [5]:
response.json()

{'results': [{'gender': 'male',
   'name': {'title': 'Mr', 'first': 'Joseph', 'last': 'Martinez'},
   'location': {'street': {'number': 5257, 'name': 'Nowlin Rd'},
    'city': 'Devonport',
    'state': 'Queensland',
    'country': 'Australia',
    'postcode': 2190,
    'coordinates': {'latitude': '-88.8811', 'longitude': '-142.1982'},
    'timezone': {'offset': '+7:00', 'description': 'Bangkok, Hanoi, Jakarta'}},
   'email': 'joseph.martinez@example.com',
   'login': {'uuid': '88c88565-4130-492a-80be-46b72f0fd075',
    'username': 'sadwolf358',
    'password': '7779311',
    'salt': 'sfpbKkNF',
    'md5': '324a77f6ee2797a12072293124a91292',
    'sha1': '8b0c5695b5b45463baaff16515afd751a07b756b',
    'sha256': '71c63f1143b066841984e57e588fc10579a9b3ea47d7447a514268bc6bcb0b60'},
   'dob': {'date': '1968-11-21T05:43:24.130Z', 'age': 55},
   'registered': {'date': '2017-04-12T00:04:40.489Z', 'age': 7},
   'phone': '01-9098-0435',
   'cell': '0436-040-595',
   'id': {'name': 'TFN', 'value':

## Query String Parameters

One common way to customize a GET request is to pass values through query string parameters in the URL. 

To do this using get(), you pass data to params.

### Weather API:
1. Sign up to RapidAPI: https://rapidapi.com/signup
2. Select the free plan: https://rapidapi.com/meteostat/api/meteostat/pricing 

It allows you to 500 requests per months, with 3 requests per second maximum.

3. Go to this page and click on `Get Daily Station Data` on the left pane: https://rapidapi.com/meteostat/api/meteostat

4. On the right part `Code snippets` should appear the `x-rapidapi-key`: copy paste it below

5. Complete the information below to get the weather of Berlin today

### Important note: do not push this notebook to Github! It contains your API Key, which you don't want to reveal. If you were to push this content to github, delete your API key before!


In [7]:
# Use double quotes to assign your API key to private_api_key variable as a string
private_api_key = "875f8a569emshd1aaadd8f671a8fp1d77e7jsn1f35e320a336"

In [8]:
# Define your parameters as a dictionary
params = {
    "lat": 52.5200,  # Find Berlin's latitude and add it here
    "lon": 13.404954,  # Find Berlin's longitude and add it here
    "start": "2023-10-08",  # Replace with today's date
    "end": "2023-10-08"  # Replace with today's date
    #Error!!
}

response = requests.get("https://meteostat.p.rapidapi.com/point/daily",
                       params=params,
                       headers={
                           "X-RapidAPI-Host": "meteostat.p.rapidapi.com",
                           "X-RapidAPI-Key": "private_api_key" # Add the private_api_key variable 
                       })

# Do not push your API key to Github!

In [9]:
print(response.json())

{'message': 'You are not subscribed to this API.'}


<br>
<br>
<br>
<br><br><br><br><br><br>

# Data from Wikipedia

## Installation

First, install and import the package

*Note: all information about the packages and their versions can be found on [Pypi](https://pypi.org/), for example the [wikipedia package](https://pypi.org/project/wikipedia/)*

In [10]:
!pip install wikipedia==1.4.0

Collecting wikipedia==1.4.0
  Using cached wikipedia-1.4.0-py3-none-any.whl
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [11]:
# Import the package
import wikipedia

## First commands
* wikipedia.summary(<>): provides a summary of the page
* wikipedia.page(<>).content: provides the content
* wikipedia.page(<>).url: provides the URL of the page
* wikipedia.set_lang(<>): changes the request language

### Try the following commands:

In [12]:
wikipedia.summary("The Kingkiller Chronicle")

'The Kingkiller Chronicle is a planned fantasy trilogy by the American writer Patrick Rothfuss. The first two books, The Name of the Wind and The Wise Man\'s Fear, were released in 2007 and 2011. The books released in the series have sold over 10 million copies.\nThe series centers on a man named Kvothe, an infamous adventurer and musician telling his life story to a scribe. The book is told in a "story-within-a-story" format: a frame narrative relates the present day in which Kvothe runs an inn under an assumed name and is told in omniscient third person. The main plot, making up the majority of the books and concerning the actual details of Kvothe\'s life, is told in the first person. The series also contains metafictional stories within stories from varying perspectives that tie to the main plot in various ways.\n\n'

In [13]:
wikipedia.page("The Kingkiller Chronicle").content

'The Kingkiller Chronicle is a planned fantasy trilogy by the American writer Patrick Rothfuss. The first two books, The Name of the Wind and The Wise Man\'s Fear, were released in 2007 and 2011. The books released in the series have sold over 10 million copies.\nThe series centers on a man named Kvothe, an infamous adventurer and musician telling his life story to a scribe. The book is told in a "story-within-a-story" format: a frame narrative relates the present day in which Kvothe runs an inn under an assumed name and is told in omniscient third person. The main plot, making up the majority of the books and concerning the actual details of Kvothe\'s life, is told in the first person. The series also contains metafictional stories within stories from varying perspectives that tie to the main plot in various ways.\n\n\n== Synopsis ==\nThe Kingkiller Chronicle tells the life story of a man named Kvothe. In the present day, Kvothe is a rural innkeeper, living under a pseudonym. In the p

In [14]:
wikipedia.page("The Kingkiller Chronicle").url

'https://en.wikipedia.org/wiki/The_Kingkiller_Chronicle'

In [15]:
wikipedia.set_lang('de')
wikipedia.summary("The Kingkiller Chronicle")

'Die Königsmörder-Chronik (Originaltitel: The Kingkiller Chronicle) ist eine Fantasy-Romanreihe des US-amerikanischen Schriftstellers Patrick Rothfuss. Thematisiert wird der sagenumwobene Arkanist Kvothe, der seine eigene Lebensgeschichte in Form einer Autobiografie von einem Chronisten niederschreiben lässt. Die Geschichte ist folglich in eine Gegenwartshandlung und eine Vergangenheitshandlung unterteilt. Jeder Buchableger verkörpert einen Tag in der Gegenwart.\n\n'

## Extract tabular data from wikipedia into a Pandas dataframe

In [16]:
# First, import the pandas package
import pandas as pd

In [17]:
# You may need to pip install lxml package
!pip install lxml

Collecting lxml
  Downloading lxml-5.3.0-cp312-cp312-win_amd64.whl.metadata (3.9 kB)
Downloading lxml-5.3.0-cp312-cp312-win_amd64.whl (3.8 MB)
   ---------------------------------------- 0.0/3.8 MB ? eta -:--:--
   ----- ---------------------------------- 0.5/3.8 MB 3.4 MB/s eta 0:00:01
   ---------- ----------------------------- 1.0/3.8 MB 3.0 MB/s eta 0:00:01
   ------------------- -------------------- 1.8/3.8 MB 3.1 MB/s eta 0:00:01
   ------------------------ --------------- 2.4/3.8 MB 2.9 MB/s eta 0:00:01
   ------------------------------ --------- 2.9/3.8 MB 3.0 MB/s eta 0:00:01
   ----------------------------------- ---- 3.4/3.8 MB 3.0 MB/s eta 0:00:01
   ---------------------------------------- 3.8/3.8 MB 3.0 MB/s eta 0:00:00
Installing collected packages: lxml
Successfully installed lxml-5.3.0


In [18]:
# Import the lxml package
import lxml

In [19]:
# Extract tabular data from wikipedia into a Pandas dataframe.
# We will use pandas.read_html

tables = pd.read_html("https://en.wikipedia.org/wiki/FIVB_Volleyball_Women%27s_World_Cup#Results_summary",
                     match='Champions')

In [20]:
# Show the tables variable
tables

[            Year     Host    Unnamed: 2      Champions     Runners-up  \
 0            NaN      NaN           NaN            NaN            NaN   
 1            NaN      NaN           NaN            NaN            NaN   
 2            NaN      NaN           NaN            NaN            NaN   
 3            NaN      NaN           NaN            NaN            NaN   
 4            NaN      NaN           NaN            NaN            NaN   
 5            NaN      NaN           NaN            NaN            NaN   
 6            NaN      NaN           NaN            NaN            NaN   
 7            NaN      NaN           NaN            NaN            NaN   
 8            NaN      NaN           NaN            NaN            NaN   
 9            NaN      NaN           NaN            NaN            NaN   
 10           NaN      NaN           NaN            NaN            NaN   
 11           NaN      NaN           NaN            NaN            NaN   
 12           NaN      NaN           N

In [21]:
# How big is tables?
len(tables)

11

In [22]:
# Let's get only the first result 
results_df = tables[0]

In [23]:
# Display the content of results_df
results_df

Unnamed: 0,Year,Host,Unnamed: 2,Champions,Runners-up,3rd place,4th place,Unnamed: 7,Teams
0,,,,,,,,,
1,,,,,,,,,
2,,,,,,,,,
3,,,,,,,,,
4,,,,,,,,,
5,,,,,,,,,
6,,,,,,,,,
7,,,,,,,,,
8,,,,,,,,,
9,,,,,,,,,


In [24]:
# Let' clean the data:
## Remove all rows that contain only null values

clean_df = results_df.dropna(how="all")
clean_df

Unnamed: 0,Year,Host,Unnamed: 2,Champions,Runners-up,3rd place,4th place,Unnamed: 7,Teams
14,1973 Details,Uruguay,Soviet Union,Japan,South Korea,Peru,10.0,,
15,1977 Details,Japan,Japan,Cuba,South Korea,China,8.0,,
16,1981 Details,Japan,China,Japan,Soviet Union,United States,8.0,,
17,1985 Details,Japan,China,Cuba,Soviet Union,Japan,8.0,,
18,1989 Details,Japan,Cuba,Soviet Union,China,Japan,8.0,,
19,1991 Details,Japan,Cuba,China,Soviet Union,United States,12.0,,
20,1995 Details,Japan,Cuba,Brazil,China,Croatia,12.0,,
21,1999 Details,Japan,Cuba,Russia,Brazil,South Korea,12.0,,
22,2003 Details,Japan,China,Brazil,United States,Italy,12.0,,
23,2007 Details,Japan,Italy,Brazil,United States,Cuba,12.0,,


#### Jump to the `Self-Work Exercises #1` Section

# Read zip files 

In [25]:
unemployment_rates = pd.read_csv("./data/arbeitsmarktstatistik_erwerbslosenquoten_geschlecht.zip", 
                        sep=";",
                        encoding='latin-1')

# Note: erwerbslosenquoten means unemployment rate
# arbeitsmarktstatistik means Statistics about the work market

In [30]:
# Display the data
unemployment_rates.head()


Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,ILO-Arbeitsmarktstatistik
Datum,"Erwerbslosenquoten Männer insgesamt, in %","Erwerbslosenquoten Männer unter 25 Jahren, in %","Erwerbslosenquoten Männer ab 25 Jahren, in %","Erwerbslosenquoten Frauen insgesamt, in %","Erwerbslosenquoten Frauen unter 25 Jahren, in %","Erwerbslosenquoten Frauen ab 25 Jahren, in %"
01/03/2007,95,127,91,88,87,88
01/04/2007,83,136,77,90,106,87
01/05/2007,85,117,81,85,104,83
01/06/2007,80,125,74,87,129,82


<br><br>
# Introduction to JSON

In [31]:
# Import the json package
import json

In [32]:
# Creating student records
student1 = {
    "name": "Alice",
    "age": 15,
    "grade": 10,
    "subjects": ["Math", "Science", "History"],
    "city": "Augsburg"
}

student2 = {
    "name": "Bob",
    "age": 16,
    "grade": 11,
    "subjects": ["English", "Physics", "Geography"],
    "city": "Berlin"
}

student3 = {
    "name": "Carol",
    "age": 14,
    "grade": 9,
    "subjects": ["Art", "Music", "PE"],
    "city": "Cottbus"
}

In [33]:
# Creating a list of student records
student_database = [student1, student2, student3]

In [34]:
# Converting the student database to JSON format
json_data = json.dumps(student_database, indent=4)

With indent=4: Each level of the JSON structure will be indented by 4 spaces, making it easier to read and visually understand the nested structure.

In [35]:
# Printing the JSON data
print(json_data)

[
    {
        "name": "Alice",
        "age": 15,
        "grade": 10,
        "subjects": [
            "Math",
            "Science",
            "History"
        ],
        "city": "Augsburg"
    },
    {
        "name": "Bob",
        "age": 16,
        "grade": 11,
        "subjects": [
            "English",
            "Physics",
            "Geography"
        ],
        "city": "Berlin"
    },
    {
        "name": "Carol",
        "age": 14,
        "grade": 9,
        "subjects": [
            "Art",
            "Music",
            "PE"
        ],
        "city": "Cottbus"
    }
]


### Read a json file

First, copy and paste the `animals.json` file into the `data` folder of your working directory

### Load the file

In [36]:
## We first load the file
with open("./data/animals.json") as f:
    d = json.load(f)

print(d)

[{'name': 'Ace', 'animal': 'Alpaca', 'tricks': ['spit'], 'demographics': {'sex': 'male', 'age': 2}, 'owner': 'Alice', 'city': 'Augsburg'}, {'name': 'Biscuit', 'animal': 'Beagle', 'tricks': ['sit', 'hunt'], 'demographics': {'sex': 'female', 'age': 1}, 'owner': 'Bob', 'city': 'Berlin'}, {'name': 'Coco', 'animal': 'Chinchilla', 'tricks': ['explore', 'climb'], 'demographics': {'sex': 'male', 'age': 5}, 'owner': 'Carol', 'city': 'Cottbus'}]


### Normalise the loaded data

In [37]:
# To convert each element in "demographics" section into new columns,
# We prefer to use : pd.json_nomalize
animals_v1 = pd.json_normalize(d)

In [38]:
# Display animals_v1
animals_v1

Unnamed: 0,name,animal,tricks,owner,city,demographics.sex,demographics.age
0,Ace,Alpaca,[spit],Alice,Augsburg,male,2
1,Biscuit,Beagle,"[sit, hunt]",Bob,Berlin,female,1
2,Coco,Chinchilla,"[explore, climb]",Carol,Cottbus,male,5


### Read a json file as a dataframe directly

In [39]:
animals_v2 = pd.read_json("./data/animals.json")

In [40]:
# Display animals_v2
animals_v2

Unnamed: 0,name,animal,tricks,demographics,owner,city
0,Ace,Alpaca,[spit],"{'sex': 'male', 'age': 2}",Alice,Augsburg
1,Biscuit,Beagle,"[sit, hunt]","{'sex': 'female', 'age': 1}",Bob,Berlin
2,Coco,Chinchilla,"[explore, climb]","{'sex': 'male', 'age': 5}",Carol,Cottbus


### Self-work exercises #1: 4 points

#### What problems do you see with `clean_df`? Can you clean the DataFrame to obtain a clean one?

Write a series of code commands in order to clean the DataFrame:
- create a deep copy of the DataFrame `clean_df` and assign it to the variable `df`
- get the list of useful columns, without copy-pasting them (use code!)
- drop the columns that contain missing values
- rename the columns so that each column corresponds to the correct header
- transform the `Year` column to remove `Details`, which we don't need

*Tip: this might be useful https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html*

In [43]:
    
df = clean_df.copy(True)
columnlist = [c for c in df.columns if not(c.startswith("Unnamed"))]  
df.dropna(axis="columns", how='all', inplace=True)
df.columns = columnlist
df['Year'] = df['Year'].str.split(" ").str[0]
print(df.head())

    Year     Host     Champions    Runners-up     3rd place      4th place  \
14  1973  Uruguay  Soviet Union         Japan   South Korea           Peru   
15  1977    Japan         Japan          Cuba   South Korea          China   
16  1981    Japan         China         Japan  Soviet Union  United States   
17  1985    Japan         China          Cuba  Soviet Union          Japan   
18  1989    Japan          Cuba  Soviet Union         China          Japan   

    Teams  
14   10.0  
15    8.0  
16    8.0  
17    8.0  
18    8.0  


### Self-work exercises #2: 6 points

#### Use the python commands to answer the following questions

**1. What type is the resulting data (the loaded json: d) ?**

In [44]:
print(type(d))


<class 'list'>


**2. How many items does the resulting data have (the loaded json: d) ?**

In [45]:
print(len(d))

3


    **3. What is the type of each item in the resulting data (the loaded json: d) ?**

In [46]:
print(type(d[1]))  




<class 'dict'>


**4. How many tricks does Coco know ? Use the code to output it**

In [47]:
#Coco the chinchilla 
coco_tricks = next(len(item["tricks"]) for item in d if item["name"] == "Coco")
print("Coco knows " + str(coco_tricks) + " amazing tricks!")


Coco knows 2 amazing tricks!


**5. How can you change the column name to be demographics_sex and demographics_age instead of demographics.sex and demographics.age?**

Save it into animals_v3 variable

*Tip: research online the pandas.json_normalize method*

In [48]:
animals_v3 = pd.json_normalize(d, sep="_")
print(animals_v3)


      name      animal            tricks  owner      city demographics_sex  \
0      Ace      Alpaca            [spit]  Alice  Augsburg             male   
1  Biscuit      Beagle       [sit, hunt]    Bob    Berlin           female   
2     Coco  Chinchilla  [explore, climb]  Carol   Cottbus             male   

   demographics_age  
0                 2  
1                 1  
2                 5  


**6. Following the examples of students_database, create the teachers_database**

Hints:
- the teacher is very wise, therefore they are at least 100 years old!
- Selma's city starts with S and is located in Germany
- The subject is the course you are taking

In [63]:
### Complete the following program:
from io import StringIO 
teacher = {
    "name": "Krisztián",
    "age": 25,
    "grade": 1.0,
    "subjects": "Data Scraping",  
    "city": "Berlin"
}

teacher_database = [teacher]
teacher_data = json.dumps(teacher_database, indent=4)

# Wrap the JSON string in a StringIO object to prevent the FutureWarning
teachers_df = pd.read_json(StringIO(teacher_data))

teachers_df

Unnamed: 0,name,age,grade,subjects,city
0,Krisztián,25,1,Data Scraping,Berlin
