# Semester 3 Coding Portfolio Topic 1 Formative Part 1/2:
# APIs and simple webscraping with requests and BeautifulSoup

This notebook covers the following topics:
 - Using APIs
 - Parsin JSON
 - Using the BeautifulSoup module

This notebook is expected to take around 5 hours to complete:
 - 2 hours for the formative part
 - 3 hours of self-study on the topics covered by this notebook

<b> This is a formative notebook</b><br>
Simply complete the given functions such that they pass the automated tests. This part is graded Pass/Fail; you must get 100% correct!
You can submit your notebook through Canvas as often as you like. Make sure to start doing so early to insure that your code passes all tests!
You may ask for help from fellow students and TAs on this section, and solutions might be provided later on.

In [None]:
# TODO: Please enter your student number here
STUDENT_NUMBER = 15281914



This notebook provides a quick introduction to simple web scraping with Python, requests and beautifulsoup.

We will start by a simple introduction to using APIs, and then we'll look at requests to parse a simple website


# Interacting with APIs 

An API, or Application Programming Interface, allows different software applications to talk to each other, sharing data and functionalities easily. Developers use APIs to access features or data from other services, enabling more complex and feature-rich applications. Essentially, APIs serve as bridges between different software, making it possible for them to interact and share resources.



We're going to start with getting data from a simple API. It's easy!

## 1. Using a simple API:  How's the weather?

To fetch data from any API or website, we can use the requests package. The requests package abstracts the complexities of making requests behind simple API methods, allowing developers to send HTTP/1.1 requests with various methods like GET, POST, PUT, and others

In [1]:
!pip install requests



As an example, we will use OpenWeatherMap.

#### API Documentation: _Read The Fine Manual! (RTFM)_
Public APIs always come with documentation that describes how to use the API, and what data you can expect. 

To find the OpenWeatherMap API, you can go to:
https://openweathermap.org/api


#### Getting the current weather
We will here use the current weather function, to get the current weather in Amsterdam.

In [2]:
import requests

api_key = "de26752686c975de6a1c38a998f50fec"
city_name = "Amsterdam"
base_url = "http://api.openweathermap.org/data/2.5/weather?"

# Complete URL for the API call
url = f"{base_url}q={city_name}&appid={api_key}"

response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Here is the result from the API:")
    print(response.text)
    json_string = response.text
else:
    print("Error: Unable to get data from OpenWeatherMap API! :(")

Here is the result from the API:
{"coord":{"lon":4.8897,"lat":52.374},"weather":[{"id":804,"main":"Clouds","description":"overcast clouds","icon":"04d"}],"base":"stations","main":{"temp":280.67,"feels_like":276.68,"temp_min":280.29,"temp_max":281.48,"pressure":1014,"humidity":87,"sea_level":1014,"grnd_level":1014},"visibility":6000,"wind":{"speed":7.72,"deg":210},"clouds":{"all":100},"dt":1764252764,"sys":{"type":2,"id":2101578,"country":"NL","sunrise":1764228127,"sunset":1764257667},"timezone":3600,"id":2759794,"name":"Amsterdam","cod":200}


#### Huh, what is this strange text?
As you can see, the result we get is in a particular text format. This format is called JSON (pronounced "Jason"), which is used by most APIs - both internal and public.

JSON (JavaScript Object Notation) is a data interchange format that is easy for humans to read and write and easy for machines to parse and generate. It is primarily used to transmit data between a server and a web application, serving as an alternative to XML, and is widely used for representing structured data and exchanging information in web development.


### Parsing JSON

Luckily, JSON is very easy to parse using Python. We may for instance turn it into a dict. We use the json library to do so.

In [3]:
import json

data = json.loads(json_string)

# Now parsed_data is a Python dictionary containing the data from the JSON string
main = data['main']
weather = data['weather']
print(f"{city_name:-^30}")
print(f"Temperature: {main['temp']}K")
print(f"Humidity: {main['humidity']}%")
print(f"Weather: {weather[0]['main']}")
print(f"Description: {weather[0]['description']}")

----------Amsterdam-----------
Temperature: 280.67K
Humidity: 87%
Weather: Clouds
Description: overcast clouds


#### Exercise 1: Your turn! Get the forecast!

Now your task is to get the "5 day / 3 hour forecast data" from the API, to figure out how the weather in Amsterdam will be in the coming days. Read the manual!

The goal is to print the date in the following format: 
- On 2023-10-06 12:00:00 the temperature will be 15 C
- On 2023-10-06 15:00:00 the temperature will be  4 C

etc.

There are two extra challenges here. 
First, the datetime is a timestamp (a float value representing the number of seconds since January 1, 1970, the Unix epoch), which you will need to convert to a readable date.

Second, you will need to convert the temperature from Kelvin to Celsius.

In [4]:
#Some help: a function to convert timestamp to date-time string
from datetime import datetime

def parse_timestamp(dt):
    dt_object = datetime.utcfromtimestamp(dt)
    formatted_date = dt_object.strftime('%Y-%m-%d %H:%M:%S')
    return formatted_date

In [5]:
import requests

api_key = "de26752686c975de6a1c38a998f50fec"
city_name = "Amsterdam"
base_url = "http://api.openweathermap.org/data/2.5/forecast?"

# Complete URL for the API call
url = f"{base_url}q={city_name}&appid={api_key}"

response = requests.get(url)
    
# Check if the request was successful
if response.status_code == 200:
    j = response.json()
    # print(j)
    # As you can see the json contains a list of timestamps and temperatures
    # Loop over the list, and for each entry, parse the timestamp (using the method above)
    # and print the dates and the temperature (tip: Kelvin - 273.15 = Celsius)
    # Also, save any of the sentences to sample_text

    # Your solution here
    sample_text = None
    for forecast in j['list']:
        # Get timestamp and temperature
        dt = forecast['dt']
        temp_kelvin = forecast['main']['temp']
        
        # Parse timestamp to readable date
        date_string = parse_timestamp(dt)
        
        # Convert Kelvin to Celsius
        temp_celsius = temp_kelvin - 273.15
        
        # Print the forecast
        print(f"On {date_string} the temperature will be {temp_celsius:.1f} C")
        
        # Save the first sentence to sample_text
        if sample_text is None:
            sample_text = f"On {date_string} the temperature will be {temp_celsius:.1f} C"
else:
    print("Error: Unable to get data from OpenWeatherMap API! :(")
    print(response)

On 2025-11-27 15:00:00 the temperature will be 7.6 C
On 2025-11-27 18:00:00 the temperature will be 7.4 C
On 2025-11-27 21:00:00 the temperature will be 7.7 C
On 2025-11-28 00:00:00 the temperature will be 8.0 C
On 2025-11-28 03:00:00 the temperature will be 7.7 C
On 2025-11-28 06:00:00 the temperature will be 8.4 C
On 2025-11-28 09:00:00 the temperature will be 9.7 C
On 2025-11-28 12:00:00 the temperature will be 10.7 C
On 2025-11-28 15:00:00 the temperature will be 10.2 C
On 2025-11-28 18:00:00 the temperature will be 9.3 C
On 2025-11-28 21:00:00 the temperature will be 10.6 C
On 2025-11-29 00:00:00 the temperature will be 9.0 C
On 2025-11-29 03:00:00 the temperature will be 8.2 C
On 2025-11-29 06:00:00 the temperature will be 7.6 C
On 2025-11-29 09:00:00 the temperature will be 7.2 C
On 2025-11-29 12:00:00 the temperature will be 9.7 C
On 2025-11-29 15:00:00 the temperature will be 9.1 C
On 2025-11-29 18:00:00 the temperature will be 8.3 C
On 2025-11-29 21:00:00 the temperature will

Now you have a sense of how to get data from a simple API!

## 2. Simple webscraping

Let's start by using _requests_ on a normal website instead. It's quite similar! We here use it to fetch the CSS programme website. 

In [7]:
import requests

url = "https://www.uva.nl/en/programmes/bachelors/computational-social-science/computational-social-science.html"

response = requests.get(url)

if response.status_code == 200:
    print("Here is the result:")
    print(f"{response.text[:300]} ...")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")


Here is the result:






<!doctype html>
<html class="no-js" lang="en">
<head>
    <meta charset="utf-8"/>

    <title>Bachelor's Computational Social Science - University of Amsterdam</title>
            <link rel="canonical" href="https://www.uva.nl/en/programmes/bachelors/computational-social-science/computational- ...


As you can see, the result is in HTML: the simple markup language that the internet is built on.

To get data from HTML, we therefore need to parse the HTML to fetch the data that we are interested in. This is core to all scraping.

We therefore need a way of parsing the HTML to get the data that we are interested in.

This is where BeautifulSoup comes in!

Beautifulsoup is a complex library for parsing HTML.

Let's first install it and load it.


In [8]:
# Install the library if you do not already have it
!pip install beautifulsoup4



In [9]:
#Load the library
from bs4 import BeautifulSoup

### Parsing a simple example website with beautifulsoup
As you may know, HTML is hierarchically structured  - sometimes referred to as an HTML parse tree or the DOM tree. The DOM is a tree data structure that represents the hierarchical structure of an HTML document. Each node in the tree corresponds to an element (or "tag") in the HTML document, and the edges represent the nesting relationships between the elements. The root of the tree is typically the <html> tag, and it has child nodes representing the head and body of the HTML document, and those child nodes, in turn, have their own child nodes representing nested elements within them.

For example, consider a simple HTML document:


In [10]:
simplehtml = '''<html>
    <head>
        <title>My Page</title>
    </head>
    <body>
        <h1 class='mainheader'>Welcome to My Page</h1>
        <p id='theparagraph'>This is a paragraph.</p>
        <p class='paraclass'>This is a second paragraph.</p>
        <div><p>This is a third paragraph, inside a div!</p></div>
    </body>
</html>'''

Let's try parsing elements of this page!

In [11]:
#This turns the website into a beautifulsoup object that we can then fetch elements from
soup = BeautifulSoup(simplehtml, 'html.parser')


There are many functions in BeautifulSoup, but we will focus on soup.select(), which uses a _CSS selector_ to select elements and data.

This function uses a particular type of strings for selecting elements, and returns a list of all matching elements (if any).

In [12]:
# This means "select all elements of type 'title'"
alltitles = soup.select('title')
# We then pick the first one; since we know there is only one
firsttitle = alltitles[0]
# And we can then select the text inside it, by getting the attribute text:
print(firsttitle.text)

My Page


In [13]:
# This means "select all elements of type 'p'"
paragraphs = soup.select('p')
# We then loop over the paragraphs and print each one
for paragraph in paragraphs:
    print(paragraph.text)

This is a paragraph.
This is a second paragraph.
This is a third paragraph, inside a div!


In [14]:
# This means "select all elements of type 'p' with class 'paraclass'"
# The dot signifies class names
# We can then select the first element, and the text content, in the same line
soup.select('p.paraclass')[0].text


'This is a second paragraph.'

In [15]:
# This means "select all elements of type 'p' with id 'theparagraph'"
# The # signifies id.
soup.select('p#theparagraph')[0].text


'This is a paragraph.'

### Exercise 2: Parse a simple website

Your task is to parse the following simple website using beautifulsoup, and extract a dataframe that has the products listed, with their name, description, and price in separate columns.


In [16]:
from IPython.display import display, HTML
website_html = '''<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Simple Website Example</title>
</head>
<body>
<h1>Welcome to Our Simple Website</h1>
<p>This is a demonstration of a simple HTML website designed for parsing practice.</p>
<h2>About Us</h2>
<p>We are a team dedicated to learning web scraping with BeautifulSoup.</p>
<h3>Contact Information</h3>
<p>Email us at: <a href="mailto:info@example.com">info@example.com</a></p>
<h2>Our Products</h2>
<table border="1">
    <tr>
        <th>Product Name</th>
        <th>Description</th>
        <th>Price</th>
    </tr>
    <tr>
        <td>Product 1</td>
        <td>An essential item for beginners.</td>
        <td>$19.99</td>
    </tr>
    <tr>
        <td>Product 2</td>
        <td>A must-have for advanced users.</td>
        <td>$29.99</td>
    </tr>
    <tr>
        <td>Product 3</td>
        <td>Now with bacon-flavor!</td>
        <td>$39.99</td>
    </tr>
</table>

</body>
</html>'''
print("This is how the website looks:")
display(HTML(website_html))

This is how the website looks:


Product Name,Description,Price
Product 1,An essential item for beginners.,$19.99
Product 2,A must-have for advanced users.,$29.99
Product 3,Now with bacon-flavor!,$39.99


In [17]:
# Parse the HTML to get the table and extract header row
soup = BeautifulSoup(website_html, 'html.parser')
table = soup.select('table')[0]
header_row = table.select('tr')[0]
header_columns = header_row.select('th')

# Print the header columns
header_columns

[<th>Product Name</th>, <th>Description</th>, <th>Price</th>]

In [18]:
# Import pandas
import pandas as pd 

# Parse the HTML
soup = BeautifulSoup(website_html, 'html.parser')

# Find the table containing products
product_table = soup.select('table')[0]

# Extract the rows in the table, skipping the header row
rows = product_table.select('tr')[1:]

# Extract the data for each row
products = []
for row in rows:
    # Select all tds in the row
    # and put the first one in variable called product_name,
    # second in description, and third in price
    
    # Your solution here
    tds = row.select('td')
    product_name = tds[0].text
    description = tds[1].text
    price = tds[2].text

    products.append([product_name, description, price])

# Convert the data into a DataFrame
df = pd.DataFrame(products, columns=['Product Name', 'Description', 'Price'])


# Dataframe that has the products
df

Unnamed: 0,Product Name,Description,Price
0,Product 1,An essential item for beginners.,$19.99
1,Product 2,A must-have for advanced users.,$29.99
2,Product 3,Now with bacon-flavor!,$39.99
