> Part of a series on auto-updating websites using GitHub Actions and GitHub Pages

# Air Quality Updater: Complete dataset copier

In this section, we are going to download the [AQI data of major cities from IQAir](https://www.iqair.com/us/world-air-quality-ranking) and save it as a CSV file.

The URL is 'https://www.iqair.com/us/thailand/chiang-mai'.

This approach is useful if you are looking to **directly copy a full dataset from the web** and use it to update a page or graphic. The alternate would be saving historical data over time, which I'll cover in another video.


In [12]:
# Install necessary packages
# Note: Uncomment the following lines if running in an environment where these packages are not installed
# %pip install beautifulsoup4
# %pip install lxml

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import os


In [13]:
# Get the current date
current_date = datetime.now().strftime('%Y-%m-%d')

In [14]:
# Fetch AQI data from the website
url = 'https://www.iqair.com/us/thailand/chiang-mai'

try:
    response = requests.get(url)
    response.raise_for_status()  # Check for request errors
except requests.exceptions.RequestException as e:
    print(f"Error fetching data: {e}")
    exit()

In [15]:
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

In [16]:
# Extract tables
tables = pd.read_html(str(soup))

  tables = pd.read_html(str(soup))


In [17]:
# Check if the desired table is in the response
if len(tables) > 3:
    df = tables[3]  # Assuming the required table is at index 3
else:
    print("Error: Expected table not found.")
    exit()

In [18]:
# Add the current date to the dataframe
df['date_pulled'] = current_date

In [19]:
# Clean the AQI column to retain only the number
if 'Air quality index' in df.columns:
    df['Air quality index'] = df['Air quality index'].str.extract('(\d+)').astype(int)

  df['Air quality index'] = df['Air quality index'].str.extract('(\d+)').astype(int)


In [20]:
# Reorder the columns to make 'date_pulled' the first column
first_column = df.pop('date_pulled')
df.insert(0, 'date_pulled', first_column)

In [21]:
# Display the dataframe with the new column order
print(df.head())


  date_pulled Air pollution level  Air quality index Main pollutant
0  2024-05-30            Moderate                 66          PM2.5


In [22]:
# Save the dataframe to a CSV file
output_filename = "air-quality.csv"
output_path = os.path.join(os.getcwd(), output_filename)
df.to_csv(output_path, index=False)
print(f"Data saved to {output_path}")

Data saved to /Users/visarutsankham/Documents/GitHub/Bad-Air_CNX/air-quality.csv
