# Extracting Data from Websites

## What is Web Scraping?

- Web Scraping is an **automatic** way to retrieve data from a website and store them in a structured format.
- For example, our intern worked on a project that involves using national level carbon emissions data.
    - **Issue**: the emissions database ([Emissions Database for Global Atmospheric Research](https://edgar.jrc.ec.europa.eu/country_profile/AFG)) does not provide a data set of all countries. Instead, they have to download the data set for each country one at a time.
    - **Solution**: our intern built a web scraper to automate the downloading process.
- There are four main steps
    1. Examining the website HTML
    2. Access URL of the website using code
    3. Parse the HTML contents and extract useful information
    4. Format the extracted data and save the data

In [1]:
import bs4
import urllib3
import certifi
import numpy as np
import pandas as pd

We will domonstrate these four steps by scraping the [Aviation Weather Center](https://www.aviationweather.gov/metar?gis=off).  

Airports around the country publish hourly weather observations, called METARs. 
- **What we want:** the weather report that is formatted as plain text in a highly abbreviated syntax. 
- **Problem:** The web page places the report as a single line of text amidst other elements. 
- **Goal:** Extract just the weather observation from this page.

## Step 1: Inspect the website HTML


### What is HTML?

- HTML stands for Hyper Text Markup Language
- HTML describes the **structure** of a Web page
- HTML consists of a series of **elements**

Below is a visualization of an HTML page structure:

```
<html>
    <head>
        <title>Page title</title>
    </head>
    <body>
        <h1>This is a heading </h1>
        <p>This is a paragraph.</p>
        <p>This is another paragraph.</p>
        <a href="https://bfi.uchicago.edu/">This is a link</a>
        <table>
            <tr>
                <th>Column 1</th>
                <th>Column 2</th>
            </tr>
            <tr>
                <td>Value 1</td>
                <td>Value 2</td>
            </tr>
        </table> 
    </body>
</html> 
```

It can be very useful to learn more about HTML. However, in most of the case we do not need to understand every element in the HTML file. We can use the **inspect** feature of Google Chrome to locate the element that we want.

## Step 2: Access URL of the website using code

We need an **HTTP client**, such as `urllib3`, to access the website in Python. We will create a PoolManager **object** to get the HTML of the webpage:

In [2]:
pm = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
myurl = "https://www.aviationweather.gov/metar/data?ids=KORD&format=raw&date=0&hours=0"
html = pm.urlopen(url=myurl, method="GET").data
# html

## Step 3: Parse the HTML contents and extract useful information

We will use the `BeautifulSoup` class from the `beautifulsoup4` package to parse the HTML file

In [3]:
soup = bs4.BeautifulSoup(html, features='lxml')
# soup

The `BeautifulSoup` class has many attributes and methods that help us navigate the HTML.

If you want to get the title of the webpage, you can access the `title` attribute of the `BeautifulSoup` object:

In [4]:
print(soup.title)
print(type(soup.title))

<title>AWC - METeorological Aerodrome Reports (METARs)</title>
<class 'bs4.element.Tag'>


Note that the type of `soup.title` is `bs4.element.Tag`, not _string_. We can extract the text by accessing the `text` attribute of a `bs4.element.Tag` object:

In [5]:
print(soup.title.text)
print(type(soup.title.text))

AWC - METeorological Aerodrome Reports (METARs)
<class 'str'>


Recall that HTML usually have a hierarchical structure. To access the _parent_ of a element, we can access the `parent` attribute:

In [6]:
print(soup.title.parent)
print(type(soup.title.parent))

<head>
<!--[if lt IE 9]>
<meta http-equiv="X-UA-Compatible" content="IE=8" />
<![endif]-->
<title>AWC - METeorological Aerodrome Reports (METARs)</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="900" http-equiv="Refresh"/>
<meta content="en-us" name="DC.language" scheme="DCTERMS.RFC1766"/>
<meta content="Aviation Weather Center Homepage provides comprehensive user-friendly aviation weather Text products and graphics." name="description"/>
<meta content="aviation, weather, icing, turbulence, convection, pirep, metar, taf, airmet, sigmet, satellite, radar, surface, wind, temperature, aloft, airplane, NEXRAD, GOES, WSR-88D, precipitation, rain, snow, sleet, thunderstorm, en-route, prognosis, chart" name="keywords"/>
<meta content="AWC - Aviation Weather Center" name="DC.title"/>
<meta content="Aviation Weather Center Home Page ... METARs Page" name="DC.description"/>
<meta content="NOAA's National Weather Service - Aviation Weather Center Homepag

We can extract the first _paragraph element_, which is represented by `<p>... </p>`, by accessing the attribute `p`:

In [7]:
soup.p

<p clear="both">
<strong>Data at: 1452 UTC 19 Aug 2022</strong></p>

We can use the `find_all()` method to find all paragraph elements:

In [8]:
help(soup.find_all)

Help on method find_all in module bs4.element:

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
    Look in the children of this PageElement and find all
    PageElements that match the given criteria.
    
    All find_* methods take a common set of arguments. See the online
    documentation for detailed explanations.
    
    :param name: A filter on tag name.
    :param attrs: A dictionary of filters on attribute values.
    :param recursive: If this is True, find_all() will perform a
        recursive search of this PageElement's children. Otherwise,
        only the direct children will be considered.
    :param limit: Stop looking after finding this many results.
    :kwargs: A dictionary of filters on attribute values.
    :return: A ResultSet of PageElements.
    :rtype: bs4.element.ResultSet



In [9]:
soup.find_all('p')

[<p clear="both">
 <strong>Data at: 1452 UTC 19 Aug 2022</strong></p>,
 <p style="text-align: center; font-size: 10px; color: #1150a0">
     Page loaded: 
   <a href="https://www.time.gov">14:52 UTC</a>  |  
   07:52 AM  Pacific  |  
   08:52 AM  Mountain  |  
   09:52 AM  Central  |  
   10:52 AM  Eastern
   </p>]

Or hyperlinks, which are represented by `<a> </a>`:

In [10]:
soup.find_all('a')[:10]

[<a href="https://www.noaa.gov" title="Visit the NOAA website"><img alt="NOAA Logo" border="0" src="/images/layout/noaa_logo.png"/></a>,
 <a href="https://www.weather.gov" title="Visit the NWS website"><img alt="NWS Logo" border="0" src="/images/layout/nws_logo.png"/></a>,
 <a href="/">AVIATION WEATHER CENTER</a>,
 <a href="https://www.noaa.gov">N O A A</a>,
 <a href="https://www.weather.gov">N A T I O N A L   W E A T H E R   S E R V I C E</a>,
 <a href="https://www.commerce.gov">
 <div class="awc_headerright" title="Visit Department of Commerce website">
 </div> <!-- /awc_headerright -->
 </a>,
 <a href="/">AWC Home</a>,
 <a href="https://www.ncep.noaa.gov/">NCEP Home</a>,
 <a href="/">Aviation (AWC)</a>,
 <a href="https://www.cpc.ncep.noaa.gov">Climate (CPC)</a>]

We can filter tags by specifying attributes of the tag:

In [11]:
soup.find_all("p", {"clear": "both"})

[<p clear="both">
 <strong>Data at: 1452 UTC 19 Aug 2022</strong></p>]

By examing the webpage, we already know that we want to get the code tag:

In [12]:
soup.find_all("code")

[<code>KORD 191351Z 21011KT 10SM FEW150 FEW250 24/15 A2997 RMK AO2 SLP144 T02440150</code>]

Note that the `find_all` method will output a _list of Tag objects_, even if there is only one tag. We need to access that Tag object, and then access the `text` attribute of the `Tag` object:

In [13]:
code_tags = soup.find_all("code")
weather_code = code_tags[0]
weather_code.text

'KORD 191351Z 21011KT 10SM FEW150 FEW250 24/15 A2997 RMK AO2 SLP144 T02440150'

Now, let's put the code we wrote into a function:

In [14]:
def get_current_weather(airport_id):
    '''
    Get current weather at a specific airport.
    
    Inputs: 
        airport_id (str): airport id
    
    Outputs:
        data (str): weather info from the website
    '''
    
    # Step 2: Access URL of the website using code 
    myurl = "https://www.aviationweather.gov/metar/data?ids=" + \
        airport_id + "&format=raw&date=0&hours=0"
    html = pm.urlopen(url=myurl, method="GET").data
    
    # Step 3: Parse the HTML contents and extract useful information
    soup = bs4.BeautifulSoup(html, features='lxml')
    code_tags = soup.find_all("code")
    weather_code = code_tags[0]
    data = weather_code.text[5:]
    
    return data

In [15]:
get_current_weather("KORD")

'191351Z 21011KT 10SM FEW150 FEW250 24/15 A2997 RMK AO2 SLP144 T02440150'

## Step 4: Format the extracted data and save the data

We can now get the weather info from multiple airports very easily:

In [16]:
pm = urllib3.PoolManager(cert_reqs='CERT_REQUIRED',
                         ca_certs=certifi.where())

airports = ["KORD", "KMDW", "KSFO", "KLAX", "KATL"]
weather = map(get_current_weather, airports)
data_dict = {
    "airports": airports,
    "weather": weather
} 
df = pd.DataFrame(data_dict)
df

Unnamed: 0,airports,weather
0,KORD,191351Z 21011KT 10SM FEW150 FEW250 24/15 A2997...
1,KMDW,191353Z 23009KT 10SM FEW060 FEW250 26/16 A2998...
2,KSFO,191356Z 27008KT 10SM FEW004 BKN008 14/12 A2990...
3,KLAX,191353Z 24004KT 8SM FEW004 FEW150 18/16 A2984 ...
4,KATL,191352Z 11006KT 2 1/2SM BR OVC005 23/21 A3004 ...


## Another Example: Climate Data from Wikipedia

Wikiepedia contains climate data for most cities, formatted as a table. For example, see https://en.wikipedia.org/wiki/Chicago#Climate.

Our goal is to extract this table from Wikipedia.

### Step 1: Examining the website HTML

Recall that a table in HTML is usually formatted as:

```
<table>
    <tr>
        <th>Column 1</th>
        <th>Column 2</th>
    </tr>
    <tr>
        <td>Value 1</td>
        <td>Value 2</td>
    </tr>
</table>         
```

### Step 2: Access URL of the website using code

In [17]:
myurl = "https://en.wikipedia.org/wiki/chicago"
html = pm.urlopen(url=myurl, method="GET").data

### Step 3: Parse the HTML contents and extract useful information

Based on our inspection of the HTML, we can locate the table by finding the table title, which is a `th` tag with `colspan="14"`:

In [18]:
soup = bs4.BeautifulSoup(html, features='lxml')
tag_list = soup.find_all("th", colspan="14")
tag_list

[<th colspan="14">Climate data for Chicago (Midway Airport), 1991–2020 normals,<sup class="reference" id="cite_ref-Strange_field_expl_147-0"><a href="#cite_note-Strange_field_expl-147">[a]</a></sup> extremes 1928–present
 </th>,
 <th colspan="14">Climate data for Chicago (O'Hare Int'l Airport), 1991–2020 normals,<sup class="reference" id="cite_ref-Strange_field_expl_147-1"><a href="#cite_note-Strange_field_expl-147">[a]</a></sup> extremes 1871–present<sup class="reference" id="cite_ref-153"><a href="#cite_note-153">[b]</a></sup>
 </th>,
 <th colspan="14">Sunshine data for Chicago
 </th>,
 <th colspan="14" style="background:#f8f9fa;font-weight:normal;font-size:95%;">Source: Weather Atlas<sup class="reference" id="cite_ref-Weather_Atlas_156-0"><a href="#cite_note-Weather_Atlas-156">[154]</a></sup>
 </th>]

In [19]:
tag_list[0].text

'Climate data for Chicago (Midway Airport), 1991–2020 normals,[a] extremes 1928–present\n'

We will select the first tag whose first word is "Climate":

In [20]:
for tag in tag_list:
    if tag.text[:7] == "Climate":
        break
tag.text

'Climate data for Chicago (Midway Airport), 1991–2020 normals,[a] extremes 1928–present\n'

Note that the `table` tag is the _grandparent_ of the current tag:

```
<table>
    <tbody>
        <th colspan="14">
     
```

In [21]:
table_tag = tag.parent.parent
# table_tag

Let's first extract all the row headers:

In [22]:
table_tag.find_all("th", scope="row")

[<th scope="row">Month
 </th>,
 <th scope="row" style="height: 16px;">Record high °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Mean maximum °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Average high °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Daily mean °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Average low °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Mean minimum °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Record low °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Average <a href="/wiki/Precipitation" title="Precipitation">precipitation</a> inches (mm)
 </th>,
 <th scope="row" style="height: 16px;">Average snowfall inches (cm)
 </th>,
 <th scope="row" style="height: 16px;">Average precipitation days <span class="nowrap" style="font-size:90%;">(≥ 0.01 in)</span>
 </th>,
 <th scope="row" style="height: 16px;">Average snowy days <span class="nowrap" style="font-size:90%;">(≥ 0.1 in)</span>
 </th>,
 <th scope="row" styl

Get the text of tags and store in a list:

In [23]:
row_names = [row.text for row in table_tag.find_all("th", scope="row")]
row_names

['Month\n',
 'Record high °F (°C)\n',
 'Mean maximum °F (°C)\n',
 'Average high °F (°C)\n',
 'Daily mean °F (°C)\n',
 'Average low °F (°C)\n',
 'Mean minimum °F (°C)\n',
 'Record low °F (°C)\n',
 'Average precipitation inches (mm)\n',
 'Average snowfall inches (cm)\n',
 'Average precipitation days (≥ 0.01 in)\n',
 'Average snowy days (≥ 0.1 in)\n',
 'Average ultraviolet index\n']

In [24]:
row_names = [row.text[:-1] for row in table_tag.find_all("th", scope="row")][1:]
row_names

['Record high °F (°C)',
 'Mean maximum °F (°C)',
 'Average high °F (°C)',
 'Daily mean °F (°C)',
 'Average low °F (°C)',
 'Mean minimum °F (°C)',
 'Record low °F (°C)',
 'Average precipitation inches (mm)',
 'Average snowfall inches (cm)',
 'Average precipitation days (≥ 0.01 in)',
 'Average snowy days (≥ 0.1 in)',
 'Average ultraviolet index']

Then, do the same thing for column headers:

In [25]:
table_tag.find_all("th", scope="col")

[<th scope="col">Jan
 </th>,
 <th scope="col">Feb
 </th>,
 <th scope="col">Mar
 </th>,
 <th scope="col">Apr
 </th>,
 <th scope="col">May
 </th>,
 <th scope="col">Jun
 </th>,
 <th scope="col">Jul
 </th>,
 <th scope="col">Aug
 </th>,
 <th scope="col">Sep
 </th>,
 <th scope="col">Oct
 </th>,
 <th scope="col">Nov
 </th>,
 <th scope="col">Dec
 </th>,
 <th scope="col" style="border-left-width:medium">Year
 </th>]

In [26]:
col_names = [col.text[:-1] for col in table_tag.find_all("th", scope="col")]
col_names

['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
 'Sep',
 'Oct',
 'Nov',
 'Dec',
 'Year']

Finally, let's extract values:

In [27]:
table_tag.find_all("td")[:10]

[<td style="background: #FF9B37; color:#000000;">67<br/>(19)
 </td>,
 <td style="background: #FF7800; color:#000000;">75<br/>(24)
 </td>,
 <td style="background: #FF4F00; color:#000000;">86<br/>(30)
 </td>,
 <td style="background: #FF3A00; color:#000000;">92<br/>(33)
 </td>,
 <td style="background: #FF1100; color:#FFFFFF;">102<br/>(39)
 </td>,
 <td style="background: #F80000; color:#FFFFFF;">107<br/>(42)
 </td>,
 <td style="background: #EA0000; color:#FFFFFF;">109<br/>(43)
 </td>,
 <td style="background: #FF0A00; color:#FFFFFF;">104<br/>(40)
 </td>,
 <td style="background: #FF1100; color:#FFFFFF;">102<br/>(39)
 </td>,
 <td style="background: #FF3300; color:#000000;">94<br/>(34)
 </td>]

In [28]:
value_tags = table_tag.find_all("td")[:len(row_names)*len(col_names)]
value_tags[:10]

[<td style="background: #FF9B37; color:#000000;">67<br/>(19)
 </td>,
 <td style="background: #FF7800; color:#000000;">75<br/>(24)
 </td>,
 <td style="background: #FF4F00; color:#000000;">86<br/>(30)
 </td>,
 <td style="background: #FF3A00; color:#000000;">92<br/>(33)
 </td>,
 <td style="background: #FF1100; color:#FFFFFF;">102<br/>(39)
 </td>,
 <td style="background: #F80000; color:#FFFFFF;">107<br/>(42)
 </td>,
 <td style="background: #EA0000; color:#FFFFFF;">109<br/>(43)
 </td>,
 <td style="background: #FF0A00; color:#FFFFFF;">104<br/>(40)
 </td>,
 <td style="background: #FF1100; color:#FFFFFF;">102<br/>(39)
 </td>,
 <td style="background: #FF3300; color:#000000;">94<br/>(34)
 </td>]

In [29]:
data = [value.text[:-1] for value in value_tags]
data[:5]

['67(19)', '75(24)', '86(30)', '92(33)', '102(39)']

Let's only keep the values in the Imperial Units and then convert these strings to numbers:

In [30]:
# for i, text in enumerate(data):
#     text = text.split("(")[0]
#     data[i] = float(text)
    
# ValueError: could not convert string to float: '−3'

In [31]:
data = [value.text[:-1] for value in value_tags]
for i, text in enumerate(data):
    text = text.split("(")[0]
    text = text.replace("−", "-")
    data[i] = float(text)
    
print(data)

[67.0, 75.0, 86.0, 92.0, 102.0, 107.0, 109.0, 104.0, 102.0, 94.0, 81.0, 72.0, 109.0, 53.4, 57.9, 72.0, 81.5, 89.2, 93.9, 96.0, 94.2, 90.8, 82.8, 68.0, 57.5, 97.1, 32.8, 36.8, 47.9, 60.0, 71.5, 81.2, 85.2, 83.1, 76.5, 63.7, 49.6, 37.7, 60.5, 26.2, 29.9, 39.9, 50.9, 61.9, 71.9, 76.7, 75.0, 67.8, 55.3, 42.4, 31.5, 52.4, 19.5, 22.9, 32.0, 41.7, 52.4, 62.7, 68.1, 66.9, 59.2, 46.8, 35.2, 25.3, 44.4, -3.0, 3.4, 14.1, 28.2, 39.1, 49.3, 58.6, 57.6, 45.0, 31.8, 19.7, 5.3, -6.5, -25.0, -20.0, -7.0, 10.0, 28.0, 35.0, 46.0, 43.0, 29.0, 20.0, -3.0, -20.0, -25.0, 2.3, 2.12, 2.66, 4.15, 4.75, 4.53, 4.02, 4.1, 3.33, 3.86, 2.73, 2.33, 40.88, 12.5, 10.1, 5.7, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 1.5, 7.9, 38.8, 11.5, 9.4, 11.1, 12.0, 12.4, 11.1, 10.0, 9.3, 8.4, 10.8, 10.2, 10.8, 127.0, 8.9, 6.4, 3.9, 0.9, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 1.6, 6.3, 28.2, 1.0, 2.0, 4.0, 6.0, 7.0, 9.0, 9.0, 8.0, 6.0, 4.0, 2.0, 1.0, 5.0]


### Step 4: Format the extracted data and save it

In [32]:
data = np.array(data).reshape(len(row_names), len(col_names))
df = pd.DataFrame(data, columns=col_names, index=row_names)
df

Unnamed: 0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Year
Record high °F (°C),67.0,75.0,86.0,92.0,102.0,107.0,109.0,104.0,102.0,94.0,81.0,72.0,109.0
Mean maximum °F (°C),53.4,57.9,72.0,81.5,89.2,93.9,96.0,94.2,90.8,82.8,68.0,57.5,97.1
Average high °F (°C),32.8,36.8,47.9,60.0,71.5,81.2,85.2,83.1,76.5,63.7,49.6,37.7,60.5
Daily mean °F (°C),26.2,29.9,39.9,50.9,61.9,71.9,76.7,75.0,67.8,55.3,42.4,31.5,52.4
Average low °F (°C),19.5,22.9,32.0,41.7,52.4,62.7,68.1,66.9,59.2,46.8,35.2,25.3,44.4
Mean minimum °F (°C),-3.0,3.4,14.1,28.2,39.1,49.3,58.6,57.6,45.0,31.8,19.7,5.3,-6.5
Record low °F (°C),-25.0,-20.0,-7.0,10.0,28.0,35.0,46.0,43.0,29.0,20.0,-3.0,-20.0,-25.0
Average precipitation inches (mm),2.3,2.12,2.66,4.15,4.75,4.53,4.02,4.1,3.33,3.86,2.73,2.33,40.88
Average snowfall inches (cm),12.5,10.1,5.7,1.0,0.0,0.0,0.0,0.0,0.0,0.1,1.5,7.9,38.8
Average precipitation days (≥ 0.01 in),11.5,9.4,11.1,12.0,12.4,11.1,10.0,9.3,8.4,10.8,10.2,10.8,127.0


Let's write a function that takes a city name as input and output a climate table:

In [33]:
def get_climate_from_wiki(city):
    '''
    Get the climate table from wikipedia if available
    
    Inputs: 
        city (str)
    
    Outputs:
        df (pandas.DataFrame): climate table
    '''
    
    # Step 2: Access URL of the website using code
    city = city.replace(" ", "_")
    myurl = f"https://en.wikipedia.org/wiki/{city}"
    html = pm.urlopen(url=myurl, method="GET").data
    
    
    # Step 3: Parse the HTML contents and extract useful information
    soup = bs4.BeautifulSoup(html, features='lxml')
    tag_list = soup.find_all("th", colspan="14")
    try:
        table_tag = tag_list[0].parent.parent
        row_names = [row.text[:-1] for row in table_tag.find_all("th", scope="row")][1:]
        col_names = [col.text[:-1] for col in table_tag.find_all("th", scope="col")]
        value_tags = table_tag.find_all("td")[:len(row_names)*len(col_names)]
        data = [value.text[:-1] for value in value_tags]
        for i, text in enumerate(data):
            text = text.split("(")[0]
            text = text.replace("−", "-")
            text = text.replace(",", "")
            data[i] = float(text)
            
        # Step 4: Format the extracted data and save it into a structured format
        data = np.array(data).reshape(len(row_names), len(col_names))
        df = pd.DataFrame(data, columns=col_names, index=row_names)
        
    except IndexError:
        return None

    return df

In [34]:
get_climate_from_wiki("San Francisco")

Unnamed: 0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Year
Record high °F (°C),79.0,81.0,87.0,94.0,97.0,103.0,98.0,98.0,106.0,102.0,86.0,76.0,106.0
Mean maximum °F (°C),67.1,71.8,76.4,80.7,81.4,84.6,80.5,83.4,90.8,87.9,75.8,66.4,94.0
Average high °F (°C),57.8,60.4,62.1,63.0,64.1,66.5,66.3,67.9,70.2,69.8,63.7,57.9,64.1
Daily mean °F (°C),52.2,54.2,55.5,56.4,57.8,59.7,60.3,61.7,62.9,62.1,57.2,52.5,57.7
Average low °F (°C),46.6,47.9,48.9,49.7,51.4,53.0,54.4,55.5,55.6,54.4,50.7,47.0,51.3
Mean minimum °F (°C),40.5,42.0,43.7,45.0,48.0,50.1,51.6,52.9,52.0,49.9,44.9,40.7,38.8
Record low °F (°C),29.0,31.0,33.0,40.0,42.0,46.0,47.0,46.0,47.0,43.0,38.0,27.0,27.0
Average precipitation inches (mm),4.4,4.37,3.15,1.6,0.7,0.2,0.01,0.06,0.1,0.94,2.6,4.76,22.89
Average precipitation days (≥ 0.01 in),11.2,10.8,10.8,6.8,4.0,1.6,0.7,1.1,1.2,3.5,7.9,11.6,71.2
Average relative humidity (%),80.0,77.0,75.0,72.0,72.0,71.0,75.0,75.0,73.0,71.0,75.0,78.0,75.0


In [35]:
city_list = ["New York", "Boston", "San Francisco", "Paris", "London", "Tokyo", "Beijing", "New Delhi"]
df_list = list(map(get_climate_from_wiki, city_list))
df_dict = dict(zip(city_list, df_list))

In [36]:
df_dict["Beijing"]

Unnamed: 0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Year
Record high °C (°F),14.3,25.6,29.5,33.5,41.1,40.6,41.9,38.3,35.0,31.0,23.3,19.5,41.9
Average high °C (°F),2.1,5.8,12.6,20.7,26.9,30.5,31.5,30.5,26.2,19.4,10.3,3.8,18.4
Daily mean °C (°F),-2.9,0.4,7.0,14.9,21.0,25.0,26.9,25.8,20.8,13.8,5.1,-0.9,13.1
Average low °C (°F),-7.1,-4.3,1.6,8.9,14.9,19.8,22.7,21.7,16.0,8.8,0.6,-4.9,8.2
Record low °C (°F),-22.8,-27.4,-15.0,-3.2,2.5,9.8,15.3,11.4,3.7,-3.5,-12.3,-18.3,-27.4
Average precipitation mm (inches),2.7,5.0,10.2,23.1,39.0,76.7,168.8,120.2,57.4,24.1,13.1,2.4,542.7
Average precipitation days (≥ 0.1 mm),1.8,2.3,3.3,4.7,6.1,9.9,12.8,10.9,7.6,4.8,2.9,2.0,69.1
Average relative humidity (%),44.0,43.0,41.0,43.0,49.0,59.0,70.0,72.0,65.0,58.0,54.0,47.0,54.0
Mean monthly sunshine hours,186.2,188.1,227.5,242.8,267.6,225.6,194.5,208.2,207.5,205.2,174.5,172.3,2500.0
Percent possible sunshine,65.0,65.0,63.0,64.0,64.0,59.0,47.0,52.0,63.0,64.0,62.0,62.0,60.0


In [37]:
df_dict["Paris"]

Unnamed: 0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Year
Record high °C (°F),16.1,21.4,26.0,30.2,34.8,37.6,42.6,39.5,36.2,28.9,21.6,17.1,42.6
Average high °C (°F),7.6,8.8,12.8,16.6,20.2,23.4,25.7,25.6,21.5,16.5,11.1,8.0,16.5
Daily mean °C (°F),5.4,6.0,9.2,12.2,15.6,18.8,20.9,20.8,17.2,13.2,8.7,5.9,12.8
Average low °C (°F),3.2,3.3,5.6,7.9,11.1,14.2,16.2,16.0,13.0,9.9,6.2,3.8,9.2
Record low °C (°F),-14.6,-14.7,-9.1,-3.5,-0.1,3.1,6.0,6.3,1.8,-3.8,-14.0,-23.9,-23.9
Average precipitation mm (inches),47.6,41.8,45.2,45.8,69.0,51.3,59.4,58.0,44.7,55.2,54.3,62.0,634.3
Average precipitation days (≥ 1.0 mm),9.9,9.1,9.5,8.6,9.2,8.3,7.4,8.1,7.5,9.5,10.4,11.4,108.9
Average snowy days,3.0,3.9,1.6,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.7,2.1,11.9
Average relative humidity (%),83.0,78.0,73.0,69.0,70.0,69.0,68.0,71.0,76.0,82.0,84.0,84.0,76.0
Mean monthly sunshine hours,59.0,83.7,134.9,177.3,201.0,203.5,222.4,215.3,174.7,118.6,69.8,56.9,1717.0


---

We chose these two websites as examples because they are easy to scrape: all the data we need are already stored in the HTML file so that we can access them directly. Unfortunately, this is not true for many other websites, and that's when we need **Selenium** for more complex web scraping tasks.

## Reference

- CMSC 12200: https://www.classes.cs.uchicago.edu/archive/2019/winter/12200-1/labs/lab1/index.html
- HTML: https://www.w3schools.com/html/html_attributes.asp
- urllib3: https://pypi.org/project/urllib3/
- Beautiful Soup: https://beautiful-soup-4.readthedocs.io/en/latest/