# Web scraping

Web scraping is a technique used to extract data from websites. It involves sending HTTP requests to websites, parsing the returned HTML code, and extracting the desired data. Web scraping is a powerful tool for data scientists as it allows them to collect large amounts of data from the web. This data can then be used to train machine learning models, analyse trends, and make informed business decisions.

---
## 1.&nbsp; Import libraries 💾

In [3]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

---
## 2.&nbsp; Beautiful Soup 🍲

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library that simplifies the process of web scraping. It provides a user-friendly interface for parsing HTML documents, enabling users to extract specific information from websites. Through Beautiful Soup, you can navigate the HTML tree structure, locate elements based on their tags, attributes, and content, and extract the desired data into a structured format.

To illustrate how to use Beautiful Soup, we'll use the simplified mock website below. This stripped-down version serves as a practical learning tool, as real websites often possess much larger and more complex HTML structures. By starting with this simplified model, you can gradually build your skills and expertise, ensuring a solid understanding of the core concepts before tackling more intricate web scraping tasks.

In [4]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1" meta="Eldest sister">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2" meta="Middle sister">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3" meta="Youngest sister">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


Beautiful Soup's HTML parser takes the raw, unruly HTML code and transforms it into a neatly organised tree structure, making the information easily accessible and manageable.

In [5]:
soup = BeautifulSoup(html_doc, 'html.parser')

We can see the tree structure using Beautiful Soup's `.prettify` attribute.

In [6]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



---
## 3.&nbsp; Navigating html for beginners 🧭
There are many methods in Beautiful Soup to explore the html data. By far the most popular and useful of these is .find_all(). So, naturally, this is where we'll start our journey.

### 3.1.&nbsp; `.find_all()`
The `.find_all()` method in Beautiful Soup returns a list of all the elements that match the specified criteria, such as tag name, class name, or attribute values.

#### 3.1.1.&nbsp; Searching by tag

The tags are the letter/word at the beginning of the angle brackets. For example, below, these brackets have an `a` tag.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

The `.find_all()` method takes a string argument and returns a list of all matching HTML tags within the current document. If no matching tags exist, an empty list is returned.

In [7]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

In [8]:
soup.find_all("p")

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

#### 3.1.2.&nbsp; Searching by attribute

Attributes are the other information in the angle brackets. For example, below, these brackets have a `class`, `href`, `id`, and `meta` attribute.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

Attributes provide additional context and functionality to the elements. They can serve various purposes, including CSS selectors for styling, URLs for linking to external resources, metadata for storing relevant data, and a multitude of other information-bearing components. By leveraging these attributes, we can effectively target specific sections of the website.

##### 3.1.2.1.&nbsp; CSS selectors
CSS selectors are used to to style certain sections of websites. This makes them very helpful for webscraping as we can then target certain regions of the website.

###### 3.1.2.1.1.&nbsp; Class
Class selectors are used to style **multiple** HTML elements that share a common characteristic or function.
> **Note:** here class has an underscore at the end of the word, this is because class is a reserved keyword in python.

In [9]:
soup.find_all(class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>]

###### 3.1.2.1.2.&nbsp; ID
ID selectors are used to style **single** HTML elements.

In [10]:
soup.find_all(id="link1")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>]

In [11]:
soup.find_all(id="link2")

[<a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>]

##### 3.1.2.2.&nbsp; Other attributes
HTML elements can also include other attributes, which can be equally useful for identifying and targeting specific data points. To locate these attributes, search for them using the same method as you do for CSS selectors.

In [12]:
soup.find_all(meta="Youngest sister")

[<a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>]

#### 3.1.3.&nbsp; Searching by string
The text (string) is the part between the opening and closing angle brackets, this is what's displayed on the webpage. For example, below, these brackets have `Elsie` as the text.

`<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>`

Instead of searching for specific tags or attributes, you can also search for this text. To do this, you can use a string or a regular expression to specify the text you're looking for.

In [13]:
soup.find_all(string="Dormouse")

[]

The string "Dormouse" didn't return any results because BeautifulSoup searches for entire strings that exactly match the string you entered. In other words, the string must be the exact same as what you're searching for for it to be considered a match.

In [14]:
soup.find_all(string="The Dormouse's story")

["The Dormouse's story", "The Dormouse's story"]

To search for a substring, the easiest way is to use the regular expressions method `.compile()`.

In [15]:
import re
soup.find_all(string=re.compile("dormouse", re.IGNORECASE))

["The Dormouse's story", "The Dormouse's story"]

> **Note:** by default, the .compile() method is case-sensitive, meaning it will only match strings that are exactly equal to the pattern you specify, including case. To perform case-insensitive matching, you must explicitly pass the re.IGNORECASE flag to the .compile() method.

### 3.2.&nbsp; Extracting text
There are a few ways to extract text in Beautiful Soup, here we'll focus on 2 of them.

#### 3.2.1.&nbsp; `.get_text()`
The `.get_text()` method extracts all the human-readable text from a Beautiful Soup object, returning it as a string.

In [16]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

In [17]:
soup.find_all("title")[0].get_text()

"The Dormouse's story"

> Read the error message and look at the output from the cell above. Can you work out why we got an error?

In [18]:
# @title Click `show code` to see the solution to the error

# It was a list, read the error messages and notice the square brackets in the original output
# Therefore, we need to select the first and only element of this list
soup.find_all("title")[0].get_text()

"The Dormouse's story"

We can also print out multiple items using our looping skills.

In [19]:
story = soup.find_all("p")
story

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [20]:
for p in story:
  print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


#### 3.2.2.&nbsp; Extracting attributes:
HTML elements often store additional information within their attributes. To extract this data using Beautiful Soup, you can append square brackets after the element selector and specify the attribute name within them.

In [21]:
soup.find_all(id="link1")

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>]

In [22]:
soup.find_all(id="link1")[0]['href']

'http://example.com/elsie'

In [23]:
soup.find_all(id="link1")[0]['meta']

'Eldest sister'

## Challenge 1 😀
Below is new HTML code. Use your scrapping skills to answer the questions.

In [24]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [25]:
# Create the "soup"
soup2 = BeautifulSoup(geography, 'html.parser')

In [26]:
# 1. All the "fun facts"
[ p.get_text() for p in soup2.select('p') ]

['London is the most popular tourist destination in the world.',
 'Paris was originally a Roman City called Lutetia.',
 "Spain produces 43,8% of all the world's Olive Oil."]

In [27]:
# 2. The names of all the places.
[ e.get_text() for e in soup2.select('h2') ]

['London', 'Paris', 'Spain']

In [28]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)
[ e.get_text() for e in soup2.select('.city') ]

['\nLondon\nLondon is the most popular tourist destination in the world.\n',
 '\nParis\nParis was originally a Roman City called Lutetia.\n']

In [29]:
{ e.select_one('h2').get_text(): e.select_one('p').get_text() for e in soup2.select('.city') }

{'London': 'London is the most popular tourist destination in the world.',
 'Paris': 'Paris was originally a Roman City called Lutetia.'}

In [30]:
{ e.h2.get_text(): e.p.get_text() for e in soup2.select('.city') }

{'London': 'London is the most popular tourist destination in the world.',
 'Paris': 'Paris was originally a Roman City called Lutetia.'}

In [31]:
# 4. The names (not facts!) of all the cities (not countries!)
[ e.get_text() for e in soup2.select('.city h2') ]

['London', 'Paris']

---
## 4.&nbsp; Navigating html with a few more advanced techniques 🗺️

### 4.1.&nbsp; `.find()`
`.find()` is similar to `.find_all()`, but it returns only the first element that matches the specified criteria. This makes it useful when you know exactly where the element you're looking for is located and you only need to retrieve one instance of it.

In [32]:
soup.find('p')

<p class="title"><b>The Dormouse's story</b></p>

### 4.2.&nbsp; `.select()`
`.select()` is similar to `.find_all()`, but there are 2 main differences:
- the way we write our query in the brackets is slightly different
- `.select()` allows you to chain CSS selectors together to navigate through the HTML structure, enabling you to select elements based on their positions within nested elements or patterns. This makes it particularly useful for extracting data from complex HTML structures.

In contrast, `.find_all()` uses a simpler syntax based on tag names and attributes, making it more straightforward for basic element selection.

Here's how we query with `.find_all()`

In [33]:
soup.find_all('a', class_='sister')

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>]

Here's the same query with `.select()`

In [34]:
soup.select('a.sister')

[<a class="sister" href="http://example.com/elsie" id="link1" meta="Eldest sister">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>]

To demonstrate the power of `.select()` in navigating through nested elements, let's extract all the `<a>` tags with the id `'link2'` that are within `<p>` tags with the class `'story'`.

In [35]:
soup.select('p.story a#link2')

[<a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>]

### 4.3.&nbsp; Navigating to the Next or Previous Element
In some cases, you may need to access specific elements that are closely related to others, but their HTML structure doesn't provide unique identifiers. To overcome this challenge, you can utilise the `.find_next()` and `.find_previous()` methods to navigate through the HTML structure and reach the desired element.

In [36]:
last_link = soup.find(id='link3')
last_link

<a class="sister" href="http://example.com/tillie" id="link3" meta="Youngest sister">Tillie</a>

#### 4.1.1.&nbsp; `.find_next()`
`.find_next()` moves forward one element

In [37]:
last_link.find_next()

<p class="story">...</p>

#### 4.2.&nbsp; `.find_previous()`
`.find_previous()` moves back one element

In [38]:
last_link.find_previous()

<a class="sister" href="http://example.com/lacie" id="link2" meta="Middle sister">Lacie</a>

---
## 5.&nbsp; Showcasing these skills on a real website 💻
Let's see what information we can get from the wikipedia site for web scraping

### Loading the html

In [39]:
url = "https://en.wikipedia.org/wiki/Web_scraping"

response = requests.get(url)

soup_3 = BeautifulSoup(response.content, 'html.parser')

> While we haven't yet looked into the requests library, we'll postpone delving into it today to avoid overwhelming you with too much new information. Instead, we'll explore the requests library when we start gathering weather data later in the project.

In [40]:
print(soup_3.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Web scraping - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-li

### Getting the title

In [41]:
soup_3.find("title").get_text()

'Web scraping - Wikipedia'

### Getting the first h1 tag

In [42]:
soup_3.find("h1").get_text()

'Web scraping'

### Getting all the h2 tags

In [43]:
h2_tags = soup_3.find_all("h2")
h2_tags

[<h2 class="vector-pinnable-header-label">Contents</h2>,
 <h2><span class="mw-headline" id="History">History</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=1" title="Edit section: History"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span></h2>,
 <h2><span class="mw-headline" id="Techniques">Techniques</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=2" title="Edit section: Techniques"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span></h2>,
 <h2><span class="mw-headline" id="Software">Software</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=11" title="Edit section: Software"><span>edit</span></a><span class="mw-editsection-bracket">]</span></

As we have multiple tags in the list here, we need to use a loop to print them out.

In [44]:
for h2 in h2_tags:
  print(h2.get_text())

Contents
History[edit]
Techniques[edit]
Software[edit]
Legal issues[edit]
Methods to prevent web scraping[edit]
See also[edit]
References[edit]


### Selecting the `Legal Issues` text for only `India`
> **Pro tip:** If you're using Google Chrome, you can navigate to `View > Developer > Inspect elements` to access the built-in web development tools. Here, you can explore the HTML structure of the webpage directly within the browser using your mouse. This interactive approach is often more intuitive than examining the raw HTML code.

By investigating the html we can see that the closest, easy to access, tag is the heading with the CSS `id` of `"India"`.

In [45]:
soup_3.find(id="India")

<span class="mw-headline" id="India">India</span>

We can then use `.find_next()` to select the text.

In [46]:
soup_3.find(id="India").find_next()

<span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=16" title="Edit section: India"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span>

Looks like the next tag was a `span` tag, so let's specify that we want the next `p` tag.

In [47]:
soup_3.find(id="India").find_next("p")

<p>Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the <a href="/wiki/Information_Technology_Act,_2000#:~:text=From_Wikipedia,_the_free_encyclopedia_The_Information_Technology,in_India_dealing_with_cybercrime_and_electronic_commerce." title="Information Technology Act, 2000">Information Technology Act, 2000</a>, which penalizes unauthorized access to a computer resource or extracting data from a computer resource.
</p>

Now we can simply extract the text, and we have what we need

In [48]:
soup_3.find(id="India").find_next("p").get_text()

'Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the Information Technology Act, 2000, which penalizes unauthorized access to a computer resource or extracting data from a computer resource.\n'

## Challenge 2 😀

Utilise your web scraping skills to gather information about three German cities – Berlin, Hamburg, and Munich – from Wikipedia. You will start by extracting the population of each city and then expand the scope of your data gathering to include latitude and longitude, country, and possibly other relevant details.

1. Population Scraping

  1.1. Begin by scraping the population of each city from their respective Wikipedia pages:

 - Berlin: https://en.wikipedia.org/wiki/Berlin
 - Hamburg: https://en.wikipedia.org/wiki/Hamburg
 - Munich: https://en.wikipedia.org/wiki/Munich

  1.2. Once you have scrapped the population of each city, reflect on the similarities and patterns in accessing the population data across the three pages. Also, analyse the URLs to identify any commonalities. Make a loop that executes once but simultaneously retrieves the population for all three cities.

2. Data Organisation

  Utilise pandas DataFrame to effectively store the extracted population data. Ensure the data is clean and properly formatted. Remove any unnecessary characters or symbols and ensure the column data types are accurate.

3. Further Enhancement

  3.1. Expand the scope of your data gathering by extracting other relevant information for each city:

 - Latitude and longitude
 - Country of location

  3.2. Create a function from the loop and DataFrame to encapsulate the scraping process. This function can be used repeatedly to fetch updated data whenever necessary. It should return a clean, properly formatted DataFrame.

4. Global Data Scraping

  With your robust scraping skills now honed, venture beyond the confines of Germany and explore other cities around the world. While the extraction methodology for German cities may follow a consistent pattern, this may not be the case for cities from different countries. Can you make a function that returns a clean DataFrame of information for cities worldwide?

In [49]:
# Store soups that we previously retrieved, so we donot cause unnecessary traffic and delays:
try:
  cached_soup
except NameError:
  cached_soup = {}

In [108]:
#! pip install lat_lon_parser

Collecting lat_lon_parser
  Downloading lat_lon_parser-1.3.0-py2.py3-none-any.whl (10 kB)
Installing collected packages: lat_lon_parser
Successfully installed lat_lon_parser-1.3.0


In [113]:
from datetime import datetime
import urllib
import json
import lat_lon_parser

url_pattern = 'https://en.wikipedia.org/wiki/{city}'
wikidata_pattern = 'https://www.wikidata.org/wiki/Special:EntityData/{entity}.json'

def parseNumber(number):
  if match := re.search(r'(\d+)', number.replace(',', '')):
    return int(match.group(1))
  return None

def parseDate(date):
  # 31 December 2018, see Zürich
  if match := re.search(r'\((\d+ \w+ \d{4})(\s.*)?\)', date):
    return datetime.strptime(match.group(1), '%d %B %Y').strftime('%Y-%m-%d')
  # 2018-12-31, see Hamburg
  if match := re.search(r'\((\d{4}-\d\d-\d\d)(\s.*)?\)', date):
    return match.group(1)
  # 2022, see most others
  if match := re.search(r'\((20\d\d)(\s.*)?\)', date):
    return match.group(1)
  return None

def getSoup(url):
  if soup := cached_soup.get(url):
    return soup
  response = requests.get(url)
  soup = BeautifulSoup(response.content, 'html.parser')
  cached_soup[url] = soup
  return soup

def scrapeCity(city):
  city_soup = getSoup(url_pattern.format(city=urllib.parse.quote(city)))

  for ib in city_soup.select('.infobox'): # .ib-settlement would be nicer but Sydney uses infobox only
    latitude = ib.select_one('.latitude').text
    longitude = ib.select_one('.longitude').text
    country = None
    date = None
    population = None

    for th in ib.select('.infobox-label'):
      if th.text == 'Country':
        country = th.parent.select_one('.infobox-data').text
      elif th.text.startswith('Population'):
        date = parseDate(th.text)
        infobox_data = th.parent.select_one('.infobox-data')
        population = parseNumber(infobox_data.text)
        if not date:
          date = parseDate(infobox_data.text)

    for th in ib.select('.infobox-header'):
      if population == None and th.text.startswith('Population'):
        date = parseDate(th.text)

        if data := th.parent.select_one('.infobox-data'):
          population = parseNumber(data.text)
          continue

        next_tr = th.parent.next_sibling
        if not re.match(r'\s*•\s+(City|Capital city|Total|Municipality)', next_tr.select_one('.infobox-label').text):
          continue

        population = parseNumber(next_tr.select_one('.infobox-data').text)
    break

  # WikiData
  # wikidata_entity = city_soup.select_one('#t-wikibase a')['href'].rsplit('/', 1)[-1]
  # wikidata_url = wikidata_pattern.format(entity=wikidata_entity)
  # print(wikidata_url)
  # wikidata_json = json.loads(requests.get(wikidata_pattern.format(entity=wikidata_entity)).content)
  # print(wikidata_json['entities'][wikidata_entity]['claims']['P2044'][0]['mainsnak']['datavalue']['value']['amount'])
  
  wikidata_soup = getSoup(city_soup.select_one('#t-wikibase a')['href'])
  try:
    base_elevation = wikidata_soup.select_one('#P2044 .wikibase-snakview-value').text
  except:
    base_elevation = None

  try:
    peak_elevation = wikidata_soup.select_one('#P610 .wikibase-statementview-qualifiers .wikibase-snakview-value').text
  except:
    peak_elevation = None
  
  return pd.DataFrame({
        'city': [city],
        'country': [country],
        'latitude': [lat_lon_parser.parse(latitude)],
        'longitude': [lat_lon_parser.parse(longitude)],
        'base_elevation': [base_elevation],
        'peak_elevation': [peak_elevation],
        'population': [population],
        'date': [date]})

def scrapeCities(cities = [ 'Berlin', 'Hamburg', 'Munich' ]):
  df = None
  for city in cities:
    print(city)
    df = pd.concat([df, scrapeCity(city)], ignore_index=True)
  return df

scrapeCity('Berlin')

Unnamed: 0,city,country,latitude,longitude,base_elevation,peak_elevation,population,date
0,Berlin,Germany,52.52,13.405,34±1 metre,121.9 metre,3850809,2021


In [114]:
df = scrapeCities([
    'Berlin',
    'Hamburg',
    'Munich',
    'Stuttgart',
    'Tübingen',
    'Potsdam',
    'Werder (Havel)',
    'Paris',
    'London',
    'Vienna',
    'Warsaw',
    'Prague',
    'Zürich',
    'New York City',
    'Tokyo',
    'Beijing',
    'Moscow',
    'Sydney'
    ])

df

Berlin
Hamburg
Munich
Stuttgart
Tübingen
Potsdam
Werder (Havel)
Paris
London
Vienna
Warsaw
Prague
Zürich
New York City
Tokyo
Beijing
Moscow
Sydney


Unnamed: 0,city,country,latitude,longitude,base_elevation,peak_elevation,population,date
0,Berlin,Germany,52.52,13.405,34±1 metre,121.9 metre,3850809,2021
1,Hamburg,Germany,53.55,10.0,6±1 metre,116.2 metre,1945532,2022-12-31
2,Munich,Germany,48.1375,11.575,519±1 metre,,1512491,2022-12-31
3,Stuttgart,Germany,48.7775,9.18,245±1 metre,,626275,2021-12-31
4,Tübingen,Germany,48.52,9.055556,338 metre,,91877,2021-12-31
5,Potsdam,Germany,52.4,13.066667,35 metre,,183154,2021-12-31
6,Werder (Havel),Germany,52.383333,12.933333,31±1 metre,,26767,2021-12-31
7,Paris,France,48.856667,2.352222,28±1 metre,,2102650,2023
8,London,England,51.507222,-0.1275,15 metre,245 metre,8799800,2021
9,Vienna,Austria,48.208333,16.3725,151 metre,,2002821,


In [122]:
import dotenv
import os
dotenv.load_dotenv()

schema = "wbscs_cities"
host = os.getenv('host')
user = os.getenv('user')
password = os.getenv('password')
port = os.getenv('port')

connection_string = f'mysql+pymysql://{user}:{password}@{host}:{port}/{schema}'

from sqlalchemy import create_engine, text, inspect
connection = create_engine(connection_string).connect()

In [181]:
def resetDatabase():
    connection.execute(text('drop table if exists weather;'))
    connection.execute(text('drop table if exists fact;'))
    connection.execute(text('drop table if exists city;'))
    connection.execute(text('drop table if exists measure;'))
    connection.execute(text('drop table if exists scrape;'))

def initDatabase():
    if len(inspect(connection).get_table_names()):
        return

    connection.execute(text(
        '''
        create table city (
            id INT AUTO_INCREMENT KEY,
            name TEXT NOT NULL,
            country TEXT,
            latitude DOUBLE,
            longitude DOUBLE,
            base_elevation TEXT,
            peak_elevation TEXT
        );
        '''));

    connection.execute(text(
        '''
        create table scrape (
            id INT AUTO_INCREMENT KEY,
            timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
        );
        '''));

    connection.execute(text(
        '''
        create table measure (
            id INT AUTO_INCREMENT KEY,
            name TEXT NOT NULL,
            type TEXT NOT NULL
        );
        '''));

    connection.execute(text(
        '''
        create table fact (
            scrape INT NOT NULL REFERENCES scrape(id) ON DELETE CASCADE ON UPDATE CASCADE,
            city INT NOT NULL REFERENCES city(id) ON DELETE CASCADE ON UPDATE CASCADE,
            measure INT NOT NULL REFERENCES measure(id) ON DELETE CASCADE ON UPDATE CASCADE,
            value TEXT NOT NULL,
            meta JSON NULL DEFAULT NULL
        );
        '''));

    connection.execute(text(
        '''
        create table weather (
            scrape INT NOT NULL REFERENCES scrape(id) ON DELETE CASCADE ON UPDATE CASCADE,
            city INT NOT NULL REFERENCES city(id) ON DELETE CASCADE ON UPDATE CASCADE,
            dt DATETIME NOT NULL,
            T_feelslike_celsius FLOAT NOT NULL,
            rain_3h_mm FLOAT NOT NULL,
            snow_3h_mm FLOAT NOT NULL
        );
        '''));

def newScrapeId():
  return connection.execute(text('insert into scrape values() returning id')).first().id

def newMeasureId(measure, type_):
  return connection.execute(text(f'insert into measure(name, type) values("{measure}", "{type_}") returning id')).first().id

resetDatabase()
initDatabase()

In [182]:
cities_df = df[['city','country','latitude','longitude','base_elevation','peak_elevation']].rename(columns={'city': 'name'})

In [183]:
cities_df.to_sql('city',
                  if_exists='append',
                  con=connection,
                  index=False)

18

In [184]:
cities_df = pd.read_sql('city', con=connection).set_index('id')

In [185]:
scrape_id = newScrapeId()
population_id = newMeasureId('population', 'number')

In [186]:
(
    df[['city','population','date']]
    .join(
        cities_df
        [['name']]
        .reset_index()
        .set_index('name'),
        on='city')
    .drop(columns=['city'])
    .rename(columns={'id': 'city', 'population': 'value'})
    .assign(
        scrape=scrape_id, 
        measure=population_id,
        meta=lambda x: x.apply(lambda r: f'{{"date": "{r.date}"}}' if r.date else None, axis=1) )
    .drop(columns=['date'])
).to_sql('fact', con=connection, if_exists='append', index=False)

18

### Weather data

In [187]:
cities_df

Unnamed: 0_level_0,name,country,latitude,longitude,base_elevation,peak_elevation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Berlin,Germany,52.52,13.405,34±1 metre,121.9 metre
2,Hamburg,Germany,53.55,10.0,6±1 metre,116.2 metre
3,Munich,Germany,48.1375,11.575,519±1 metre,
4,Stuttgart,Germany,48.7775,9.18,245±1 metre,
5,Tübingen,Germany,48.52,9.055556,338 metre,
6,Potsdam,Germany,52.4,13.066667,35 metre,
7,Werder (Havel),Germany,52.383333,12.933333,31±1 metre,
8,Paris,France,48.856667,2.352222,28±1 metre,
9,London,England,51.507222,-0.1275,15 metre,245 metre
10,Vienna,Austria,48.208333,16.3725,151 metre,


In [188]:
openweathermap_pattern = 'https://api.openweathermap.org/data/2.5/forecast?lat={lat}&lon={lon}&appid={openweathermap_key}&units=metric'
openweathermap_key = os.getenv('openweathermap_key')

In [189]:
try:
  cached_json
except:
  cached_json = {}

In [190]:
def getJson(url):
  if json := cached_json.get(url):
    return json
  response = requests.get(url)
  json = response.json()
  cached_json[url] = json
  return json

In [195]:
def getWeather(city=None, lat=None, lon=None):
  if city != None:
    latlon = cities_df.set_index('name').loc[city, ['latitude', 'longitude']]
    return getWeather(lat=latlon['latitude'], lon=latlon['longitude'])
  
  url = openweathermap_pattern.format(lat=lat, lon=lon, openweathermap_key=openweathermap_key)
  json = getJson(url)
  df = pd.json_normalize(json['list'])
  for optional in ['rain.3h', 'snow.3h']:
    if not optional in df:
      df[optional] = None
  df.dt = pd.to_datetime(df.dt, unit='s')
  return df[['dt', 'main.feels_like', 'rain.3h', 'snow.3h']].fillna(0)
  #print(json.dumps(response.json(), indent=2, sort_keys=True))
  
getWeather(city='Berlin')

Unnamed: 0,dt,main.feels_like,rain.3h,snow.3h
0,2024-01-10 18:00:00,-3.88,0.0,0.0
1,2024-01-10 21:00:00,-3.77,0.0,0.0
2,2024-01-11 00:00:00,-5.83,0.0,0.0
3,2024-01-11 03:00:00,-7.14,0.0,0.0
4,2024-01-11 06:00:00,-7.47,0.0,0.0
5,2024-01-11 09:00:00,-6.97,0.0,0.0
6,2024-01-11 12:00:00,-4.83,0.0,0.0
7,2024-01-11 15:00:00,-3.95,0.0,0.0
8,2024-01-11 18:00:00,-3.98,0.0,0.0
9,2024-01-11 21:00:00,-3.65,0.0,0.0


In [192]:
cities_df['name']

id
1             Berlin
2            Hamburg
3             Munich
4          Stuttgart
5           Tübingen
6            Potsdam
7     Werder (Havel)
8              Paris
9             London
10            Vienna
11            Warsaw
12            Prague
13            Zürich
14     New York City
15             Tokyo
16           Beijing
17            Moscow
18            Sydney
Name: name, dtype: object

In [197]:
def scrapeWeather():
  scrape_id = newScrapeId()
  for city_id, city in cities_df['name'].items():
    rows = (
      getWeather(city=city)
      .assign(city=city_id, scrape=scrape_id)
      .rename(columns={'main.feels_like':'T_feelslike_celsius', 'rain.3h': 'rain_3h_mm', 'snow.3h': 'snow_3h_mm'})
    ).to_sql('weather', con=connection, index=False, if_exists='append')
    print(city, rows)
        
scrapeWeather()

Berlin 40
Hamburg 40
Munich 40
Stuttgart 40
Tübingen 40
Potsdam 40
Werder (Havel) 40
Paris 40
London 40
Vienna 40
Warsaw 40
Prague 40
Zürich 40
New York City 40
Tokyo 40
Beijing 40
Moscow 40
Sydney 40
