# Challenge

Utilise your web scraping skills to gather information about three German cities – Berlin, Hamburg, and Munich – from Wikipedia. You will start by extracting the population of each city and then expand the scope of your data gathering to include latitude and longitude, country, and possibly other relevant details.

1. Population Scraping

  1.1. Begin by scraping the population of each city from their respective Wikipedia pages:

 - Berlin: https://en.wikipedia.org/wiki/Berlin
 - Hamburg: https://en.wikipedia.org/wiki/Hamburg
 - Munich: https://en.wikipedia.org/wiki/Munich

  1.2. Once you have scrapped the population of each city, reflect on the similarities and patterns in accessing the population data across the three pages. Also, analyse the URLs to identify any commonalities. Make a loop that executes once but simultaneously retrieves the population for all three cities.

2. Data Organisation

  Utilise pandas DataFrame to effectively store the extracted population data. Ensure the data is clean and properly formatted. Remove any unnecessary characters or symbols and ensure the column data types are accurate.

3. Further Enhancement

  3.1. Expand the scope of your data gathering by extracting other relevant information for each city:

 - Latitude and longitude
 - Country of location

  3.2. Create a function from the loop and DataFrame to encapsulate the scraping process. This function can be used repeatedly to fetch updated data whenever necessary. It should return a clean, properly formatted DataFrame.

4. Global Data Scraping

  With your robust scraping skills now honed, venture beyond the confines of Germany and explore other cities around the world. While the extraction methodology for German cities may follow a consistent pattern, this may not be the case for cities from different countries. Can you make a function that returns a clean DataFrame of information for cities worldwide?

## 1. Population Scraping

In [1]:
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

In [2]:
def get_pop_data(s):
    # get the infobox
    ibox = s.find_all('table', class_='infobox ib-settlement vcard')

    # get the population box
    popbox = ibox[0].find(string='Population').parent.find_next('td')

    # get population data as int
    pop = int(popbox.get_text(strip=True)
    .replace(',', '') # clean stuff
    .split(' ')[0] # clean stuff
    .split('[')[0]) # clean stuff

    return pop


def get_coordinates(s):
    lat_dms = s.find('span', class_='latitude').get_text(strip=True)
    lon_dms = s.find('span', class_='longitude').get_text(strip=True)
    lat = dms_to_decimal(lat_dms)
    lon = dms_to_decimal(lon_dms)
    return lat, lon


def dms_to_decimal(dms_str):
    """
    Convert degrees, minutes, and seconds coordinates in string format to decimal degrees.
    
    Parameters:
    - dms_str: string representing coordinates in the format 'dd°mm′ss″N'.

    Returns:
    - decimal_degrees: float representing decimal degrees.
    """
    dms_parts = dms_str.split('°')
    degrees = int(dms_parts[0])
    mm_ss_parts = dms_parts[1].split('′')
    minutes = int(mm_ss_parts[0])
    if '″' in mm_ss_parts[1]:
        mm_dir_part = mm_ss_parts[1].split('″')
        seconds = int(mm_dir_part[0])
        direction = mm_dir_part[-1]
    else:
        seconds = 0
        direction = mm_ss_parts[-1]
    
    # Determine sign based on direction
    sign = 1 if direction in ['N', 'E'] else -1
    
    # Convert degrees, minutes, and seconds to decimal degrees
    decimal_degrees = sign * (degrees + (minutes / 60) + (seconds / 3600))
    
    return decimal_degrees


def get_country(s):
    # get the infobox
    #ibox = s.find_all('tbody')

    # get the country
    #countrybox = ibox[0].find('th', string='Country').find_next('td')
    countrybox = s.find('th', string='Country').find_next('td')
    country = countrybox.get_text(strip=True)

    return country
    

In [3]:
url_base = 'https://en.wikipedia.org/wiki/'

names_de = ['Berlin', 'Hamburg', 'Munich',
            'Cologne', 'Frankfurt', 'Stuttgart',
            'Düsseldorf', 'Leipzig', 'Dortmund']

names_eu = ['Paris', 'Madrid', #'Rome',
            'Amsterdam', 'Warsaw', 'Vienna',
            'Zürich', 'Prague']

sites = []
cities = []
pops = []
lats, lons = [], []
countries = []

urls = []
for name in names_de + names_eu:
    url = url_base + name
    urls.append(url)

#for url in urls + urls_de + urls_eu:
for url in urls:
    # get site
    print(url)
    site = requests.get(url)
    sites.append(site)

    # get city name
    city = url.split('/')[-1]
    cities.append(city)

    # generate soup
    soup = BeautifulSoup(site.content, 'html.parser')

    # get pop
    pop = get_pop_data(soup)
    pops.append(pop)

    # get coordinates
    lat, lon = get_coordinates(soup)
    lats.append(lat)
    lons.append(lon)

    # get countries
    country = get_country(soup)
    countries.append(country)

#print(sites, cities, pops)

https://en.wikipedia.org/wiki/Berlin
https://en.wikipedia.org/wiki/Hamburg
https://en.wikipedia.org/wiki/Munich
https://en.wikipedia.org/wiki/Cologne
https://en.wikipedia.org/wiki/Frankfurt
https://en.wikipedia.org/wiki/Stuttgart
https://en.wikipedia.org/wiki/Düsseldorf
https://en.wikipedia.org/wiki/Leipzig
https://en.wikipedia.org/wiki/Dortmund
https://en.wikipedia.org/wiki/Paris
https://en.wikipedia.org/wiki/Madrid
https://en.wikipedia.org/wiki/Amsterdam
https://en.wikipedia.org/wiki/Warsaw
https://en.wikipedia.org/wiki/Vienna
https://en.wikipedia.org/wiki/Zürich
https://en.wikipedia.org/wiki/Prague


## 2. Data Organisation

In [4]:
df = pd.DataFrame({'city': cities,
                   'pop': pops,
                   'lat': lats,
                   'lon': lons,
                   'country': countries})
#df.info()
df

Unnamed: 0,city,pop,lat,lon,country
0,Berlin,3576873,52.52,13.405,Germany
1,Hamburg,1945532,53.55,10.0,Germany
2,Munich,1512491,48.1375,11.575,Germany
3,Cologne,1073096,50.936389,6.952778,Germany
4,Frankfurt,773068,50.110556,8.682222,Germany
5,Stuttgart,626275,48.7775,9.18,Germany
6,Düsseldorf,619477,51.233333,6.783333,Germany
7,Leipzig,601866,51.34,12.375,Germany
8,Dortmund,586852,51.513889,7.465278,Germany
9,Paris,2102650,48.856667,2.352222,France


In [5]:
df.to_csv(path_or_buf='./city_data.csv', index=False)