Note: The purpose of this project is to learn dealing with API's and JSON files.

You are working in a startup developing an e-scooter-sharing system. It aspires to operate in the most populous cities all around the world. In each city, your company will have hundreds of e-scooters parked in the streets and allow users to rent them by the minute.

The company wants to anticipate as much as possible scooter movements. Predictive modelling is certainly on the roadmap, but the first step is to collect more data, transform it and store it appropriately. This is where you comes in: your task will be to collect data from external sources that can potentially help your company in predicting e-scooter movements. Since data is needed every day, in real-time and accessible by everyone in the company, the challenge is going to be in assembling and automating a data pipeline in the cloud.


We can divide this requirement into 5 parts:

1. Web Scraping to collect demographical data;
2. Weather data using OWN API;
3. Collect flights data using the Aerodatabox API;
4. Storing Data in a Postgres Database;
5. Create a Pipeline and Automate. [Assignment]

In [41]:
import pandas as pd
import requests
from datetime import datetime, timedelta
import requests
from IPython.display import JSON
import os
from dotenv import load_dotenv
from bs4 import BeautifulSoup as bs
import unicodedata
import numpy as np

In [30]:
load_dotenv()

True

In [31]:
flight_api_key = os.getenv('flight_api_key')
OWM_key = os.getenv('OWM_key')

In [34]:
OWM_key

'30189de10b104455d3feb65acfa05996'

In [36]:
api_input = pd.read_csv("data/api_inputs.csv")
api_input

Unnamed: 0,Name,WikiData_code,ISO_3166_code,airport_icao
0,Berlin,Q64,DE,EDDB
1,London,Q84,GB,EGLC
2,Madrid,Q2807,ES,LEMD
3,Paris,Q90,FR,LFPG


## Web Scraping: Collect demographical data

For the web scrapping a library called Beautiful Soup (BS) was used. But what is Beautiful Soup?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Using this library, data about the population of the cities, where the company is operational, in order to know how many peoples are potential costumers. For this, the Wikipedia was used. The data were obtained for Berlin, London, Madrid and Paris.

The objective, was to use python to download the HTML document and then find and extract the required data. To use the Beatiful Soup, first is necessary to inspect the web page and identify the part of the HTML/CSS selectors needed.

### Population data

In [42]:
cities = ['Berlin', 'London', 'Madrid', 'Paris']

def city_info(soup):
    
    ret_dict = {}
    ret_dict['city'] = soup.h1.get_text()
        
    if soup.select_one('.mergedrow:-soup-contains("Mayor")>.infobox-label') != None:
        i = soup.select_one('.mergedrow:-soup-contains("Mayor")>.infobox-label')
        mayor_name_html = i.find_next_sibling()
        mayor_name = unicodedata.normalize('NFKD',mayor_name_html.get_text())
        ret_dict['mayor']  = mayor_name
    
    if soup.select_one('.mergedrow:-soup-contains("City")>.infobox-label') != None:
        j =  soup.select_one('.mergedrow:-soup-contains("City")>.infobox-label')
        area = j.find_next_sibling('td').get_text()
        ret_dict['city_size'] = unicodedata.normalize('NFKD',area)

    if soup.select_one('.mergedtoprow:-soup-contains("Elevation")>.infobox-data') != None:
        k = soup.select_one('.mergedtoprow:-soup-contains("Elevation")>.infobox-data')
        elevation_html = k.get_text()
        ret_dict['elevation'] = unicodedata.normalize('NFKD',elevation_html)
    
    if soup.select_one('.mergedtoprow:-soup-contains("Population")') != None:
        l = soup.select_one('.mergedtoprow:-soup-contains("Population")')
        c_pop = l.findNext('td').get_text()
        ret_dict['city_population'] = c_pop
    
    if soup.select_one('.infobox-label>[title^=Urban]') != None:
        m = soup.select_one('.infobox-label>[title^=Urban]')
        u_pop = m.findNext('td')
        ret_dict['urban_population'] = u_pop.get_text()

    if soup.select_one('.infobox-label>[title^=Metro]') != None:
        n = soup.select_one('.infobox-label>[title^=Metro]')
        m_pop = n.findNext('td')
        ret_dict['metro_population'] = m_pop.get_text()
    
    if soup.select_one('.latitude') != None:
        o = soup.select_one('.latitude')
        ret_dict['lat'] = o.get_text()

    if soup.select_one('.longitude') != None:    
        p = soup.select_one('.longitude')
        ret_dict['long'] = p.get_text()
    
    return ret_dict

list_of_city_info = []
for city in cities:
    url = 'https://en.wikipedia.org/wiki/{}'.format(city)
    web = requests.get(url,'html.parser')
    soup = bs(web.content)
    list_of_city_info.append(city_info(soup))
df_cities = pd.DataFrame(list_of_city_info)
# df_cities = df_cities.set_index('city')
df_cities

Unnamed: 0,city,mayor,city_size,elevation,city_population,urban_population,metro_population,lat,long
0,Berlin,Franziska Giffey (SPD),891.3 km2 (344.1 sq mi),34 m (112 ft),3677472,4473101,6144600,52°31′12″N,13°24′18″E
1,London,Greater London Authority• Mayor Sadiq Khan (L)...,Greater London (ceremonial county)City of London,36 ft (11 m),"8,799,800[1]",9787426,"14,257,962 (London metropolitan area)",51°30′26″N,0°7′39″W
2,Madrid,José Luis Martínez-Almeida (PP),,"650 m (2,130 ft)",3223334,"6,211,000[2]","6,791,667[1]",40°25′00″N,03°42′09″W
3,Paris,Anne Hidalgo (PS),,28–131 m (92–430 ft) (avg. 78 m or 256 ft),2165423,10858852,13024518,48°51′24″N,2°21′08″E


## Weather data using OWM(OpenWeatherMap) API

For collecting data about weather OpenWeatherMap web API was used. To use this API you just need to create an account to access some of the free services. You can subscribe and start using OpenWeatherMapas soon as you have an account on RapidAPI. Same as at the AeroDataBox API, you need an API key for AeroDataBox API too. You can get it from RapidAPI for free by subscribing for BASIC plan. This will give you 200 free request for API so be careful while testing your code.

Some of the free services included in the OWM API are free, for example the [3h weather forecast for the next 5 days](https://openweathermap.org/forecast5#5days). To make a request for the API, python’s requests library was used.

In [37]:
city_name = api_input["Name"].tolist()
city_name

['Berlin', 'London', 'Madrid', 'Paris']

In [38]:
country_name = api_input["ISO_3166_code"].tolist()
country_name

['DE', 'GB', 'ES', 'FR']

In [40]:
response_berlin = requests.get(f'http://api.openweathermap.org/data/2.5/forecast/hourly?q={city_name[0]},{country_name[0]}&appid={OWM_key}&units=metric&lang=en')

response_berlin

{'cod': 401,
 'message': 'Invalid API key. Please see https://openweathermap.org/faq#error401 for more info.'}

In [None]:
forecast_api = response_berlin.json()['list']
weather_info = []

# datetime, temperature, wind, prob_perc, rain_qty, snow = [], [], [], [], [], []
for forecast_3h in forecast_api: 
    weather_hour = {}
    # datetime utc
    weather_hour['datetime'] = forecast_3h['dt_txt']
    # temperature 
    weather_hour['temperature'] = forecast_3h['main']['temp']
    # wind
    weather_hour['wind'] = forecast_3h['wind']['speed']
    # probability precipitation 
    try: weather_hour['prob_perc'] = float(forecast_3h['pop'])
    except: weather_hour['prob_perc'] = 0
    # rain
    try: weather_hour['rain_qty'] = float(forecast_3h['rain']['3h'])
    except: weather_hour['rain_qty'] = 0
    # wind 
    try: weather_hour['snow'] = float(forecast_3h['snow']['3h'])
    except: weather_hour['snow'] = 0
    weather_hour['municipality_iso_country'] = city_name[0] + ',' + country_name[0] #Check it
    weather_info.append(weather_hour)    
    
weather_data_berlin = pd.DataFrame(weather_info)
weather_data_berlin.head()

In [None]:
response_london = requests.get(f'http://api.openweathermap.org/data/2.5/forecast/?q={city_name[1]},{country_name[1]}&appid={OWM_key}&units=metric&lang=en')

response_london

In [None]:
#For London
forecast_api = response_london.json()['list']
# look for the fields that could ve relevant: 
# better field descriptions https://www.weatherbit.io/api/weather-forecast-5-day

weather_info = []

# datetime, temperature, wind, prob_perc, rain_qty, snow = [], [], [], [], [], []
for forecast_3h in forecast_api: 
    weather_hour = {}
    # datetime utc
    weather_hour['datetime'] = forecast_3h['dt_txt']
    # temperature 
    weather_hour['temperature'] = forecast_3h['main']['temp']
    # wind
    weather_hour['wind'] = forecast_3h['wind']['speed']
    # probability precipitation 
    try: weather_hour['prob_perc'] = float(forecast_3h['pop'])
    except: weather_hour['prob_perc'] = 0
    # rain
    try: weather_hour['rain_qty'] = float(forecast_3h['rain']['3h'])
    except: weather_hour['rain_qty'] = 0
    # wind 
    try: weather_hour['snow'] = float(forecast_3h['snow']['3h'])
    except: weather_hour['snow'] = 0
    weather_hour['municipality_iso_country'] = city_name[1] + ',' + country_name[1] 
    weather_info.append(weather_hour)    
    
weather_data_london = pd.DataFrame(weather_info)
weather_data_london.head()

In [None]:
response_madrid = requests.get(f'http://api.openweathermap.org/data/2.5/forecast/?q={city_name[2]},{country_name[2]}&appid={OWM_key}&units=metric&lang=en')

response_madrid.json()

In [None]:
#For Madrid
forecast_api = response_madrid.json()['list']
# look for the fields that could ve relevant: 
# better field descriptions https://www.weatherbit.io/api/weather-forecast-5-day

weather_info = []

# datetime, temperature, wind, prob_perc, rain_qty, snow = [], [], [], [], [], []
for forecast_3h in forecast_api: 
    weather_hour = {}
    # datetime utc
    weather_hour['datetime'] = forecast_3h['dt_txt']
    # temperature 
    weather_hour['temperature'] = forecast_3h['main']['temp']
    # wind
    weather_hour['wind'] = forecast_3h['wind']['speed']
    # probability precipitation 
    try: weather_hour['prob_perc'] = float(forecast_3h['pop'])
    except: weather_hour['prob_perc'] = 0
    # rain
    try: weather_hour['rain_qty'] = float(forecast_3h['rain']['3h'])
    except: weather_hour['rain_qty'] = 0
    # wind 
    try: weather_hour['snow'] = float(forecast_3h['snow']['3h'])
    except: weather_hour['snow'] = 0
    weather_hour['municipality_iso_country'] = city_name[2] + ',' + country_name[2] 
    weather_info.append(weather_hour)    
    
weather_data_madrid = pd.DataFrame(weather_info)
weather_data_madrid.head()

In [None]:
response_paris = requests.get(f'http://api.openweathermap.org/data/2.5/forecast/?q={city_name[3]},{country_name[3]}&appid={OWM_key}&units=metric&lang=en')

response_paris.json()

In [None]:
#For Paris
forecast_api = response_paris.json()['list']
# look for the fields that could ve relevant: 
# better field descriptions https://www.weatherbit.io/api/weather-forecast-5-day

weather_info = []

# datetime, temperature, wind, prob_perc, rain_qty, snow = [], [], [], [], [], []
for forecast_3h in forecast_api: 
    weather_hour = {}
    # datetime utc
    weather_hour['datetime'] = forecast_3h['dt_txt']
    # temperature 
    weather_hour['temperature'] = forecast_3h['main']['temp']
    # wind
    weather_hour['wind'] = forecast_3h['wind']['speed']
    # probability precipitation 
    try: weather_hour['prob_perc'] = float(forecast_3h['pop'])
    except: weather_hour['prob_perc'] = 0
    # rain
    try: weather_hour['rain_qty'] = float(forecast_3h['rain']['3h'])
    except: weather_hour['rain_qty'] = 0
    # wind 
    try: weather_hour['snow'] = float(forecast_3h['snow']['3h'])
    except: weather_hour['snow'] = 0
    weather_hour['municipality_iso_country'] = city_name[3] + ',' + country_name[3] #Check it
    weather_info.append(weather_hour)    
    
weather_data_paris = pd.DataFrame(weather_info)
weather_data_paris.head()

Joining the dataframes for the weather of the cities(Berlin, London, Madrid and Paris) into only one dataframe, named "weather_data" 

In [None]:
weather_data = pd.concat([weather_data_berlin, weather_data_london, weather_data_madrid, weather_data_paris], axis=0)
weather_data.reset_index(drop = True, inplace=True) 
weather_data

## Collect flights data using the Aerodatabox API

The AeroDataBox API make it possible to request data about arriving flights. This API is accessible through RapidAPI, with 200 API requests per month in the free subscription plan. The steps needed for getting the data are the same as the OWM API — request the data from the API and clean the response to obtain only the relevant data from the response JSON.

The AeroDataBox API only returns flight data for 12 hours. The time used were the current time plus 11h and the current day and year.

As the input data for the API, the airport ICAO code, the code for the Europeans airports can be found in this [link](https://airmundo.com/en/blog/airport-codes-european-airports/). In this project just the arrivals were used.

Using the Pandas Library, to read a Html and obtain the IATA and ICOA codes for the european airports

In [4]:
url = 'https://airmundo.com/en/blog/airport-codes-european-airports/'

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

r = requests.get(url, headers=header)

airports_codes = pd.read_html(r.text)

In [5]:
airports_codes_df = airports_codes[0]
airports_codes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Airport            264 non-null    object
 1   Country            264 non-null    object
 2   IATA airport code  264 non-null    object
 3   ICAO airport code  264 non-null    object
dtypes: object(4)
memory usage: 8.4+ KB


In [6]:
airports_codes_df.head()

Unnamed: 0,Airport,Country,IATA airport code,ICAO airport code
0,Tirana Airport,Albania,TIA,LATI
1,Yerevan Zvartnots Airport,Armenia,EVN,UDYZ
2,Graz Airport,Austria,GRZ,LOWG
3,Innsbruck Airport,Austria,INN,LOWI
4,Klagenfurt Airport,Austria,KLU,LOWK


Selecting the airports from which we want to obtain data from

In [8]:
airport_DE = airports_codes_df.loc[airports_codes_df['Country'] == "United Kingdom"]
airport_DE.head()

Unnamed: 0,Airport,Country,IATA airport code,ICAO airport code
239,Aberdeen Airport,United Kingdom,ABZ,EGPD
240,Belfast City Airport,United Kingdom,BHD,EGAC
241,Belfast International Airport,United Kingdom,BFS,EGAA
242,Birmingham Airport,United Kingdom,BHX,EGBB
243,Bristol Airport,United Kingdom,BRS,EGGD


The Icao codes needed in this project are

In [10]:
icao = api_input["airport_icao"].tolist()
icao

['EDDB', ' EGLC', 'LEMD', ' LFPG ']

In [20]:
to_local_time = datetime.now().strftime('%Y-%m-%dT%H:00')
from_local_time = (datetime.now() + timedelta(hours=11)).strftime('%Y-%m-%dT%H:00')

querystring = {"withLeg":"true","withCancelled":"true","withCodeshared":"true","withCargo":"true","withPrivate":"false","withLocation":"false"}

headers = {
    'x-rapidapi-host': "aerodatabox.p.rapidapi.com",
    'x-rapidapi-key': flight_api_key
    }

In [None]:
berlin_icoa = "EDDB"
url = f"https://aerodatabox.p.rapidapi.com/flights/airports/icao/{berlin_icoa}/{to_local_time}/{from_local_time}"

berlin_flights= requests.request("GET", url, headers=headers, params=querystring)
berlin_flights

In [17]:
arrivals_berlin = berlin_flights.json()['arrivals']
berlin_icoa = "EDDB"

def get_flight_info(flight_json):
    # terminal
    try:
        terminal = flight_json['arrival']['terminal']
    except:
        terminal = None
    # aircraft
    try: 
        aircraft = flight_json['aircraft']['model']
    except: 
        aircraft = None

    return {
        'dep_airport':flight_json['departure']['airport']['name'],
        'sched_arr_loc_time':flight_json['arrival']['scheduledTimeLocal'],
        'terminal':terminal,
        'status':flight_json['status'],
        'aircraft':aircraft,
        'icao_code': berlin_icoa 
    }

# [get_flight_info(flight) for flight in arrivals_berlin]
arrivals_berlin = pd.DataFrame([get_flight_info(flight) for flight in arrivals_berlin])
arrivals_berlin

Unnamed: 0,dep_airport,sched_arr_loc_time,terminal,status,aircraft,icao_code
0,Paris,2022-11-29 22:40+01:00,1,Unknown,Airbus A220-300,EDDB
1,Stuttgart,2022-11-29 22:00+01:00,1,Unknown,Airbus A320,EDDB
2,Cologne,2022-11-29 22:20+01:00,1,Unknown,Airbus A320,EDDB
3,Madrid,2022-11-29 22:45+01:00,1,Unknown,Airbus A320,EDDB
4,Munich,2022-11-29 22:35+01:00,1,Unknown,Airbus A320,EDDB
5,Frankfurt-am-Main,2022-11-29 22:25+01:00,1,Unknown,Airbus A320,EDDB
6,Zurich,2022-11-29 22:20+01:00,1,Unknown,Airbus A220-300,EDDB
7,Vienna,2022-11-29 22:15+01:00,1,Unknown,Airbus A320,EDDB
8,Brussels,2022-11-29 22:20+01:00,1,Unknown,Airbus A320,EDDB
9,Barcelona,2022-11-29 22:30+01:00,1,Unknown,Airbus A320,EDDB


Making the calls for the others cities

In [21]:
london_icoa = "EGLC"
url = f"https://aerodatabox.p.rapidapi.com/flights/airports/icao/{london_icoa}/{to_local_time}/{from_local_time}"

london_flights= requests.request("GET", url, headers=headers, params=querystring)
london_flights

<Response [200]>

In [22]:
arrivals_london = london_flights.json()['arrivals']
london_icoa = icao[1]

def get_flight_info(flight_json):
    # terminal
    try: terminal = flight_json['arrival']['terminal']
    except: terminal = None
    # aircraft
    try: aircraft = flight_json['aircraft']['model']
    except: aircraft = None

    return {
        'dep_airport':flight_json['departure']['airport']['name'],
        'sched_arr_loc_time':flight_json['arrival']['scheduledTimeLocal'],
        'terminal':terminal,
        'status':flight_json['status'],
        'aircraft':aircraft,
        #'icao_code': london_icoa 
    }

# [get_flight_info(flight) for flight in arrivals_berlin]
arrivals_london = pd.DataFrame([get_flight_info(flight) for flight in arrivals_london])
arrivals_london

Unnamed: 0,dep_airport,sched_arr_loc_time,terminal,status,aircraft
0,Duesseldorf,2022-11-30 06:55+00:00,,Unknown,Embraer 190
1,Rotterdam,2022-11-30 07:00+00:00,,Unknown,Embraer 190
2,Berlin,2022-11-30 07:50+00:00,,Unknown,Embraer 190
3,Amsterdam,2022-11-30 07:15+00:00,,Unknown,Embraer 190
4,Luxembourg,2022-11-30 07:05+00:00,,Unknown,Bombardier Dash 8 Q400 / DHC-8-400
5,Frankfurt-am-Main,2022-11-30 07:35+00:00,,Unknown,Embraer 190
6,Zurich,2022-11-30 07:30+00:00,,Unknown,Embraer 190
7,Dublin,2022-11-30 08:35+00:00,,Unknown,Embraer 190
8,Edinburgh,2022-11-30 08:15+00:00,,Unknown,Embraer 190
9,Edinburgh,2022-11-30 08:35+00:00,,Unknown,Embraer 190


In [23]:
madrid_icoa = "LEMD"
url = f"https://aerodatabox.p.rapidapi.com/flights/airports/icao/{madrid_icoa}/{to_local_time}/{from_local_time}"

madrid_flights= requests.request("GET", url, headers=headers, params=querystring)
madrid_flights

<Response [200]>

In [24]:
arrivals_madrid = madrid_flights.json()['arrivals']
madrid_icoa = icao[1]

def get_flight_info(flight_json):
    # terminal
    try: terminal = flight_json['arrival']['terminal']
    except: terminal = None
    # aircraft
    try: aircraft = flight_json['aircraft']['model']
    except: aircraft = None

    return {
        'dep_airport':flight_json['departure']['airport']['name'],
        'sched_arr_loc_time':flight_json['arrival']['scheduledTimeLocal'],
        'terminal':terminal,
        'status':flight_json['status'],
        'aircraft':aircraft,
        #'icao_code': madrid_icoa 
    }

# [get_flight_info(flight) for flight in arrivals_berlin]
arrivals_madrid = pd.DataFrame([get_flight_info(flight) for flight in arrivals_madrid])
arrivals_madrid

Unnamed: 0,dep_airport,sched_arr_loc_time,terminal,status,aircraft
0,Málaga,2022-11-29 22:55+01:00,4,Expected,Airbus A320 (sharklets)
1,Almería,2022-11-29 22:50+01:00,4,Expected,Bombardier CRJX
2,Tenerife Island,2022-11-29 22:55+01:00,4,Expected,Airbus A320
3,Palma De Mallorca,2022-11-29 22:35+01:00,2,Expected,Boeing 737-800 (winglets)
4,Culleredo,2022-11-29 22:20+01:00,2,Expected,Boeing 737-800 (winglets)
...,...,...,...,...,...
315,Nice,2022-11-30 08:35+01:00,4,Expected,Bombardier CRJX
316,Marseille,2022-11-30 08:40+01:00,4,Expected,Bombardier CRJX
317,Santa Cruz de la Sierra,2022-11-30 04:40+01:00,1,Expected,Boeing 787-8
318,Lima,2022-11-30 05:10+01:00,1,Expected,Boeing 787-9


In [25]:
paris_icoa = "LFPG"
url = f"https://aerodatabox.p.rapidapi.com/flights/airports/icao/{paris_icoa}/{to_local_time}/{from_local_time}"

paris_flights= requests.request("GET", url, headers=headers, params=querystring)
paris_flights

<Response [200]>

In [26]:
arrivals_paris = paris_flights.json()['arrivals']
paris_icoa = icao[1]

def get_flight_info(flight_json):
    # terminal
    try: terminal = flight_json['arrival']['terminal']
    except: terminal = None
    # aircraft
    try: aircraft = flight_json['aircraft']['model']
    except: aircraft = None

    return {
        'dep_airport':flight_json['departure']['airport']['name'],
        'sched_arr_loc_time':flight_json['arrival']['scheduledTimeLocal'],
        'terminal':terminal,
        'status':flight_json['status'],
        'aircraft':aircraft,
        #'icao_code': paris_icoa 
    }

# [get_flight_info(flight) for flight in arrivals_berlin]
arrivals_paris = pd.DataFrame([get_flight_info(flight) for flight in arrivals_paris])
arrivals_paris

Unnamed: 0,dep_airport,sched_arr_loc_time,terminal,status,aircraft
0,Munich,2022-11-29 22:20+01:00,2F,Unknown,Airbus A220-300
1,Barcelona,2022-11-29 22:20+01:00,2F,Unknown,Airbus A320
2,Madrid,2022-11-29 22:20+01:00,2F,Unknown,Airbus A319
3,Lisbon,2022-11-29 22:15+01:00,2F,Unknown,Airbus A319
4,Milan,2022-11-29 22:05+01:00,2F,Unknown,Airbus A320
...,...,...,...,...,...
133,Luxembourg,2022-11-30 08:35+01:00,2G,Unknown,Bombardier Dash 8 Q400 / DHC-8-400
134,Munich,2022-11-30 08:25+01:00,2B,Unknown,Airbus A319
135,Milan,2022-11-30 08:05+01:00,2B,Unknown,Airbus A319
136,Barcelona,2022-11-30 08:25+01:00,3,Unknown,Airbus A320


### Airports data

In [46]:
airports_cities = (
pd.read_csv('data/airports.csv')
    .query('type == "large_airport"')
    .filter(['name','latitude_deg','longitude_deg','iso_country','iso_region','municipality','gps_code','iata_code'])
    .rename(columns={'gps_code':'icao_code'})
    .assign(municipality_iso_country = lambda x: x['municipality'] + ',' + x['iso_country'])
)
airports_cities.head()

Unnamed: 0,name,latitude_deg,longitude_deg,iso_country,iso_region,municipality,icao_code,iata_code,municipality_iso_country
10890,Honiara International Airport,-9.428,160.054993,SB,SB-CT,Honiara,AGGH,HIR,"Honiara,SB"
12461,Port Moresby Jacksons International Airport,-9.44338,147.220001,PG,PG-NCD,Port Moresby,AYPY,POM,"Port Moresby,PG"
12981,Keflavik International Airport,63.985001,-22.6056,IS,IS-2,Reykjavík,BIKF,KEF,"Reykjavík,IS"
13028,Priština Adem Jashari International Airport,42.5728,21.035801,XK,XK-01,Prishtina,BKPR,PRN,"Prishtina,XK"
17254,Guodu Air Base,36.001741,117.63201,CN,CN-37,"Xintai, Tai'an",,,"Xintai, Tai'an,CN"


In [47]:
airports_cities.query('municipality == "Berlin"')

Unnamed: 0,name,latitude_deg,longitude_deg,iso_country,iso_region,municipality,icao_code,iata_code,municipality_iso_country
20244,Berlin Brandenburg Airport,52.351389,13.493889,DE,DE-BR,Berlin,EDDB,BER,"Berlin,DE"


In [48]:
arrivals_berlin.head()

Unnamed: 0,dep_airport,sched_arr_loc_time,terminal,status,aircraft,icao_code
0,Paris,2022-11-29 22:40+01:00,1,Unknown,Airbus A220-300,EDDB
1,Stuttgart,2022-11-29 22:00+01:00,1,Unknown,Airbus A320,EDDB
2,Cologne,2022-11-29 22:20+01:00,1,Unknown,Airbus A320,EDDB
3,Madrid,2022-11-29 22:45+01:00,1,Unknown,Airbus A320,EDDB
4,Munich,2022-11-29 22:35+01:00,1,Unknown,Airbus A320,EDDB


In [49]:
arrivals_london.head()

Unnamed: 0,dep_airport,sched_arr_loc_time,terminal,status,aircraft
0,Duesseldorf,2022-11-30 06:55+00:00,,Unknown,Embraer 190
1,Rotterdam,2022-11-30 07:00+00:00,,Unknown,Embraer 190
2,Berlin,2022-11-30 07:50+00:00,,Unknown,Embraer 190
3,Amsterdam,2022-11-30 07:15+00:00,,Unknown,Embraer 190
4,Luxembourg,2022-11-30 07:05+00:00,,Unknown,Bombardier Dash 8 Q400 / DHC-8-400


Concatenating arrivals data for all airports

In [50]:
arrivals_data = pd.concat([arrivals_berlin, arrivals_london, arrivals_madrid, arrivals_paris], axis=0)
arrivals_data.reset_index(drop = True, inplace=True)
arrivals_data

Unnamed: 0,dep_airport,sched_arr_loc_time,terminal,status,aircraft,icao_code
0,Paris,2022-11-29 22:40+01:00,1,Unknown,Airbus A220-300,EDDB
1,Stuttgart,2022-11-29 22:00+01:00,1,Unknown,Airbus A320,EDDB
2,Cologne,2022-11-29 22:20+01:00,1,Unknown,Airbus A320,EDDB
3,Madrid,2022-11-29 22:45+01:00,1,Unknown,Airbus A320,EDDB
4,Munich,2022-11-29 22:35+01:00,1,Unknown,Airbus A320,EDDB
...,...,...,...,...,...,...
503,Luxembourg,2022-11-30 08:35+01:00,2G,Unknown,Bombardier Dash 8 Q400 / DHC-8-400,
504,Munich,2022-11-30 08:25+01:00,2B,Unknown,Airbus A319,
505,Milan,2022-11-30 08:05+01:00,2B,Unknown,Airbus A319,
506,Barcelona,2022-11-30 08:25+01:00,3,Unknown,Airbus A320,


In [51]:
cities = airports_cities.filter(['municipality','iso_country','municipality_iso_country']).drop_duplicates()
cities.head()

Unnamed: 0,municipality,iso_country,municipality_iso_country
10890,Honiara,SB,"Honiara,SB"
12461,Port Moresby,PG,"Port Moresby,PG"
12981,Reykjavík,IS,"Reykjavík,IS"
13028,Prishtina,XK,"Prishtina,XK"
17254,"Xintai, Tai'an",CN,"Xintai, Tai'an,CN"


In [52]:
arrivals_berlin.head()

Unnamed: 0,dep_airport,sched_arr_loc_time,terminal,status,aircraft,icao_code
0,Paris,2022-11-29 22:40+01:00,1,Unknown,Airbus A220-300,EDDB
1,Stuttgart,2022-11-29 22:00+01:00,1,Unknown,Airbus A320,EDDB
2,Cologne,2022-11-29 22:20+01:00,1,Unknown,Airbus A320,EDDB
3,Madrid,2022-11-29 22:45+01:00,1,Unknown,Airbus A320,EDDB
4,Munich,2022-11-29 22:35+01:00,1,Unknown,Airbus A320,EDDB


In [53]:
arrivals_data = airports_cities.merge(arrivals_data, on='icao_code', how='inner').merge(weather_data, on='municipality_iso_country', how='inner').head()

NameError: name 'weather_data' is not defined

In [54]:
df_cities['municipality_iso_country'] = [
    'Berlin,DE',
    'London,GB',
    'Madrid,ES',
    'Paris,FR'    
]

## Storing Data in a Postgres Database

In [None]:
db_host=
db_port=
db_user=
db_password=
db_schema=

con = f'postgresql://{db_user}:{db_password}@{db_host}:{db_port}/{db_schema}'

### Update the tables

ARRIVALS

In [None]:
arrivals_berlin\
    .replace({np.nan},'unknown')\
    .assign(sched_arr_loc_time = lambda x: pd.to_datetime(x['sched_arr_loc_time']))\
    .to_sql('arrivals', if_exists='append', con=con, index=False)

AIRPORTS

In [None]:
airports_cities\
    .dropna()\
    .to_sql('airports', if_exists='append', con=con, index=False)

CITIES

In [None]:
df_cities\
    .dropna()\
    .rename(
        columns={
            'lat':'latitude',
            'long':'longitude'
            }
        )\
    .to_sql('cities', con=con, if_exists='append', index=False)

WEATHER

In [None]:
weather_data\
    .assign(datetime = lambda x: pd.to_datetime(x['datetime']))\
    .to_sql('weather', if_exists='append', con=con, index=False)