# Coronavirus propagation and airflights routes

<img src="https://openflights.org/demo/openflights-routedb-2048.png" width="1000px">

Following up on the great [Mykola's kernel](https://www.kaggle.com/grebublin/coronavirus-propagation-visualization-forecast), I have matched the coronavirus propagation with the information of the [OpenFlights dataset](https://openflights.org/data.html). This is the most complete open airflights routes dataset I have found, even though the last updated date back to 2014. In fact, the dataset webpage warnings that "the third-party that OpenFlights uses for route data ceased providing updates in June 2014. The current data is of historical value only". 

Therefore, we cannot expect a precise analysis here, since the airflights routes may have dramatically change. I am not an expert on this field, and maybe some of you may shed light on this matter. 

![](http://)Moreover, this dataset does not provide information of the actual traffic of these routes. It describes the routes (airline X operating from airport A to airport B), but it does not account for the passenger flux of these routes, nor is the date time information available. Thus, most of the kernel prepares the OpenFlights dataset for the visualization.

## Proof of concept
___

Before writting a single line of code, I want to check whether there is some kind of correlation between the propagation of the virus (during 2019 and 2020) and the airflights routes (till 2014). To do so, I just use the [OpenFlights interface](https://openflights.org/) and I search the routes for the Wuhan Tianhe International Airport.

Comparing coronavirus propagation and the flights departing from Wuhan,  we can observe that the two figures roughly match one another:

- Most regions far from the epicenter are connected through a direct flight route from Wuhan.

- We do not see any connection from Wuhan to North America, neither to Oceania nor Africa. We guess that there is an indirect connection between these places. 

<img src="https://drive.google.com/uc?id=1piaVoYDrLLle4ROvjELGNJ-l4XP9Qex3" width="1000px">
<img src="https://drive.google.com/uc?id=1ONHcu2f5pIa5A_SRcZojN8npi0vepT7D" width="1000px">

### Imports

In [None]:
import math
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from shapely.ops import nearest_points
import networkx as nx
from tqdm import tqdm_notebook as tqdm

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
sns.set_style("whitegrid")


import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster

from itertools import tee

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

## Reading coronavirus data
___
Big thanks to [Brenda So](https://www.kaggle.com/brendaso) for her original coronavirus [dataset](https://www.kaggle.com/brendaso/2019-coronavirus-dataset-01212020-01262020) and to [Mykola](https://www.kaggle.com/grebublin) for his geocoded [dataset](https://www.kaggle.com/grebublin/coronavirus-latlon-dataset). 

I will use here the geocoded dataset.

In [None]:
# just downloading the dataset, which has all the necessary data already
df = pd.read_csv("../input/coronavirus-latlon-dataset/coronavirus_cleaned_21Jan2Feb.csv", index_col=0)
# WARNING: set the coordinates for the province of Hubei to its capital Wuhan, instead of the geo-pyshical center.
# It helps to match the routes more precisely, at least from the epicenter to the rest of the world.
df.loc[df["Province/State"]=="Hubei", ["lon", "lat"]] = [114.283333, 30.583332]
# Cast into GeoDataFrame
df = gpd.GeoDataFrame(df, geometry=[Point(xy) for xy in zip(df['lon'], df['lat'])])
# Set coordinate reference system
df.crs = {'init' :'epsg:4326'}  
df.head(3)

## Reading OpenAirflights data
___

The OpenFlights dataset is already available at [kaggle](https://www.kaggle.com/open-flights/airline-database), but I have created a new kaggle dataset using updated data from the [original sources](https://github.com/jpatokal/openflights).

#### Routes from A to B

In [None]:
columns_and_dtypes = {
    'Airline': "category", # 2-letter (IATA) or 3-letter (ICAO) code of the airline.
    'Airline ID': "category", # Unique OpenFlights identifier for airline.
    'Source Airport': "category", # 3-letter (IATA) or 4-letter (ICAO) code of the source airport.
    'Source Airport ID': "category", # Unique OpenFlights identifier for source airport.
    'Destination Airport': "category", # 3-letter (IATA) or 4-letter (ICAO) code of the destination airport.
    'Destination Airport ID': "category", # Unique OpenFlights identifier for destination airport.
    'Codeshare': "category", # "Y" if this flight is a codeshare (that is, not operated by Airline, but another carrier), empty otherwise.
    'Stops': int, # Number of stops on this flight ("0" for direct)
    'Equipment': "category" # 3-letter codes for plane type(s) generally used on this flight, separated by spaces
}
routes = pd.read_csv('../input/openflights-20200201-dump/routes.dat', 
                     names=columns_and_dtypes.keys(),
                     dtype=columns_and_dtypes)
display(routes.head(3))

#### Airports

I use the dataset airports.csv, therefore, I only consider "airport" terminals. I discard on the other train stations, ferry terminals and unknown terminals.

In [None]:
columns_and_dtypes = {
    'Airport ID': "category", # Unique OpenFlights identifier for this airport.
    'Airport': "category", # Name of airport. May or may not contain the City name.
    'City': "category", # Main city served by airport. May be spelled differently from Name.
    'Country': "category", # or territory where airport is located. See Countries to cross-reference to ISO 3166-1 codes.
    'IATA': "category", # 3-letter IATA code. Null if not assigned/unknown.
    'ICAO': "category", # 4-letter ICAO code. Null if not assigned.
    'Latitude': float, # Decimal degrees, usually to six significant digits. Negative is South, positive is North.
    'Longitude': float, # Decimal degrees, usually to six significant digits. Negative is West, positive is East.
    'Altitude': float, # In feet.
    'Timezone': "category", # Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5.
    'DST': "category", # Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown). See also: Help: Time
    'Tz database time zone': "category", # Timezone in "tz" (Olson) format, eg. "America/Los_Angeles".
    'Type': "category", # Type of the airport. Value "airport" for air terminals, "station" for train stations, "port" for ferry terminals and "unknown" if not known. In airports.csv, only type=airport is included.
    'Source': "category", # Source of this data.
}
airports = pd.read_csv('../input/openflights-20200201-dump/airports.dat', 
                     names=columns_and_dtypes.keys(),
                     dtype=columns_and_dtypes)
# Cast into GeoDataFrame
airports = gpd.GeoDataFrame(airports, geometry=[Point(xy) for xy in zip(airports['Longitude'], airports['Latitude'])])
# Set coordinate reference system
airports.crs = {'init' :'epsg:4326'}  
airports.head(3)

In [None]:
airports["Type"].value_counts()

In [None]:
# Number of airports by country. 
airports["Country"].value_counts().head(10)

In [None]:
# Number of airports by chinese city
airports.loc[airports["Country"]=="China", "City"].value_counts().head(10)

In [None]:
# Add the prefix source to the column names
src_airports = airports[['Airport ID', 'City', 'Country', 'Latitude', 'Longitude', 'Altitude']]
src_airports.columns = 'Source ' + src_airports.columns
# Add the prefix destination to the column names
dst_airports = airports[['Airport ID', 'City', 'Country', 'Latitude', 'Longitude', 'Altitude']]
dst_airports.columns = 'Destination ' + dst_airports.columns
# Add source coordinates to routes dataset
flights = pd.merge(
    routes[['Airline', 'Source Airport ID', 'Destination Airport ID', 'Stops']], 
    src_airports, 
    on="Source Airport ID", 
    how="left",
)
# Add destination coordinates to routes dataset
flights = pd.merge(
    flights, 
    dst_airports, 
    on="Destination Airport ID", 
    how="left",
)
# There are some unknown destination or source airports... I will drop them
flights = flights[~flights.isna().any(axis=1)].reset_index(drop=True).copy()
flights["Route ID"] = flights.index.copy()

def haversine(lat1, lon1, lat2, lon2, earth_radius=6378*1e3):
    """
    Need to compute distance from lon and lat to deal with mercator discontinuities (-180 -> 180, -90 - >90).
    """
    lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

flights["Distance"] = haversine(flights["Source Latitude"], flights["Source Longitude"], flights["Destination Latitude"], flights["Destination Longitude"])

In [None]:
flights.sample(3)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 3))
ax = sns.distplot(flights["Distance"]/1000, norm_hist=True, ax=ax)
ax.set_xlabel("Route length (km)", fontsize=16);
start, end = ax.get_xlim();
ax.xaxis.set_ticks(np.arange(0, end, 1500));

In [None]:
# Save complete routes dataset with coordinates, it may be useful for other kernels
flights.to_csv("openflights_routes.csv", index=False)

## Routes EDA
___
Analisys of routes departing from Wuhan.

In [None]:
wuhan = flights[flights["Source City"]=="Wuhan"].copy()
print(f"{len(wuhan)} routes departing from Wuhan.")
display(wuhan.head())
print("Top ten destination countries:")
display(wuhan["Destination Country"].value_counts()[:10])
print("Top ten destination cities:")
display(wuhan["Destination City"].value_counts()[:10])

Top ten is sorted according to the number of combinations of airlines and destination airports of a particular place. For instance, there are 9 different combinations of airline + destination airport that departure from Wuhan and arrive to Shanghai.

It does not mean number of flights nor amount of passengers, though it may be related to.

#### Number of stops on this flight ("0" for direct)

Let's analyze the variable stops for the whole routes dataset

In [None]:
display(flights["Stops"].value_counts())
display(flights[flights["Stops"]>0])

Unfortunately, the "stops" variable does not provide further information...

## Join coronavirus & airports datasets
___

It is hard to join both datasets, since the coordinates of the coronavirus dataset are very coarse grained. I mean, for so many cases, we only know the country or state. On the other hand, the coordinates of the airports are really fined grained.

In [None]:
# Project airports dataset to world mercator. The bounds of this projection are from -180.0 -80.0 to 180.0 84.0 degrees.
# Therefore, we have to filter the airport data accordingly, by rejecting 4 airports (from Antarctica, New Zealand, Russia and Canda).
EPSG = "epsg:3832" # epsg:3395
columns = ['Airport ID', 'City', 'Country', 'geometry']
keep_latitude = (airports["Latitude"] > -80) & (airports["Latitude"] < 80)
# Display rejected airports.
print(f"Rejecting airports that cannot be reprojected into {EPSG}")
display(airports.loc[~keep_latitude, columns])
# Filter airports and project into mercator.
airports_mercator = airports.loc[keep_latitude, columns].to_crs({'init': EPSG}).copy()
# Reproject also corona dataset
corona_with_airports = df.to_crs({'init': EPSG}).copy()
# Keep only last update per province
corona_with_airports = corona_with_airports.sort_values("Last Update", ascending=False).drop_duplicates(["lat", "lon"], keep="first")
# Find the nearest airport for the coronavirus dataset (I know, I know, it is zero-optimized, TOO SLOW...)
pts = airports_mercator.geometry.unary_union
def near(point):
    """
    Credit from:
    https://gis.stackexchange.com/questions/222315/geopandas-find-nearest-point-in-other-dataframe
    """
    # find the nearest point and return the corresponding 'Airport ID'
    geom_1, geom_2 = nearest_points(point, pts)
    nearest = airports_mercator.geometry == geom_2
    nearest_feats = airports_mercator.loc[nearest, ['City', 'Airport ID']].iloc[0].tolist() + [geom_1.distance(geom_2)]
    return nearest_feats
(corona_with_airports['Airport City'], 
 corona_with_airports['Nearest Airport'], 
 corona_with_airports["Distance to Airport"]) = zip(*corona_with_airports["geometry"].apply(near))

#### Distance thresholding


I check the distances from the "Province" to the "Nearest airport", maybe I need to filter out airports that are too far away from the coronavirus province... Please, keep in mind that distances are inexact because of epsg:3395 trasform.


In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 3))
ax = sns.distplot(corona_with_airports["Distance to Airport"]/1000, norm_hist=True, ax=ax)
ax.set_xlabel("Distance to airport (km)", fontsize=16);

In [None]:
corona_with_airports[corona_with_airports["Distance to Airport"]>200*1e3]

250km still makes sense to me.... But honestly, I am not sure about that these russian guys arriving at Yerbogachen. We know that the coordinates of "Province/State" are coarse grained. In general the only information that we really have is not the city, and rather the country, so the coordinates of both, the provinces and airports, are not precise. 

For instance, the coronavirus in Spain appeared at La Gomera (Canary Islands), even though our dataset locates it in Madrid.

In [None]:
airport_counts = corona_with_airports['Nearest Airport'].value_counts()
target_airports = airports_mercator[airports_mercator["Airport ID"].isin(airport_counts.index)]
target_airports = pd.merge(target_airports, 
                           airport_counts.to_frame("Freq").reset_index().rename(columns={"index":"Airport ID"}),
                           on="Airport ID", 
                           how="left")
target_airports = target_airports.sort_values("Freq", ascending=False).reset_index(drop=True).copy()

### Debug mapping of province into airport

In [None]:
# Simple debug plot
rename_countries ={
    "Mainland China": "China",
    "Hong Kong": "China",
    "Macau": "China",
    "United States": "United States of America",
    "US": "United States of America",
    "UK": "United Kingdom",
    "Singapore": "Malaysia",
    "Ivory Coast": "CÃ´te d'Ivoire"
}
fig, ax = plt.subplots(1, 1, figsize=(20, 20))
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world = world[(world.name != "Antarctica") & (world.name != "Fr. S. Antarctic Lands")].to_crs({'init': EPSG})
world["active"] = world["name"].isin(df["Country/Region"].apply(lambda x: rename_countries.get(x, x)).unique())
ax = world.plot(ax=ax, column="active", edgecolor='black')
ax = corona_with_airports.plot(ax=ax, marker='o', color='red', markersize=20, label="Provinces with coronavirus")
ax = target_airports.plot(ax=ax, marker='x', color='green', markersize=20, label="Nearest Airports")
ax.set_ylim(-1.e7, 1.05e7);
ax.set_xlim(-2.e7, 2.e7);
ax.legend(fontsize=16);

As you can see airports are close to the provinces.

## Create network of routes
___

Create a graph of routes in order to find the shortest path between Wuhan and the rest of provinces.

In [None]:
airports[airports["City"]=="Wuhan"]

First, create the actual network of routes. Then, find a route to Wuhan from every other node:

1. Find the shortest route in number of stops.
2. Find the shortest route in distance length (km).
3. Choose one of the two using the following heuristic: if the difference in distance (km) of the shortest stops route and the shortest distance route is bellow a threshold select the one with the minimum number of stops.

In [None]:
def get_routes_and_distance_for_path(net, path):
    """Compute the distance in km for a particular path using flights dataframe."""
    routes = [net.get_edge_data(src, dst)["Route ID"] for src, dst in pairwise(path)]
    distance = flights.loc[flights["Route ID"].isin(routes), "Distance"].sum()/1e3
    return routes, distance     
    
MAX_ABSOLUTE_NUM_STOPS = 3
HEURISTIC_ROUTE_TH = 0.2

# Creates graph from edges -> flights dataset
G = nx.from_pandas_edgelist(flights, "Source Airport ID", "Destination Airport ID", edge_attr=["Route ID", "Distance"])
# Add nodes information -> airports dataset
airports_mercator["pos"] = airports_mercator["geometry"].apply(lambda x: list(list(x.coords)[0]))
airports_mercator["stops_to_wuhan"] = np.iinfo(np.int).max
airports_mercator["related_to_wuhan"] = False
node_attr = airports_mercator.set_index('Airport ID').to_dict('index')
nx.set_node_attributes(G, node_attr)
# Compute number of stops to arrive at Wuhan
wuhan_airport_id = corona_with_airports.loc[corona_with_airports["Province/State"]=="Hubei", "Nearest Airport"].iloc[0]
nodes_related_to_wuhan_by_stops = {num_stops:[] for num_stops in range(0, MAX_ABSOLUTE_NUM_STOPS + 1)}
edges_related_to_wuhan_by_stops = {num_stops:[] for num_stops in range(0, MAX_ABSOLUTE_NUM_STOPS + 1)}
diff_rates = []
count_win_distance = 0
count_win_stops = 0
for node in tqdm(G.nodes()):
    try:
        # Compute two kinds of shortest paths and choose one of them.        
        shortest_stops_path = nx.shortest_path(G, source=wuhan_airport_id, target=node)
        # km from A to B using min number of stops path
        stops_routes, shortest_stops_length = get_routes_and_distance_for_path(G, shortest_stops_path) 
        shortest_distance_path = nx.shortest_path(G, source=wuhan_airport_id, target=node, weight="Distance")
        # km from A to B using min length path
        length_routes, shortest_distance_length = get_routes_and_distance_for_path(G, shortest_distance_path) 
        # Compute percentual difference between routes
        diff_rate = (shortest_stops_length - shortest_distance_length)/shortest_distance_length
        diff_rates.append(diff_rate)
        if  diff_rate < HEURISTIC_ROUTE_TH:
            path = shortest_stops_path
            distance = shortest_stops_length
            count_win_stops += 1
        else:
            path = shortest_distance_path
            distance = shortest_distance_length
            count_win_distance += 1
        num_stops = len(path) - 2 # Wuhan -> Paris = 0 Stops
        G.nodes[node]['stops_to_wuhan'] = num_stops
        G.nodes[node]['distance_to_wuhan'] = distance
        if num_stops <= MAX_ABSOLUTE_NUM_STOPS:
            if num_stops >= 0:
                # Avoid self loop: Wuhan -> Wuhan                
                nodes_related_to_wuhan_by_stops[num_stops] += path
                edges_related_to_wuhan_by_stops[num_stops] += pairwise(path)
            for n in path:
                G.nodes[n]['related_to_wuhan'] = True
    except nx.NetworkXNoPath:
        pass

In [None]:
# uniquify
nodes_related_to_wuhan_by_stops = {stops:set(nodes) for stops, nodes in nodes_related_to_wuhan_by_stops.items()}
# Create subgraphs
subgraphs_by_stops = {}
for num_stops in range(1, MAX_ABSOLUTE_NUM_STOPS + 1):
    subg = nx.Graph()
    subg.add_nodes_from(nodes_related_to_wuhan_by_stops[num_stops])
    subg.add_edges_from(edges_related_to_wuhan_by_stops[num_stops])
    subgraphs_by_stops[num_stops] = subg.copy()
# Create dictionary of subgraphs at a certain number of stops
print("Routes network:")
print(nx.info(G))

Plot winning criteria stats for debugging purposes

In [None]:
diffs = pd.Series(diff_rates, dtype=np.float) # there's one nan
diffs = diffs[~diffs.isna()] * 100

fig, ax = plt.subplots(1, 3, figsize=(16, 2))
sns.barplot(["Less distance", "Fewer stops"], [count_win_distance, count_win_stops], ax=ax[0]);
ax[0].set_ylabel("Number of wins [#]", fontsize=14);
sns.distplot(diffs, norm_hist=True, ax=ax[1])
ax[1].set_xlabel("Distance rate\nbetween routes [%]", fontsize=14);
sns.distplot(diffs[diffs<200], norm_hist=True, ax=ax[2])
ax[2].set_xlabel("Distance rate\nbetween routes zom < 200% [%]", fontsize=14);
start, end = ax[2].get_xlim();
ax[2].xaxis.set_ticks(np.arange(0, end, 50));

## Plot network of routes related to Wuhan
___

Plot network of routes related to Wuhan at different number of stops (node distances).

**Disclaimer**: routes from North America to West Europe are not properly plotted because of the geospatial projection. They cross the whole map... but I would rather prefer a pacman effect.

In [None]:
# Plot Wuhan network
node_to_pos = {node:attrs["pos"] for node, attrs in node_attr.items()}

for num_stops in range(0, MAX_ABSOLUTE_NUM_STOPS + 1):
    fig, ax = plt.subplots(1, 1, figsize=(20, 20))
    ax = world.plot(ax=ax, column="active", edgecolor='black')
    ax = corona_with_airports.plot(ax=ax, marker='o', color='red', markersize=26, label="Provinces with coronavirus")
    # Keep only edges connected into Wuhan    
    nx.draw(
        G,
        nodelist=nodes_related_to_wuhan_by_stops[num_stops], 
        edgelist=edges_related_to_wuhan_by_stops[num_stops], 
        pos=node_to_pos,
        ax=ax,
        node_size=2,
        arrowsize=2,
        edge_color="blue",
        style="dashed"
    )
    ax.set_ylim(-1.e7, 1.05e7);
    ax.set_xlim(-2.e7, 2.e7);
    ax.set_title(f"Wuhan connections at {num_stops} stop/s", fontsize=16)
    plt.show()

## TODO: Mykola's Visualization
___


## Conclusions
___

Recently, the Institute for Theoretical Biology from the University of Berlin has pusblished a incredible work for the [Coronavirus Global Risk Assessment](http://rocs.hu-berlin.de/corona/) . They estimate the likelihood of importing a case from an affected location to an airport or country distant from the outbreak location. I strongly recommend going throught the description of the dataset and the statistical model. They provide wonderful visualizations and insights as well.

For this work, they have used a "closed" dataset I think, collected from https://www.oag.com/. I hope we all could access these data :(



## TODO

This is an incomplete kernel. There is still a lot of work to be done... starting by combining mykola's visualization with this airfligts data.