# Find the connection between demography and the spread of virus

## Background

The 2019 Novel Coronavirus, or 2019-nCoV, is a new respiratory virus first identified in Wuhan, Hubei Province, China. A novel coronavirus (nCoV) is a new coronavirus that has not been previously identified. The 2019 novel coronavirus (2019-nCoV), is not that same as the coronaviruses that commonly circulate among humans and cause mild illness, like the common cold.

This virus probably originally emerged from an animal source but now seems to be spreading from person-to-person. It’s important to note that person-to-person spread can happen on a continuum. Some viruses are highly contagious (like measles), while other viruses are less so. At this time, it’s unclear how easily or sustainably this virus is spreading between people. 

By adding other country related data like population and GDP enables to learn more about the spread of newly emerged coronaviruses.

**Reference:** https://www.cdc.gov/coronavirus/2019-ncov/faq.html

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.graph_objects as go
from fbprophet import Prophet
import pycountry
import plotly.express as px
from datetime import timedelta

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

# Data Import, Preprocessing and EDA

In [None]:
df = pd.read_csv('../input/novel-corona-virus-2019-dataset/covid_19_data.csv',parse_dates=['Last Update'])
df.rename(columns={'ObservationDate':'Date', 'Country/Region':'Country'}, inplace=True)

countries_of_the_world = pd.read_csv("../input/countries-demographic-and-economic-data/countries of the world.csv")

In [None]:
df.shape

In [None]:
countries_of_the_world.shape

In [None]:
countries_of_the_world["Country"] = countries_of_the_world["Country"].str.strip()

In [None]:
list(countries_of_the_world["Country"])[:10]

In [None]:
df_countries = df.groupby(["Country","Last Update"])[['Confirmed', 'Deaths', 'Recovered']].sum().reset_index()

In [None]:
df_countries

Select only the first day of each country when it reported virus

In [None]:
df_countries_start = df_countries.groupby("Country", as_index=False).nth(0)
df_countries_start.reset_index(drop=True, inplace=True)
df_countries_start

Calculate the number of cases x days after the first reported case for each country

In [None]:
days_later = 15 # x days after the first observation of a country
df_countries_start["xDays Later Confirmed"] = 0 # create a new column to register number of observations
countries_list = list(df_countries_start["Country"].drop_duplicates()) # Check if there is any duplicate data

for item in countries_list:
    
    try:
        print(item)
        
        first_date = min(df_countries_start[df_countries_start["Country"] == item]["Last Update"])
        xdays_later = first_date + timedelta(days=days_later)
        announce_date = min(df_countries[(df_countries["Country"] == item) & (df_countries["Last Update"] > xdays_later)]["Last Update"])
        confirmed_sick = df_countries[(df_countries["Country"] == item) & (df_countries["Last Update"] == announce_date)]["Confirmed"]
        death_sick = df_countries[(df_countries["Country"] == item) & (df_countries["Last Update"] == announce_date)]["Deaths"]
        recovered_sick = df_countries[(df_countries["Country"] == item) & (df_countries["Last Update"] == announce_date)]["Recovered"]
        
        print(confirmed_sick.values)
        
        df_countries_start.loc[df_countries_start['Country'] == item, 'xDays Later Confirmed'] = confirmed_sick.values
        df_countries_start.loc[df_countries_start['Country'] == item, 'xDays Later Death'] = death_sick.values
        df_countries_start.loc[df_countries_start['Country'] == item, 'xDays Later Recovered'] = recovered_sick.values

    except:
        continue
    


In [None]:
df_countries_start.head(20)

# Add demographic data to virus penetration data for each country

In [None]:
df_countries_dem = df_countries_start.merge(countries_of_the_world, left_on='Country', right_on='Country')

In [None]:
df_countries_dem.shape

In [None]:
df_countries_dem[df_countries_dem["xDays Later Confirmed"] > 0].head(15)

In [None]:
df_research = df_countries_dem[df_countries_dem["xDays Later Confirmed" ] > 0]

In [None]:
df_research.shape

In [None]:
df_research = df_research.drop(['Country', 'Last Update','Region'], axis=1)

In [None]:
df_research.dropna(inplace=True)

In [None]:
df_research.reset_index(drop=True, inplace=True)

In [None]:
for i in range (0,23):
    if df_research.iloc[:,i].dtypes == object:
        df_research.iloc[:,i] = df_research.iloc[:,i].str.replace(',', '.').astype(float)
        

In [None]:
df_research.corr()

In [None]:
plt.figure(figsize=(13,13))

sns.heatmap(df_research.corr().round(1), vmax=1, square=True,annot=True,cmap='coolwarm')

plt.title('Correlation between different fearures')

# PCA visualization of countries

Scatter plot 2 factor PCA without virus spread data

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
df_research = df_countries_dem[df_countries_dem["xDays Later Confirmed" ] > 0]

In [None]:
df_research = df_research.drop(['Country', 'Last Update','Region'], axis=1)
df_research = df_research.drop(['Confirmed', 'Deaths','Recovered'], axis=1)
df_research = df_research.drop(['xDays Later Confirmed', 'xDays Later Death','xDays Later Recovered'], axis=1)

In [None]:
df_research.dropna(inplace=True)

In [None]:
df_research.head(10)

In [None]:
df_research.reset_index(drop=True, inplace=True)

In [None]:
df_research.shape

In [None]:
for i in range (0,18):
    if df_research.iloc[:,i].dtypes == object:
        df_research.iloc[:,i] = df_research.iloc[:,i].str.replace(',', '.').astype(float)
        pass
 

In [None]:
X = df_research.iloc[:,:]

In [None]:
X.shape

In [None]:
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(X)
print(standardized_data.shape)

In [None]:
covar_matrix = np.matmul(standardized_data.T , standardized_data)

In [None]:
covar_matrix.shape

In [None]:
from scipy.linalg import eigh 

In [None]:
values, vectors = eigh(covar_matrix,eigvals=(16,17))

In [None]:
values = values.real
print(values)

In [None]:
#transpose
vectors = vectors.T
vectors.shape

In [None]:
new_coordinates = np.matmul(vectors, standardized_data.T)
print ("Resultant at new data shape: ", vectors.shape, "*", standardized_data.T.shape," = ", new_coordinates.shape)

In [None]:
df_research.head()

In [None]:
df_research.reset_index(drop=True, inplace=True)

In [None]:
new_coordinates.shape

In [None]:
new_coordinates = np.vstack((new_coordinates)).T

df = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal"))

In [None]:
df_pca_ext = pd.concat([df, df_research], axis=1).reindex(df.index)

In [None]:
df_pca_ext.head()

In [None]:
sns.set(style="ticks")


sns.FacetGrid(df_pca_ext, height=10, hue="Climate").map(plt.scatter, '1st_principal', '2nd_principal').add_legend()
plt.title('PCA visualization of sequences')
plt.show()

# Scatter plot 2 factor PCA including virus spread data

In [None]:
df_research = df_countries_dem[df_countries_dem["xDays Later Confirmed" ] > 0]

In [None]:
df_research = df_research.drop(['Country', 'Last Update','Region'], axis=1)

In [None]:
df_research.reset_index(drop=True, inplace=True)

In [None]:
for i in range (0,24):
    if df_research.iloc[:,i].dtypes == object:
        df_research.iloc[:,i] = df_research.iloc[:,i].str.replace(',', '.').astype(float)
        pass


In [None]:
df_research.head(5)

In [None]:
df_research.dropna(inplace=True)

In [None]:
df_research.reset_index(drop=True, inplace=True)

In [None]:
X = df_research.iloc[:,:]

In [None]:
standardized_data = StandardScaler().fit_transform(X)
covar_matrix = np.matmul(standardized_data.T , standardized_data)
values, vectors = eigh(covar_matrix,eigvals=(22,23))
vectors = vectors.T
new_coordinates = np.matmul(vectors, standardized_data.T)
new_coordinates = np.vstack((new_coordinates)).T
df = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal"))

In [None]:
df_pca_ext = pd.concat([df, df_research], axis=1).reindex(df.index)

In [None]:
df_pca_ext.head()

In [None]:
df_pca_ext['Rel Death'] = df_pca_ext['xDays Later Death']/df_pca_ext['Population']
df_pca_ext['Rel Migr'] = df_pca_ext['xDays Later Death']/df_pca_ext['Net migration']
df_pca_ext['Rel Migr range'] = (df_pca_ext['Net migration']/2).round(0)*2

In [None]:
sns.set(style="ticks")
sns.FacetGrid(df_pca_ext, height=8, hue="Rel Death").map(plt.scatter, '1st_principal', '2nd_principal')
plt.title('PCA visualization of sequences')
plt.show()

In [None]:
sns.set(style="ticks")
sns.FacetGrid(df_pca_ext, height=8, hue="Rel Migr").map(plt.scatter, '1st_principal', '2nd_principal')
plt.title('PCA visualization of sequences')
plt.show()

In [None]:
sns.set(style="ticks")
sns.FacetGrid(df_pca_ext, height=8, hue="Rel Migr range").map(plt.scatter, '1st_principal', '2nd_principal').add_legend()
plt.title('PCA visualization of sequences')
plt.show()

# Based on the reported spread data after 15 days of virus appearance these countries tend to have lower risk of death

The Net Migration ratio shows some correlation being in reverse relation with the risk of death.
There are some other demographic data which introduces noticable impact on the spread of virus (GDP, Infant Mortality, Climate ..)
The observed duration of virus infection can be changed and the same kind analysis can be repeated.
This way, time series are generated and other kind of predictions becomes applicable, e.g. Prophet.

In [None]:
list(df_countries_dem[df_countries_dem["Net migration"] > '4']["Country"])

Consider checking net migration statistics on the world

[Net Migration of Countries](https://en.wikipedia.org/wiki/Net_migration_rate)

![][worldmap]

[worldmap]: https://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/Net_Migration_Rate.svg/400px-Net_Migration_Rate.svg.png