# Defeating under reporting: neural networks for estimation of the true scale of Covid-19 in Colombia

Sir Thomas Bayes said that statistics is not the study of outcomes, but of uncertainty. We are living in uncertain times, so it might be a good time to start listening. One of the most powerful tools we have in the fight is information. Every hour of every day, public and private entities are collecting data with the hope that with enough of it, we may plot a course ahead. Although this approach is founden in sound principles, most treatments assume something misleading: we have complete information.

Not only there is a period of incubation of the disease before starting to show simptoms, but those simptoms can be misinterpreted. That, and other factors such as healthcare access and intentional misreporting may skew the data and prevent the through analysis that can be done.

In this kernel, I propose a method for estimating the uncertainty of statistical reports of Covid-19 cases in the case study of Colombia between the months of March and April of 2020. It begins with an exploration of the behavior of the countries relative to each other based on the historic day by day increase in cases and deaths. This exploration was done via different clustering techniques.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import matplotlib.pyplot as plt
import sklearn.cluster
from matplotlib.lines import Line2D
import umap
import torch
import sklearn.neural_network

%matplotlib inline
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Data ordering
In order to better understand the temporal evolution of the cases, we manipulated the data to counstruct an evolution vector, whose components are the increases in cases and deaths per date.

In [None]:
eddc=pd.read_csv('/kaggle/input/uncover/UNCOVER/ECDC/current-data-on-the-geographic-distribution-of-covid-19-cases-worldwide.csv')
eddc=eddc.drop('daterep',axis='columns')
eddc['casesNorm']=eddc['cases']/eddc['popdata2018']
eddc['deathsNorm']=eddc['deaths']/eddc['popdata2018']
eddc=eddc.dropna(axis=0)
eddc=eddc.set_index(eddc['countriesandterritories'])

We can visualize the data as is and look for any patterns. Regional patterns, for instance, may give us insight into colective behavior.

In [None]:
april4th=(eddc[(eddc['day']==28) & (eddc['month']==4)])
plt.figure(figsize=(14,7))

asia=april4th[april4th['continentexp']=='Asia']
america=april4th[april4th['continentexp']=='America']
europe=april4th[april4th['continentexp']=='Europe']
oceania=april4th[april4th['continentexp']=='Oceania']
africa=april4th[april4th['continentexp']=='Africa']

plt.subplot(1,2,1)

plt.plot(asia['casesNorm'],asia['deathsNorm'],'.',label='Asia',markersize=10,color='red')
plt.plot(america['casesNorm'],america['deathsNorm'],'.',label='America',markersize=10,color='blue')
plt.plot(europe['casesNorm'],europe['deathsNorm'],'.',label='Europe',markersize=10,color='green')
plt.plot(oceania['casesNorm'],oceania['deathsNorm'],'.',label='Oceania',markersize=10,color='black')
plt.plot(africa['casesNorm'],africa['deathsNorm'],'.',label='Africa',markersize=10,color='cyan')

#for index,row in april4th.iterrows():
#    plt.annotate(s=row['countriesandterritories'],xy=(row['cases'],row['deaths']))
plt.xlabel('Cases')
plt.ylabel('Deaths')
plt.legend()

plt.subplot(1,2,2)

plt.plot(asia['casesNorm'],asia['deathsNorm'],'.',label='Asia',markersize=10,color='red')
plt.plot(america['casesNorm'],america['deathsNorm'],'.',label='America',markersize=10,color='blue')
plt.plot(europe['casesNorm'],europe['deathsNorm'],'.',label='Europe',markersize=10,color='green')
plt.plot(oceania['casesNorm'],oceania['deathsNorm'],'.',label='Oceania',markersize=10,color='black')
plt.plot(africa['casesNorm'],africa['deathsNorm'],'.',label='Africa',markersize=10,color='cyan')
plt.title('Deaths and cases')
#for index,row in april4th.iterrows():
#    plt.annotate(s=row['countriesandterritories'],xy=(row['cases'],row['deaths']))
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Cases per capita')
plt.ylabel('Deaths per capita')
plt.legend()

Besides a european cluster, nothing jumps out from a simple death per cases analysis, not even a range. This points towards an heavy dependence of the particular country conditions. We might try and study the temporal evolution vectors and look for similarities.

In [None]:
tsne = sklearn.manifold.TSNE(perplexity=100)
predictors=['casesNorm','deathsNorm']
tsne.fit(april4th[predictors])
embedding = tsne.embedding_

In [None]:
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
i=0
for index,row in april4th.iterrows():
    continent=row['continentexp']
    if continent=='Asia':
        plt.scatter(embedding[i,0], embedding[i,1], s=3.0, color='red')
    if continent=='America':
        plt.scatter(embedding[i,0], embedding[i,1], s=3.0, color='blue')
    if continent=='Europe':
        plt.scatter(embedding[i,0], embedding[i,1], s=3.0, color='green')
    if continent=='Africa':
        plt.scatter(embedding[i,0], embedding[i,1], s=3.0, color='cyan')
    if continent=='Oceania':
        plt.scatter(embedding[i,0], embedding[i,1], s=3.0, color='black')
    i=i+1
legend_elements = [Line2D([0], [0], color='w', marker='o', markerfacecolor='red',label='Asia'),
                   Line2D([0], [0], color='w', marker='o', markerfacecolor='blue',label='America'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='green',label='Europe'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='black',label='Oceania'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='cyan',label='Africa')]
plt.legend(handles=legend_elements)

plt.subplot(1,2,2)

i=0
for index,row in april4th.iterrows():
    #plt.annotate(s=row['countriesandterritories'],xy=(embedding[i,0],[i,1]))
    continent=row['continentexp']
    if continent=='Asia':
        plt.scatter(embedding[i,0], embedding[i,1], s=3.0, color='red')
    if continent=='America':
        plt.scatter(embedding[i,0], embedding[i,1], s=3.0, color='blue')
    if continent=='Europe':
        plt.scatter(embedding[i,0], embedding[i,1], s=3.0, color='green')
    if continent=='Africa':
        plt.scatter(embedding[i,0], embedding[i,1], s=3.0, color='cyan')
    if continent=='Oceania':
        plt.scatter(embedding[i,0], embedding[i,1], s=3.0, color='black')
    labels=['Colombia','Peru','Ecuador','Venezuela','Canada','Germany','Singapore','Brazil','China','France']
    for label in labels:
        if row['countriesandterritories']==label:
            plt.annotate(s=label,xy=(embedding[i,0],embedding[i,1])) 
    i=i+1
legend_elements = [Line2D([0], [0], color='w', marker='o', markerfacecolor='red',label='Asia'),
                   Line2D([0], [0], color='w', marker='o', markerfacecolor='blue',label='America'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='green',label='Europe'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='black',label='Oceania'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='cyan',label='Africa')]
plt.legend(handles=legend_elements)

This crescent moon shape with a lasso at the end is quite sugestive, the countries in the lower end of the crest are countries with high reporting metrics, independent of their success mitigating policies. Now, we can use dimension reducing clustering to study similarities between temporal evolutions for different countries.

First, we create the temporal evolution vectors for each country.

In [None]:
time_evol=pd.DataFrame(index=list(eddc.countriesandterritories.unique()))
countries=eddc.countriesandterritories.unique()
dates=eddc.drop_duplicates(['month','day'])
dates=np.vstack(((np.array(dates['day'])),(np.array(dates['month']))))
dates=dates.T
dates=dates[:45]
nan_col=np.empty(len(countries))
nan_col[:]=np.NaN
time_evol['Continent']=nan_col

for date in dates:
        
    #date=eddc[(eddc['day']==day) & (eddc['month']==month)]
    #if eddc_it['countriesandterritories']==country:
    day=date[0]
    month=date[1]
    #print('Month: ',month,'Day: ',day)
    labelCases='casesNorm '+str(day)+'-'+str(month)
    labelDeaths='deathsNorm '+str(day)+'-'+str(month)
    time_evol[labelCases]=nan_col
    time_evol[labelDeaths]=nan_col
    time_evol['Cases']=nan_col
    time_evol['Deaths']=nan_col
    cases=0
    deaths=0
    temp =eddc[(eddc['day']==day) & (eddc['month']==month)]

    for country,row in temp.iterrows():
        casesNorm=row['casesNorm']
        deathsNorm=row['deathsNorm']
        continent=row['continentexp']
        time_evol.loc[country,labelCases]=casesNorm
        cases+=casesNorm
        time_evol.loc[country,labelDeaths]=deathsNorm
        deaths+=deathsNorm
        time_evol.loc[country,'Continent']=continent
        
        time_evol.loc[country,'Cases']=cases
        time_evol.loc[country,'Deaths']=deaths
#time_evol=time_evol.drop(0,axis='columns')
time_evol=time_evol.dropna(axis=0)
time_evol

Then, we apply t-SNE clustering to search for local similarities.

In [None]:
tsne = sklearn.manifold.TSNE(perplexity=100)
tsne.fit(time_evol.drop(['Continent','Cases','Deaths'],axis='columns'))
tsne_embedding = tsne.embedding_

plt.figure(figsize=(10,7))
i=0
for index,row in time_evol.iterrows():
    #plt.annotate(s=row['countriesandterritories'],xy=(tsne_embedding[i,0],[i,1]))
    continent=row['Continent']
    if continent=='Asia':
        plt.scatter(tsne_embedding[i,0], tsne_embedding[i,1], s=3.0, color='red')
    if continent=='America':
        plt.scatter(tsne_embedding[i,0], tsne_embedding[i,1], s=3.0, color='blue')
    if continent=='Europe':
        plt.scatter(tsne_embedding[i,0], tsne_embedding[i,1], s=3.0, color='green')
    if continent=='Africa':
        plt.scatter(tsne_embedding[i,0], tsne_embedding[i,1], s=3.0, color='cyan')
    if continent=='Oceania':
        plt.scatter(tsne_embedding[i,0], tsne_embedding[i,1], s=3.0, color='black')
    labels=['Colombia','Peru','Ecuador','Venezuela','Canada','Germany','Singapore','France','United_States_of_America']
    for label in labels:
        if index==label:
            plt.annotate(s=label,xy=(tsne_embedding[i,0],tsne_embedding[i,1])) 
    i=i+1
legend_elements = [Line2D([0], [0], color='w', marker='o', markerfacecolor='red',label='Asia'),
                   Line2D([0], [0], color='w', marker='o', markerfacecolor='blue',label='America'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='green',label='Europe'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='black',label='Oceania'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='cyan',label='Africa')]
plt.legend(handles=legend_elements)
plt.title('t-SNE clustering',size=20)

Now, we see a comet-like clustering, with a good portion of African countries near the core and well reported territories towards the tail. May be a more global aproach can be used. 

In [None]:
reducer = umap.UMAP(n_neighbors=5)
reducer.fit(time_evol.drop(['Continent','Deaths','Cases'],axis='columns'))
umap_embedding = reducer.transform(time_evol.drop(['Continent','Deaths','Cases'],axis='columns'))
plt.figure(figsize=(10,7))
i=0
for index,row in time_evol.iterrows():
    #plt.annotate(s=row['countriesandterritories'],xy=(umap_embedding[i,0],[i,1]))
    continent=row['Continent']
    if continent=='Asia':
        plt.scatter(umap_embedding[i,0], umap_embedding[i,1], s=3.0, color='red')
    if continent=='America':
        plt.scatter(umap_embedding[i,0], umap_embedding[i,1], s=3.0, color='blue')
    if continent=='Europe':
        plt.scatter(umap_embedding[i,0], umap_embedding[i,1], s=3.0, color='green')
    if continent=='Africa':
        plt.scatter(umap_embedding[i,0], umap_embedding[i,1], s=3.0, color='cyan')
    if continent=='Oceania':
        plt.scatter(umap_embedding[i,0], umap_embedding[i,1], s=3.0, color='black')
    labels=['Colombia','Peru','Ecuador','Venezuela','Canada','Germany','Singapore','United_States_of_America','Chile','China']
    for label in labels:
        if index==label:
            plt.annotate(s=label,xy=(umap_embedding[i,0],umap_embedding[i,1])) 
    i=i+1
legend_elements = [Line2D([0], [0], color='w', marker='o', markerfacecolor='red',label='Asia'),
                   Line2D([0], [0], color='w', marker='o', markerfacecolor='blue',label='America'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='green',label='Europe'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='black',label='Oceania'),
                  Line2D([0], [0], color='w', marker='o', markerfacecolor='cyan',label='Africa')]
plt.legend(handles=legend_elements)
plt.title('UMAP clustering',size=20)

When global features are studied, we can clearly see a sharpening of our suspictions, with the well reported cases closer to the (mainly) european cluster and the under reported ones to the (mainly) african one. Colombia is closer to the latter, suggesting a lower level of reporting and a need for estimation of the uncertainty of the reports. To estimate said uncertainty, I propose a neural network to predict the number of cases per capita based on the passed history of cases and deaths in each country.

In [None]:
net = torch.nn.Sequential(
                torch.nn.Linear(90, 40),
                torch.nn.ReLU(),
                torch.nn.Linear(40, 20),
                torch.nn.ReLU(),
                torch.nn.Linear(20, 10),
                torch.nn.ReLU(),
                torch.nn.Linear(10, 5),
                torch.nn.ReLU(),
                torch.nn.Linear(5, 1)
)
#Dado que no es un problema de clasificación, debemos utilizar un criterio diferente. MSELoss mide la distancia entre la
#predicción y el valor verdadero
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.02) #lr: learning rate

In this neural network, the mean squared error (MSE) metric was used to compare the prediction of the network with the most recent values of cases. Given that this square root of this metric is in units of the cases per capita, its square root can be interpreted as a standard deviation from the actual value. For training the network, we pass the temporal evolution for every country and determine the absolute deviation between the oredicted and true values via the loss criterion.

In [None]:
epochs = 100

excep_deaths=[s for s in time_evol.keys() if 'deaths' in s]
only_cases=time_evol.drop(excep_deaths,axis='columns')
target_cases=only_cases['Cases']
only_cases=time_evol.drop(['Continent','Cases','Deaths'],axis='columns')
loss_values = np.zeros((len(only_cases),epochs))

it=0
only_cases_countries=[]
for index,row in only_cases.iterrows():
    inputs = torch.autograd.Variable(torch.Tensor(row.values).float())
    targets = torch.autograd.Variable(torch.Tensor(row.values).float())
    for epoch in range(epochs):

        optimizer.zero_grad()
        out = net(inputs)
        loss = criterion(out, targets)
        loss.backward()
        optimizer.step()

        loss_values[it,epoch] = loss.item()
    only_cases_countries.append(index)
    it+=1

In [None]:
plt.figure(figsize=(20,20))
plt.imshow(loss_values.T)
plt.colorbar()
plt.xlabel('Countries',size=20)
plt.ylabel('Epochs',size=20)
plt.title('Loss criterion for training country by country for 100 epochs',size=24)
plt.xticks(np.arange(len(only_cases_countries)),only_cases_countries,rotation=90,size=9)
plt.show()

Although, there is improvement the more we train the network, 100  epochs is a reasonably high computation parameter for a study of the general features of the desviation for every country. It only rests to study the deviation for the particular case of Colombia.

In [None]:
casesCol =time_evol.loc['Colombia',[s for s in time_evol.keys() if 'cases' in s]]
casesCol= casesCol[::-1]
i=1
while i<45:
    casesCol[i]=casesCol[i]+casesCol[i-1]
    i+=1
deathsCol =time_evol.loc['Colombia',[s for s in time_evol.keys() if 'deaths' in s]]
deathsCol= deathsCol[::-1]
i=1
while i<45:
    deathsCol[i]=deathsCol[i]+deathsCol[i-1]
    i+=1

strDates=[]
for date in dates:
    strDates.append(str(date[0])+'/'+str(date[1]))
strDates= strDates[::-1]

plt.figure(figsize=(14,10))
UncertCol=np.array(casesCol+12*(loss_values[99,only_cases_countries.index('Colombia')])**0.5)
DowncertCol=np.array(casesCol-12*(loss_values[99,only_cases_countries.index('Colombia')])**0.5)

dias=np.array(np.arange(len(strDates)))
plt.plot(dias,casesCol,'.')
#plt.plot(dias,UncertCol,'.')
plt.fill_between(x=dias,y1=UncertCol.astype(float),y2=DowncertCol.astype(float),alpha=0.2)
plt.xticks(dias,strDates,rotation=45)
plt.xlabel('Date',size=20)
plt.ylabel('Covid cases per capita',size=20)
plt.title('Confiedence interval for Covid cases in Colombia',size=24)
plt.show()

We see a high level of variability and a wide range of values avaliable of the possible evolution of the system. This is a rough estimate based on the similarity of the data in every country and to minimize the difference between a prediction and the real value.

## Limitations

It bears to say that the method exposed here only applies for determination of uncertanty based in previous data and does not aim to predict the evolution of the disease or prove the effectivity of public policies. A high dispersion may also affect the stability of the network, as it would happen in a country with a poor healthcare system that reports the cases in batches, such as week per week.

## Outlook and final remarks

This is a first approximation to the problem of determining uncertainty and it is based on the data alone. Proposing a model for the evolution of the virus may help in improving the calculation of uncertainty via logistic regression, for instance. This project has been quite fun to develop and has showed me a new perspective on how we read and interpret data. I eagerly await for comments and ideas in how I can improve my estimations.
