In [None]:
!pip install pycountry_convert

In [None]:
import pandas as pd # Load libaries
import numpy as np
import pycountry_convert as pc
import plotly.express as px
import os
import plotly
import plotly.graph_objs as go
from IPython.display import Image

## Abstract

This paper presents some analysis and conclusions for different types of questions related to the COVID-19 pandemic. It utilizes data sets that have been provided on the website Kaggle, where the user, ‘SRK,’ has been updating the sources daily. The pandemic has been an ongoing problem globally and so the research for this issue has been made available for free to interested researchers. However, since the pandemic is still not over, the research is still in the process of going through refinement and further improvement. This paper will attempt to analyze six data visualizations based on data that tracks the spread of the pandemic over time throughout different countries. By analyzing patterns in the spread of the pandemic, different insights will be delivered to try and help make clearer the current situation regarding COVID-19. The paper examines how the developed world has gone through the most severe stages of the pandemic and are entering the late stage where China is apparently given their plateauing number of new infections. The implication then is that there are still billions of people in the developing world (e.g., India, Africa, South America) whom are susceptible to the pandemic but lack the same means to handle the virus due to poor healthcare infrastructures. The conclusion then is that what has happened over the past months is unfortunate, but the real next steps are for the world to come together and help those who are truly vulnerable right now.

# Table of Contents
<a id="top"></a>
1. [Introduction](#introduction)<br>
2. [Background](#background)<br>
3. [Approach](#approach)<br>
    3.1 [First Visualization](#31)<br>
    3.2 [Second Visualization](#32)<br>
    3.3 [Third Visualization](#33)<br>
    3.4 [Fourth Visualization](#34)<br>
    3.5 [Fifth Visualization](#35)<br>
    3.6 [Sixth Visualization](#36)<br>
4. [Results](#results)<br>
    4.1 [Analysis of First Visualization](#41)<br>
    4.2 [Analysis of Second Visualization](#42)<br>
    4.3 [Analysis of Third Visualization](#43)<br>
    4.4 [Analysis of Fourth Visualization](#44)<br>
    4.5 [Analysis of Fifth Visualization](#45)<br>
    4.6 [Analysis of Sixth Visualization](#46)<br>
5. [Conclusion](#conclusion)<br>
6. [References](#code)<br>
7. [Code Appendix](#code)<br>
    7.1 [Code for First Visualization](#71)<br>
    7.2 [Code for Second Visualization](#72)<br>
    7.3 [Code for Third Visualization](#73)<br>
    7.4 [Code for Fourth Visualization](#74)<br>
    7.5 [Code for Fifth Visualization](#75)<br>
    7.6 [Code for Sixth Visualization](#76)<br>

# 1. Introduction
<a id="introduction"></a>
<a href="#top">Back to top</a>

Currently, with the pandemic first being discovered in Wuhan, China, it has since been found to have spread globally, reaching Europe and North America. There’s a great fear that once it reaches the least developed parts of the world (e.g., South America, Africa, South East Asia) that there will be even more devastating consequences. Over a period of just a few months over three million have been infected with tens of thousands of deaths being related to COVID-19. The pandemic has made it such that there’s now a global effort for citizens of all nations to work together globally to fight the war against this virus.

The virus itself is genetically like the seasonal flu virus that appears annually. The difference about COVID-19 is that it’s especially infectious while having devastating effects for those that suffer serious symptoms. Like the normal flu, patients may experience pneumonia, the difference is that the pneumonia experienced by COVID-19 patients is particularly deadly and has led to thousands of deaths thus far throughout much of the developed world. The most common victims of COVID-19 are elderly patients with weaker systems and are unable to fight off the symptoms. At the same time there are many people who face no symptoms and overcome the virus on their own by developing antibodies.

Given the infectious nature of COVID-19, it’s become law throughout nations around the world for citizens to remain in a state of quarantine until the virus passes. The quarantine stage makes it such that the number of patients needing medical care goes down and so hospitals are not flooded with patients and thus overwhelming the healthcare systems of the respective countries. In this paper, the goal is to use data visualization techniques to investigate the dataset and drive insights to better understand the current situation regarding the pandemic’s progress.

# 2. Background
<a id="background"></a>
<a href="#top">Back to top</a>

As mentioned previously, the COVID-19 pandemic is an ongoing problem for the world. Therefore, the work done on understanding how to overcome the virus is still a hot topic in the research world. Medical communities globally are lowering barriers for getting access to research related to COVID-19. Furthermore, medical teams globally are working together to find solutions to different problems that are arising that are caused by the pandemic.

To deal with this issue in a practical way, research related to past pandemics will be resourced to help understand how epidemiological techniques can be utilized. It’s worth noting that the COVID-19 pandemic is significant in terms of its effect on humanity and has been called a once in a lifetime event (the analogy has been made to the Spanish Flu in 1918 where millions died).

Another resource is the YouTube user Grant Sanderson who is famous for his ‘3Brown1Blue’ channel that uses visualization techniques to understand mathematical concepts. A popular video of his recently seeks to understand the exponential growth seen within epidemics by using data based on COVID-19. Since Sanderson has already created some interesting visualizing methods to explore the current pandemic, time will be spent reproducing some of his work to try and perform analysis on further questions.

Additionally, from a Data Visualization class, many different techniques have been learned that can help to visualize datasets such as the COVID-19 dataset. The one that will be used in the end is the sunburst visualization. This method breaks down the data into hierarchies which is useful for the geographic data.

# 3. Approach
<a id="approach"></a>
<a href="#top">Back to top</a>

In this section, the different data visualization techniques will be explained and briefly shown. Six different visualizations were created and so this section shall explain what each of these methodologies are and how they work. The different visualizations are used to drive new insights about the dataset from various angles. Therefore, by understanding this section it’ll be possible to then understand the Results section where the different methods are implemented, and insights are derived from the data.

## 3.1 First Visualization
<a id="31"></a>
<a href="#top">Back to top</a>

The first data visualization is a geospatial graph that shows the number of confirmed cases daily for each country. The goal of this visualization is to gain an understanding of the spread of the pandemic over land. The virus was first discovered to have been in the city of Wuhan. This city is in Hubei province, a province within the People’s Republic of China. The origins of the virus are still uncertain, but this is the location where cases were first confirmed through testing that had been developed within the country. Trying to understand the spread of the virus globally in terms of confirmed cases is a logical way to understand the severity of the pandemic over time.

Below in Figure 1, it’s possible to see the bubble map that was created in Python using the visualization package Plot.ly. The data was first changed from a long format to a wide format so that the dates would move from columns to being rows. Initially, the Bubble Map made it such that each country had a unique color and marker. With over a hundred countries this became cumbersome, therefore a new column was created to indicate which continent each country is from. This makes it possible to see more clearly how the outbreak began in Asia and spread west towards Europe and eventually North America. As time has gone by, the pandemic has also begun to move southwards towards the developing nations in South America and Africa.

In [None]:
Image(filename='../input/data-viz-files/screenshot_visualization1.JPG')

*Figure 1. Bubble map of daily confirmed cases by country. The geospatial graph colors countries according to their respective continent. The visualization is an animation where the size of each bubble varies by the number of confirmed cases on a given day.*

## 3.2 Second Visualization
<a id="32"></a>
<a href="#top">Back to top</a>

Another way to analyze the growth of the spread of the pandemic is to plot the growth in number of cases over time in a line chart. It would also be helpful to compare side-by-side the growth in number of cases for different countries. Due to the large number of countries, a technique to manage this is to subset the countries so that only countries with a minimum number of cases are included. The value for this minimum number of cases is 50,000. As of May 12th, the dataset pulled from Kaggle includes fifteen countries that fit into this category. Another important factor for comparing these countries side-by-side on a line chart is to consider how the spread of the virus to each of the countries had different start dates. Furthermore, at times the number of cases could be quite low, for example less than ten for a long period of time before testing starts to count the cases that were previously unknown. Therefore, it also makes sense to include a lag of for example ten days after the first confirmed case.

Below in Figure 2 is a line chart created in Python using the visualization package Plot.ly. It shows the number of confirmed cases against the lag days for countries with at least 50,000 confirmed cases. The lag days represents number of days that there have been confirmed cases in the country, minus ten (the number of days to lag the count). For example, China has had confirmed cases since the first day, so they’ve had confirmed cases for a total of 111 days. Subtracting ten, gives 101 lag days, the maximum number of days in the below line chart. Another country, Italy, has a maximum lag day of 92, so therefore the line for that country isn’t as long. This number implies that Italy’s first confirmed case began 102 days before. The inspiration for this idea of creating lags comes from online sources, for example Johns Hopkins University has not only a dashboard for the number of cases on a geospatial map, but also a page that shows numerous different Plot.ly visualizations (including one similar to Figure 2).

In [None]:
Image(filename='../input/data-viz-files/screenshot_visualization2.JPG')

*Figure 2. A line chart of the number of confirmed cases against lag days (a feature explained above). The countries with at least 50,000 confirmed cases have been subset from the original dataset to avoid flooding the visualization with too much information.*

The country with the largest number of cases at over 1.3 million is the United States of America. This large value makes it difficult to see in a proper perspective the other countries with fewer cases. However, Plot.ly’s chart allows users to zoom in on any section of the chart so that this problem is no longer an issue. Below in Figure 3, this feature is highlighted where the other countries are zoomed in on so that it’s possible to compare the growth more easily in number of cases daily.

In [None]:
Image(filename='../input/data-viz-files/screenshot_visualization2b.JPG')

*Figure 3. The line chart allows for zooming in on areas and so this image shows all the countries up close after the line representing the United States of America is left out of the image. This feature makes it possible to compare more realistically the other countries and their growth in number of cases.*

## 3.3 Third Visualization
<a id="33"></a>
<a href="#top">Back to top</a>

The third visualization is like the previous visualization. The only difference is that the number of confirmed cases has been transformed with the natural logarithm function. When analyzing data such as the COVID-19 confirmed cases, it’s helpful to transform the values so that they’re easier to visualize. Data such as this tends to follow an exponential growth rate and so taking the natural logarithm helps to scale the data so that it appears more linear. Later, this relationship between exponential growth will be examined further.
Below in Figure 4 it shows a graph like the previous two graphs, the only difference is that the confirmed cases variable has been transformed via natural log transformation. The concept related to lag days are identical and so those variables in the plot below remain the same. Without further analysis, it’s interesting to see the data scaled in such a manner, but it’s not particularly intuitive to most people what this implies. However, in later visualizations there will be more similar plots except that there will be deeper analysis of the scaled data.


In [None]:
Image(filename='../input/data-viz-files/screenshot_visualization3.JPG')

*Figure 4. The graph shows the confirmed cases after being transformed by the natural logarithm against the lag days. The countries have been subset so that only those with at least 50,000 current cases are shown.*

## 3.4 Fourth Visualization
<a id="34"></a>
<a href="#top">Back to top</a>

The fourth visualization is based on some concepts explained by Grant Sanderson from 3Blue1Brown. The following few visualizations will revolve around ideas that he elaborates on regarding exponential growth and the pandemic. In the third visualization, the number of confirmed cases was transformed using the natural logarithm. In this presentation, it will become more apparent the reason why this was done.

In Sanderson’s YouTube video, he explains how the number of cases of COVID-19 patients increases with an exponential growth. This growth is explained where the number of new cases increases by some value slightly over 1. In other words, for every person that is infected, they infect $1.x$ new people, where $x>0$. This is a natural phenomenon seen in different environments and therefore it makes sense to perform some modeling based on this characteristic. To be more specific, imagine that the growth rate is,

$$exp⁡(1.5x+3),$$

where $x$ is the $n$'th day of infections. Then on day 1 there are 4.5 cases, on day 2 there are 6 cases, on day 30 there are 48 cases, etc.  By transforming this growth rate with the natural logarithm, it becomes $1.5x+3$, the formula for a linear equation. Using simple linear regression, it’s possible to see how well this fits a straight line by analyzing the $R^2$ value. Admittedly, this is an somewhat of an ad-hoc trick and there are more complex methods for determining what sort of exponential growth that the data is going through.

Below in Figure 5 is a grid of the fifteen countries with at least 50,000 as of May 11th, 2020. Furthermore, the confirmed cases have been transformed through log transformation and the days have been changed to lag days. Lag days in this case is the same as the previous visualization with 10 lag days being used since the first confirmed case for each country. On the bottom of each plot it shows the Adjusted $R^2$ values for each of the fitted lines derived using simple linear regression. Values that are high $(\geq 0.9)$ indicate that the relationship between the log number of cases and lag days is highly linear. Examples of this can be seen in Russia, United Kingdom, United States of America, and Canada with Adjusted $R^2$ values of 0.9686, 0.9277, 0.9312, and 0.9446.

In [None]:
Image(filename='../input/data-viz-files/screenshot_visualization4.JPG')

*Figure 5. A 3x5 grid of the fifteen countries with at least 50,000 confirmed cases. The grid plots the log number of cases against the lag days. A linear regression line is fitted to each of the plots and their corresponding Adj. R-squared values are shown at the bottom of each plot.*

An example of how this method can be effective is to look at an example where it seems to not work. In this case, China has an Adj. $R^2$ of 0.3776, an extremely low value indicating a non-linear pattern in the data. Below in Figure 6 on the right-hand side shows a subset of China’s data of only the first 20 days since infections began. In this plot, the Adj. $R^2$ is 0.9459 and so the data is much more linear than before. Over time, as the growth ceases to continue exponentially, the data will begin to appear more sigmoidal rather than exponential. This sigmoidal appearance is evident on the left-hand side of Figure 6 that shows China’s plot from Figure 5. This concept is elaborated on in the next visualization. However, this contrast shows how the spread of the pandemic can be understood to be in exponential terms for some time before beginning to plateau towards the next stage which is sigmoidal.

In [None]:
Image(filename='../input/data-viz-files/screenshot_visualization4b.JPG')

*Figure 6. On the left-hand side is the plot of China’s log confirmed cases with 10 lag days from Figure 5. On the right-hand side is a subset of China’s data showing only the first 20 days of confirmed cases. Applying simple linear regression to this subset shows a much more linear appearance in the data.*

## 3.5 Fifth Visualization
<a id="35"></a>
<a href="#top">Back to top</a>

The fifth visualization expands on the idea of the exponential growth. In exponential growth, the growth rate continues to be growing at a rate of $1.x$, where $x>0$. However, after some period the value of $x$ should start to shrink until the total rate becomes less than 1. When modeling pandemics, the following formula is used (in the Figure 8 plot it’s called the growth rate),
$$\text{Growth factor}=\frac{(\text{New cases on current day})}{(\text{New cases on previous day})}.$$

During the beginning of the pandemic, this growth rate starts at 1 and increases by some decimal points until it reaches the inflection point of the sigmoidal curve and the growth rate starts to go back to 1. After the inflection point, the growth rate starts to decrease lower than 1 as the growth decreases. This sigmoidal curve has an “S”-shape and the inflection point is in the middle. Below in Figure 7 is an image of the sigmoidal curve that models the growth in number of cases.


In [None]:
Image(filename='../input/data-viz-files/screenshot_visualization5b.jpg')

*Figure 7. Image of sigmoidal curve that models the spread of a pandemic. The focus is on the growth rate which begins to decrease at the point of inflection. (Source: nctm.org)*

The idea behind the fifth visualization is to create a plot of the growth factor to try and monitor the growth rate for each country. By plotting the growth rate as a vertical line and the days on the $x$-axis, it’s possible to see the growth factor for each day in a country. The red lines indicate when the previous day had 0 new cases, which defaults the growth rate to infinity since it’s dividing by 0 in RStudio. Also, on certain days the growth factor is too large, and the plots lose perspective, therefore the growth rate is capped at 5 for each of the plots.

Below in Figure 8, it shows the fifth visualization that plots the growth rate against the days of cases (without the lag). It shows in a grid for all fifteen countries with at least 50,000 confirmed cases and their growth rate plot. For many of the countries, their growth in number of cases didn’t start for some time until after China first experienced cases. Therefore, they have a growth rate of 0 for long periods of time. The plots show that countries will experience a growth in cases that is above 1 for some period. This shows that for each consecutive day, the total number of new cases is slightly greater than the previous day’s number of new cases. Relating this to Figure 7, this indicates that they’re still in the first half of the pandemic. Once the number of cases start to level out at 1, it implies that they’ve past the inflection point and the number of new cases is beginning to slow down.

In [None]:
Image(filename='../input/data-viz-files/screenshot_visualization5.JPG')

*Figure 8. A grid of the fifteen countries with over 50,000 confirmed cases and their growth rates. The red lines signify that the growth rate is infinite due to a division by 0 calculation in RStudio. The plot is also capped at 5 so that the perspective is not too heavily skewed by certain outlier dates.*

## 3.6 Sixth Visualization
<a id="36"></a>
<a href="#top">Back to top</a>

The sixth visualization is a sunburst that allows for better understanding of the hierarchical data structure of the dataset. The idea is to understand the spread in number of cases based on the geographic region. The first visualization performed a similar task, but it wasn’t simple to determine the relative number of cases by continent since the bubbles in the bubble map are spread far apart. The sunburst visualization allows for a radial view of the number of confirmed cases, breaking down the data’s hierarchy into from continent to country or region. This simpler view of the geographic spread allows for interpretation such as how the spread is largely confined to developed countries such as European countries, the United States of America, and some Asian countries (whom may be developing but still have a comparatively more developed infrastructure).

The sunburst visualization of the dataset can be seen below in Figure 9. As mentioned before, it’s apparent that the developed countries with high levels of people movement across borders are the ones with the most infected. However, poorer countries where the population is largely poor and there is less movement across borders have much fewer cases. The areas of Africa and South America are of concern, due to their poorer healthcare infrastructure and a potential inability for the governments to handle the pandemic in a manner that more developed countries have.


In [None]:
Image(filename='../input/data-viz-files/screenshot_visualization6.JPG')

*Figure 9. A sunburst visualization of the confirmed cases. The hierarchy of the sunburst is to go from Continent to Country / Region to Province / State.*

Looking closely at Figure 9, it’s apparent that there are many small slices of the visualization that are difficult to analyze. This visualization however has a feature that makes it possible to click on a level within the hierarchy, for example, North America, and it will allow the user to zoom in to see the different pieces more clearly. This can be seen below in Figure 10.

In [None]:
Image(filename='../input/data-viz-files/screenshot_visualization6b.JPG')

*Figure 10. The above shows the same visualization as in Figure 9, except the ‘North America’ level of the data’s hierarchy is selected. This allows the user to zoom in on successive levels of the dataset to see more clearly the pieces that maybe too small to notice in comparison due to the scaling of the numbers in relation to the size of the slices.*

The sunburst visualization was created in Python using Plot.ly. Furthermore, it used an alternate dataset from the series of datasets that were part of the database on Kaggle. This dataset had better details related to the variables of: Province/State, Country/Region, and Confirmed. This dataset is called ‘covid_19_data.csv,’ while previously a modified version of ‘time_series_covid_19_confirmed.csv’ was used. Furthermore, to create this visualization, some files were moved back and forth between Python and R so that data could be wrangled properly.

# 4. Results
<a id="results"></a>
<a href="#top">Back to top</a>

In this section, the different data visualization techniques will be analyzed more closely. The visualizations will be broken down to attempt to find interesting insights about the data based on the methods described above. This will often involve screenshots of the visualizations and drawing on them with Paint3D. These images will be accompanied by text to help explain the analysis.

## 4.1 Analysis of First Visualization
<a id="41"></a>
<a href="#top">Back to top</a>

The goal of the first visualization is to analyze the spread of the pandemic through time and space. In other words, the idea is to find out where and when the virus has been since it was first discovered in the city of Wuhan within Hubei, China. Looking below in Figure 11, it’s apparent that up until 3/6/2020, the virus was restricted to mainly the borders of China. Less noticeable are the countries of South Korea, Iran, and Italy with between 4,000 and 7,000 cases each. It’s quite logical why South Korea would be so susceptible since they’re nearby neighbors. However, a possible reason for why Iran and Italy also large numbers of infected are because they have cross-border travel between the country. It’s possible that some infected people had traveled back and forth from Wuhan to Iran and Italy.

In [None]:
Image(filename='../input/data-viz-files/11.jpg')

*Figure 11. The screenshot shows the bubble map of confirmed cases on 3/6/2020. At this point, the spread of the virus is largely contained within China. There are other countries with notable amounts of cases such as South Korea, Iran, and Italy. These countries are highlighted using arrows and text in the image.*

Below in Figure 12 it shows the bubble map on 3/23/2020, roughly 3 weeks after the previous screenshot. By this time, the spread has gotten more serious in Europe with major countries like the United States, United Kingdom, France, Spain, and Germany experiencing tens of thousands of cases. The spread in the previously infected countries are also increasing. However, it’s interesting to see that Iran has increased much more rapidly, reaching over 23,000 infected, while South Korea has only reached about 9,000 infected. Some logical reasons for the difference in the increase of cases is that South Korea has a significantly more developed economy. Furthermore, besides the government’s ability to implement useful prevention measures, the culture is quick to practice habits such as wearing masks and social distancing to help slow the spread of infections.

Although the number of cases is still comparatively low, Japan and Malaysia have broken the threshold of one thousand cases. These are both countries that are roughly nearby to China. Of special concern is Malaysia, which is a country with a less sophisticated healthcare system and thus is likely less prepared to handle the effects of the pandemic spreading throughout their society. It’s logical to think that from Italy, the pandemic could’ve easily spread to other European countries due to the heavy traffic flow between the states within the European Union. The spike in number of cases for the United States is also worrisome due to the similar danger of heavy traffic flow from major economic hubs across large distances (e.g., infected people traveling from New York to California).


In [None]:
Image(filename='../input/data-viz-files/12.jpg')

*Figure 12. A screenshot of countries that are beginning to develop a larger number of confirmed cases. This screenshot is taken from the bubble map on 3/23/2020, roughly three weeks after the previous screenshot. It shows how other developed countries have become to have a more significant number of confirmed cases. The developed countries include: United States, United Kingdom, France, Spain, and Germany. Furthermore, other nearby countries to China, Japan and Malaysia, are starting to develop more cases.*

The screenshot below in Figure 13 shows the number of cases per country on 4/1/2020. By this time, a week after the previous screenshot, the cases have now spread to most of the developed world. The major European countries are now near or over 100,000 confirmed cases. The United States has now become the new country with the quickly increasing number of cases at over 630,000. China has since peaked for some time and has remained at around 80,000 cases.

Evident now are other countries that are beginning to develop many cases. The countries of Brazil, Turkey, India, and Russia now have tens of thousands of cases. These countries are poorer than the developed western countries of the United States, United Kingdom, France, Germany, etc. Therefore, there’s now a significant concern over whether they can handle the combination of a large population and a highly infectious virus. Australia is now in the thousands and is starting to look like it’ll grow quicker too. However, it’s an island country and so it’s difficult to know how the spread of the virus will take place there given its unique geographical characteristics. Another island nation with an increasing number of cases is the Philippines, however it has a comparatively weaker health infrastructure, so the concern is even greater. A major worry is South Africa with the number of confirmed cases in the 2,000 range. Egypt (not shown) also has a similar number of confirmed cases, however, it’s quite close to Europe which has been having an exploding number of confirmed cases. The reason is that South Africa is possibly a sign that the rest of the continent is also not safe and there’s a great risk that many extremely impoverished communities will become victim to the virus without means to overcome the pandemic.

In [None]:
Image(filename='../input/data-viz-files/13.jpg')

*Figure 13. A screenshot of countries as the number of cases have not only spread, but are also growing to numbers in the hundreds of thousands. More distant parts of the world relative to China are now developing large numbers of cases. This screenshot represents the number of cases on 4/1/2020.*

The final screenshot from the bubble map can be seen below in Figure 14. It shows the latest update for the spread of the pandemic for countries on 5/11/2020. As of this point in time, the number of confirmed cases in the developed western world (e.g., European countries and the United States) are skyrocketing. Cases in major European countries (e.g., U.K., France, Spain, etc.) are in the hundreds of thousands. The United States has over one million confirmed cases. Cases in some Asian countries are steadily increasing and as of note is India. China and India are both countries with equally enormous populations and a developing infrastructure. However, China managed to stem the spread of the virus early on by imposing a lockdown on the epicenter. India on the other hand is in a position like the western world where they knew of the pandemic but took time to act. With a population of over a billion citizens, there’s the real danger that the current count in the 70,000 range could grow to incredibly high levels as such a dense population seems especially susceptible to the effects of a highly infections virus.

In Figure 14, highlighted however are countries within the Caribbean, South America, and Africa. The number of cases in these areas are now growing to numbers in the thousands and tens of thousands. This is the position that western countries previously saw themselves in and given the fate of those areas it’s possible that these more remote parts of the world will face a similar future. The countries previously mentioned have had the means to obtain medical supplies on their own, but these countries are poor and will require the assistance of international organizations to help them in their battle against the pandemic. Furthermore, South America and Africa have a large population. Although the flow of people and goods maybe slower due to weaker infrastructure if the virus manages to reach every region there’s a lack of means for these areas to overcome the virus. It’s problematic if the developed world can overcome the pandemic, but the southern countries are left to be devastated by the virus running for years to come.

In [None]:
Image(filename='../input/data-viz-files/14.JPG')

*Figure 14. The latest screenshot from the dataset on 5/11/2020. It shows that as the number of cases in the developed world skyrockets into the hundreds of thousands, the pandemic is reaching the undeveloped countries. Currently, countries in the Caribbean, South America, and Africa are now seeing cases in the thousands and tens of thousands. This pattern was seen before in the developed world and if it continues the consequences can be devastating.*

## 4.2 Analysis of Second Visualization
<a id="42"></a>
<a href="#top">Back to top</a>

The second visualization is the line chart that includes a lag of 10 days after the first infection for each of the countries. Furthermore, the countries have been subset to only include those with at least 50,000 confirmed cases as of 5/11/2020. The first part of the analysis is to look at China, the country where the virus was first detected. Below in Figure 15, the arrow shows a large spike in the number of cases. However, this is associated with the number accounting for clinical diagnosis by doctors, where patients are confirmed cases through a doctor’s observation, rather than through some sort of chemical test. Before this time, the number of confirmed cases were already jumping sharply to over 40,000 confirmed within 10 lag days. However, China is known to be the country that quickly implemented some of the strictest measures to stop the virus, including placing a lockdown on the city of Wuhan only days after the seriousness of the virus became known to authorities. Furthermore, medical personnel from across the country entered Wuhan to help manage the confirmed cases, and temporary hospitals were quickly set up to handle the situation. It could be said that the country had learned lessons from the previous SARS virus and so they were better prepared.

Quickly after this timely reaction to the virus by the Chinese government, the number of cases could be seen to begin a stage of plateauing. By around 30 lag days (40 days since the first confirmed case), the number of total confirmed cases hits 80,000 and the growth in number of cases rapidly slows down. It could then be said that when the clinically diagnosed cases were included into the confirmed cases at around 10 lag days, this is possibly the inflection point for when the virus spread through the country. By this point, half the population of infected had been infected, and the other half is left to be infected over a period (refer to Fig. 7). This pattern is highlighted using the gray S-shaped line that represents the sigmoid curve of a pandemic. The only concern now for China then is to ensure that a second outbreak of the virus doesn’t place the country back in danger.


In [None]:
Image(filename='../input/data-viz-files/15.JPG')

*Figure 15. In the above screenshot, the country of China is highlighted out of the fifteen countries. An arrow shows a spike in cases caused by the inclusion of clinically diagnosed cases. The gray S-shaped curve shows the sigmoid curve of a pandemic.*

Based on the lag days, it’s possible to compare countries that have a similar projection. Normally, there’s be a significant lag between countries that have a similar trajectory since the virus first entered the countries on different days. By creating a lag of 10 days after the first confirmed case and counting from there, it’s possible to see side-by-side the growth in number of cases in a logical manner. This example can be seen in Figure 16, where the following countries are shown: Spain, Italy, Brazil, United Kingdom, Germany, France, and Peru. The country of China is included as a reference point to maintain perspective on the growth in number of cases. It was shown previously that the S-shaped sigmoidal curve could be seen in China’s growth pattern. The same half-opaque gray line can be seen in Figure 16. It shows that the European countries are following a similar pattern, with both Brazil and Peru also possibly on the same path. Two parallel red lines show the channel where the countries enter this growth phase and eventually, they must exit the channel if they are to follow the S-shaped trend.

The plot shows that Italy and Spain, the two European countries most effected by the pandemic, are at the upper end of the channel but are still following the curve to eventually exit the channel. At the bottom of the channel are Germany and France. It seems then that they’ve had a comparatively slower spread of the virus and so they can begin plateauing at a lower level. It’s worth noting that this is still about twice the level of where China began to plateau (160,000 compared to 80,000). In the center of the parallel trend channel is the United Kingdom.  Looking carefully, it seems to be trending in a slightly different S-shaped pattern as the previously mentioned European countries. The other countries seem to be on a direction of plateauing, but the United Kingdom still seems to be moving upwards relative to the S-shaped sigmoid curve. It’s possible that it could begin to plateau later than the others. A reference for the United Kingdom is France, where their trend began at a similar point in time (on the lag day scale). However, France has a less steep curve, while the United Kingdom has a slightly steeper curve. This difference has a greater effect over a longer period, and so it’s worth noting that the United Kingdom is perhaps going to be a bit slower to enter the plateau phase.

To take a second look at Brazil, it’s interesting that the country seems to be following the same S-shaped trajectory as the major European countries. Brazil on the other hand is a major country in South America. Also, in terms of the country’s economic development, it’s behind the European powers in terms of GDP per capita (\\$8,000 range versus \\$30-40,000 range, based on a Google search). Also, it could be said that the healthcare infrastructure there doesn’t compare to the modernized hospitals of the European countries. So, although Brazil seems to be on a similar trend, this is perhaps an underestimate of the true numbers that Brazil may face in the future. Peru on the other hand is a comparatively smaller countries with less than a quarter of the population of Brazil. So although it remains in the same channel currently, the fact that there are fewer potential people to become infected makes it plausible that eventually Peru will find its own path.

In [None]:
Image(filename='../input/data-viz-files/16.JPG')

*Figure 16. In the above screenshot, the countries with a similar S-shaped curve projection are highlighted. These countries include the European powers of Spain, Italy, United Kingdom, Germany, and France. The South American countries of Brazil and Peru also seems to potentially be on a similar projection. The country of China is used as a reference point for the scale of the number of confirmed cases. The gray line represents the S-shaped sigmoid curve and the parallel red lines represent the channel for which countries would trend while moving along the sigmoid curve.*

The next screenshot shows the other countries not included in Figure 17 (apart from the United States). These countries follow a different trajectory than what was seen before with the European countries. These countries include: Turkey, Russia, Iran, Canada, Belgium, and India. Like before, China is included as a reference point for the scale of confirmed cases. For all these countries, a gray line shows the potential S-shaped trajectory that they maybe are on.

Both Turkey and Iran seem to be on a steady trend to reach a plateau at some level possibly not too far from where they are currently. In other words, they may have already reached their inflection points and are on the second half of their journey through the pandemic. It seems that their experience with the pandemic is being controlled and they may meet a similar end as France and Germany where they plateau at a similar level of around 160,000 cases.

An interesting exception is Russia, a neighbor of China to the north. The trajectory of Russia seems to be sudden and steep. For many days, the country had few cases as an exception to what was seen in other European countries. However, geographically, Russia is quite different from what is seen in the other European countries. The western part of Russia in Moscow where the outbreak is serious is further east from the other countries such as: Germany, France, and Italy. Perhaps this distance could be a cause for why Russia is late to experience a growth in cases. However, of concern is the steep trajectory that the country is experiencing, if this growth continues without some sort of intervention, it could be dangerous for the country’s healthcare infrastructure as there’d be concern over whether their hospitals would be flooded.

The country at the bottom is Canada and it seems that their growth is slow but steady. Canada is comparatively quite remote, located across the Pacific in North America and having large distances of land where there are many less dense areas makes possibly for a slow spreading pandemic. This is a good sign for their healthcare system as there’s less of a chance of them being overwhelmed with cases. However, it’s important for the country to maintain protective measures so that the number of cases plateaus eventually without dragging on for too long. Another worry for Canada is the fact that it’s the neighbor of the United States to the north. The United States is currently the country with a skyrocketing number of confirmed cases and despite temporarily closing borders, if these measures are relaxed there’s the risk that Canada would soon become overwhelmed.

Belgium seems to have a similar situation as Peru, where due its smaller population the growth is limited by the number of people that can become infected. Looking at this country, it seems that they're on the path already of going towards the end stage of the pandemic, similar to where China is currently.

India is an interesting case here, as it's currently near the bottom, but looking at its slope it seems to be increasing quickly with time. With such a large population and the pandemic not being confined to a single area such as in the case of Hubei province with China, there seems to be the danger of where an exponential growth for such a large country could continue for a longer period unlike in other places. With a larger number of people that are vulnerable to infection, that implies then that the growth could become much stronger in comparison to other countries.

In [None]:
Image(filename='../input/data-viz-files/17.JPG')

*Figure 17. In the above screenshot it shows the countries of: Turkey, Russia, Iran, Canada, Belgium, and India (with China for reference). Each of their prospective S-shaped sigmoid curve trajectories are shown.*

The last screenshot for analysis from the line chart is regarding the country with the most cases, the United States of America. Below in Figure 18 it shows the comparison of confirmed cases against lag days for all fifteen countries with at least 50,000 confirmed cases. A straight gray line is drawn through the line that represents the United States. It’s interesting to note that while other countries tended to have some sort of S-shaped trajectory, the United States is the only one to have an obviously linear projection in terms of the number of cases. It’s uncertain how this can be explained, but the number of confirmed cases doesn’t represent the number of true cases, since there are likely many people who never get tested. Another factor is the healthcare system of the United States which is different from other countries. There are many in the country who lack healthcare and therefore can’t afford treatment. In other words, they’d rather deal with the symptoms themselves rather than seek help from a hospital. To imagine the true number of cases, it’d make sense for the true current trajectory to be steeper. Therefore, given that the true number of cases is steeper, then it could be imagined that a more accurate S-shaped sigmoidal curve could then be mapped on top of the line to project the path of the spread of the virus. The country of China is also highlighted to give reference to the scale of the number of confirmed cases.

On the other hand, another view could be that since the United States is one of the most developed countries in the world with the highest GDP, it could be reasoned that this is a good sign. For example, a linear growth in number of cases is much preferable to an exponential growth in the number of cases. Perhaps social distancing efforts and lockdowns by government officials has made it possible for a country like the United States with its large population to handle the influx of pandemic patients in a manageable manner. This is perhaps related to the concept of “flattening the curve,” where citizens are encouraged to practice safety measures to avoid hospitals being flooded with patients. Therefore, the S-shaped sigmoidal curve for this country is different from others in that it could be less steep, but quite sharp at the plateau. It's also interesting to note that recently (i.e., in the past few days) that the trend is starting to turn towards a new path. It's possible that it's begining a new linear trend with less steep of a slope. This indicates that the country is possibly starting to plateau soon.

In [None]:
Image(filename='../input/data-viz-files/18.JPG')

*Figure 18. The above screenshot shows the number of confirmed cases for the United States of America. A straight gray line is drawn through the line representing the United States to indicate that the growth has been linear. The country of China is also highlighted as a reference for the scale in the number of cases.*

## 4.3 Analysis of Third Visualization
<a id="43"></a>
<a href="#top">Back to top</a>

The third visualization is the same as the second visualization, except for that the y-axis has been scaled using the natural log transformation. The first point to be made about this plot is by looking at the bottom left corner many countries are horizontal for a long period. This is highlighted in Figure 19 by a black box around those countries with a horizontal or near horizontal trend during this period. It means that many countries had a relatively few numbers of confirmed cases for almost three weeks past the first confirmed lag case. This is an interesting insight in that it possibly could imply that due to slow or late testing, the numbers were already spreading rapidly for three weeks, but these cases were amongst the population who were not yet aware of the existence of the virus within their community. The countries within this box include: Canada, France, Germany, Italy, Russia, Spain, India, Belgium, United Kingdom, and United States. It’s interesting that these are all the European countries and their North American counterparts (i.e., Canada and the United States).

This pattern is possibly reflective of the response levels for western countries which are currently suffering an explosion in cases that more than double the levels that China had ever experienced at its peak. The data for the number of confirmed cases show that for over three weeks, these countries had data that only reflected the minor presence of the pandemic within their borders. It makes sense then why the government officials were so slow to react and hesitant to implement strict measures such as lockdown orders. Notably are the exceptions of Italy and Spain. These are the two hardest hit European countries (during the beginning of the spread within Europe) and it shows that Italy was the first to exit this phase after about 10 lag days. About 5 lag days later, Spain would also exit this horizontal phase. Afterwards, their numbers would soon grow rapidly at an exponential rate as the pandemic would spread rapidly within their populations.

The other countries that were within this horizontal channel would also eventually all exit after 25 lag days. Eventually, they too would all experience the sudden exponential growth in the number of confirmed cases as their healthcare systems would begin to feel the strain of the pandemic running through their respective populations. This data then can show the reason why western government officials were both slow to react and surprised by the inevitable skyrocketing number of confirmed cases. They had no evidence other than news reports from China that the virus would become present in their communities. This lack of apparent data to support any decisive action on the part of the government led to the false notion that the virus was a problem for China and that they were content to act minimally. Once the confirmed cases of the virus spread within their population for a period of around 2-3 weeks, the hospitals would quickly come to the realization that the virus has infected and would continue to infect large numbers of their population. At this point, the aspect of the virus’ potential for exponential infection rates would become apparent. The virus would then continue to spread until the countries had over a hundred thousand infected and tens of thousands dead due to the decision to not take the measures required to halt its progress.

In [None]:
Image(filename='../input/data-viz-files/19.JPG')

*Figure 19. The above is a screenshot of the confirmed cases against lag days after logarithmic transformation. The horizontal box in black indicates that many European and western countries would remain in a phase where the confirmed cases was low and constant for a period. Only after they exited this channel would they experience the rapid exponential growth in the number of cases.*

The last screenshot can be seen below in Figure 20. It shows all the countries in the graph of the logarithm of confirmed cases against lag days. Drawn on top of this chart are lines to indicate where and when the lines for each country are entering a trend phase. A linear trend on a logarithmic chart implies an exponential trend on the untransformed chart. The exceptions to this behavior can be seen in China and Turkey. China has already entered the plateau stage and so their number of cases are roughly constant, leading to a horizontal line. This is where all the countries are trying to eventually end up as the pandemic reaches its end. Turkey is an interesting exception in that its trend is curved throughout, indicating that their growth level has never reached exponential growth levels.

The countries of Brazil, Peru, and Italy are interesting for the purpose of understanding how the exponential growth trend lasts through time. In both cases, they seem to experience two different rates of exponential growth. This is indicated by two separate lines drawn through the lines representing the number of logarithmic confirmed cases. This may possibly imply that for both countries, they experience on type of exponential growth which was faster during the beginning of the pandemic. As time went on, factors such as government action and fewer potential people to become infected possibly caused the growth rate to alternate from one trend of exponential growth to another slow exponential growth trend.

The other countries all show that they too entered a period of roughly exponential growth as indicated by the straight lines drawn through their representative lines. As a note, the countries of Germany, France, United Kingdom, and Belgium along with Canada, Russia, and India are drawn together with a single thicker line since they share some significant overlap in the logarithmic chart.

In [None]:
Image(filename='../input/data-viz-files/20.JPG')

*Figure 20. The above screenshot shows all the countries together and their graph of log confirmed cases against lag days. Straight lines have been drawn to signify what portions of their logarithmic trend appear linear. The implication is that a linear trend in a logarithmic chart equates with an exponential trend in the untransformed plot. The exceptions are China and Turkey. China has already reached the plateau stage and so their straight line is completely horizontal. Turkey’s trend has a constant curve to it, indicating that their trend isn’t exponential in growth. Other countries have straight lines broken into sections, indicating that possibly the amount of exponential growth decreases over time (e.g., Brazil, Peru, and Italy).*

## 4.4 Analysis of Fourth Visualization
<a id="44"></a>
<a href="#top">Back to top</a>

The fourth visualization (see Figure 21) will be used to further expand on ideas elaborated on in the second and third visualizations. Those visualizations made it possible to view the number of both confirmed cases and log confirmed cases against lag days. Below in Figure 21 it shows the log cases against lag days for all fifteen countries. Furthermore, for each country a linear regression model is fit to each set of data and a fitted line is drawn. The logic is that given that the growth is truly exponential, then the logarithm of the data should equate to a linear equation. Adding the fitted line using simple linear regression will model how well the transformed data is to a straight line. Of note is the fact that since the ultimate trend of the data is likely to represent a sigmoid function with an S-shape, then the level of how close the transformed data fits a straight-line will change over time. Currently, as of 5/11/20, the data seems to signify that the data is beginning to plateau and so the time for the data to look linear rather than sigmoidal is beginning to end for the below fifteen countries.

The grid of plots was created in RStudio using base R commands. The colored lines were added using Paint3D in Windows. The vertical red lines represent where the trend of the data crosses below the dotted line that represents the fit of the linear model. The red, yellow, and green highlights indicate the Adjusted $R^2$ of each of the plots. The color red indicates that the Adj. $R^2$ is at least 0.9. The color yellow indicates that the Adj. $R^2$ is larger than 0.75 and less than 0.9. The color green is for China whose Adj. $R^2$ is comparatively far lower than the other countries.

Looking first at China (row 1, column 3), it’s apparent that the trend is no longer linear and so the country’s growth is in no way linear. The value for the Adj. $R^2$ value is 0.3776, indicating that the trend doesn’t fit a linear model. As seen in previous visualizations, the line at this point has become horizontal and so that explains the poor fit.
The countries that are colored yellow include: Belgium, France, Germany, Iran, Italy, Spain, and Turkey. The pattern seems to be such that they were previously in a period of linear growth, but as the cases begin to plateau a linear model no longer properly fits the data. Therefore, they result in lower Adj. $R^2$ values. As the lines cross below the straight-line fit of the linear regression model, it seems to lead to the logic that they will continue to trend below those fitted lines. Since the number of new cases continues to increase at a much more gradual rate, there’s no chance for the number of new cases to break upwards through the fitted line and potentially restart a linear trend (i.e., exponential trend when unscaled). The only way for these countries to resume that sort of linear growth is if the pandemic begins a second wave as the countries relax their lockdown measures. If the second wave does occur, then the growth rates would move the lines in a vertical direction more rapidly, giving them the ability to start fitting an exponential growth curve.

The countries in red include: Brazil, Canada, India, Peru, Russia, United Kingdom, and United States. These countries all have an Adj. $R^2$ of at least 0.9, this indicates that their trend fits a linear model quite well at the current stage. However, as indicated by the vertical red lines, they’re all breaking below the fitted line of the linear model. This indicates that they’re all possibly on the track of plateauing. They’ve past their inflection points and are on the second half of their journey through the pandemic (referring to Fig. 7).

From what was mentioned, then the implications are that although some countries seem to more clearly be on a path of entering the plateau stage, others still need to be careful as their growth rates can still be subtly exponential, but at a different rate than what was experienced initially. Looking at the infection numbers of being in the hundreds of thousands, there are in fact still millions of individuals that have yet to contract the virus. Therefore, governments need to be particularly cautious of reopening their countries and ending lockdowns. Despite the desire to restart their economies, it’s apparent that the numbers are still growing exponentially. If the countries try to reopen in a manner that doesn’t continue to halt the progress of the virus, then there’s a change that the growth will again spike up. What would happen is that they would enter a new linear trend on the logarithmic chart, likely in a steeper direction. On the chart with the linear model, they’d start to point upwards at the right tip and start to move up above the fitted line. It makes sense then for health officials to be weary of any sudden changes in direction of the line representing number of confirmed cases. The most certain way seems then for them to have data which reflects what’s seen now in China where the growth in cases is extremely low and remains low without any resurgences.

In [None]:
Image(filename='../input/data-viz-files/21.JPG')

*Figure 21. The above grid of plots show the comparison of logarithmic confirmed cases against lag days for the twelve countries with over 50,000 total confirmed cases. The vertical red lines indicate the point at which the log trend dips below the linear regression line. This possibly indicates that their growth is no longer exponential and is beginning to plateau. The numbers highlighted are the Adj. $R^2$ values. The red color indicates that the values is $\geq 0.9$. The yellow color indicates that it’s $< 0.9$ but $\geq 0.75$. The green color is for China with a significantly lower Adj. $R^2$ value of around 0.38*

## 4.5 Analysis of Fifth Visualization
<a id="45"></a>
<a href="#top">Back to top</a>

The fifth visualization is somewhat more derived in comparison to the other examples in that it’s based on the growth factor, which is based on an analysis of the sigmoidal curve. The concept is that at the beginning of the pandemic, the growth rate will steadily increase. Over time, the growth rate will begin to approach 1 for a period, then start to become smaller. Furthermore, once the data begins to enter the plateau phase, small spikes in the growth rate would be normal since the growth rate on a horizontal trend means that spikes should be normal. For example, if the previous two days had 10 and 5 new cases, then suddenly there are 30 cases, the spike would give a growth rate would suddenly jump from 1/2 to 6. Therefore, it’s important to be careful with the types of interpretations that are made based on this visualization.

The first chart to be analyzed is China. It’s a good model for other countries to look towards when wondering what type of trend they may be interested in seeing within their own data. The reason is that China has already experienced the spike in cases and began a period of plateauing. Below in Figure 22 is the plot of growth rate against days of cases for China. The green lines are added in Pain3D. It shows many spikes above the line, but it also shows many points below the horizontal line of 1. This implies that there are many days in which the number of new cases is less than the previous day’s number of new cases. The four green lines indicate that despite there being times when the growth rate spikes, it will continue to trend lower. This pattern would reflect that on certain days that the number of new cases would be higher than before, but this pattern isn’t sustained. For example, there isn’t a pattern of one day having 10 new cases, the next day having 20 new cases, and the next having 30 cases, etc. Furthermore, besides there lacking that upward trend in growth rate, the healthy amount of points below the line means that it’s typical for there to be fewer new cases than the day before. This is an important distinction as it will be seen in other countries that this pattern isn’t yet picked up too commonly elsewhere.

In [None]:
Image(filename='../input/data-viz-files/22.JPG')

*Figure 22. The above screen shot shows the growth rate (growth factor) against days of cases for China. The green lines indicate the trend of there being spikes in the growth rate followed by declines. There are also many days where the growth rate is below 1.*

The next screen shot (see Figure 23) shows countries shows countries where there seems to be a more significant pattern of growth rates appearing below 1. Looking at the example of China in the top left, of course they can't always remain below 1, otherwise the number of new cases would reach 0 quite quickly. Instead, the pattern shows that there are days with low numbers that may go up and down, but there's no evidence of a strong resurgence. The pattern is then saying that the number of new cases begins to decrease and plateaus. The green line highlights the plentiful number of points clearly below 1 (in the next Figure it will be apparent that some countries have points that hover quite close to 1). The other European countries (i.e., France, Germany, and Spain) are beginning to show this trend. In the beginning, they had many growth rates above 1. However, over time they begin to see a decent amount of days where the growth rate is often below 1. This is probably seen the least in France.

Based on the sigmoid curve (recall Figure 7), after some time where the growth rate remains around 1, that is the end of the inflection point and the pattern slope of the line representing the number of new cases begins to decrease much more dramatically. In China, this event of passing the inflection point occurred quite quickly. For the other countries here, this process is much slower and so it would make sense that after two or three weeks if this pattern continues, they likely will also see the number of new cases reach a plateau. Coinciding with this is a healthy number of growth rates below 1. A worrying pattern can be seein in Spain with the increase in growth rates recently, however, given that the inflection point has passed seeing points above 1 is expected. 

In [None]:
Image(filename='../input/data-viz-files/23.JPG')

*Figure 23. The screenshot above shows the countries of China, France, Germany, and Spain. The graphs represent the growth rates against days of cases. Green lines are drawn to help highlight the healthy pattern of growth rates below 1.*

The screenshot below (see Figure 24) includes the following countries: Belgium, Brazil, Canada, India, Iran, Italy, Peru, Russia, Turkey, United Kingdom, and United States. In these countries, they largely lack the pattern that is more evident in the countries shown in Figure 23. The yellow lines highlight the patterns that show more of an opposite behavior in the pattern of growth rates. For the countries below, there can still be seen many points above 1. In terms of the sigmoid curve, this can indicate that they're still hovering around the middle and have yet to clearly passed the inflection point.

Certain countries are showing a more worrying pattern where the growth rate is quite large. These countries include: India, Iran, and United States. Previously, Spain showed this pattern also, but it was balanced by a large number of points below 1 (particularly in the past few weeks). These countries have their larger number of spikes highlighted with a wide yellow line. The pattern is of concern in that it could indicate that they're further behind in terms of getting past the inflection point (representing the half-way point of the pandemic) of the sigmoid curve. If these spikes continue, it could indicate that they're still about to ramp up in the number of cases and so the slope of the line representing the number of cases can become quite steep.

Other countries have a pattern where they have many points close to 1, but they exist both above and below it. This is different from the pattern seein in China where there were a healthy number of points clearly below 1. These countries include: Italy, Peru, Russia, Turkey, and United Kingdom. Italy is unique in that it seems to have a balance of points both clearly above and below the line. However, it doesn't seem to signify a more obvious larger majority like what's seen in the European countries of Figure 23.

A good sign for the countries here would first be for the countries with wide yellow lines to start trending near 1 for some time. Afterwards, once the countries hover around 1 for some period, it'd be good if there would start to be a pattern of points clearly below 1. This would indicate that they've not only passed the inflection point of their sigmoid curves, but that they've started to reach their plateaus and that new growth rate only indicate that the numbers are bouncing between numbers below 50.

In [None]:
Image(filename='../input/data-viz-files/24.JPG')

*Figure 24. The above screenshot shows the growth rate against days of cases for the following countries: Belgium, Brazil, Canada, India, Iran, Italy, Peru, Russia, Turkey, United Kingdom, and United States. The yellow line highlights dangerous patterns that signify that countries are not quite free from danger in terms of their growth pattern. Some green is shown for Italy to help indicate that this country also has some healthy patterns.*

## 4.6 Analysis of Sixth Visualization
<a id="46"></a>
<a href="#top">Back to top</a>

The last visualization is the sunburst chart that displays the number of confirmed cases by continent, then country, then province or state. The following table shows the corresponding number of confirmed cases by continent along with the respective percentages.

| Europe | North America | Asia | South America | Africa | Australia | Total |
| --- | --- | --- | --- | --- | --- | --- |
| 1,612,876 (38.63%) | 1,482,631 (35.51%) | 684,483 (16.39%) | 320,686 (7.68%) | 65,980 (1.58%) | 8,493 (0.20%) | 4,175,149 | 

From the table, Europe has the largest number of cases followed closely by North America. The third most is Asia which is followed by South America. Africa and Australia have comparatively fewer cases as they’re both still under 100,000 as of 5/12/2020.

The first step then will be to analyze the continent with the most cases of infections, Europe. Below in Figure 25 it shows the breakdown of cases for the continent. The hierarchy is such that the center is the continent of Europe, the next level are the countries, and the following level are provinces or states. The countries of United Kingdom, France, Netherlands, and Denmark all show a province or state region due to there being small areas that they are considered to have some sort of responsibility over according to the dataset. Examples of this include territories from colonial times that they still have domain over, or other places that they’re legally supposed to provide with defense.

The sunburst chart shows that the order of most cases to the least is as follows: Spain, United Kingdom, Russia, Italy, France, and Germany. Between these countries there are between 172,000 and 228,000 confirmed cases as of 5/11/2020. It’s worth noting that these are the largest countries within Europe and consist of the major powers of the continent. The other countries have less influence and have smaller economies. They’re the top six in terms of both population and GDP (based on a Google search). It’s not surprising then why these countries have the most confirmed cases. A logical abstraction then is that the most densely populated areas with the most movement due to economic activities are the most susceptible. For example, Ukraine has a population not too distant from Spain, but the number of cases is considerably fewer. Furthermore, Netherlands with a nominal GDP right behind Spain also has a comparatively high number of confirmed cases. By number of confirmed cases, behind Russia is Belgium then Netherlands. The population of the Netherlands however is less than half of Spain. It follows then that important factors for determining the spread of the virus include the combination of population and level of economic activity.

In [None]:
Image(filename='../input/data-viz-files/25.JPG')

*Figure 25. The above screenshot shows the sunburst visualization after selecting Europe. The plot shows the continent of Europe in the center and starting with Spain it shows from the most to the least number of confirmed cases. The third level of the hierarchy include small territories or other locations that the countries have some responsibility over.*

The next screenshot shows the subset of North America after selecting it from the original sunburst visualization (Figure 9). The United States has a large majority of the confirmed cases for the continent. It’s followed by Canada and Mexico, the other major countries in North America. It’s interesting that Mexico which has over three times the population of Canada has fewer cases. However, from analyzing Europe it follows that the factors for having confirmed cases include the combination of population and economic activity. In terms of GDP per capita, Canada is far ahead of Mexico at nearly five times the amount.

Both the United States and Canada have third level values for the hierarchy that include provinces or states. In the United States, the state with the most cases is New York, followed by New Jersey, Illinois, Massachusetts, and California. It’s worth noting that New Jersey and Massachusetts are both close neighbors to New York. There is also a large commute for people who live in New Jersey but work in New York City. It’s interesting too that California, the most populous state, has fewer cases than the other four. A possible reason is that the president closed borders to travelers that have been to China early in the pandemic. It’s also been noted by researchers that most cases in New York come from infected people from Europe. Therefore, it follows that the spread of the pandemic comes from the east coast, whereas California is located far on the west coast. This is a possible reason why California as the state with both the largest economy and population of the country isn’t number one in term of infections.

In [None]:
Image(filename='../input/data-viz-files/26.JPG')

*Figure 26. The above screenshot shows the sunburst visualization after selecting North America. The plot shows the continent of North America in the center and starting with United States it shows from the most to the least number of confirmed cases. The third level of the hierarchy include states from the U.S. and provinces from Canada.*

On the other hand, the provinces with the most infected in Canada are Quebec followed by Ontario. In terms of GDP they aren’t nearly the largest which is interesting. It’s possible that a different dynamic exists for Canada since it’s such a large area of land with a relatively sparse population. It’s possible that traffic around the country isn’t as intense due to the large area. Therefore, when the pandemic reached those two provinces it didn’t spread much further. Below in Figure 27, behind the most populous provinces of Quebec and Ontario it follows Alberta and British Columbia, the provinces with the highest GDP.

In [None]:
Image(filename='../input/data-viz-files/27.JPG')

*Figure 27. The above screenshot shows the sunburst after selecting Canada. It shows the country of Canada in the center followed by its provinces of Quebec, Ontario, Alberta, B.C., etc. in terms of the number of confirmed cases.*

The next screenshot shows the continent of Asia. It’s interesting to see that China is no longer the country with the most cases. The order from most to least are as follows: Turkey, Iran, China, and India. The third level hierarchy of the sunburst shows the provinces of China. The sunburst visualization makes clear how Hubei (the province of Wuhan) is the province with most cases. The most notable piece of this portion of the visualization is India with a population like China (over a billion). However, few countries have managed to stem the growth of the pandemic in a quick manner. Some exceptions are South Korea and Singapore. Therefore, if India follows a similar path as some of the major European countries and United States, then it follows that there could be hundreds of thousands infected (if not more) with a similar fate as their western counterparts.

In [None]:
Image(filename='../input/data-viz-files/28.JPG')

*Figure 28. The above screenshot shows the sunburst visualization after selecting Asia. The plot shows the continent of Asia in the center and starting with Turkey it shows from the most to the least number of confirmed cases. The third level of the hierarchy include provinces from China.*

The next screenshot shows South America where Brazil has almost half the total number of infected cases. It’s followed by Peru, Chile, Ecuador, Colombia, and Argentina. Based on the idea of combining population and GDP, it makes sense why Brazil is the largest. It has both the largest population and highest GDP within the continent. Argentina and Colombia are the next two highest for those two factors, so it’s interesting to see them behind Peru, Ecuador, and Chile. Another factor is that all these countries are neighbors or near neighbors to Brazil. Brazil is situated such that it’s nearly the center of the continent with many neighboring countries. However, Peru, Ecuador and Chile all have populations and GDP’s that aren’t too far away from the leaders.

Another possibility is that since Brazil is one of the most developed countries in the region, it has the best access to medical staff and supplies, including testing kits. Therefore, it’s possible that the pandemic already exists in other regions of South America, but due to a lack of healthcare infrastructure it’s going around unaccounted for. This is an issue that exists for many Asian, South American, and African countries. It makes sense then to be skeptical about the true numbers and to be unsure if the given sample is truly representative of the unknown population of confirmed cases.

In [None]:
Image(filename='../input/data-viz-files/29.JPG')

*Figure 29. The above screenshot shows the sunburst visualization after selecting South America. The plot shows the continent of South America in the center and starting with Brazil it shows from the most to the least number of confirmed cases.*

The next screenshot shows Africa, a continent with a population close to China and India at 1.2 billion. Therefore, the potential for the pandemic to spread is quite dangerous. Thinking back to the concept of exponential growth, the factor that slowed the spread for many countries isn’t necessarily because of government action, but due to the way that environments experiencing exponential growth tend to use up all the fuel source until the growth is extinguished due to there being no more fuel. In this case, the fuel for the virus spreading is more uninfected people.

The country with the most confirmed cases is South Africa, followed by Egypt, Morocco, and Algeria. The top two countries have a similar number of cases at around 6,500. However, geographically speaking, Egypt is quite close to Europe where the outbreak has quickly exploded. South Africa is at the southern tip of the continent, further away from the pandemic. However, they both have significant population and GDP’s and are both amongst the top 5 for both. Algeria and Morocco both have high GDP’s for the continent while having populations that are slightly smaller than Egypt or South Africa. Therefore, it makes sense why there are also many confirmed cases there.

The concern for Africa however is related to the ability for each of the many countries to take care of themselves when the pandemic reaches their communities. In terms of GDP per capita, as a continent they’re behind Asia, South America, Europe, and North America (in order from lowest to highest). In other words, they’re the least prepared financially to help their population. It’s then up to international organizations and multilateral institutions to help their African counter parts in their time of need. Now, the pandemic is ravaging through Europe and North America. It’s possible that given that the pandemic is noticeable moving through time, African countries are maybe lucky that the virus will possibly spread first through the developed world before reaching the continent. Once the developed countries are largely over handling the pandemic situation, they’ll be more suited to assist other countries in need, such as developing countries in Africa, South America, and Asia.

In [None]:
Image(filename='../input/data-viz-files/30.JPG')

*Figure 30. The above screenshot shows the sunburst visualization after selecting Africa. The plot shows the continent of Africa in the center and starting with South Africa it shows from the most to the least number of confirmed cases.*

# 5. Conclusion
<a id="conclusion"></a>
<a href="#top">Back to top</a>

The six visualizations put into perspective the spread of the COVID-19 virus since it was first detected in the city of Wuhan. The first visualization showed from a timeline perspective how the pandemic spread globally. It began in China and traveled throughout Asia and Europe. It was able to exist on continents across oceans and managed to infect much of the developed world with skyrocketing cases. An important theme from this visualization is that as of the date of data, 5/11/2020, there’s still a looming potential for the pandemic to spread into the developing world and continue to infect the billions who live there.

The second visualization showed how the trajectory of the pandemic in terms of number of cases tends to follow a sigmoidal curve. At the beginning of the spread of the virus, countries experience an exponential growth in the number of cases. Over time as the rate of growth begins to slow, the countries are tending towards a plateauing stage. However, of the twelve countries highlighted, only China has secured itself as being in this end stage of the pandemic. Other countries are either still in the exponential phase or are trying to enter the end stage as their number of new cases start to level off.

The third visualization made apparent why western countries were hesitant to act despite their developed healthcare infrastructures. The countries had few cases for an extended period before they suddenly exploded with an exponentially growing number of cases. This was highlighted using straight lines on the log plot of the confirmed number of cases. It was further shown that the many of the major European countries all experienced a similar growth pattern.

The fourth visualization expanded on the linear modeling of the logarithmic charts of the confirmed cases. It utilized linear regression to see how closely the countries’ growth fit to a straight line, indicating exponential growth. Many of the countries had this trait still and it shows that they’ve yet to reach the end stage of the pandemic where the number of cases plateaus as in China. Therefore, countries need to be cautious unless they want to risk a second wave of the pandemic where the number of cases spikes and begins again a trajectory of exponential growth.

The fifth visualization shows the growth factor for different countries. This visualization made it apparent that the countries for the most part are still in the stage of staying around a growth factor of 1. They’re trying to get past the inflection point until they make experience a pattern of where the growth factor is regularly below 1 as seen in China. There is a distinction however in the pattern that seems to confirm how some countries maybe are nearing the plateau while others are possibly lagging by a few days or weeks.

The sixth visualization makes it clear how the pandemic is as of now largely within the developed world (i.e., Europe and North America). However, from the previous visualizations it was apparent that despite the developed healthcare infrastructures of those modern countries, their populations were still highly susceptible to the spread of the virus. This visualization then makes more apparent what was seen in the first visualization which is that the virus can still have devastating effects on the world, but it’s a matter of time for the pandemic to spread and to do damage. Furthermore, the developing countries have poor healthcare infrastructure that makes them less able to fend off the pandemic in the same manner that the developed world has.

Despite the devastating affects of the pandemic on European countries and the United States, it’s apparent from the trends seen within this paper that the modern countries have gone through a severe part of the pandemic already. They seem to be on the way towards the end stage of the pandemic where the number of cases plateaus. At that point it’s only up to the countries to be cautious to avoid a second outbreak as the world waits for a cure to the virus. The real problem then seems to be the fact that there are still billions of people in India, Africa, and South America that have yet to be infected. These are the vulnerable populations with poor citizens that won’t be able to get help even if they seek it. Then it should be apparent that it’s up to the commitment of international organization and multilateral institutions to develop mechanisms to help these poor countries as if the people were their own. Only until the virus is completely managed will the world be at peace. Otherwise, if the virus runs rampant throughout the developing world, then it’s only a matter of time until the virus returns to the countries who’ve fought hard to keep the pandemic from defeating their countries. If that happens, there would be no one else to help and the progress of the world could meet an unfortunate end due to the virus.

# 6. References
<a id="references"></a>
<a href="#top">Back to top</a>

1.	Google. (n.d.). Retrieved May 6, 2020, from https://www.google.com/
2.	Johns Hopkins University. (n.d.). Maps & Trends Follow global cases and trends. Updated daily. Retrieved March 20, 2020, from https://coronavirus.jhu.edu/data
3.	National Council of Teachers of Mathematics. (n.d.). Pandemics: How Are Viruses Spread? Retrieved March 20, 2020, from https://www.nctm.org/Classroom-Resources/Illuminations/Interactives/Pandemics-How-Are-Viruses-Spread/
4.	Plotly Python Graphing Library. (n.d.). Retrieved May 20, 2020, from https://plotly.com/python/
5.	Quaresima, V., Naldini, M. M., & Cirillo, D. M. (2020). The prospects for the            SARS            ‐CoV‐2 pandemic in Africa. EMBO Molecular Medicine. https://doi.org/10.15252/emmm.202012488
6.	Rajkumar, S. R. (n.d.). Novel Corona Virus 2019 Dataset. Retrieved May 3, 2020, from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
7.	Sanderson, G. S. (2020, March 8). Exponential Growth and Epidemics. Retrieved March 20, 2020, from https://www.3blue1brown.com/videos-blog/exponential-growth-and-epidemics
8.	Stack Overflow - Where Developers Learn, Share, & Build Careers. (n.d.). Retrieved March 20, 2020, from https://stackoverflow.com
9.	Stasko, J. S. (n.d.). SunBurst Page. Retrieved May 6, 2020, from https://www.cc.gatech.edu/gvu/ii/sunburst/
10.	Wikipedia contributors. (n.d.). Wikipedia. Retrieved May 6, 2020, from https://www.wikipedia.org/


# 7. Code Appendix
<a id="code"></a>
<a href="#top">Back to top</a>

The format of the Code Appendix is such that the code for each of the visualizations is self-contained. In other words, the code should be able to produce the entire visualization from top to bottom for each section (sections 7.1, 7.2, etc.). As of note, in section 7.2, the code will output two csv files, 'python_melt.csv' and 'lag_data.csv.' These are used in RStudio for section 7.4 and 7.5. Lastly, in the final section, 7.2, the method of producing the visualization is slightly more complicated. The Python code will output a file called, 'sunburst.csv.' This is then loaded into R where it's edited and re-output as 'sunburst_edit.csv.' This file is then re-opened in Python for use in the final visualization.

## 7.1 Code for First Visualization
<a id="71"></a>
<a href="#top">Back to top</a>

In [None]:
import pandas as pd # Load libaries
import numpy as np
import pycountry_convert as pc
import plotly.express as px
import os
import plotly
import plotly.graph_objs as go

ts_confirmed = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv') # Load data

# Remove certain counts for convenience
ts_confirmed = ts_confirmed[~ts_confirmed['Country/Region'].isin(['Congo (Brazzaville)'])]
ts_confirmed = ts_confirmed[~ts_confirmed['Country/Region'].isin(['Diamond Princess'])]
ts_confirmed = ts_confirmed[~ts_confirmed['Country/Region'].isin(['MS Zaandam'])]
ts_confirmed = ts_confirmed[~ts_confirmed['Country/Region'].isin(['Holy See'])]
ts_confirmed = ts_confirmed[~ts_confirmed['Country/Region'].isin(['Western Sahara'])]
ts_confirmed = ts_confirmed[~ts_confirmed['Country/Region'].isin(['Kosovo'])]
ts_confirmed = ts_confirmed[~ts_confirmed['Country/Region'].isin(['Cruise Ship'])]
ts_confirmed = ts_confirmed[~ts_confirmed['Country/Region'].isin(['Timor-Leste'])]
ts_confirmed = ts_confirmed[~ts_confirmed['Country/Region'].isin(['West Bank and Gaza'])]

# Change country name to full name for US
ts_confirmed.iloc[:,1][ts_confirmed.iloc[:,1] == 'US'] = 'United States of America'

iso_list = [] # Create ISO column for location plotting
for i in range(0,len(ts_confirmed)): # Manual ISO labeling
    if ts_confirmed.iloc[i,1] == 'Korea, South':
        iso_list.append('KOR')
    elif ts_confirmed.iloc[i,1] == 'Taiwan*':
        iso_list.append('TWN')
    elif ts_confirmed.iloc[i,1] == 'Congo (Kinshasa)':
        iso_list.append('COD')
        ts_confirmed.iloc[i,1] = 'Congo'
#    elif ts_confirmed.iloc[i,1] == 'Congo (Brazzaville)':
#        iso_list.append('COD')
#        ts_confirmed.iloc[i,1] = 'Congo'
    elif ts_confirmed.iloc[i,1] == 'Cote d\'Ivoire':
        iso_list.append('CIV')
    elif ts_confirmed.iloc[i,1] == 'Gambia, The':
        iso_list.append('GMB')
    elif ts_confirmed.iloc[i,1] == 'Bahamas, The':
        iso_list.append('BHS')
    elif ts_confirmed.iloc[i,1] == 'West Bank and Gaza':
        iso_list.append('PS')
    elif ts_confirmed.iloc[i,1] == 'Burma':
        iso_list.append('MMR')
    else:
        iso_list.append(pc.country_name_to_country_alpha3(ts_confirmed.iloc[i,1], cn_name_format="default"))

geospatial = ts_confirmed.copy() # Create new df to work with
geospatial['ISO'] = iso_list
geospatial.head()

# References: https://stackoverflow.com/questions/28654047/pandas-convert-some-columns-into-rows
# Change dataframe from long format to wide format
geospatial_melt = geospatial.melt(id_vars=geospatial.columns[[0,1,2,3,(len(geospatial.columns) - 1)]],
                 var_name='Date',
                 value_name='Confirmed Cases')

# Change some country names for continent labels
#geospatial_melt.loc[geospatial_melt['Country/Region'] == 'Congo (Brazzaville)','Country/Region'] = 'Congo'
geospatial_melt.loc[geospatial_melt['Country/Region'] == 'Congo (Kinshasa)','Country/Region'] = 'Congo'
geospatial_melt.loc[geospatial_melt['Country/Region'] == 'Cote d\'Ivoire','Country/Region'] = 'Ivory Coast'
geospatial_melt.loc[geospatial_melt['Country/Region'] == 'Korea, South','Country/Region'] = 'South Korea'
geospatial_melt.loc[geospatial_melt['Country/Region'] == 'Taiwan*','Country/Region'] = 'Taiwan'
geospatial_melt.loc[geospatial_melt['Country/Region'] == 'Kosovo','Country/Region'] = 'Serbia'
geospatial_melt.loc[geospatial_melt['Country/Region'] == 'Burma','Country/Region'] = 'Myanmar'
geospatial_melt.drop(geospatial_melt[geospatial_melt['Country/Region'] == 0].index, inplace=True)

# Reference: https://stackoverflow.com/questions/55910004/get-continent-name-from-country-using-pycountry
continents = { # Add continent column
    'NA': 'North America',
    'SA': 'South America', 
    'AS': 'Asia',
    'OC': 'Australia',
    'AF': 'Africa',
    'EU': 'Europe'
}
geospatial_melt['Continent'] = [continents[pc.country_alpha2_to_continent_code(pc.country_name_to_country_alpha2(country))] for country in geospatial_melt['Country/Region']]

# Combine province/state from countries into single rows with groupby
# Reference: https://stackoverflow.com/questions/33068007/pandas-keeping-dates-in-order-when-using-groupby-or-pivot-table
geo_groupby = geospatial_melt.groupby(['Date', 'Continent', 'Country/Region', 'ISO'],
                                      sort=False)['Confirmed Cases'].sum().reset_index()

fig = px.scatter_geo(geo_groupby, locations="ISO", color="Continent",
                     hover_name="Country/Region", size="Confirmed Cases",
                     animation_frame="Date",
                     height = 1000,
                     template='plotly_dark',
                     title='Timeseries of COVID-19 Pandemic by Country',
                     projection="natural earth")
fig.update(layout=dict(title=dict(x=0.5)))
fig.show()

# Reference: https://community.plot.ly/t/proper-way-to-save-a-plot-to-html/7063/8
cwd = os.getcwd()
#fig.write_html(cwd + '\\visualization1.html')

## 7.2 Code for Second Visualization
<a id="72"></a>
<a href="#top">Back to top</a>

In [None]:
import pandas as pd # Load libaries
import numpy as np
import pycountry_convert as pc
import plotly.express as px
import os
import plotly
import plotly.graph_objs as go

df_viz2 = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv') # Load data

### The melt_groupby_lowerbound function will melt the original timeseries
### dataframe from long format to wide format. It will also do a groupby
### to keep only Country/Region and Date columns. It also adds the columns
### of 'Days' and 'Log Confirmed Cases'. There's also the option to subset
### countries with a lower bound number of cases on the latest date.
def melt_groupby_lowerbound(untouched_data=df_viz2, lower_bound_cases=5000):
    # Drop Lat and Long columns
    untouched_data.drop(['Lat', 'Long'], axis=1, inplace=True)
    
    # Melt from long to wide format keeping only ['Country/Region','Date']
    subset_melt = untouched_data.melt(id_vars=untouched_data.columns[[0,1]],
                 var_name='Date',
                 value_name='Confirmed Cases')
    
    # Reference: https://stackoverflow.com/questions/40553002/pandas-group-by-two-columns-to-get-sum-of-another-column
    # Reference: https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-output-from-series-to-dataframe
    # Groupby ['Country/Region','Date'] and sum the rows to combine states and provinces
    df_groupby_country_date = subset_melt.groupby(['Country/Region','Date']).agg({'Confirmed Cases': 'sum'}).reset_index()

    # Reference: https://stackoverflow.com/questions/28161356/sort-pandas-dataframe-by-date
    # Reference: https://stackoverflow.com/questions/17141558/how-to-sort-a-dataframe-in-python-pandas-by-two-or-more-columns
    # Re-sort values after doing groupby
    df_groupby_country_date['Date'] = pd.to_datetime(df_groupby_country_date.Date, format='%m/%d/%y')
    df_groupby_country_date.sort_values(by=['Country/Region', 'Date'], ascending=[True, True], inplace=True)
    df_groupby_country_date.reset_index(inplace=True, drop=True)

    # Reference: https://stackoverflow.com/questions/59642338/creating-new-column-based-on-condition-on-other-column-in-pandas-dataframe
    # Add 'Days' column which repeats 1:n_dates per country by mapping dates to corresponding nth day
    unique_dates = df_groupby_country_date['Date'].unique()
    unique_dates_df = pd.DataFrame({'Dates': unique_dates})
    unique_dates_df['Days'] = [i for i in range(1, len(unique_dates_df) + 1)]
    df_groupby_country_date['Days'] = [unique_dates_df[x == unique_dates_df['Dates']]['Days'].values[0]
                                       for x in df_groupby_country_date['Date']]
    
    # Add log confirmed cases
    df_groupby_country_date['Log Confirmed Cases'] = np.log(df_groupby_country_date['Confirmed Cases'])
    
    # Subset countries with lower bound confirmed cases at current date
    # Reference: https://stackoverflow.com/questions/22591174/pandas-multiple-conditions-while-indexing-data-frame-unexpected-behavior
    lower_bound_countries = df_groupby_country_date[(df_groupby_country_date['Days'] == int(df_groupby_country_date['Days'].tail(1))) &
                            (df_groupby_country_date['Confirmed Cases'] >= lower_bound_cases)]['Country/Region']
    # Reference: https://stackoverflow.com/questions/17071871/how-to-select-rows-from-a-dataframe-based-on-column-values
    lower_bound_subset = df_groupby_country_date.loc[df_groupby_country_date['Country/Region'].isin(lower_bound_countries)]
    lower_bound_subset.reset_index(drop=True, inplace=True)

    return lower_bound_subset

# Subset countries with at least 50,000 cases
df_line_chart = melt_groupby_lowerbound(untouched_data=df_viz2, lower_bound_cases=50000)

df_line_chart['Country/Region'].replace('US', 'United States of America', inplace=True) # edit US to United States...

# df_line_chart.to_csv('python_melt.csv', index=False) # Save data to file for use in RStudio

### The k_lag_subset function is used to subset the data so that it creates a k'th lag in the data
### from the date of the first infection. It modifies the dataset so that the countries only
### show confirmed cases for 'k' days after their first infection. This helps to stagger the
### countries on a similar start date so that they can be compared.
def k_lag_subset(full_line_chart_df=df_line_chart, k_lags=0):
    df = full_line_chart_df.copy()
    subset_countries = df['Country/Region'].unique()
    total_days = np.max(df['Days'])
    lag_list = []
    for i in subset_countries:
        # Subset by i'th country
        country_subset = df[df['Country/Region'] == i]
        # Find row index of k'th day lag
        first_case_lag = np.where(country_subset['Confirmed Cases'] > 0)[0][0] + k_lags
        # Subset df by k'th row
        lag_subset = country_subset.iloc[first_case_lag:total_days]
        lag_subset['Lag Days'] = [i for i in range(1, lag_subset['Days'].tail(1).values[0] -
                                                   lag_subset['Days'].head(1).values[0] + 2)]
        # Append subset df
        lag_list.append(lag_subset)
    
    k_lag_df = pd.concat(lag_list) # Concatenate back to single df
    
    return k_lag_df

# Lag up to k=10 days
lag_df_k10 = k_lag_subset(full_line_chart_df=df_line_chart, k_lags=10)

# lag_df_k10.to_csv('lag_data.csv', index=False) # save to df to work with in R

fig = px.line(lag_df_k10, x='Lag Days', y='Confirmed Cases', color='Country/Region',
             hover_name='Country/Region', height=1000, title='Confirmed Cases vs. Lag Days')
fig.update(layout=dict(title=dict(x=0.5)))
fig.show()

# Reference: https://community.plot.ly/t/proper-way-to-save-a-plot-to-html/7063/8
cwd = os.getcwd()
# fig.write_html(cwd + '\\visualization2.html')

## 7.3 Code for Third Visualization
<a id="73"></a>
<a href="#top">Back to top</a>

In [None]:
import pandas as pd # Load libaries
import numpy as np
import pycountry_convert as pc
import plotly.express as px
import os
import plotly
import plotly.graph_objs as go

df_viz3 = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv') # Load data

### The melt_groupby_lowerbound function will melt the original timeseries
### dataframe from long format to wide format. It will also do a groupby
### to keep only Country/Region and Date columns. It also adds the columns
### of 'Days' and 'Log Confirmed Cases'. There's also the option to subset
### countries with a lower bound number of cases on the latest date.
def melt_groupby_lowerbound(untouched_data=df_viz3, lower_bound_cases=5000):
    # Drop Lat and Long columns
    untouched_data.drop(['Lat', 'Long'], axis=1, inplace=True)
    
    # Melt from long to wide format keeping only ['Country/Region','Date']
    subset_melt = untouched_data.melt(id_vars=untouched_data.columns[[0,1]],
                 var_name='Date',
                 value_name='Confirmed Cases')
    
    # Reference: https://stackoverflow.com/questions/40553002/pandas-group-by-two-columns-to-get-sum-of-another-column
    # Reference: https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-output-from-series-to-dataframe
    # Groupby ['Country/Region','Date'] and sum the rows to combine states and provinces
    df_groupby_country_date = subset_melt.groupby(['Country/Region','Date']).agg({'Confirmed Cases': 'sum'}).reset_index()

    # Reference: https://stackoverflow.com/questions/28161356/sort-pandas-dataframe-by-date
    # Reference: https://stackoverflow.com/questions/17141558/how-to-sort-a-dataframe-in-python-pandas-by-two-or-more-columns
    # Re-sort values after doing groupby
    df_groupby_country_date['Date'] = pd.to_datetime(df_groupby_country_date.Date, format='%m/%d/%y')
    df_groupby_country_date.sort_values(by=['Country/Region', 'Date'], ascending=[True, True], inplace=True)
    df_groupby_country_date.reset_index(inplace=True, drop=True)

    # Reference: https://stackoverflow.com/questions/59642338/creating-new-column-based-on-condition-on-other-column-in-pandas-dataframe
    # Add 'Days' column which repeats 1:n_dates per country by mapping dates to corresponding nth day
    unique_dates = df_groupby_country_date['Date'].unique()
    unique_dates_df = pd.DataFrame({'Dates': unique_dates})
    unique_dates_df['Days'] = [i for i in range(1, len(unique_dates_df) + 1)]
    df_groupby_country_date['Days'] = [unique_dates_df[x == unique_dates_df['Dates']]['Days'].values[0]
                                       for x in df_groupby_country_date['Date']]
    
    # Add log confirmed cases
    df_groupby_country_date['Log Confirmed Cases'] = np.log(df_groupby_country_date['Confirmed Cases'])
    
    # Subset countries with lower bound confirmed cases at current date
    # Reference: https://stackoverflow.com/questions/22591174/pandas-multiple-conditions-while-indexing-data-frame-unexpected-behavior
    lower_bound_countries = df_groupby_country_date[(df_groupby_country_date['Days'] == int(df_groupby_country_date['Days'].tail(1))) &
                            (df_groupby_country_date['Confirmed Cases'] >= lower_bound_cases)]['Country/Region']
    # Reference: https://stackoverflow.com/questions/17071871/how-to-select-rows-from-a-dataframe-based-on-column-values
    lower_bound_subset = df_groupby_country_date.loc[df_groupby_country_date['Country/Region'].isin(lower_bound_countries)]
    lower_bound_subset.reset_index(drop=True, inplace=True)

    return lower_bound_subset

# Subset countries with at least 50,000 cases
df_line_chart = melt_groupby_lowerbound(untouched_data=df_viz3, lower_bound_cases=50000)
df_line_chart['Country/Region'].replace('US', 'United States of America', inplace=True) # edit US to United States...

### The k_lag_subset function is used to subset the data so that it creates a k'th lag in the data
### from the date of the first infection. It modifies the dataset so that the countries only
### show confirmed cases for 'k' days after their first infection. This helps to stagger the
### countries on a similar start date so that they can be compared.
def k_lag_subset(full_line_chart_df=df_line_chart, k_lags=0):
    df = full_line_chart_df.copy()
    subset_countries = df['Country/Region'].unique()
    total_days = np.max(df['Days'])
    lag_list = []
    for i in subset_countries:
        # Subset by i'th country
        country_subset = df[df['Country/Region'] == i]
        # Find row index of k'th day lag
        first_case_lag = np.where(country_subset['Confirmed Cases'] > 0)[0][0] + k_lags
        # Subset df by k'th row
        lag_subset = country_subset.iloc[first_case_lag:total_days]
        lag_subset['Lag Days'] = [i for i in range(1, lag_subset['Days'].tail(1).values[0] -
                                                   lag_subset['Days'].head(1).values[0] + 2)]
        # Append subset df
        lag_list.append(lag_subset)
    
    k_lag_df = pd.concat(lag_list) # Concatenate back to single df
    
    return k_lag_df

# Lag up to k=10 days
lag_df_k10 = k_lag_subset(full_line_chart_df=df_line_chart, k_lags=10)

# Reference: https://github.com/plotly/plotly_express/issues/52
fig = px.line(lag_df_k10, x='Lag Days', y='Log Confirmed Cases', color='Country/Region',
             hover_name='Country/Region', height=1000, title='Log Confirmed Cases vs. Lag Days')
fig.update(layout=dict(title=dict(x=0.5)))
fig.show()

# Reference: https://community.plot.ly/t/proper-way-to-save-a-plot-to-html/7063/8
cwd = os.getcwd()
# fig.write_html(cwd + '\\visualization3.html')

## 7.4 Code for Fourth Visualization
<a id="74"></a>
<a href="#top">Back to top</a>

```r
py_df <- read.csv('python_melt.csv') # Load data from python
lag_df <- read.csv('lag_data.csv')

### Fig. 5
### Plot log cases with linear regression fit
subset_countries <- unique(lag_df$Country.Region) # country names

### The plog_log_cases function will plot the log number of cases
### against the days of cases. It will fit a line using linear regression
### and the plot will include Adj. R-squared, slope, and intercept.
plot_log_cases <- function(log_df = outside_china_log,
                           location_name = 'Outside China') {
  # Fit model
  # Reference: https://www.r-bloggers.com/that-damn-r-squared/
  model_log_df <- lm(formula = log_confirmed~days, data = log_df)
  adj_r_squared <- summary(model_log_df)$r.squared # adj. R-squared
  intercept <- summary(model_log_df)$coefficients[1]
  slope <- summary(model_log_df)$coefficients[2]
  date_count <- 1:nrow(log_df)
  
  # Plot data
  plot(date_count, log_df$log_confirmed, # plot log of cases
       type = 'l', ylab = 'log confirmed cases', xlab = 'Lag Days',
       main = paste(location_name, 'Log Cases vs. Lag Days'),
       sub = paste('Adj. R-squared:', round(adj_r_squared, 4)))
  abline(model_log_df, col = 'blue', lty = 2)
  return(c(slope, intercept))
}

### The log_plot_grid() function is a helper function to
### make the data loaded from python suitable for the
### variables in the plot_log_cases function. It also
### makes sure that the log cases with -Inf are changed
### to 0.
log_plot_grid <- function(df = lag_df, country_name) {
  # Reference: https://stackoverflow.com/questions/7531868/how-to-rename-a-single-column-in-a-data-frame
  names(df)[names(df) == 'Log.Confirmed.Cases'] <- 'log_confirmed'
  names(df)[names(df) == 'Lag.Days'] <- 'days'
  df$log_confirmed <- ifelse(df$log_confirmed < 0, 0, df$log_confirmed)
  plot_log_cases(log_df = df, location_name = country_name)
}

par(mfrow = c(3,5)) # plot grid of log cases
for (i in subset_countries) {
  ith_country <- lag_df[lag_df$Country.Region == i,]
  log_plot_grid(df = ith_country, country_name = i)  
}
dev.off()

### Fig. 6
par(mfrow = c(1,2))
china_lag <- lag_df[lag_df$Country.Region == 'China',]
log_plot_grid(df = china_lag, country_name = 'China')

# Use the code from plot_log_cases to plot a custom
# plot for China
china <- py_df[py_df$Country.Region == 'China',]
names(china)[names(china) == 'Log.Confirmed.Cases'] <- 'log_confirmed'
names(china)[names(china) == 'Days'] <- 'days'
location_name <- 'China'

log_df <- china[1:20,] # Subset first 20 days of cases
model_log_df <- lm(formula = log_confirmed~days, data = log_df)
adj_r_squared <- summary(model_log_df)$r.squared # adj. R-squared
intercept <- summary(model_log_df)$coefficients[1]
slope <- summary(model_log_df)$coefficients[2]
date_count <- 1:nrow(log_df)

# Plot data
plot(date_count, log_df$log_confirmed, # plot log of cases
     type = 'l', ylab = 'log confirmed cases', xlab = 'Days of Cases',
     main = paste(location_name, 'Log Cases vs. Days of Cases'),
     sub = paste('Adj. R-squared:', round(adj_r_squared, 4)))
abline(model_log_df, col = 'blue', lty = 2)

```

## 7.5 Code for Fifth Visualization
<a id="75"></a>
<a href="#top">Back to top</a>

```r
py_df <- read.csv('python_melt.csv') # Load data from python
lag_df <- read.csv('lag_data.csv')

subset_countries <- unique(lag_df$Country.Region) # country names

### Fig. 8
### The delta_confirmed() function takes in the number of
### daily confirmed cases from a country and calculates the
### change in number of cases along with the growth rate.
### The change in number of cases is the difference in
### cases between day i and day i-1. The growth rate is the
### change in cases on day i divided by change in cases on
### day i-1. When the growth is NaN, it defaults to 0.
delta_confirmed <- function(df_cases = us_sum$confirmed) {
  N <- length(df_cases); change_matrix <- matrix(0, nrow = N)
  growth_matrix <- matrix(0, nrow = N) # Initialize variables
  
  for (i in 2:N) { # calculate change in cases
    change_matrix[i] <- df_cases[i] - df_cases[i-1]
  }
  
  for (j in 3:(N-1)) { # calculate growth rate of cases
    growth <- (change_matrix[j] / change_matrix[j-1])
    if (is.nan(growth)) {
      growth_matrix[j] <- 0
    } else {
      growth_matrix[j] <- growth
    }
  }
  
  change_df <- data.frame(days = 1:N, # save to df
                          cases = df_cases,
                          delta_cases = change_matrix,
                          growth_cases = growth_matrix)
  return(change_df)
}

country_list <- split(py_df, py_df$Country.Region) # split countries to list

# calculate the delta dataframes for each of the countries
delta_list <- lapply(country_list, function(x) {
  delta_confirmed(df_cases = x$Confirmed.Cases)
})

### The growth_plot() function is to plot the growth rate
### which is a function of the change in number of cases per
### day. By taking deltaN_d / deltaN_(d-1) the growth is
### calculated. This function will plot this growth rate
### against the number of days of cases. Additionally, it
### will draw a horizontal line at 1 to show when it may
### soon hit the inflection point and taper off.
growth_plot <- function(df_growth = outside_china_delta, location = 'Outside-China') {
  growth_capped <- ifelse(df_growth$growth_cases < 5,
                          ifelse(df_growth$growth_cases >= 0, df_growth$growth_cases, 0), 5)
  N <- nrow(df_growth)
  plot(1:N, growth_capped, pch = 19,
       xlab = 'Days of Cases', ylab = 'Growth Rate',
       main = paste(location, ': Growth Rate vs. Days of Cases'))#,
  # sub = '(Growth capped at 5 for perspective purposes.)')
  abline(h = 1); abline(h = 0)
  segments(x0 = 1:N, y0 = 0, x1 = 1:N, y1 = growth_capped)
}

par(mfrow = c(3,3)) # plot the grid of growth factors
lapply(seq_along(delta_list), function(x) {
  growth_plot(df_growth = delta_list[[x]], location = subset_countries[[x]])
  
  inf_growth <- which((delta_list[[x]]$growth_cases == Inf))
  sapply(inf_growth, function(y) {
    abline(v = y, col = 'red')
  })
})
dev.off()

# Below certain individual and group plots are created
# from the previous grid of plots.
### The group_plot function is used to plot an individual
### country from the previous grid of plots.
group_plot <- function(delta_df, country_name) {
  growth_plot(df_growth = delta_df, location = country_name)
  
  inf_growth <- which((delta_df$growth_cases == Inf))
  sapply(inf_growth, function(y) {
    abline(v = y, col = 'red')
  })
}
country_order <- unique(py_df$Country.Region) # order of countries

# China
china_num <- which(country_order == 'China')
group_plot(delta_df = delta_list[[china_num]], country_name = 'China')

# Positive trends
positive_trend <- c('China', 'France', 'Germany', 'Spain')
positive_index <- sapply(positive_trend, function(x) which(country_order == x))
par(mfrow = c(2,2))
for(i in 1:length(positive_index)) {
  group_plot(delta_df = delta_list[[i]], country_name = positive_trend[i])
}

# Negative trends
negative_trend <- country_order[-positive_index]
negative_index <- sapply(negative_trend, function(x) which(country_order == x))
par(mfrow = c(3,4))
for(i in 1:length(negative_index)) {
  group_plot(delta_df = delta_list[[i]], country_name = negative_trend[i])
}
                         
```

## 7.6 Code for Sixth Visualization
<a id="76"></a>
<a href="#top">Back to top</a>

```r
### Edit data from Python for sunburst
sunburst <- read.csv('sunburst.csv') # load sunburst data
sunburst_sub <- sunburst[,colnames(sunburst)[-4]] # remove ISO column

# NA the blank columns
sunburst_sub[sunburst_sub$Province.State == '',]['Province.State'] = 'other'

# set NA to country
# Reference: https://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters
sunburst_sub$Province.State <- as.character(sunburst_sub$Province.State,
                                            stringsAsFactors=FALSE) # remove factors
for (i in 1:nrow(sunburst_sub)) { # set Na's to country names instead
  if (is.na(sunburst_sub[i,1])) {
    sunburst_sub[i,1] = as.character(sunburst_sub[i,2])
  }
}

# Reference: https://stackoverflow.com/questions/14262741/combining-duplicated-rows-in-r-and-adding-new-column-containing-ids-of-duplicate
sunburst_final <- aggregate(sunburst_sub[3], sunburst_sub[-3], sum)
# replace double names in province/country with NaN
for (i in 1:nrow(sunburst_final)) {
  if (sunburst_final[i,1] == sunburst_final[i,2]) {
    sunburst_final[i,1] <- NaN
  }
}
write.csv(sunburst_final, file = 'sunburst_edit.csv', row.names = FALSE)

```

In [None]:
import pandas as pd # Load libaries
import numpy as np
import pycountry_convert as pc
import plotly.express as px
import os
import plotly
import plotly.graph_objs as go

sunburst_data = pd.read_csv('../input/novel-corona-virus-2019-dataset/covid_19_data.csv') # Load data

latest_date = max(sunburst_data['ObservationDate']) # Find latest date

# Subset 'Province/State', 'Country/Region','Confirmed' from the latest date
sunburst_sub = sunburst_data[sunburst_data['ObservationDate'] == latest_date][['Province/State', 'Country/Region','Confirmed']]

# Remove certain counts for convenience
sunburst_sub = sunburst_sub[~sunburst_sub['Country/Region'].isin(['Congo (Brazzaville)'])]
sunburst_sub = sunburst_sub[~sunburst_sub['Country/Region'].isin(['Diamond Princess'])]
sunburst_sub = sunburst_sub[~sunburst_sub['Country/Region'].isin(['MS Zaandam'])]
sunburst_sub = sunburst_sub[~sunburst_sub['Country/Region'].isin(['Holy See'])]
sunburst_sub = sunburst_sub[~sunburst_sub['Country/Region'].isin(['Western Sahara'])]
sunburst_sub = sunburst_sub[~sunburst_sub['Country/Region'].isin(['Kosovo'])]
sunburst_sub = sunburst_sub[~sunburst_sub['Country/Region'].isin(['Cruise Ship'])]
sunburst_sub = sunburst_sub[~sunburst_sub['Country/Region'].isin(['Timor-Leste'])]
sunburst_sub = sunburst_sub[~sunburst_sub['Country/Region'].isin(['West Bank and Gaza'])]

# Change country name to alternate name for US, UK, and China
sunburst_sub.iloc[:,1][sunburst_sub.iloc[:,1] == 'US'] = 'United States of America'
sunburst_sub.iloc[:,1][sunburst_sub.iloc[:,1] == 'UK'] = 'United Kingdom'
sunburst_sub.iloc[:,1][sunburst_sub.iloc[:,1] == 'Mainland China'] = 'China'

iso_list = [] # Create ISO column for location plotting
for i in range(0,len(sunburst_sub)): # Manual ISO labeling
    if sunburst_sub.iloc[i,1] == 'Korea, South':
        iso_list.append('KOR')
    elif sunburst_sub.iloc[i,1] == 'Taiwan*':
        iso_list.append('TWN')
    elif sunburst_sub.iloc[i,1] == 'Congo (Kinshasa)':
        iso_list.append('COD')
        sunburst_sub.iloc[i,1] = 'Congo'
#    elif sunburst_sub.iloc[i,1] == 'Congo (Brazzaville)':
#        iso_list.append('COD')
#        sunburst_sub.iloc[i,1] = 'Congo'
    elif sunburst_sub.iloc[i,1] == 'Cote d\'Ivoire':
        iso_list.append('CIV')
    elif sunburst_sub.iloc[i,1] == 'Gambia, The':
        iso_list.append('GMB')
    elif sunburst_sub.iloc[i,1] == 'Bahamas, The':
        iso_list.append('BHS')
    elif sunburst_sub.iloc[i,1] == 'West Bank and Gaza':
        iso_list.append('PS')
    elif sunburst_sub.iloc[i,1] == 'Burma':
        iso_list.append('MMR')
    else:
        iso_list.append(pc.country_name_to_country_alpha3(sunburst_sub.iloc[i,1], cn_name_format="default"))
sunburst_sub['ISO'] = iso_list # Save ISO column

# Change some country names for continent labels
sunburst_sub.loc[sunburst_sub['Country/Region'] == 'Congo (Kinshasa)','Country/Region'] = 'Congo'
sunburst_sub.loc[sunburst_sub['Country/Region'] == 'Cote d\'Ivoire','Country/Region'] = 'Ivory Coast'
sunburst_sub.loc[sunburst_sub['Country/Region'] == 'Korea, South','Country/Region'] = 'South Korea'
sunburst_sub.loc[sunburst_sub['Country/Region'] == 'Taiwan*','Country/Region'] = 'Taiwan'
sunburst_sub.loc[sunburst_sub['Country/Region'] == 'Burma','Country/Region'] = 'Myanmar'
sunburst_sub.drop(sunburst_sub[sunburst_sub['Country/Region'] == 0].index, inplace=True)

# Reference: https://stackoverflow.com/questions/55910004/get-continent-name-from-country-using-pycountry
continents = { # Add continent to data frame
    'NA': 'North America',
    'SA': 'South America', 
    'AS': 'Asia',
    'OC': 'Australia',
    'AF': 'Africa',
    'EU': 'Europe'
}

sunburst_sub['Continent'] = [continents[pc.country_alpha2_to_continent_code(pc.country_name_to_country_alpha2(country))] for country in sunburst_sub['Country/Region']]

# sunburst_sub.to_csv('sunburst.csv', index=False) # Save data

sunburst_edit = pd.read_csv('../input/data-viz-files/sunburst_edit.csv') # Load edited data from R

# Reference: https://stackoverflow.com/questions/14162723/replacing-pandas-or-numpy-nan-with-a-none-to-use-with-mysqldb
sunburst_edit = sunburst_edit.where(pd.notnull(sunburst_edit), None) # change NaN to None

provinces = list(sunburst_edit['Province.State']) # make dataframe version of dictionary
countries = list(sunburst_edit['Country.Region'])
continents = list(sunburst_edit['Continent'])
confirmed_cases = list(sunburst_edit['Confirmed'])
sunburst_dict = pd.DataFrame(
    dict(Provinces=provinces, Countries=countries, Continents=continents, Cases=confirmed_cases)
)

# change countries with extra None Leaves to match Parent node
non_none_leaves = sunburst_dict[sunburst_dict['Provinces'].notnull()]['Countries'].unique()
for i in non_none_leaves:
    # sum to see if any None
    # Reference: https://stackoverflow.com/questions/45271309/check-for-none-in-pandas-dataframe
    num_none = sum(sunburst_dict[sunburst_dict['Countries'] == i].applymap(lambda x: x is None)['Provinces'])
    # if there's a None, change it to country
    if num_none != 0:
        none_province = sunburst_dict[sunburst_dict['Countries'] == i]
        none_index = none_province['Provinces'][none_province['Provinces'].isnull()].index
        # References: https://stackoverflow.com/questions/13842088/set-value-for-particular-cell-in-pandas-dataframe-using-index
        sunburst_dict.at[none_index, 'Provinces'] = i + ' (country)'
        
# Reference: https://plot.ly/python/sunburst-charts/#sunburst-of-a-rectangular-dataframe-with-continuous-color-argument-in-pxsunburst
fig = px.sunburst(sunburst_dict,
                  path=['Continents', 'Countries', 'Provinces'],
                  values='Cases',
                  color='Continents',
                  hover_data=['Countries'],
                  title="Sunburst Hierarchy of Confirmed Cases",
                  height=1000,
                  width=1000,
                  color_continuous_scale=px.colors.sequential.Viridis)
fig.update(layout=dict(title=dict(x=0.5)))
fig.update_layout(legend_title='<b> Total Cases </b>')
fig.show()

cwd = os.getcwd()
# fig.write_html(cwd + '\\visualization6.html')