Hi everyone! Today I'm going to be looking at the COVID Cases around the world and observing countries to see how many COVID cases and deaths they have. However, I plan on taking a different approach than other notebooks. Instead of just plotting all the countries and the amount of COVID Cases and Deaths each has, I want to find new approaches so we can get insight into some new information about COVID in the world. I will not be looking at COVID generally, rather I will look into specific features about COVID (things like top COVID Deaths) and make conclusions about the data that can help us glean new information about the pandemic.

I hope you enjoy this notebook and feel free to ask any questions, thank you!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
from plotly.offline import iplot, plot
import plotly.express as px

You may notice that when reading the file to the csv every columns for some reason is shifted one the the left meaning that the values in the Name Column is now put as the index and inside the Name column itself is the values from the WHO Region. This seems to be a problem with the csv file itself but I was able to resolve it with this code cell below below that I found on stack overflow:

In [None]:
df = pd.read_csv('/kaggle/input/who-covid19-data-tabe/WHO COVID-19 global table data August 11th 2021 at 10.41.34 AM.csv'
, sep = ',', index_col=False)
df.head()

Thank you to Oladeinbo Olutayo for this code. The reason why we are renaming a bunch of the Countries is so that plotly can recognize them as their respective countries when graphed.

In [None]:
df = df.rename(columns={'Name':'Country', 'WHO Region':'Continent'})
df = df.drop(0)
df = df[df['Continent'] != 'Other']
df['Country']= df['Country'].replace('United States of America', 'United States')
df['Country']= df['Country'].replace('Russian Federation', 'Russian')
df['Country']= df['Country'].replace('The United Kingdom' , 'United Kingdom')
df['Country']= df['Country'].replace('Iran (Islamic Republic of)' , 'Iran')
df['Country']= df['Country'].replace('Czechia' , 'Czech Republic')
df['Country']= df['Country'].replace('occupied Palestinian territory, including east Jerusalem' , 'Palestine, State of')
df['Country']= df['Country'].replace('Republic of Moldova' , 'Moldova')
df['Country']= df['Country'].replace('Viet Nam' , 'Vietnam')
df['Country']= df['Country'].replace('Bolivia (Plurinational State of)' , 'Bolivia')
df['Country']= df['Country'].replace( 'Palestine','Palestine, State of')
df['Country']= df['Country'].replace('Venezuela (Bolivarian Republic of)' , 'Venezuela')
df['Country']= df['Country'].replace('Republic of Korea' , 'Korea, Republic of')
df['Country']= df['Country'].replace( 'Korea(Republic of','Korea, Republic of')
df['Country']= df['Country'].replace( 'Bosnia and Herzegovina','Bosnia And Herzegovina')
df['Country']= df['Country'].replace( ' Namibia ','Namibia')
df['Country']= df['Country'].replace( 'Syrian Arab Republic','Syria')
df['Country']= df['Country'].replace( 'United Republic of Tanzania','Tanzania')
df.head()

# My Goals

So it seems like other users have already done a good job using graphs to analyze Cumulative total cases and deaths for each country, so instead I'm going to focus on analyzing interesting features in the data and the total cases and deaths per 100,000 population. Not just with graphs, but rather, by directly observing curious data that would be hard to find by just looking at a graph of all cases or deaths.

# Creating New Features
First think I want to do is create some more features. These features will consist of
1. Total percentage of country that have had COVID cases.
2. Total percentage of country that have died of COVID.
3. Total population. 
   1. Population = ("Cases - cumulative total" * 100,000)/("Cases - cumulative total per 100000 population")
   2. Note that "Cases - cumulative total" and "Cases - cumulative total per 100000 population" are columns, it is not Cases minus cumulative total.

In [None]:
df['Perc_COVID_Cases'] = np.round(df['Cases - cumulative total per 100000 population']/1000, 4)
df['Perc_COVID_Deaths'] = np.round(df['Deaths - cumulative total per 100000 population']/1000, 4)
df['Population'] = np.round((df['Cases - cumulative total'] * 100000)/(df['Cases - cumulative total per 100000 population']), 4)

In [None]:
df.head()

# **Percentage of Cases and Deaths in Countries**
First thing I want to look at is some of the the top/lowest cases and deaths in each Country and I will go on from there.

# Top Cases per 100,000 people


In [None]:
df.nlargest(6, 'Cases - cumulative total per 100000 population')

Some of you (I was at least haha) may be wondering how the Cumulative total per 100,000 of cases is higher then the cumulative total of cases. This is because these countries don't actually have 100,000 people in their population making their cumulative total per 100,000 higher then their actual total cases and population.

I don't like making an analysis when lots of the countries don't even have a population of 100,000 because at that point we're just kind of predicting how many cases they would have if the country did have a population of 100,000.

I will say though that I am kind of surprised that countries like U.S.A and India are not up there considering they're all over in the news but they aren't even in the top 6 of most cases per 100,000 population. Then again, American and India have some of the biggest populations so even though they have some of the most cases, they probably won't have the most cases per some number of people.

# Top Percentage of Total Cases in Country

In [None]:
df_t = df[['Country', 'Perc_COVID_Cases', 'Population']].copy()
df_t['Pop_per'] = df_t['Population'].apply(lambda pop: 'pop < 100,000' if pop < 100000 else ('100,000 < pop < 1,000,000' if pop < 1000000 else ('1,000,000 < pop < 10,000,000' if pop < 10000000 else 'pop > 10,000,000')))
fig = px.bar(data_frame=df_t.nlargest(20, 'Perc_COVID_Cases'), x = 'Country', y = 'Perc_COVID_Cases', color = 'Pop_per', title='Percentage of Population that has gotten COVID per Country')
fig.update_layout(xaxis_categoryorder = 'total descending')
#fig.update_layout(height = 500)
iplot(fig)

Wow so in some countries COVID has infected almost 20% of the entire population! However, many of these countries seem to have a low population to begin with compared with other countries. Some countries to note though with a big population and lots of COVID cases are Czech Republic, Argentina, and Netherlands. While the US is constantly talked about in the news, it seems that other countries still have a lot of people infected with COVID but are not portrayed in the news as much (Probably because the US and India has some of the most cumulative cases). 

# Smallest Percentage of Total Cases in the Country

In [None]:
fig = px.bar(data_frame=df_t.nsmallest(20, 'Perc_COVID_Cases'), x = 'Country', y = 'Perc_COVID_Cases', color = 'Pop_per', title='Percentage of Population that has gotten COVID per Country')
fig.update_layout(xaxis_categoryorder = 'total ascending')

So most of these countries seem to have no COVID Cases but this seems to be because msot of the countries are either corrupt or are small islands that don't really bother to record COVID Cases (may be to poor to). China being at the bottom is obviously a lie as well considering it's where the whole pandemic originated.

# Top Deaths per 100,000 people

In [None]:
df.nlargest(6, 'Deaths - cumulative total per 100000 population')

Oh man, especially in Peru, that is a lot of deaths per 100,000 population. I am going to ignore San Marino in this case because it has less than 100,000 people in the population.

Peru significantly exceeds other countries in terms of Cumulative Deaths per 100,000 population and has more cumulative cases then any of the top 6 as well. In fact, according to this news article: https://www.theguardian.com/global-development/2021/aug/16/hidden-pandemic-peruvian-children-in-crisis-as-carers-die
tens of thousands of kids are now losing their parents to COVID-19, a very troublesome sign.

The countries below Peru like Hungary also have more than 280 deaths per 100,000 people indicating that these countries are also suffering from COVID-19.

In [None]:
df['Pop_per'] = df['Population'].apply(lambda pop: 'pop < 100,000' if pop < 100000 else ('100,000 < pop < 1,000,000' if pop < 1000000 else ('1,000,000 < pop < 10,000,000' if pop < 10000000 else 'pop > 10,000,000')))
df_b = df[['Country', 'Perc_COVID_Deaths', 'Population', 'Pop_per']].copy()
fig = px.bar(data_frame=df_b.nlargest(20, 'Perc_COVID_Deaths'), x = 'Country', y = 'Perc_COVID_Deaths', color = 'Pop_per', title='Percentage of Population that has died to COVID per Country')
fig.update_layout(xaxis_categoryorder = 'total descending')

So in Peru about 0.6% of the entire population (196,950 in 32.51 million) has died to COVID. other countries in the top 20 have around 0.2% - 0.3% of their population dying to COVID.

While these percentages may look small, 0.3% of 1,000,000 is 3,000 deaths and most of these countries have more than 1,000,000 people in the population

# Smallest Percentage of Total Deaths in the Country

In [None]:
px.bar(data_frame=df_b.nsmallest(20, 'Perc_COVID_Deaths'), x = 'Country', y = 'Perc_COVID_Deaths', color = 'Pop_per', title='Percentage of Population that has gotten COVID per Country')

Just like the countries that had almost no COVID cases, the countries that have almost no percentage of population dying of COVID Deaths are mainly corrupted countries that don't show the amount of deaths from COVID.

What's interesting though is places like Greenland having  0 deaths. It seems that in 2020 Greenland did an extremely good job handling COVID with less then 200 cases and 0 deaths total (https://tinyurl.com/282xb3ej). This did come at the cost however of completely denying travel into or out of the country and big lockdowns.

# **Who has recently been experiencing the most COVID Cases and Deaths?**

# Highest Amount of COVID Cases Recently

In [None]:
fig = px.bar(data_frame = df.nlargest(20, 'Cases - newly reported in last 24 hours'), x = 'Country', y = 'Cases - newly reported in last 24 hours', title = 'Highest Amount of COVID Cases in Last 24 Hours by Country', color = 'Pop_per')
fig.update_layout(xaxis_categoryorder = 'total descending')

In [None]:
df[df['Country'] == 'Japan']

Wow, in the last 24 hours there have been a lot of new cases in bigger countries! America is experiencing over 150,000 COVID Cases in just the last day topping the next biggest cases in Iran by over 3x.

What's interesting is that the amount of COVID Cases does not seem to depend much on how developed the country is considering there are multiple developed countries such as: UK, US, and Japan which all have high amounts of COVID Cases. 

It's kind of sad that many of these countries, especially the US, ranked as some of the best for dealing with global pandemics and yet they are doing the worst job of dealing with it. Guess it goes to show that you can have all the resources in the world but if you don't use them, then it's useless. (https://www.ghsindex.org/)

# Lowest Amount of Recent COVID Cases

In [None]:
df.nsmallest(10, 'Cases - newly reported in last 24 hours')

It seems that in many of these Countries COVID is still spreading strongly and yet in the last 24 hours there were somehow no new COVID Cases. I do not know the date that these new COVID Cases came from but I'm guessing they may just be errors as from what I have seen from Google, the COVID Cases have not even come close to a full stop in any of these countries. Perhaps they were just not recorded for the day.

# Highest Amount of COVID Deaths Recently

In [None]:
fig = px.bar(data_frame = df.nlargest(20, 'Deaths - newly reported in last 24 hours'), x = 'Country', y = 'Deaths - newly reported in last 24 hours', title = 'Highest Amount of COVID Deaths in Last 24 Hours by Country', color = 'Pop_per')
fig.update_layout(xaxis_categoryorder = 'total descending')

As expected, the countries that have the top COVID of the day Cases also have the top Deaths of the Day. 

One thing to note though is which countries have the most deaths. US ranked first for the amount of COVID Cases in the day but has significantly less deaths than Indonesia even though Indonesia had much less cases in the day. This is because America is a much more developed country and has a lot more resources to stop deaths due to COVID while Indonesia does not.

So while most countries that had the most COVID cases also have the most deaths, developed countries are much less likely to have this same trend even if they have a lot of cases because they have a lot more resources to make sure people won't die.

# **Continents COVID Cases and Deaths**

The last thing I want to take a look at is how many cases and deaths there are per continent just to see which of them is doing the best and the worst.

In [None]:
df_con = df.groupby('Continent').sum()[['Cases - cumulative total', 'Cases - newly reported in last 7 days', 'Deaths - cumulative total', 'Deaths - newly reported in last 7 days', 'Population']]
df_con.head()

So looking at this dataframe we can see that the Americas has the most cases and deaths cumulatively. This makes sense as the US has, by far, the most cases and deaths in the world along with Brazil which is the second most in deaths and third most in cases (also in Americas).

One thing to note though is that South-East Asia actually has more deaths reported in the last 7 days then does the Americas which comes second. This makes sense as many of these countries have a big population but are not as developed as countries like US so they have less ways to help people and most likely have had many cases of their hospitals being overwhelmed.

# COVID Cases in the Continents

In [None]:
df_con['Case_per'] = df_con['Cases - cumulative total']/df_con['Population']*100
df_con['Death_per'] = df_con['Deaths - cumulative total']/df_con['Population']*100
df_con = df_con.reset_index()
df_con.head()

In [None]:
px.bar(data_frame = df_con, x = 'Continent', y = 'Case_per', title = 'Percent of Continent that has been infected with COVID')

Looking at this graph we can see that almost 8% of the entirety of America has been infected by COVID along with about 6% in the entrety of Europe.

Surprisingly, in Africa there are a lot less people who have been infected with COVID relative to their population. Africa has the greatest population in its continent however and most countries don't have very good testing methods like the US or msot countries in Europe.

Countries in the Western Pacific are doing the best compared to other Continents. This is most likely because many of the countries are islands, countries that did well at preparing for COVID (South Korea, Japan, Australia), and China who has completely lied about the amount of COVID cases they have.

# COVID Deaths in the Continents

In [None]:
px.bar(data_frame = df_con, x = 'Continent', y = 'Death_per', title = 'Percent of Continent that has died to COVID')

So as expected, it is about the amount of COVID Deaths correlates with the amount of COVID cases in a continent. It's frightening to see how about 0.2% of the entire Americas continent has died to COVID, along with other continents expriencing a lot of there population dying to COVID as well.

# Total COVID Cases and Deaths in the World

In [None]:
df_con['Cases - cumulative total'].sum()/df_con['Population'].sum()*100

In [None]:
df_con['Deaths - cumulative total'].sum()/df_con['Population'].sum()*100

So it seems that out of the entire world, 2.6% of everyone have gotten COVID and 0.055% have died of it. May all who died rest in peace and hopefully this pandemic will end soon.

Thank you for reading my notebook, have a great day!