# Final Project: Covid Case Analysis
`Jordan Renaud, Shane Hoock, Zach Philipp, Alex Garza`

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Functions

### Create a Bar plot.

The bar plots created by this function will fill the width of the screen (or so)

```
x     = Variable for x axis
y     = Variable for y axis
data  = The DataFrame to pull data from
angle = The angle the tick labels on the x axis should be rotated to. 
        Default: 90 Degrees (Vertical)
size  = A tuple containing the desired aspect ratio of the plot
        Default: (25, 8)
```

In [None]:
def bar(x, y, data, angle=90, size=(25, 8)):
    # create bar plot
    plt.figure(figsize=size)
    ax = sns.barplot(x=x, y=y, data=data)
    plt.xticks(rotation=angle)
    plt.ticklabel_format(style='plain', axis='y')

### Plot a set of ratios for countries

Think of it as a fraction of a/b, for example deaths/cases

```
a    = Numerator
b    = Denominator
data = The DataFrame to pull data from
```


In [None]:
def plot_ratio(a, b, data):
    sum_country = data.groupby(['location']).sum().reset_index()
    # Calculate deaths per case per country
    sum_country['ratio'] = sum_country[a] / sum_country[b]
    sum_country['ratio']
    # plot deaths per case by country
    plt.figure(figsize=(30,8))
    ax = sns.barplot(x="location", y="ratio", data=sum_country.sort_values(by="ratio")[sum_country['ratio'] > 0])
    plt.xticks(rotation=90)

### Plot Cases, Tests, and Deaths as a Line Graph for a given country

```
country = The country to pull data for
data    = The DataFrame to pull the data from
```

In [None]:
def plot_cases(country, data):
    # pull country data from clean dataframe
    c = data[data['location'] == country]
    g = c.groupby(by='location').agg({'new_cases': 'sum', 
                                      'new_tests': 'sum', 
                                      'new_deaths': 'sum', 
                                      'population': 'max', 
                                      'extreme_poverty': 'max'})
    g.columns  = ['new_cases', 
                  'new_tests', 
                  'new_deaths', 
                  'population', 
                  'extreme_poverty']

    print("Cases:", g['new_cases'].values[0], "Tests:", g['new_tests'].values[0], "Deaths:", g['new_deaths'].values[0], "Population:", g['population'].values[0], "% Poverty:", g['extreme_poverty'].values[0])
    
    # plot the lines
    plt.figure(figsize=(30,8))
    ax = sns.lineplot(data=pd.melt(c[['date', 'new_deaths', 'new_cases', 'new_tests']], ['date']), x="date", y="value", hue="variable")
    ax.set_title(country + ": Cases, Deaths, and Tests")
    plt.legend(labels=['Deaths', 'Cases', 'Tests'])
    plt.ticklabel_format(style='plain', axis='y')
    
    # skip 30 ticks on the x axis at a time (too many tick labels)
    for ind, label in enumerate(ax.get_xticklabels()):
        if ind % 30 == 0:  
            label.set_visible(True)
        else:
            label.set_visible(False)
    plt.xticks(rotation=30)

# Import and Clean

We're only interested in these columns:
- date
- location
- continent
- new_cases
- new_deaths
- new_tests
- population
- extreme_poverty

We're also not interested in the "World" location, as it represents the total for the entire planet.

In [None]:
df = pd.read_csv("/kaggle/input/the-our-world-in-data-covid-vaccination-data/owid-covid-data_3.csv")

# only use certain columns, data cleaning here <<<
dfclean = df[['date', 'location', 'continent', 'new_cases', 'new_deaths', 'new_tests', 'population', 'extreme_poverty']]
dfclean = dfclean[dfclean["location"] != "World"]

In [None]:
dfclean

# Analysis

We start by plotting a bar graph of the total reported cases in all countries in the dataset.

This will add up the number of new cases from each day in every country, resulting in a new DataFrame, 
consisting of one row for each country, and each column filled with the summations (non-cumulative).

In [None]:
# Group by country, sum the values for each country, reset index so the country can be used in the chart below, then sort
sum_country = dfclean.groupby(['location']).sum().reset_index().sort_values(by="new_cases")

In [None]:
sum_country

In [None]:
bar('location', 'new_cases', sum_country[sum_country['new_cases'] > 0])

## Deaths/Case ratio

Let's look at the ratio of Deaths to Cases for each country

In [None]:
# plots a chart of the ratio of deaths to cases for each country
plot_ratio("new_deaths", "new_cases", dfclean)

## Statistical Outliers

A few of the countries on the plot have unusually high death to cases ratios.

- Vanuatu Islands
- Yemen
- Mexico

We can do a further investigation of those countries by plotting the daily reports of cases, tests, and deaths, as well as further online research.

### United States

We'll be using the United States as a sort of guideline for what "normal" testing and reporting would look like.

The United States seems to be testing well. The pattern in the chart shows an obvious structure to how tests are administered/reported with respect to time. It seems that tests are administered on a weekly basis.

In [None]:
plot_cases("United States", dfclean)

### Vanuatu Islands

The country has only seen 4 cases of covid, and one death as a result of the virus.
- On November 11, 2020, a man had traveled to the islands from the United States, with layovers in Sydney and Auckland. He arrived on November 4th and was put in isolation. He was asymptomatic the entire time, and tested positive on November 10th
- On March 6 2021, Prime Minister Bob Loughman announced two new cases.
- On April 19 2021, a new positive case on a deceased Filipino fisherman, who was found on a beach in Efate, was confirmed.

In [None]:
plot_cases("Vanuatu", dfclean)

### Mexico

Mexico's deaths per case ratio is close to 20%. It seems that a good effort is being put into testing and case reporting, however this ratio is very high. Testing/reporting also seems to be on a schedule/structure. Like many other countries, Mexico saw an increasing spike in there cases as the year of 2020 went on, and as you can see towards the end of the year, they more than doubled there testing rate which could have very well been the main cause in the steep increase of confirmed cases. 

In [None]:
plot_cases("Mexico", dfclean)

### Yemen

Yemen was recogized by the UK as the most needy country. The true number of cases in Yemen is unknown due to the very poor supply of covid testing supplies and lack of availability of medical and health services.

In [None]:
plot_cases("Yemen", dfclean)

# Continental Analysis

Let's take a look at this dataset from a continental point of view. Below is a chart showing the total cases in each continent, excluding Antarctica.

In [None]:
# Total cases by continent
sum_continent = dfclean.groupby(['continent']).sum().reset_index().sort_values(by="new_cases")

In [None]:
sum_continent

In [None]:
# plot above data
bar('continent', 'new_cases', sum_continent, size=(6, 8), angle=0)

## Africa

Africa has a very low number of cases. This could be attributed to very strict lockdown policies and laws, however the possibility of poor testing rates can very well undermine the integrity of the data.

In [None]:
africa = dfclean[dfclean['continent'] == "Africa"]
africa = africa.groupby('location').sum().reset_index().sort_values('new_cases')

In [None]:
africa

In [None]:
bar('location', 'new_cases', africa)

### South Africa

In South Africa,the current population as of May, 2010, 2021, reached 59,945,412, with a population density 49 people per Km^2 
and an overall population that ranks them 25th in the world for the highest population density.

With the sole knowledge of this information, one could predict that they would be towards the top in the number of confirmed cases 
in all of Africa. 

As you can see above, our chart has South Africa at number one for the most new cases in Africa,
more than doubling the amount of cases that Morocco has, who has the second most. However, to our suprise, 
Morocco has a population density of 83 people per Km^2, which is almost twice as more as South Africa. 

From this research, we could make a very strong arugment that when you look just at the recent studies, population density seems to have
little to no effect on the amount of covid cases that have been confirmed for a country. This also may be caused by a different 
strand of covid that is more contagious in South Africa than the strand that is effecting Morocco. 

When you look at the Graph of South Africa, you will see two major rises in both of the number of cases and tests. We learned that this was because when the virus first effected them, their 
government enforced a lockdown to stop the spread as quickly as possible, however, once the numbers started to decrease, all of the policies were 
not being followed as strongly as they were and maybe should have been, which is the major cause for their second spike in confirmed cases.

In [None]:
plot_cases('South Africa', dfclean)