# Project Description

<b>At the end of 2019, the world was faced with a new challenge : the COVID-19 pandemic, which is a still ongoing global pandemic of coronavirus disease 2019. The novel virus was first identified in Wuhan, China in December 2019 but has managed to spread worlwide, which incluldes Germany of course.

At hand, we have a dataset containing informations about the infections, deaths and vaccines in Germany.

Let's analyse this data and see what we can retain from it.</b>

# Description of the data

We will be using 3 datasets in this study :  

- The first dataset, which we named **'`covid`'**, contains 831.604 rows of data with information on covid cases, deaths and recoveries in Germany. The data is separated into the following 8 columns :
    - **`state`** : The state to which the tested individuals belong;
    - **`county`** :  The county which the state belongs to;
    - **`age_group`** : The age group of the tested (xx-xx);
    - **`gender`** : The gender of the tested;
    - **`date`** : The date of the tests (YYYY-MM-DD);
    - **`cases`** : The number of new covid cases;
    - **`deaths`** : The number of deaths;
    - **`recovered`** : The number of recovered cases.
    
- The second dataset, which we named **'`vaccine`'**, contains 313 rows of data with information on the COVID vaccines in Germany. The data is separated into the following 9 columns :
    - **`date`** : The date of the administration of the vaccine (YYYY-MM-DD);
    - **`doses`** :  The number of doses administered;
    - **`doses_first`** : The number of first doses administered;
    - **`doses_second`** : The number of second doses administered;
    - **`pfizer_cumul`** : The cumulated number of the Pfizer vaccine doses administered;
    - **`moderna_cumul`** : The cumulated number of the Moderna vaccine doses administered;
    - **`Astrazeneca_cumul`** : The cumulated number of the AstraZeneca vaccine doses administered;
    - **`persons_first_cumul`** : The cumulated number of patients who have received the first dose of the vaccine;
    - **`persons_second_cumul`** : The cumulated number of patients who are fully vaccinated, meaning they have received the second dose of the vaccine.

- The third dataset, which we named **'`demographics`'**, contains 192 rows of data with information on the demographic distribution in Germany. The data is separated into the following 4 columns :
    - **`state`** : The state to which the demographic belongs;
    - **`gender`** : The gender of the demographic group;
    - **`age_group`** : The age group of the demographic (xx-xx);
    - **`population`** : The number of individuals;

# Table of contents  



[**1 : Data Preprocessing**](#1)  
&emsp;&emsp;[1.1 - Importing Libraries and Datasets](#11)     
&emsp;&emsp;[1.2 - Checking Anomalies in Data](#12)  
&emsp;&emsp;&emsp;&emsp;- [Missing Values](#121)  
&emsp;&emsp;&emsp;&emsp;- [Duplicates](#122)  
&emsp;&emsp;[1.3 - Converting Data](#13)  
&emsp;&emsp;[1.4 - Conclusion](#14)   
[**2 : Exploratory Data Analysis**](#2)     
&emsp;&emsp;[2.1 - Data Description](#21)   
&emsp;&emsp;[2.2 - Data Visualization](#22)    
&emsp;&emsp;&emsp;&emsp;- Calls by Date   
&emsp;&emsp;&emsp;&emsp;- Calls by Operator and by User  
&emsp;&emsp;&emsp;&emsp;- Clients per Date  
&emsp;&emsp;2.6 - Processing Outliers    
**3 : Data Analysis**  
&emsp;&emsp;3.1 - Catgorize Operators  
&emsp;&emsp;&emsp;&emsp;- Ranking Operators    
&emsp;&emsp;&emsp;&emsp;- The Most Ineffective Operators    
&emsp;&emsp;&emsp;&emsp;- The Most Effective Operators  
&emsp;&emsp;3.2 - Hypotheses  
&emsp;&emsp;&emsp;&emsp;- Identify Relation Between Datasets (Operator - Tariff)  
&emsp;&emsp;&emsp;&emsp;- Hypothsize  
&emsp;&emsp;&emsp;&emsp;- Whitney-Mann Testing  
**4 : Conclusions**  
**5 : Dashboard**  
&emsp;&emsp;4.1 - Dashboard Draft  
&emsp;&emsp;4.1 - Dashboard Link  
**6 : Presentation**   
**7 : References and External Links**    



## 1. Data Preprocessing <a name="1"></a>

### 1.1. Importing Libraries and Datasets<a name="11"></a>

In [1]:
import requests
import sys
import warnings
import time

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
import math as mth

from scipy import stats
from io import BytesIO
from scipy.stats import ttest_ind
from IPython.display import display_html 
from pandas.plotting import register_matplotlib_converters
from plotly import graph_objects as go 
from scipy.cluster.hierarchy import dendrogram, linkage 
from scipy.cluster import hierarchy
from scipy.stats import rankdata

if not sys.warnoptions:
       warnings.simplefilter("ignore")

In [2]:
# Opening the data files, taking into consideration the separator
covid = pd.read_csv('/kaggle/input/covid19-tracking-germany/covid_de.csv')
vaccine = pd.read_csv('/kaggle/input/covid19-tracking-germany/covid_de_vaccines.csv')
demographics = pd.read_csv('/kaggle/input/covid19-tracking-germany/demographics_de.csv')

# We print informations about the dataset to examine
print('')
print('------------------------------------------ Informations About the Datasets ------------------------------------------')
print('')
display (covid.info())
print('')
display (vaccine.info())
print('')
display (demographics.info())
print('')

# Print a few lines of the datasets to examine data
print('----------------------------------------------- Samples of the datasets ---------------------------------------------')
display(covid.sample(10))
print('')
display(vaccine.sample(10))
print('')
display(demographics.sample(10))
print('')


------------------------------------------ Informations About the Datasets ------------------------------------------

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1176455 entries, 0 to 1176454
Data columns (total 8 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   state      1176455 non-null  object
 1   county     1176455 non-null  object
 2   age_group  1174033 non-null  object
 3   gender     1156365 non-null  object
 4   date       1176455 non-null  object
 5   cases      1176455 non-null  int64 
 6   deaths     1176455 non-null  int64 
 7   recovered  1176455 non-null  int64 
dtypes: int64(3), object(5)
memory usage: 71.8+ MB


None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 324 entries, 0 to 323
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   date                 324 non-null    object
 1   doses                324 non-null    int64 
 2   doses_first          324 non-null    int64 
 3   doses_second         324 non-null    int64 
 4   pfizer_cumul         324 non-null    int64 
 5   moderna_cumul        324 non-null    int64 
 6   astrazeneca_cumul    324 non-null    int64 
 7   persons_first_cumul  324 non-null    int64 
 8   persons_full_cumul   324 non-null    int64 
dtypes: int64(8), object(1)
memory usage: 22.9+ KB


None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192 entries, 0 to 191
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   state       192 non-null    object
 1   gender      192 non-null    object
 2   age_group   192 non-null    object
 3   population  192 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 6.1+ KB


None


----------------------------------------------- Samples of the datasets ---------------------------------------------


Unnamed: 0,state,county,age_group,gender,date,cases,deaths,recovered
993717,Rheinland-Pfalz,SK Trier,15-34,M,2021-04-22,1,0,1
198946,Bayern,LK Donau-Ries,60-79,M,2021-01-19,2,0,2
509134,Hessen,LK Darmstadt-Dieburg,15-34,F,2020-04-13,1,0,1
903028,Nordrhein-Westfalen,SK Remscheid,60-79,F,2021-01-16,3,0,3
974432,Rheinland-Pfalz,SK Frankenthal,35-59,M,2021-04-22,4,0,4
563657,Hessen,LK Waldeck-Frankenberg,35-59,F,2021-06-13,1,0,1
959419,Rheinland-Pfalz,LK Rhein-Pfalz-Kreis,00-04,M,2021-06-02,1,0,1
38054,Baden-Wuerttemberg,LK Heidenheim,00-04,F,2021-07-17,1,0,1
487945,Brandenburg,SK Frankfurt,35-59,F,2021-06-06,1,0,1
1025750,Sachsen,LK Leipzig,15-34,F,2021-01-14,4,0,4





Unnamed: 0,date,doses,doses_first,doses_second,pfizer_cumul,moderna_cumul,astrazeneca_cumul,persons_first_cumul,persons_full_cumul
76,2021-03-13,241271,190878,50393,7403134,352352,1758508,6578719,2935374
124,2021-04-30,805673,668953,136720,22138531,1770485,5875981,23226921,6564712
54,2021-02-19,159075,103152,55923,4670119,141187,168293,3221238,1758460
305,2021-10-28,246462,50488,81853,85957670,9782447,12704273,57685076,55408770
91,2021-03-28,196764,143426,53338,9800190,651879,2792552,9275578,3969153
289,2021-10-12,178794,49992,77401,83748225,9714297,12699750,57104794,54441416
70,2021-03-07,147883,114359,33524,6552062,275602,1050992,5300155,2578602
11,2021-01-07,54777,54757,20,520213,37,9,519045,1250
298,2021-10-21,205395,49646,80450,84993021,9752683,12701353,57453999,55024275
130,2021-05-06,992951,772910,220037,25080587,2155297,6571981,26406020,7445256





Unnamed: 0,state,gender,age_group,population
137,Saarland,female,80-99,45781
129,Rheinland-Pfalz,male,35-59,716037
60,Hamburg,female,00-04,48335
12,Bayern,female,00-04,306378
46,Brandenburg,male,60-79,299105
166,Sachsen-Anhalt,male,60-79,273670
171,Schleswig-Holstein,female,35-59,517809
31,Berlin,male,05-14,164059
57,Bremen,male,35-59,116674
145,Sachsen,female,05-14,172239





<b>Observations :</b>  

We have imported the three datasets we will use for the study and named them **covid**, **vaccine** and **demographics**. From the first look at the datasets and their informations, we observe the following :  

For the **`covid`** dataset :  
- The dataset consists of 1.138.488 logs, and 8 columns;  
- 3 Numerical columns : 'cases', 'deaths', and 'recovered';
- 1 Timestamp column : 'date';
- 4 Categorical columns : 'state', 'county', 'gender' and 'age_group'.

For the **`vaccine`** datatset :
- The dataset consists of 314 rows and 9 columns;
- 1 Timestamp column : 'date';
- 8 Numerical columns : the remaining columns.  

For the **`demographics`** dataset :  
- The dataset consists of 192 logs, and 4 columns;  
- 1 Numerical column : 'population';
- 3 Categorical columns : 'state', 'gender' and 'age_group'.


The dataset informations allow us to see the data types, which we can already state that they are not correctly assigned. In the next steps, we will address the data types in more detail.
</div>

### 1.2. Checking Anomalies in Data<a name="12"></a>

#### 1.2.1. Missing Values<a name="121"></a>

From the datasets information above, we can already see that there are instances of missing data in **`covid`** in two columns : 'age_group' and 'gender'. But **how much data is exactly missing?** Let's calculate the exact amount of missing values in thisdataset and display a sample of the rows with missing values.

In [3]:
# Missing data count
missing_values = covid.isnull().sum()

# Display
print('----------------- Missing Values Per Column ------------------')
print('')
display(missing_values)
print('--------------------------------------------------------------')
print('')
print('{:.2%} of rows in the covid dataset are missing data.'.format(covid.isnull().sum().sum()/len(covid)))
print('')
print('--------------------------------------------------------------')

# Sample of rows with missing values from covid
display(covid[covid['gender'].isnull()].head(5))
display(covid[covid['age_group'].isnull()].head(5))

----------------- Missing Values Per Column ------------------



state            0
county           0
age_group     2422
gender       20090
date             0
cases            0
deaths           0
recovered        0
dtype: int64

--------------------------------------------------------------

1.91% of rows in the covid dataset are missing data.

--------------------------------------------------------------


Unnamed: 0,state,county,age_group,gender,date,cases,deaths,recovered
716,Baden-Wuerttemberg,LK Alb-Donau-Kreis,05-14,,2020-10-30,1,0,1
717,Baden-Wuerttemberg,LK Alb-Donau-Kreis,05-14,,2020-11-19,1,0,1
718,Baden-Wuerttemberg,LK Alb-Donau-Kreis,05-14,,2021-06-14,1,0,1
719,Baden-Wuerttemberg,LK Alb-Donau-Kreis,05-14,,2021-11-03,1,0,1
720,Baden-Wuerttemberg,LK Alb-Donau-Kreis,05-14,,2021-11-08,1,0,0


Unnamed: 0,state,county,age_group,gender,date,cases,deaths,recovered
3284,Baden-Wuerttemberg,LK Alb-Donau-Kreis,,F,2020-10-26,1,0,1
3285,Baden-Wuerttemberg,LK Alb-Donau-Kreis,,F,2020-12-24,1,0,1
3286,Baden-Wuerttemberg,LK Alb-Donau-Kreis,,M,2021-11-13,1,0,0
6508,Baden-Wuerttemberg,LK Biberach,,F,2021-11-15,1,0,0
10590,Baden-Wuerttemberg,LK Boeblingen,,F,2021-11-12,2,0,0


<b>Observations :</b>   

We notice above that the number of rows missing data in the **`age_group`** column is very small : **2217 rows, which makes for 0,19% of the dataset rows**. This means that we can simply drop these rows, as their effect on the study redults will be minimal.

On the other hand, the number is bigger for the **`gender`** column, which is missing 18.672 rows. Before we decide how to treat these rows, let's take a look at the rows with missing values. We will display, as an example, the data for the state of 'Baden-Wuerttemberg', the county of 'LK Alb-Donau-Kreis', for the age groups '05-14' and '15-34', on the following dates : '2020-10-30', '2020-11-19' and '2020-06-14'.


In [4]:
# Creating data slice

Slice = covid.query('state == "Baden-Wuerttemberg" and county == "LK Alb-Donau-Kreis" and age_group == ["05-14","15-34"] and date == ["2020-10-30", "2020-11-19"]')
Slice.sort_values(by = 'date')

Unnamed: 0,state,county,age_group,gender,date,cases,deaths,recovered
264,Baden-Wuerttemberg,LK Alb-Donau-Kreis,05-14,F,2020-10-30,2,0,2
504,Baden-Wuerttemberg,LK Alb-Donau-Kreis,05-14,M,2020-10-30,1,0,1
716,Baden-Wuerttemberg,LK Alb-Donau-Kreis,05-14,,2020-10-30,1,0,1
817,Baden-Wuerttemberg,LK Alb-Donau-Kreis,15-34,F,2020-10-30,8,0,8
1225,Baden-Wuerttemberg,LK Alb-Donau-Kreis,15-34,M,2020-10-30,9,0,9
1544,Baden-Wuerttemberg,LK Alb-Donau-Kreis,15-34,,2020-10-30,2,0,2
276,Baden-Wuerttemberg,LK Alb-Donau-Kreis,05-14,F,2020-11-19,4,0,4
518,Baden-Wuerttemberg,LK Alb-Donau-Kreis,05-14,M,2020-11-19,2,0,2
717,Baden-Wuerttemberg,LK Alb-Donau-Kreis,05-14,,2020-11-19,1,0,1
835,Baden-Wuerttemberg,LK Alb-Donau-Kreis,15-34,F,2020-11-19,7,0,7


<b>Observations :</b>   

This allows us to see that for all these instances, both genders are already present, and that the row with the missing value is either due to human error, where the doctors testing the patients for example did not note the gender of the patient, and in older gender slices, it can be due to the personal choice of the tested person to not reveal their gender, or to not identify as either of the binary gender values. 

**Therefore, we will assign the value 'U' to this category to indicate that their gender is unkown.**

In [5]:
# Fillinf missing values in the gender column with the value 'U'
covid['gender'] = covid['gender'].fillna('U')

# Dropping the remaining missing values which are from the age group column
covid = covid.dropna()

# Infos on the dataset
covid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1174033 entries, 0 to 1176450
Data columns (total 8 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   state      1174033 non-null  object
 1   county     1174033 non-null  object
 2   age_group  1174033 non-null  object
 3   gender     1174033 non-null  object
 4   date       1174033 non-null  object
 5   cases      1174033 non-null  int64 
 6   deaths     1174033 non-null  int64 
 7   recovered  1174033 non-null  int64 
dtypes: int64(3), object(5)
memory usage: 80.6+ MB


#### 1.2.1. Duplicates<a name="122"></a>

After having dropped the rows with the missing values, we are now left with 1.136.271 rows of data in the **`covid`** dataset, which along with the two other datasets needs to now be checked for duplicates.

In [6]:
# Checking for duplicates
print('----------------------------------------------------------------------------------')
print('')
print('number of duplicated rows in "covid" :', covid.duplicated().sum())
print('')
print('number of duplicated rows in "vaccine" :', vaccine.duplicated().sum())
print('')
print('number of duplicated rows in "demographics" :', demographics.duplicated().sum())
print('')
print('----------------------------------------------------------------------------------')

----------------------------------------------------------------------------------

number of duplicated rows in "covid" : 0

number of duplicated rows in "vaccine" : 0

number of duplicated rows in "demographics" : 0

----------------------------------------------------------------------------------


<b>Observations :</b>  

**All three datasets have no duplicated rows**, which makes them ready for the next step.

### 1.3. Converting Data <a name="13"></a>

As mentioned previously, we noticed that there are columns with wrong data types. In this step, we will convert the date columns into the right data type.

In [7]:
# Converting the 'date' columns
covid['date'] = covid['date'].astype('datetime64[D]')
vaccine['date'] = vaccine['date'].astype('datetime64[D]')

# We print informations about the dataset to make sure the data types are updated
print(' ')
print('----------------------------------- Information About the Covis Dataset -----------------------------------')
print(' ')
display(covid.info())
print('---------------------------------- Information About the Vaccine Dataset ----------------------------------')
print(' ')
display(vaccine.info())

 
----------------------------------- Information About the Covis Dataset -----------------------------------
 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1174033 entries, 0 to 1176450
Data columns (total 8 columns):
 #   Column     Non-Null Count    Dtype         
---  ------     --------------    -----         
 0   state      1174033 non-null  object        
 1   county     1174033 non-null  object        
 2   age_group  1174033 non-null  object        
 3   gender     1174033 non-null  object        
 4   date       1174033 non-null  datetime64[ns]
 5   cases      1174033 non-null  int64         
 6   deaths     1174033 non-null  int64         
 7   recovered  1174033 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(4)
memory usage: 80.6+ MB


None

---------------------------------- Information About the Vaccine Dataset ----------------------------------
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 324 entries, 0 to 323
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 324 non-null    datetime64[ns]
 1   doses                324 non-null    int64         
 2   doses_first          324 non-null    int64         
 3   doses_second         324 non-null    int64         
 4   pfizer_cumul         324 non-null    int64         
 5   moderna_cumul        324 non-null    int64         
 6   astrazeneca_cumul    324 non-null    int64         
 7   persons_first_cumul  324 non-null    int64         
 8   persons_full_cumul   324 non-null    int64         
dtypes: datetime64[ns](1), int64(8)
memory usage: 22.9 KB


None

### 1.4. Conclusion <a name="14"></a>

So far we have treated the missing data, where **we removed 2217 rows of data, which was about 0.19% of the entirety of the covid dataset**. We have found no duplicates in all three datasets. 

**We have also converted the dates to the proper format, which makes our datatsets now ready for the next step.**

## 2. Exploratory Data Analysis<a name="2"></a>


To start the exploratory data analysis, we will retreive some numerical informations from the datasets.

### 2.1. Data Description<a name="21"></a>

Let's take a closer look at the data we already have, by making some simple calculations.

In [8]:
print('------------------------------------------- Dataset Descriptions -------------------------------------------')
display(vaccine.describe().round(2))
display(demographics.describe().round(2))

print('------------------------------------------------------------------------------------------------------------')
print('')
print('The data in "covid" covers the period between :', covid.date.min(), 'and', covid.date.max(),'.')
print('')
print('The data in "vaccine" covers the period between :', vaccine.date.min(), 'and', vaccine.date.max(),'.')
print('')
print('------------------------------------------------------------------------------------------------------------')
print('The total number of COVID-19 cases in Germany is :', covid.cases.sum(), 'cases.')
print('')
print('The total number of COVID-19 deaths in Germany is :', covid.deaths.sum(), 'deaths.')
print('')
print('The total number of COVID-19 recoveries in Germany is :', covid.recovered.sum(), 'cases.')
print('')
print('------------------------------------------------------------------------------------------------------------')
print('The first registered case of COVID-19 in Germany was on the :', covid.date.min(), '.')
print('')
print('The first registered administered vaccine against COVID-19 in Germany was on the :', vaccine.date.min(), '.')
print('')
print('------------------------------------------------------------------------------------------------------------')
print('On the 04-11-2021, a total of', round(vaccine.persons_full_cumul.max()/1000000 , 2), 'Million individuals are fully vaccinated against COVID-19.')
print('')
print('On the 04-11-2021, a total of', round(vaccine.persons_first_cumul.max()/1000000 , 2), 'Million individuals have received the fist dose of the vaccine against COVID-19.')
print('')
print('------------------------------------------------------------------------------------------------------------')
print('On the 04-11-2021, a total of', round((vaccine.persons_full_cumul.max())/(demographics.population.sum())*100 , 2), '% of the population fully vaccinated against COVID-19.')
print('')
print('On the 04-11-2021, a total of', round((vaccine.persons_first_cumul.max())/(demographics.population.sum())*100 , 2), '% of the population has received the fist dose of the vaccine against COVID-19.')
print('')
print('------------------------------------------------------------------------------------------------------------')

------------------------------------------- Dataset Descriptions -------------------------------------------


Unnamed: 0,doses,doses_first,doses_second,pfizer_cumul,moderna_cumul,astrazeneca_cumul,persons_first_cumul,persons_full_cumul
count,324.0,324.0,324.0,324.0,324.0,324.0,324.0,324.0
mean,355530.12,179946.81,163137.15,42512506.47,4807416.56,7644238.46,32418664.37,24955631.49
std,311360.87,188807.42,181848.14,32476329.3,4024905.28,5194202.43,22254227.97,22084975.5
min,13883.0,4542.0,5.0,24349.0,6.0,0.0,24344.0,11.0
25%,134135.0,55351.75,53385.25,8129693.75,413673.0,1920935.0,7227658.5,3236740.25
50%,250176.0,106697.5,81275.0,40492460.0,4485247.0,9685893.5,38227997.0,18221867.5
75%,455790.25,229845.0,213118.0,76501673.25,9338189.5,12657717.75,54081315.75,50017006.25
max,1428602.0,1056367.0,879492.0,89225246.0,9897747.0,12712123.0,58302768.0,56213082.0


Unnamed: 0,population
count,192.0
mean,432391.73
std,557233.7
min,15906.0
25%,95457.0
50%,234596.0
75%,484169.0
max,3147565.0


------------------------------------------------------------------------------------------------------------

The data in "covid" covers the period between : 2020-01-02 00:00:00 and 2021-11-15 00:00:00 .

The data in "vaccine" covers the period between : 2020-12-27 00:00:00 and 2021-11-15 00:00:00 .

------------------------------------------------------------------------------------------------------------
The total number of COVID-19 cases in Germany is : 5073231 cases.

The total number of COVID-19 deaths in Germany is : 97971 deaths.

The total number of COVID-19 recoveries in Germany is : 4513132 cases.

------------------------------------------------------------------------------------------------------------
The first registered case of COVID-19 in Germany was on the : 2020-01-02 00:00:00 .

The first registered administered vaccine against COVID-19 in Germany was on the : 2020-12-27 00:00:00 .

-----------------------------------------------------------------------------------

<b>Observations :</b> 

From the description of the datasets above, we retain the following :

- The date of the detection of **the first COVID-19 case in Germany was on the 02-01-2020**. **The date of the administration of the first dose of vaccine in Germany is the 27-12-2020**, nearly a year later.  
- **The most commonly administered vaccine in Germany is the Pfizer vaccine at at total of nearly 87 Million doses (79.4% of total doses)**, followed by AstraZeneca at 12.7 Million doses (11.6%) and Moderna at 9.8 Million doses(9%);  
- There has been 4.74 Million confirmed cases of COVID-19 in Germany. **93% of these cases have recovered from the virus (4.3 Million recoveries)**;  
- The **Infection Fatality Rate** (IFR), which calculates the percentage of cases with a death outcome to the number of cases with the formula below, is 2.03% in Germany. This means that **2 out of every 100 confirmed COVID-19 cases has a fatal outcome.** This is similar to the global fatality rate which is 2% according to a [study](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8451339/) by PMC Labs   

<img src="https://latex.codecogs.com/gif.latex?Infection&space;Fatality&space;Rate&space;(IFR)&space;=&space;\frac{Deaths}{Cases}" title="Infection Fatality Rate (IFR) = \frac{Deaths}{Cases}" />

- The **Mortality Rate** (MR) calculates the COVID-19 deaths by total population of Germany. 

<img src="https://latex.codecogs.com/gif.latex?Mortality&space;Rate&space;(MR)&space;=&space;\frac{Deaths}{Total&space;Population}" title="Mortality Rate (MR) = \frac{Deaths}{Total Population}" />

The Mortality Rate for COVID-19 in Germany is the total deaths (96.481 deaths) divided by the total population of Germany (83.019.213 population), which results in 0.116%. This translates to **116 deaths per 100.000 population**, or **1 death every 862 people**. 

This mortality rate is smaller compared to other countries such as [Bulgaria](https://coronavirus.jhu.edu/data/mortality) with a MR of 357 deaths per 100.000 population, The United States of America with 230 deaths per 100.000 population, but it's also higher than certain countries such as Vietnam and Egypt with respectively 23 and 19 deaths per 100.000 population (Note that the countries given in the example are all, including Germany in the top 20 countries with the most COVID-19 cases).


### 2.2. Calculating Number of Active Cases<a name="22"></a>

The number of **active cases** can be defined as the number of confirmed cases minus the number of recovered cases and deaths. It is the number of cases still considered to be infectious.

In [9]:
# Calculating "active"
covid['active'] = covid['cases'] - covid['recovered'] - covid['deaths']

# Display
display(covid.sample(5))

# Calculating number of currently active cases
print('------------------------------------------------------------------------------------------------------------')
print('')
print('On the 04-11-2021, there are', covid.active.sum(), 'active COVID-19 cases.')
print('')
print('------------------------------------------------------------------------------------------------------------')

Unnamed: 0,state,county,age_group,gender,date,cases,deaths,recovered,active
444286,Berlin,SK Berlin Tempelhof-Schoeneberg,15-34,F,2021-09-28,9,0,9,0
1136381,Thueringen,LK Greiz,60-79,M,2021-03-02,4,0,4,0
966011,Rheinland-Pfalz,LK Trier-Saarburg,00-04,F,2020-12-13,1,0,1,0
264651,Bayern,LK Muenchen,00-04,F,2021-10-25,3,0,1,2
478221,Brandenburg,LK Spree-Neisse,35-59,M,2021-11-02,21,0,13,8


------------------------------------------------------------------------------------------------------------

On the 04-11-2021, there are 462128 active COVID-19 cases.

------------------------------------------------------------------------------------------------------------


### 2.3. Data Visualizations<a name="23"></a>

After Having made the calculations above, we will now visualize some of the data from all three datasets.

#### 2.2.1 Daily Cases<a name="221"></a>

We group the **`covid`** dataset by date, to be able to see the total daily cases.

In [10]:
# Grouping the data by date
cases = covid.groupby('date', as_index = False).agg({'cases' : 'sum'})

# Plot bar chart
fig = px.bar(cases, x = 'date', y = 'cases',height = 400,
             color = 'cases', 
             color_continuous_scale = px.colors.diverging.RdBu,
             title = 'Daily Cases of COVID-19 in Germany')
fig.show()

<b>Observations :</b> 

The daily cases graph shows that the COVID-19 pandemic has known **4 peak infection periods concentrated around April 2020, January 2021, April 2021 and a fourth peak currently happening**. These peaks are separated by low infection periods, but we notice that it only took approximately 9 months between the first and second peak, approximately 4 months between the second and third, and 7 months between the third and the start of the fourth one.

**The current peak, or what is also designated as an infection wave, is the most severe so far, registering the highest number of cases in one day so far at 34.696 cases on 03-11-2021.**

**We can see a pattern in the timing of the waves so far, occuring in the Easter period and the Christmas/ New Year period.** This can be explained by the rise in travels and family gatherings in these periods which are known to have a school break.

#### 2.2.2 Daily Deaths<a name="222"></a>

We group the **`covid`** dataset by date, to be able to see the total daily deaths.

In [11]:
# Grouping the data by date
deaths = covid.groupby('date', as_index = False).agg({'deaths' : 'sum'})

# Plot bar chart
fig = px.bar(deaths, x = 'date', y = 'deaths',height = 400,
             color = 'deaths', 
             color_continuous_scale = px.colors.diverging.RdBu,
             title = 'Daily Deaths of COVID-19 in Germany')
fig.show()

<b>Observations :</b> 

The daily deaths graph shows similarly to the cases graph 4 peaks signifying the 4 waves the pandemic has known. 3 of these peaks are low, but **the second peak or wave has known the highest death counts**, where starting from 26.09.2020, prior to which the death count was stable between 0 and 20 deaths since 06.05.2020, **we notice an exponential growth in the number of deaths, reaching its peak on 30-12-2020, at 1282 deaths in the day.** 


**By 28-01-2020 the number of deaths had dropped below 500, and then below 100 starting the 13-05-2021, to stay below this limit until today.**





#### 2.2.3 Daily Recoveries<a name="223"></a>

In [12]:
# Grouping the data by date
deaths = covid.groupby('date', as_index = False).agg({'recovered' : 'sum'})

# Plot bar chart
fig = px.bar(deaths, x = 'date', y = 'recovered',height = 400,
             color = 'recovered', 
             color_continuous_scale = px.colors.diverging.RdBu,
             title = 'Daily Recoveries of COVID-19 in Germany')
fig.show()

## 4. Conclusion :<a name="4"></a>
