## Data

This data was scraped from woldometers.info on 2022-05-14 by Joseph Assaker.

225 countries are represented in this data.

All of countries have records dating from 2020-2-15 until 2022-05-14 (820 days per country).
That's with the exception of China, which has records dating from 2020-1-22 until 2022-05-14 (844 days per country), and Palau which has records dating from 2021-8-25 until 2022-05-14 (263 days per country)

There are two files in the dataset

1) Summary Data Columns Description:
* **country:** designates the Country in which the the row's data was observed.
* **continent:** designates the Continent of the observed country.
* **total_confirmed:** designates the total number of confirmed cases in the observed country.
* **total_deaths:** designates the total number of confirmed deaths in the observed country.
* **total_recovered:** designates the total number of confirmed recoveries in the observed country.
* **active_cases:** designates the number of active cases in the observed country.
* **serious_or_critical:** designates the estimated number of cases in serious or critical conditions in the observed country.
* **total_cases_per_1m_population:** designates the number of total cases per 1 million population in the observed country.
* **total_deaths_per_1m_population:** designates the number of total deaths per 1 million population in the observed country.
* **total_tests:** designates the number of total tests done in the observed country.
* **total_tests_per_1m_population:** designates the number of total test done per 1 million population in the observed country.
* **population:** designates the population count in the observed country.
    
2) Daily Data Columns Description:
* **date:** designates the date of observation of the row's data in YYYY-MM-DD format.
* **country:** designates the Country in which the the row's data was observed.
* **cumulative_total_cases:** designates the cumulative number of confirmed cases as of the row's date, for the row's country.
* **daily_new_cases:** designates the daily new number of confirmed cases on the row's date, for the row's country.
* **active_cases:** designates the number of active cases (i.e., confirmed cases that still didn't recover nor die) on the row's date, for the row's country.
* **cumulative_total_deaths:** designates the cumulative number of confirmed deaths as of the row's date, for the row's country.
* **daily_new_deaths:** designates the daily new number of confirmed deaths on the row's date, for the row's country.

Source of the Dataset : https://www.kaggle.com/datasets/josephassaker/covid19-global-dataset

## Import

### Import Data and Required libraries

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
#importing daily dataframe
daily_df = pd.read_csv('data/worldometer_coronavirus_daily_data.csv')
daily_df.head()

Unnamed: 0,date,country,cumulative_total_cases,daily_new_cases,active_cases,cumulative_total_deaths,daily_new_deaths
0,2020-2-15,Afghanistan,0.0,,0.0,0.0,
1,2020-2-16,Afghanistan,0.0,,0.0,0.0,
2,2020-2-17,Afghanistan,0.0,,0.0,0.0,
3,2020-2-18,Afghanistan,0.0,,0.0,0.0,
4,2020-2-19,Afghanistan,0.0,,0.0,0.0,


In [6]:
#importing summary Data
summary_df = pd.read_csv('data/worldometer_coronavirus_summary_data.csv')
summary_df.head()

Unnamed: 0,country,continent,total_confirmed,total_deaths,total_recovered,active_cases,serious_or_critical,total_cases_per_1m_population,total_deaths_per_1m_population,total_tests,total_tests_per_1m_population,population
0,Afghanistan,Asia,179267,7690.0,162202.0,9375.0,1124.0,4420,190.0,951337.0,23455.0,40560636
1,Albania,Europe,275574,3497.0,271826.0,251.0,2.0,95954,1218.0,1817530.0,632857.0,2871945
2,Algeria,Africa,265816,6875.0,178371.0,80570.0,6.0,5865,152.0,230861.0,5093.0,45325517
3,Andorra,Europe,42156,153.0,41021.0,982.0,14.0,543983,1974.0,249838.0,3223924.0,77495
4,Angola,Africa,99194,1900.0,97149.0,145.0,,2853,55.0,1499795.0,43136.0,34769277


## Data Preparation 

### Basic Analysis

In [8]:
#shape of both data
daily_df.shape

(184787, 7)

We have 184787 number of records and 7 columns in daily dataset

In [9]:
summary_df.shape

(226, 12)

we have 226 number of records and 12 columns in summary data

In [10]:
#check data types
daily_df.dtypes

date                        object
country                     object
cumulative_total_cases     float64
daily_new_cases            float64
active_cases               float64
cumulative_total_deaths    float64
daily_new_deaths           float64
dtype: object

* Here date needed to be fixed, it should be datatime instead of object

In [11]:
#check if there is any null values
daily_df.isna().sum()

date                           0
country                        0
cumulative_total_cases         0
daily_new_cases            10458
active_cases               18040
cumulative_total_deaths     6560
daily_new_deaths           26937
dtype: int64

There are many null values in the data set but wrt to overall data size the  null values are minor.

#### Statistical Summary of Data

In [12]:
daily_df.describe()

Unnamed: 0,cumulative_total_cases,daily_new_cases,active_cases,cumulative_total_deaths,daily_new_deaths
count,184787.0,174329.0,166747.0,178227.0,157850.0
mean,725108.9,2987.633285,62392.83,13886.0,39.831834
std,3681471.0,17803.232663,395564.1,60495.21,181.10277
min,0.0,-322.0,-14321.0,0.0,-39.0
25%,1099.0,0.0,60.0,24.0,0.0
50%,17756.0,58.0,1386.0,304.0,1.0
75%,223808.5,728.0,14620.5,4111.0,12.0
max,84209470.0,909610.0,17935430.0,1026646.0,5093.0


#### Categorical Summary of Data

In [13]:
daily_df.describe(include = 'O')

Unnamed: 0,date,country
count,184787,184787
unique,844,226
top,2021-11-28,China
freq,226,844


* There are total 226 countries about which we're having the data
* China is the most frequent country in the data

In [21]:
#convert data into date time
daily_df['date'] = pd.to_datetime(daily_df['date'])