### I. Data load

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv("../input/novel-corona-virus-2019-dataset/covid_19_data.csv")
df.head()

**Get more information about table:**

In [None]:
df.info()

In [None]:
df.columns

### II. Data Cleaning

**Let's check if there are missing values, outliers, or duplicates in our data**

In [None]:
df.isna().sum()

In [None]:
df["Province/State"].isna().sum() * 100 / len(df["Province/State"])

**More than 30 percent of our data was missing, so we'll use the heat map to display it:**

In [None]:
colours = ["#66FF00", "#ff2000"] # Green and red colors

sns.heatmap(df.isna(), cmap = colours, cbar = False)

So, what can we do? 
Let's try to impute it using frequency

In [None]:
df["Province/State"].unique()

In [None]:
df["Province/State"].value_counts()

In [None]:
df["Province/State"].value_counts() * 100 / len(df["Province/State"].value_counts())

In [None]:
df["Province/State"].replace({np.nan: "Diamond Princess cruise ship"}, inplace = True)

In [None]:
df.isna().sum()

**Great! We got rid of the missing values, we will reduce the string data to lowercase:**

In [None]:
for col in df.columns:
    if df[col].dtype == "object":
        df[col] = df[col].str.lower()
df.head()

In [None]:
df.columns = df.columns.str.lower()

**Let's check our data for outliers:**

In [None]:
np.round(df.describe(), 2)

**"SNo" is the index of this table. Drop it:**

In [None]:
df.drop("SNo", axis = 1, inplace = True)
df.head()

In [None]:
figure, axes = plt.subplots(1, 3, figsize = (20, 8))
axes = axes.flatten()
k = 0

colors = ["green", "orange", "steelblue"]

for i, col in enumerate(df.columns):
    if df[col].dtype == "float":
        sns.distplot(df[col], kde = False, color = colors[k], ax = axes[k])
        axes[k].grid(True)
        k += 1

**Based on the histograms and descriptive table, there may be outliers in our data. Need more information**

In [None]:
figure, axes = plt.subplots(1, 3, figsize = (20, 8))
axes = axes.flatten()
k = 0

colors = ["green", "orange", "steelblue"]

for i, col in enumerate(df.columns):
    if df[col].dtype == "float":
        sns.lineplot(df["observationdate"], df[col], color = colors[k], ax = axes[k])
        k += 1

**As the graph shows, large values are due to rapid dynamics. No outliers**

### III. EDA

**Let's start an exploratory data analysis. Let's look at the dynamics of infection by uploading additional data:**

In [None]:
df_time_ser = pd.read_csv("../input/novel-corona-virus-2019-dataset/time_series_covid_19_deaths.csv")
df_time_ser.head()

In [None]:
len(df_time_ser.columns[4:])

**Observations were carried out for 246 days, starting from January 22**

In [None]:
df_time_ser.isna().sum() * 100 / len(df_time_ser.isna().sum())

**More than 74 percent of the data in Province/State column is missing. We will drop this column and look at the dynamics by country / region**

In [None]:
df_time_ser.drop("Province/State", axis = 1, inplace = True)

In [None]:
df_time_ser.columns = df_time_ser.columns.str.lower()

for col in df_time_ser.columns:
    if df_time_ser[col].dtype == "object":
        df_time_ser[col] = df_time_ser[col].str.lower()

In [None]:
df_time_ser.head()

**Drop columns that have more than 50 values equal to 0:**

In [None]:
df_time_ser.describe()