# **Visual analysis of COVID-19 data**

In this Kernel, we will look at the data about COVID-19 provided by the Mexican government.

I would like to thank the person under the nickname **naks** for providing the Kernel that will help prepare the data for analysis.

Link: https://www.kaggle.com/universalastro/covid-precondition

In [None]:
import pandas as pd 
import numpy as np
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('dark')
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
print('Setup complete')

**The process of describing data cleanup I'm missing, details you will receive the link for the Kernel above.**

In [None]:
df = pd.read_csv('..//input//covid19-patient-precondition-dataset//covid.csv', index_col='id')

In [None]:
plt.subplots(figsize=(12, 10))
sns.heatmap(df.corr())

In [None]:
ELD = np.zeros_like(df['diabetes'].values, dtype='int32')

for col in df.columns[9:19]:
    uniques = df[col].unique()
    uniques = np.sort(uniques)
    ELD += df[col].replace(uniques[1:], 0).values

df['ELD_indx'] = ELD
df = df.drop(df.columns[9:19], axis=1)

plt.subplots(figsize=(12, 10))
sns.heatmap(df.corr())

In [None]:
no_icu_data_bool = df['icu'].isin([97, 98, 99])
no_icu_data_bool

icu_data = df[~ no_icu_data_bool]
no_icu_data = df[no_icu_data_bool]
print("{} rows have ICU details ".format(icu_data.shape[0]))
print("Only {}% of given data has ICU details ".format(round((icu_data.shape[0]/ no_icu_data.shape[0])*100)))

In [None]:
icu_data.sex.replace({1: 'Female', 2: 'Male'}, inplace=True)
icu_data.patient_type.replace({1: 'Outpatient', 2: 'Inpatient'}, inplace=True)
icu_data.intubed.replace({1: 'Yes', 2: 'No',97:'Not Specified', 98:'Not Specified',99:'Not Specified'}, inplace=True)
icu_data.pneumonia.replace({1: 'Yes', 2: 'No', 98:'Not Specified',99:'Not Specified', 97:'Not Specified'}, inplace=True)
icu_data.pregnancy.replace({1: 'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
icu_data.contact_other_covid.replace({1: 'Yes', 2: 'No', 97:'Not Specified',99:'Not Specified',98:'Not Specified'}, inplace=True)
icu_data.covid_res.replace({1: 'Positive', 2: 'Negative', 3:'Awaiting Results'}, inplace=True)
icu_data.icu.replace({1: 'Yes', 2: 'No', 97:'Not Specified',98:'Not Specified', 99:'Not Specified'}, inplace=True)


In [None]:
icu_data.head()

In [None]:
from datetime import datetime
def convert_date(day, first_day="01-01-2020", sep='-'):
    d1 = first_day.replace('-', sep)
    fmt = f'%d{sep}%m{sep}%Y'
    d1 = datetime.strptime(d1, fmt)
    d2 = datetime.strptime(day, fmt)
    delta = d2 - d1
    return delta.days

In [None]:
icu_data['date_died'] = icu_data['date_died'].replace('9999-99-99', 0)
icu_data['day_died'] = icu_data['date_died'].apply(lambda date: np.NaN if date == 0 else convert_date(date))

icu_data['entry_date'] = icu_data['entry_date'].replace('9999-99-99', 0)
icu_data['entry_day'] = icu_data['entry_date'].apply(lambda date: np.NaN if date == 0 else convert_date(date))

icu_data['date_symptoms'] = icu_data['date_symptoms'].replace('9999-99-99', 0)
icu_data['day_symptoms'] = icu_data['date_symptoms'].apply(lambda date: np.NaN if date == 0 else convert_date(date))

In [None]:
icu_data['died'] = icu_data['date_died'].apply(lambda x: 'Non-died' if x == 0 else 'Died')

In [None]:
icu_data

# **Analys**

This is where the fun begins. 

First, I want to see how the ages are distributed in General and among the dead and survivors of the virus.

In [None]:
df = icu_data

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(12,6))
axarr[0].set_title('Age distribution')
f = sns.distplot(df['age'], color='g', bins=40, ax=axarr[0])
axarr[1].set_title('age distribution for the two subpopulations')
g = sns.kdeplot(df['age'].loc[df['died'] == 'Died'], 
                shade= True, ax=axarr[1], label='Died').set_xlabel('Age')
g = sns.kdeplot(df['age'].loc[df['died'] == 'Non-died'], 
                shade=True, ax=axarr[1], label='Not died')

All distributions are approximately normal. 

After studying the distributions of the dead and survivors, I have a suggestion that age affects survival (Although this has been known for a long time, but we will check it using classical statistical methods).

We accept the following statement as a null hypothesis (H0): Age doesn't affect survival.

In [None]:
from scipy import stats

mask = df['died'] == 'Died'
died = df['age'][mask]

mask = df['died'] == 'Non-died'
nondied = df['age'][mask]

res = stats.ttest_ind(died, nondied, equal_var=False)
print('p-value:', res[1])

the p-value is too low, so much so that the Python interpreter reduces it to absolute zero, so we can safely reject the hypothesis and say that age affects the survival rate of COVID-19 infection. 

Empirically, it can be understood that the older a person is, the more likely they are to die.

Let's look at the statistics of the dead and survivors.

In [None]:
died.describe()

Average age of a died: 61-62 years old.

In [None]:
nondied.describe()

Average age of a survivor: 50-52 years old.

# **Analysis of the ELD index.**

Let me remind you that the ELD index was added to reduce the size of the source data and it means that a person is prone to lung diseases. 

The index values are from 1 to 10, where 1 is the complete absence of the disease tendency, and 10 is the most dangerous situation for a person.

In [None]:
print('Number of people for each ELD index value:')
print(df.ELD_indx.value_counts())

fig, ax = plt.subplots(figsize=(12, 8))
ax = sns.barplot(x=df['ELD_indx'].value_counts().keys(),
            y=df['ELD_indx'].value_counts().values)
ax.set_xticklabels(ax.get_xticklabels(), rotation=35)
plt.title('Number of people for each ELD index value')
plt.xlabel('ELD index')
plt.grid()
plt.show()

We have quite a lot of healthy people, and very few people with an elevated ELD index.

I wonder what percentage of people with an ELD index ended up dead.

In [None]:
unique_ELD = np.sort(df['ELD_indx'].unique())
all_ELD = []
died_ELD = []
percentage = []
for indx in unique_ELD:
    all_ELD.append(df['ELD_indx'][df['ELD_indx'] == indx].count())
    died_ELD.append(df['ELD_indx'][(df['ELD_indx'] == indx) & (df['died'] == 'Died')].count())
    percentage.append((died_ELD[-1] / all_ELD[-1]) * 100)

fig, ax = plt.subplots(figsize=(12, 8))
ax = sns.barplot(x=unique_ELD, y=percentage)
ax.set_xticklabels(ax.get_xticklabels(), rotation=35)
plt.title('The percentage of deaths when there is a tendency to lung diseases.')
plt.ylabel('Percentage')
plt.xlabel('ELD index')
plt.grid()
plt.show()

This histogram shows the percentage of deaths among a group with the same ELD index. We see quite a clear trend, although there is some anomaly in the last columns, but I think this is due to the small number of people in the ELD above 8 in the sample.

# **Analysis of mortality by gender**

In this section we will try to find out the influence of gender on mortality and infection with a virus. 

Also I will first consider the influence of the pregnancy on mortality.

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
p = sns.countplot(x ='pregnancy', hue ='died', data = df[:][df['sex'] == 'Female'], 
                  ax=ax).set_title('The mortality of pregnant and non-pregnant women')

This histogram shows the statistics of deaths among pregnant and non-pregnant women.

This histogram gives quite little information, since the number of pregnant women is quite small compared to non-pregnant women. 

Let's look at the percentages.

In [None]:
all_preg_female = df['sex'][(df['sex'] == 'Female') & (df['pregnancy'] == 'Yes')]
died_preg_female = df['sex'][(df['sex'] == 'Female') & (df['pregnancy'] == 'Yes') & (df['died'] == 'Died')]

all_notpreg_female = df['sex'][(df['sex'] == 'Female') & (df['pregnancy'] == 'No')]
died_notpreg_female = df['sex'][(df['sex'] == 'Female') & (df['pregnancy'] == 'No') & (df['died'] == 'Died')]

percentage = round(died_preg_female.count() / all_preg_female.count() * 100, 3)
print(f'Percentage of pregnant women who died: {percentage} %')
percentage = round(died_notpreg_female.count() / all_notpreg_female.count() * 100, 3)
print(f'Percentage of non-pregnant women who died.: {percentage} %')

As we can see, pregnant women die many times less often than non-pregnant women. There can be many reasons for this:
1. Pregnant women are often very young.
2. Pregnant women are much more diligent about hygiene.
3. Pregnant Women are very diligent about their health, have few bad habits, and try not to appear in public places during COVID-19
4. Most often, the pre-pregnancy plan and get rid of the diseases that can cause complications.
5. For pregnant women in hospitals, the care will be clearly stronger than for non-pregnant women.

Which of these reasons, or which of their combinations, will be the true one, I do not undertake to judge. This will be your food for thought :)

In [None]:
percentage_f = df['sex'][(df['sex'] == 'Female') & (df['died'] == 'Died')].count() / df['sex'][(df['sex'] == 'Female')].count()
percentage_f = round(percentage_f * 100, 3)

percentage_m = df['sex'][(df['sex'] == 'Male') & (df['died'] == 'Died')].count() / df['sex'][(df['sex'] == 'Male')].count()
percentage_m = round(percentage_m * 100, 3)

fig, ax = plt.subplots(figsize=(10, 8))
ax = sns.barplot(x=['Male', 'Female'], y=[percentage_m, percentage_f])
plt.title('Percentage of deaths among men and women.')
plt.ylabel('Percentage')
plt.xlabel('Gender')
plt.grid()
plt.show()

The histogram above visualizes the percentage of deaths among men and women. Note that men die more often than women, perhaps this is due to the ELD index, we will check this later.

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
p = sns.countplot(x ='ELD_indx', hue ='sex', data = df, 
                  ax=ax).set_title('Number of men and women with an ELD index')
plt.grid()

Indeed, men have lung problems much more often than women.

Let's look at the average age of men and women with pneumonia.

In [None]:
percentage_f = df['sex'][(df['sex'] == 'Female') & (df['pneumonia'] == 'Yes')].count() / df['sex'][(df['sex'] == 'Female')].count()
percentage_f = round(percentage_f * 100, 3)

percentage_m = df['sex'][(df['sex'] == 'Male') & (df['pneumonia'] == 'Yes')].count() / df['sex'][(df['sex'] == 'Male')].count()
percentage_m = round(percentage_m * 100, 3)

fig, ax = plt.subplots(figsize=(10, 8))
ax = sns.barplot(x=['Male', 'Female'], y=[percentage_m, percentage_f])
plt.title('The average age of men and women with pneumonia.')
plt.ylabel('Percentage')
plt.xlabel('Gender')
plt.grid()
plt.show()

# **Time series analysis**

Let's look at the dynamics of people who discovered the first symptoms, received treatment, and the dynamics of deaths.

In [None]:
def plot_day_counts(data, columns_names, color=None, show_friday=True, figsize=(12, 5)):
    fig, axarr = plt.subplots(figsize=figsize)
    for col in columns_names:
        unique_days = np.sort(data[col].unique())
        unique_days = unique_days[:-1]
        counts = []
        all_days = np.linspace(0.0, unique_days.max(), int(unique_days.max()) + 1)
        for day in all_days:
            counts.append(data[col][(data[col] == day) & (data['ELD_indx'] == 0)].count())
        label = f'{col} counts'
        if color:
            plt.plot(all_days, counts, label=label, color=color)
        else:
            plt.plot(all_days, counts, label=label)
    if show_friday:
        plt.vlines([i for i in range(2, int(unique_days.max()) + 1, 7)], 0, 
                   max(counts), linestyles='--', color='green', alpha=0.5, label='Fridays')
    plt.grid()
    plt.legend()
    plt.xlabel('Number of the day from the beginning of 2020.')
    plt.ylabel('Number of people.')

plot_day_counts(df, ['entry_day'])
plt.title('The number of patients admitted for treatment over time.')

In the first 3 months(until April 2020), the number of incoming cases remained almost unchanged, and the virus was quite weak. Next, you can notice a massive infection and a sharp change in the trend. The chart clearly shows the seasonality. This is due to the weekend, the weekend received much fewer people than on weekdays. 

The green lines are shown every Friday.

In [None]:
plot_day_counts(df, ['day_symptoms'], color='brown')
plt.title('The number of people who show symptoms over time.')

On this graph, we can see a picture that is the opposite of the previous one. People most often detect symptoms in the middle of the week. This may be due to the fact that on weekends people still do not adhere to a self-isolation regime and go out on the street and in public places, and then discover symptoms in the middle of the week, it may also be due to work on weekdays.

In [None]:
plot_day_counts(df, ['day_died'], color='red', show_friday=False)
plt.title('The number of deaths over time.')

The mortality schedule has a growing trend, but it is already beginning to fade, there is no pronounced periodicity, people die on weekends as often as on weekdays(It is not surprising).

# **END**

Thank you for your attention, this was one of my first analyses, so please give constructive criticism in the comments and suggest improvements, ask questions that I will answer in the next aptdates notebook.

Thank you! Good luck!