# **Preparing data for analysis.**

This notebook describes the stage of data preparation before further analysis(Later I will post a Kernel with a full visual analysis.)

In [None]:
import pandas as pd 
import numpy as np
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('dark')
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
print('Setup complete')

In [None]:
df = pd.read_csv('..//input//covid19-patient-precondition-dataset//covid.csv', index_col='id')
df.head()

Let's look at the contents of the columns.

In [None]:
df.info()

We see 3 columns with the object type, they contain dates in the format DD-MM-YY. The other columns are indicators in most cases, and one of them contains the age.

Let's look at the correlation matrix

In [None]:
df.corr()

In [None]:
sns.heatmap(df.corr())

We see a fairly strong correlation between diabetes, Smoking and lung disease. If everything is quite obvious with Smoking, then diabetes confused me a little and I had to look at additional information from different sources.

It turns out that the lung tissue is defined as another target organ for diabetes due to specific changes in the "diabetic lung". These anomalies are manifested accelerated decline of respiratory function, decrease in respiratory lung volume, decreased lung diffusion capacity, deterioration of control of ventilation, bronchial tone and neuroadrenergic innervation of the bronchial tubes. Patients with type 2 diabetes and insufficient glycemic control had a lower forced expiratory volume 1 and forced vital capacity of the lungs.

After thinking about solving the problem of correlation of these diseases, I decided to enter the ELD index (Exposure to lung diseases inedex.) Since we have exactly 10 correlating columns, the index will take values from 1 to 10, where 10 is the most bad case, and 1 is a person with a minimal tendency to lung diseases.

In [None]:
ELD = np.zeros_like(df['diabetes'].values, dtype='int32')

for col in df.columns[9:19]:
    uniques = df[col].unique()
    uniques = np.sort(uniques)
    ELD += df[col].replace(uniques[1:], 0).values
df['ELD_indx'] = ELD

Let's look at the new correlation matrix.

In [None]:
sns.heatmap(df.drop(df.columns[9:19], axis=1).corr())

In this way, we got rid of most correlations, leaving only natural correlations, such as female gender-pregnancy,icu-patient_type etc.

In [None]:
df = df.drop(df.columns[9:19], axis=1)

Let's start working with time. In the next Kernel, I plan to analyze the time series, so we will convert the date to the format " day from the beginning of the year"

In [None]:
from datetime import datetime
def convert_date(day, first_day="01-01-2020", sep='-'):
    d1 = first_day.replace('-', sep)
    fmt = f'%d{sep}%m{sep}%Y'
    d1 = datetime.strptime(d1, fmt)
    d2 = datetime.strptime(day, fmt)
    delta = d2 - d1
    return delta.days

In [None]:
df['date_died'] = df['date_died'].replace('9999-99-99', 0)
df['day_died'] = df['date_died'].apply(lambda date: np.NaN if date == 0 else convert_date(date))

df['entry_date'] = df['entry_date'].replace('9999-99-99', 0)
df['entry_day'] = df['entry_date'].apply(lambda date: np.NaN if date == 0 else convert_date(date))

df['date_symptoms'] = df['date_symptoms'].replace('9999-99-99', 0)
df['day_symptoms'] = df['date_symptoms'].apply(lambda date: np.NaN if date == 0 else convert_date(date))

I will also add a death indicator.

In [None]:
df['died'] = df['date_died'].apply(lambda x: 0 if x == 0 else 1)

In [None]:
df.head()

This is the end of my little introductory tour of data preparation, write your suggestions and comments, always ready to listen to **constructive** criticism. **Thank you!**