### COVID-19 Dataset

Data for this project is sourced from https://www.kaggle.com/datasets/meirnizri/covid19-dataset. The description below is copied from this link.

> The dataset was provided by the Mexican government ([link](https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico)). This dataset contains an enormous number of anonymized patient-related information including pre-conditions. The raw dataset consists of 21 unique features and 1,048,576 unique patients. In the Boolean features, 1 means "yes" and 2 means "no". values as 97 and 99 are missing data.
> 
> - sex: 1 for female and 2 for male.
> - age: of the patient.
> - classification: covid test findings. Values 1-3 mean that the patient was diagnosed with covid in different degrees. 4 or higher means that the patient is not a carrier of covid or that the test is inconclusive.
> - patient type: type of care the patient received in the unit. 1 for returned home and 2 for hospitalization.
> - pneumonia: whether the patient already have air sacs inflammation or not.
> - pregnancy: whether the patient is pregnant or not.
> - diabetes: whether the patient has diabetes or not.
> - copd: Indicates whether the patient has Chronic obstructive pulmonary disease or not.
> - asthma: whether the patient has asthma or not.
> - immsupr: whether the patient is immunosuppressed or not.
> - hypertension: whether the patient has hypertension or not.
> - cardiovascular: whether the patient has heart or blood vessels related disease.
> - renal chronic: whether the patient has chronic renal disease or not.
> - other disease: whether the patient has other disease or not.
> - obesity: whether the patient is obese or not.
> - tobacco: whether the patient is a tobacco user.
> - usmr: Indicates whether the patient treated medical units of the first, second or third level.
> - medical unit: type of institution of the National Health System that provided the care.
> - intubed: whether the patient was connected to the ventilator.
> - icu: Indicates whether the patient had been admitted to an Intensive Care Unit.
> - date died: If the patient died indicate the date of death, and 9999-99-99 otherwise.

In [1]:
import numpy as np
import pandas as pd
import pickle

In [2]:
df = pd.read_csv('Covid_Data.csv')

In [3]:
df.head(10)

Unnamed: 0,USMER,MEDICAL_UNIT,SEX,PATIENT_TYPE,DATE_DIED,INTUBED,PNEUMONIA,AGE,PREGNANT,DIABETES,...,ASTHMA,INMSUPR,HIPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,CLASIFFICATION_FINAL,ICU
0,2,1,1,1,03/05/2020,97,1,65,2,2,...,2,2,1,2,2,2,2,2,3,97
1,2,1,2,1,03/06/2020,97,1,72,97,2,...,2,2,1,2,2,1,1,2,5,97
2,2,1,2,2,09/06/2020,1,2,55,97,1,...,2,2,2,2,2,2,2,2,3,2
3,2,1,1,1,12/06/2020,97,2,53,2,2,...,2,2,2,2,2,2,2,2,7,97
4,2,1,2,1,21/06/2020,97,2,68,97,1,...,2,2,1,2,2,2,2,2,3,97
5,2,1,1,2,9999-99-99,2,1,40,2,2,...,2,2,2,2,2,2,2,2,3,2
6,2,1,1,1,9999-99-99,97,2,64,2,2,...,2,2,2,2,2,2,2,2,3,97
7,2,1,1,1,9999-99-99,97,1,64,2,1,...,2,1,1,2,2,2,1,2,3,97
8,2,1,1,2,9999-99-99,2,2,37,2,1,...,2,2,1,2,2,1,2,2,3,2
9,2,1,1,2,9999-99-99,2,2,25,2,2,...,2,2,2,2,2,2,2,2,3,2


In [4]:
df.shape

(1048575, 21)

In [5]:
df.columns

Index(['USMER', 'MEDICAL_UNIT', 'SEX', 'PATIENT_TYPE', 'DATE_DIED', 'INTUBED',
       'PNEUMONIA', 'AGE', 'PREGNANT', 'DIABETES', 'COPD', 'ASTHMA', 'INMSUPR',
       'HIPERTENSION', 'OTHER_DISEASE', 'CARDIOVASCULAR', 'OBESITY',
       'RENAL_CHRONIC', 'TOBACCO', 'CLASIFFICATION_FINAL', 'ICU'],
      dtype='object')

In [6]:
df.rename(columns={'INMSUPR': 'IMMSUPR', 'HIPERTENSION': 'HYPERTENSION', 'CLASIFFICATION_FINAL': 'CLASSIFICATION_FINAL'}, inplace=True)

In [7]:
df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 21 columns):
 #   Column                Non-Null Count    Dtype 
---  ------                --------------    ----- 
 0   USMER                 1048575 non-null  int64 
 1   MEDICAL_UNIT          1048575 non-null  int64 
 2   SEX                   1048575 non-null  int64 
 3   PATIENT_TYPE          1048575 non-null  int64 
 4   DATE_DIED             1048575 non-null  object
 5   INTUBED               1048575 non-null  int64 
 6   PNEUMONIA             1048575 non-null  int64 
 7   AGE                   1048575 non-null  int64 
 8   PREGNANT              1048575 non-null  int64 
 9   DIABETES              1048575 non-null  int64 
 10  COPD                  1048575 non-null  int64 
 11  ASTHMA                1048575 non-null  int64 
 12  IMMSUPR               1048575 non-null  int64 
 13  HYPERTENSION          1048575 non-null  int64 
 14  OTHER_DISEASE         1048575 non-null  int64 
 15

In [8]:
df.nunique(axis=0)

USMER                     2
MEDICAL_UNIT             13
SEX                       2
PATIENT_TYPE              2
DATE_DIED               401
INTUBED                   4
PNEUMONIA                 3
AGE                     121
PREGNANT                  4
DIABETES                  3
COPD                      3
ASTHMA                    3
IMMSUPR                   3
HYPERTENSION              3
OTHER_DISEASE             3
CARDIOVASCULAR            3
OBESITY                   3
RENAL_CHRONIC             3
TOBACCO                   3
CLASSIFICATION_FINAL      7
ICU                       4
dtype: int64

In [9]:
# Show unique values
for col in df.columns:
    if df[col].nunique() > 20:
        pass
    else:
        print(col, ":\n  ", df[col].unique(), "\n")

USMER :
   [2 1] 

MEDICAL_UNIT :
   [ 1  2  3  4  5  6  7  8  9 10 11 12 13] 

SEX :
   [1 2] 

PATIENT_TYPE :
   [1 2] 

INTUBED :
   [97  1  2 99] 

PNEUMONIA :
   [ 1  2 99] 

PREGNANT :
   [ 2 97 98  1] 

DIABETES :
   [ 2  1 98] 

COPD :
   [ 2  1 98] 

ASTHMA :
   [ 2  1 98] 

IMMSUPR :
   [ 2  1 98] 

HYPERTENSION :
   [ 1  2 98] 

OTHER_DISEASE :
   [ 2  1 98] 

CARDIOVASCULAR :
   [ 2  1 98] 

OBESITY :
   [ 2  1 98] 

RENAL_CHRONIC :
   [ 2  1 98] 

TOBACCO :
   [ 2  1 98] 

CLASSIFICATION_FINAL :
   [3 5 7 6 1 2 4] 

ICU :
   [97  2  1 99] 



In [10]:
# Show value counts
for col in df.columns:
    if df[col].nunique() > 20:
        pass
    else:
        print(col, ":\n  ", df[col].value_counts(sort=False), "\n")

USMER :
   USMER
2    662903
1    385672
Name: count, dtype: int64 

MEDICAL_UNIT :
   MEDICAL_UNIT
1        151
2        169
3      19175
4     314405
5       7244
6      40584
7        891
8      10399
9      38116
10      7873
11      5577
12    602995
13       996
Name: count, dtype: int64 

SEX :
   SEX
1    525064
2    523511
Name: count, dtype: int64 

PATIENT_TYPE :
   PATIENT_TYPE
1    848544
2    200031
Name: count, dtype: int64 

INTUBED :
   INTUBED
97    848544
1      33656
2     159050
99      7325
Name: count, dtype: int64 

PNEUMONIA :
   PNEUMONIA
1     140038
2     892534
99     16003
Name: count, dtype: int64 

PREGNANT :
   PREGNANT
2     513179
97    523511
98      3754
1       8131
Name: count, dtype: int64 

DIABETES :
   DIABETES
2     920248
1     124989
98      3338
Name: count, dtype: int64 

COPD :
   COPD
2     1030510
1       15062
98       3003
Name: count, dtype: int64 

ASTHMA :
   ASTHMA
2     1014024
1       31572
98       2979
Name: count, dtype: int

In [11]:
df['DATE_DIED'].value_counts()

DATE_DIED
9999-99-99    971633
06/07/2020      1000
07/07/2020       996
13/07/2020       990
16/06/2020       979
               ...  
24/11/2020         1
17/12/2020         1
08/12/2020         1
16/03/2021         1
22/04/2021         1
Name: count, Length: 401, dtype: int64

In [12]:
df['DIED'] = df['DATE_DIED'].ne("9999-99-99").astype(int)

In [13]:
df.head()

Unnamed: 0,USMER,MEDICAL_UNIT,SEX,PATIENT_TYPE,DATE_DIED,INTUBED,PNEUMONIA,AGE,PREGNANT,DIABETES,...,IMMSUPR,HYPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,CLASSIFICATION_FINAL,ICU,DIED
0,2,1,1,1,03/05/2020,97,1,65,2,2,...,2,1,2,2,2,2,2,3,97,1
1,2,1,2,1,03/06/2020,97,1,72,97,2,...,2,1,2,2,1,1,2,5,97,1
2,2,1,2,2,09/06/2020,1,2,55,97,1,...,2,2,2,2,2,2,2,3,2,1
3,2,1,1,1,12/06/2020,97,2,53,2,2,...,2,2,2,2,2,2,2,7,97,1
4,2,1,2,1,21/06/2020,97,2,68,97,1,...,2,1,2,2,2,2,2,3,97,1


In [14]:
# df2 is the subset of df containing only those patients who returned a positive COVID test

df2 = df[df['CLASSIFICATION_FINAL'] <=3]

In [15]:
df2.head()

Unnamed: 0,USMER,MEDICAL_UNIT,SEX,PATIENT_TYPE,DATE_DIED,INTUBED,PNEUMONIA,AGE,PREGNANT,DIABETES,...,IMMSUPR,HYPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,CLASSIFICATION_FINAL,ICU,DIED
0,2,1,1,1,03/05/2020,97,1,65,2,2,...,2,1,2,2,2,2,2,3,97,1
2,2,1,2,2,09/06/2020,1,2,55,97,1,...,2,2,2,2,2,2,2,3,2,1
4,2,1,2,1,21/06/2020,97,2,68,97,1,...,2,1,2,2,2,2,2,3,97,1
5,2,1,1,2,9999-99-99,2,1,40,2,2,...,2,2,2,2,2,2,2,3,2,0
6,2,1,1,1,9999-99-99,97,2,64,2,2,...,2,2,2,2,2,2,2,3,97,0


In [16]:
df2.shape

(391979, 22)

In [17]:
# Show value counts
for col in df2.columns:
    if df2[col].nunique() > 20:
        pass
    else:
        print(col, ":\n  ", df2[col].value_counts(sort=False), "\n")

USMER :
   USMER
2    244853
1    147126
Name: count, dtype: int64 

MEDICAL_UNIT :
   MEDICAL_UNIT
1         40
2         12
3       8631
4     126649
5       2853
6      17688
7        414
8       4994
9      14585
10      3807
11      3735
12    208226
13       345
Name: count, dtype: int64 

SEX :
   SEX
1    182490
2    209489
Name: count, dtype: int64 

PATIENT_TYPE :
   PATIENT_TYPE
1    280687
2    111292
Name: count, dtype: int64 

INTUBED :
   INTUBED
97    280687
1      23670
2      86109
99      1513
Name: count, dtype: int64 

PNEUMONIA :
   PNEUMONIA
1      86041
2     305934
99         4
Name: count, dtype: int64 

PREGNANT :
   PREGNANT
2     178353
97    209489
1       2754
98      1383
Name: count, dtype: int64 

DIABETES :
   DIABETES
2     328425
1      62114
98      1440
Name: count, dtype: int64 

COPD :
   COPD
2     384535
1       6131
98      1313
Name: count, dtype: int64 

ASTHMA :
   ASTHMA
2     380258
1      10412
98      1309
Name: count, dtype: int64 

I

In [18]:
df2['ICU_'] = df2['ICU'].eq(1).astype(int)
df2['INTUBED_'] = df2['INTUBED'].eq(1).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['ICU_'] = df2['ICU'].eq(1).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['INTUBED_'] = df2['INTUBED'].eq(1).astype(int)


In [19]:
df2['SEVERE'] = df2['DIED'] + df2['ICU_'] + df2['INTUBED_']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['SEVERE'] = df2['DIED'] + df2['ICU_'] + df2['INTUBED_']


In [20]:
df2['SEVERE'] = df2['SEVERE'].ne(0).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['SEVERE'] = df2['SEVERE'].ne(0).astype(int)


In [21]:
df2.head(10)

Unnamed: 0,USMER,MEDICAL_UNIT,SEX,PATIENT_TYPE,DATE_DIED,INTUBED,PNEUMONIA,AGE,PREGNANT,DIABETES,...,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,CLASSIFICATION_FINAL,ICU,DIED,ICU_,INTUBED_,SEVERE
0,2,1,1,1,03/05/2020,97,1,65,2,2,...,2,2,2,2,3,97,1,0,0,1
2,2,1,2,2,09/06/2020,1,2,55,97,1,...,2,2,2,2,3,2,1,0,1,1
4,2,1,2,1,21/06/2020,97,2,68,97,1,...,2,2,2,2,3,97,1,0,0,1
5,2,1,1,2,9999-99-99,2,1,40,2,2,...,2,2,2,2,3,2,0,0,0,0
6,2,1,1,1,9999-99-99,97,2,64,2,2,...,2,2,2,2,3,97,0,0,0,0
7,2,1,1,1,9999-99-99,97,1,64,2,1,...,2,2,1,2,3,97,0,0,0,0
8,2,1,1,2,9999-99-99,2,2,37,2,1,...,2,1,2,2,3,2,0,0,0,0
9,2,1,1,2,9999-99-99,2,2,25,2,2,...,2,2,2,2,3,2,0,0,0,0
10,2,1,1,1,9999-99-99,97,2,38,2,2,...,2,2,2,2,3,97,0,0,0,0
11,2,1,2,2,9999-99-99,2,2,24,97,2,...,2,2,2,2,3,2,0,0,0,0


In [22]:
df2.columns

Index(['USMER', 'MEDICAL_UNIT', 'SEX', 'PATIENT_TYPE', 'DATE_DIED', 'INTUBED',
       'PNEUMONIA', 'AGE', 'PREGNANT', 'DIABETES', 'COPD', 'ASTHMA', 'IMMSUPR',
       'HYPERTENSION', 'OTHER_DISEASE', 'CARDIOVASCULAR', 'OBESITY',
       'RENAL_CHRONIC', 'TOBACCO', 'CLASSIFICATION_FINAL', 'ICU', 'DIED',
       'ICU_', 'INTUBED_', 'SEVERE'],
      dtype='object')

In [23]:
for col in ['PNEUMONIA', 'PREGNANT', 'DIABETES', 'COPD', 'ASTHMA',
        'IMMSUPR', 'HYPERTENSION', 'OTHER_DISEASE', 'CARDIOVASCULAR',
        'OBESITY', 'RENAL_CHRONIC', 'TOBACCO']:
       df2[col].replace([97,98,99], 3, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[col].replace([97,98,99], 3, inplace=True)


In [24]:
model_df = df2[['SEX', 'PNEUMONIA', 'AGE', 'PREGNANT', 'DIABETES', 'COPD',
       'ASTHMA', 'IMMSUPR', 'HYPERTENSION', 'OTHER_DISEASE',
       'CARDIOVASCULAR', 'OBESITY', 'RENAL_CHRONIC', 'TOBACCO', 'SEVERE']]

In [25]:
model_df

Unnamed: 0,SEX,PNEUMONIA,AGE,PREGNANT,DIABETES,COPD,ASTHMA,IMMSUPR,HYPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,SEVERE
0,1,1,65,2,2,2,2,2,1,2,2,2,2,2,1
2,2,2,55,3,1,2,2,2,2,2,2,2,2,2,1
4,2,2,68,3,1,2,2,2,1,2,2,2,2,2,1
5,1,1,40,2,2,2,2,2,2,2,2,2,2,2,0
6,1,2,64,2,2,2,2,2,2,2,2,2,2,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1047933,1,2,77,2,1,2,2,1,1,2,2,2,2,2,0
1047934,1,2,55,2,1,2,2,2,2,2,2,2,2,2,0
1047935,2,2,70,3,2,2,2,2,1,2,2,2,2,2,0
1047936,2,2,32,3,2,2,2,2,2,2,2,2,2,2,0


In [26]:
# Show value counts
# For boolean values other than 'SEVERE', 1=yes, 2=no and 3=missing
for col in model_df.columns:
    if model_df[col].nunique() > 20:
        pass
    else:
        print(col, ":\n  ", model_df[col].value_counts(sort=False), "\n")

SEX :
   SEX
1    182490
2    209489
Name: count, dtype: int64 

PNEUMONIA :
   PNEUMONIA
1     86041
2    305934
3         4
Name: count, dtype: int64 

PREGNANT :
   PREGNANT
2    178353
3    210872
1      2754
Name: count, dtype: int64 

DIABETES :
   DIABETES
2    328425
1     62114
3      1440
Name: count, dtype: int64 

COPD :
   COPD
2    384535
1      6131
3      1313
Name: count, dtype: int64 

ASTHMA :
   ASTHMA
2    380258
1     10412
3      1309
Name: count, dtype: int64 

IMMSUPR :
   IMMSUPR
2    385757
1      4773
3      1449
Name: count, dtype: int64 

HYPERTENSION :
   HYPERTENSION
1     76727
2    313864
3      1388
Name: count, dtype: int64 

OTHER_DISEASE :
   OTHER_DISEASE
2    379825
1     10018
3      2136
Name: count, dtype: int64 

CARDIOVASCULAR :
   CARDIOVASCULAR
2    382082
1      8506
3      1391
Name: count, dtype: int64 

OBESITY :
   OBESITY
2    317852
1     72774
3      1353
Name: count, dtype: int64 

RENAL_CHRONIC :
   RENAL_CHRONIC
2    382677
1   

In [27]:
model_df.to_csv('model_data.csv', index=False)