# ETL for Covid-19 data in Mexico

**Objective**: This notebook outlines the ETL (Extract, Transform, Load) process for managing COVID-19 data from Mexico. The primary goal is to extract the daily published raw data from Mexico's Ministry of Health (*Secretaria de Salud*), transform it into a clean and structured format suitable for analysis, and load it into a structured data format that can be utilized in subsequent analysis and modeling tasks.

Given the continuous changes and updates in the published data, it is essential to work with a reliable data source that preserves historical records. Therefore, we will rely on a digital archive that captures past versions of the data, ensuring consistency and reproducibility in our analysis. The data used in this ETL process can be accessed through the digital archive at the following link: [COVID-19 Data Archive](https://web.archive.org/web/20220122063317/http://datosabiertos.salud.gob.mx/gobmx/salud/datos_abiertos/historicos/08/datos_abiertos_covid19_30.08.2020.zip).

**Process**:
1. **Extraction**: Data is extracted from the archived versions of daily reports published by Mexico's Ministry of Health. This step involves accessing the archive, downloading raw data files, and ensuring data completeness and integrity.
   
1. **Transformation**: The raw data undergoes a cleaning and transformation process, including handling missing values, correcting data types, standardizing date formats, and calculating derived metrics such as daily new cases and deaths.
   
2. **Loading**: The transformed data is loaded into a structured format, such as CSV files or databases, to be used in exploratory data analysis (EDA), modeling, and visualization.

**Importance**: The ETL process ensures that the data used for analysis is reliable and in a format conducive to accurate analysis. By leveraging archived data, we can maintain consistency in our analysis despite ongoing updates and changes to the original published data, thereby enhancing the validity of our results.

In [3]:
# Import all necesary libraries
import os
import pandas as pd
import numpy as np
from datetime import datetime
from itertools import product

## Delay in COVID-19 Reporting

In this first section, we will perform the ETL steps for the COVID-19 data from August 1st, 2020. Having successfully understood the data, we will repeat the same steps for the published databases of August 8 and 15 of 2020, corresponding to data published 1 week and 2 weeks later. The main purpose of these databases will be to further understand how using weekly/bi-weekly data can help mitigate delay in report counting.

#### August 1, 2020 data

In [4]:
# Read the data
raw_data_2020_08_01 = pd.read_csv('/Users/ro/Downloads/200801COVID19MEXICO.csv', encoding='latin1', low_memory=False)

# Convert object columns to datetime format
raw_data_2020_08_01['FECHA_INGRESO'] = pd.to_datetime(raw_data_2020_08_01['FECHA_INGRESO'], errors='coerce') 
raw_data_2020_08_01['FECHA_SINTOMAS'] = pd.to_datetime(raw_data_2020_08_01['FECHA_SINTOMAS'], errors='coerce') 

# Replace '9999-99-99' with NaT (Not a Time)
raw_data_2020_08_01['FECHA_DEF'] = raw_data_2020_08_01['FECHA_DEF'].replace('9999-99-99', pd.NaT)

# Convert FECHA_DEF to datetime after replacement
raw_data_2020_08_01['FECHA_DEF'] = pd.to_datetime(raw_data_2020_08_01['FECHA_DEF'], errors='coerce')

In [5]:
raw_data_2020_08_01.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999697 entries, 0 to 999696
Data columns (total 35 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   FECHA_ACTUALIZACION  999697 non-null  object        
 1   ID_REGISTRO          999697 non-null  object        
 2   ORIGEN               999697 non-null  int64         
 3   SECTOR               999697 non-null  int64         
 4   ENTIDAD_UM           999697 non-null  int64         
 5   SEXO                 999697 non-null  int64         
 6   ENTIDAD_NAC          999697 non-null  int64         
 7   ENTIDAD_RES          999697 non-null  int64         
 8   MUNICIPIO_RES        999697 non-null  int64         
 9   TIPO_PACIENTE        999697 non-null  int64         
 10  FECHA_INGRESO        999697 non-null  datetime64[ns]
 11  FECHA_SINTOMAS       999697 non-null  datetime64[ns]
 12  FECHA_DEF            61264 non-null   datetime64[ns]
 13  INTUBADO      

In [6]:
raw_data_2020_08_01.head()

Unnamed: 0,FECHA_ACTUALIZACION,ID_REGISTRO,ORIGEN,SECTOR,ENTIDAD_UM,SEXO,ENTIDAD_NAC,ENTIDAD_RES,MUNICIPIO_RES,TIPO_PACIENTE,...,CARDIOVASCULAR,OBESIDAD,RENAL_CRONICA,TABAQUISMO,OTRO_CASO,RESULTADO,MIGRANTE,PAIS_NACIONALIDAD,PAIS_ORIGEN,UCI
0,2020-08-01,0e23f9,2,3,15,1,15,15,99,1,...,2,2,2,2,2,1,99,MÃ©xico,99,97
1,2020-08-01,14c60f,2,3,15,2,15,15,106,2,...,2,2,2,2,2,1,99,MÃ©xico,99,2
2,2020-08-01,1b640f,2,4,9,2,9,9,11,2,...,2,1,2,2,99,1,99,MÃ©xico,99,2
3,2020-08-01,0c8a89,2,4,15,2,15,15,109,2,...,2,2,2,2,99,1,99,MÃ©xico,99,2
4,2020-08-01,159028,2,4,7,1,9,7,97,1,...,2,2,2,2,99,1,99,MÃ©xico,99,97


In [7]:
raw_data_2020_08_01.columns

Index(['FECHA_ACTUALIZACION', 'ID_REGISTRO', 'ORIGEN', 'SECTOR', 'ENTIDAD_UM',
       'SEXO', 'ENTIDAD_NAC', 'ENTIDAD_RES', 'MUNICIPIO_RES', 'TIPO_PACIENTE',
       'FECHA_INGRESO', 'FECHA_SINTOMAS', 'FECHA_DEF', 'INTUBADO', 'NEUMONIA',
       'EDAD', 'NACIONALIDAD', 'EMBARAZO', 'HABLA_LENGUA_INDIG', 'DIABETES',
       'EPOC', 'ASMA', 'INMUSUPR', 'HIPERTENSION', 'OTRA_COM',
       'CARDIOVASCULAR', 'OBESIDAD', 'RENAL_CRONICA', 'TABAQUISMO',
       'OTRO_CASO', 'RESULTADO', 'MIGRANTE', 'PAIS_NACIONALIDAD',
       'PAIS_ORIGEN', 'UCI'],
      dtype='object')

In [8]:

# Filter the data to include only confirmed cases (RESULTADO = 1, represents positive cases)
confirmed_cases_2020_08_01 = raw_data_2020_08_01[raw_data_2020_08_01['RESULTADO'] == 1].copy()

# Calculate the number of days from symptom onset to death
# FECHA_DEF: Date of death
# FECHA_SINTOMAS: Date of symptom onset
confirmed_cases_2020_08_01['OnsetToDeath'] = (confirmed_cases_2020_08_01['FECHA_DEF'] - confirmed_cases_2020_08_01['FECHA_SINTOMAS']).dt.days

# Calculate the number of days from symptom onset to hospital admission
# FECHA_INGRESO: Date of hospital admission
confirmed_cases_2020_08_01['OnsetToHospital'] = (confirmed_cases_2020_08_01['FECHA_INGRESO'] - confirmed_cases_2020_08_01['FECHA_SINTOMAS']).dt.days

# Filter out cases where OnsetToDeath is negative, keeping only realistic values (non-negative values)
confirmed_cases_2020_08_01 = confirmed_cases_2020_08_01[(confirmed_cases_2020_08_01['OnsetToDeath'] >= 0) | confirmed_cases_2020_08_01['OnsetToDeath'].isna()].copy()

# Filter out cases where OnsetToHospital is negative, keeping only realistic values (non-negative values)
confirmed_cases_2020_08_01 = confirmed_cases_2020_08_01[(confirmed_cases_2020_08_01['OnsetToHospital'] >= 0) | confirmed_cases_2020_08_01['OnsetToHospital'].isna()].copy()


confirmed_cases_2020_08_01.head()

Unnamed: 0,FECHA_ACTUALIZACION,ID_REGISTRO,ORIGEN,SECTOR,ENTIDAD_UM,SEXO,ENTIDAD_NAC,ENTIDAD_RES,MUNICIPIO_RES,TIPO_PACIENTE,...,RENAL_CRONICA,TABAQUISMO,OTRO_CASO,RESULTADO,MIGRANTE,PAIS_NACIONALIDAD,PAIS_ORIGEN,UCI,OnsetToDeath,OnsetToHospital
0,2020-08-01,0e23f9,2,3,15,1,15,15,99,1,...,2,2,2,1,99,MÃ©xico,99,97,,4
1,2020-08-01,14c60f,2,3,15,2,15,15,106,2,...,2,2,2,1,99,MÃ©xico,99,2,,4
2,2020-08-01,1b640f,2,4,9,2,9,9,11,2,...,2,2,99,1,99,MÃ©xico,99,2,,7
3,2020-08-01,0c8a89,2,4,15,2,15,15,109,2,...,2,2,99,1,99,MÃ©xico,99,2,10.0,8
4,2020-08-01,159028,2,4,7,1,9,7,97,1,...,2,2,99,1,99,MÃ©xico,99,97,,4


In [9]:
# Group confirmed cases by the date of symptom onset (FECHA_SINTOMAS) and the state of residence (ENTIDAD_RES)
state_cases_2020_08_01 = confirmed_cases_2020_08_01.groupby(['FECHA_SINTOMAS', 'ENTIDAD_RES']).size().reset_index(name='confirmed_cases')

# Group confirmed deaths by the date of death (FECHA_DEF) and the state of residence (ENTIDAD_RES)
state_deaths_2020_08_01 = confirmed_cases_2020_08_01.groupby(['FECHA_DEF', 'ENTIDAD_RES']).size().reset_index(name='confirmed_deaths')

# Rename the date columns in both DataFrames to a common name 'FECHA'
# This is done to facilitate merging the cases and deaths on the same date column
state_cases_2020_08_01.rename(columns={'FECHA_SINTOMAS': 'FECHA'}, inplace=True)
state_deaths_2020_08_01.rename(columns={'FECHA_DEF': 'FECHA'}, inplace=True)

# Merge the confirmed cases and confirmed deaths DataFrames on the date ('FECHA') and state ('ENTIDAD_RES')
processed_data_2020_08_01= pd.merge(state_cases_2020_08_01, state_deaths_2020_08_01, how='outer', on=['FECHA', 'ENTIDAD_RES']).fillna(0)

# Rename the merged columns for clarity and consistency: 'FECHA' to 'date' and 'ENTIDAD_RES' to 'state'
processed_data_2020_08_01.rename(columns={'FECHA': 'date', 'ENTIDAD_RES': 'state'}, inplace=True)

processed_data_2020_08_01.head()


Unnamed: 0,date,state,confirmed_cases,confirmed_deaths
0,2020-01-13,19,1.0,0.0
1,2020-01-29,25,1.0,0.0
2,2020-02-06,2,1.0,0.0
3,2020-02-19,15,1.0,0.0
4,2020-02-21,15,1.0,0.0


In [10]:
# Create a list of all unique dates in the data
all_dates_2020_08_01 = pd.date_range(start=processed_data_2020_08_01['date'].min(), 
                          end=processed_data_2020_08_01['date'].max(), 
                          freq='D')

# Create a sorted list of all unique states in your data
all_states_2020_08_01 = sorted(processed_data_2020_08_01['state'].unique())

print(all_dates_2020_08_01)
print(all_states_2020_08_01)

DatetimeIndex(['2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16',
               '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20',
               '2020-01-21', '2020-01-22',
               ...
               '2020-07-23', '2020-07-24', '2020-07-25', '2020-07-26',
               '2020-07-27', '2020-07-28', '2020-07-29', '2020-07-30',
               '2020-07-31', '2020-08-01'],
              dtype='datetime64[ns]', length=202, freq='D')
[np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14), np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19), np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29), np.int64(30), np.int64(31), np.int64(32)]


In [11]:
# Create a complete template DataFrame with all possible combinations of dates and states
complete_template_2020_08_01 = pd.DataFrame(list(product(all_dates_2020_08_01, all_states_2020_08_01)), columns=['date', 'state'])

# Merge the complete template with the state_time_series DataFrame
final_data_2020_08_01 = pd.merge(complete_template_2020_08_01, processed_data_2020_08_01, on=['date', 'state'], how='left')

# Fill missing values in the 'confirmed_cases' column with 0
final_data_2020_08_01['confirmed_cases'] = final_data_2020_08_01['confirmed_cases'].fillna(0)

# Fill missing values in the 'confirmed_deaths' column with 0
final_data_2020_08_01['confirmed_deaths'] = final_data_2020_08_01['confirmed_deaths'].fillna(0)

final_data_2020_08_01.tail(32)

Unnamed: 0,date,state,confirmed_cases,confirmed_deaths
6432,2020-08-01,1,0.0,0.0
6433,2020-08-01,2,0.0,0.0
6434,2020-08-01,3,0.0,0.0
6435,2020-08-01,4,0.0,0.0
6436,2020-08-01,5,0.0,0.0
6437,2020-08-01,6,0.0,0.0
6438,2020-08-01,7,0.0,1.0
6439,2020-08-01,8,0.0,0.0
6440,2020-08-01,9,0.0,0.0
6441,2020-08-01,10,0.0,0.0


In [45]:
# Save as a feather file in the Data folder for later use
data_folder = '/Users/ro/Desktop/Undergrad_AM_Thesis/Data'

file_path = os.path.join(data_folder, 'final_data_2020_08_01_df.feather')
final_data_2020_08_01.to_feather(file_path)

#### August 8, 2020 data

In [13]:
# Read the data
raw_data_2020_08_08 = pd.read_csv('/Users/ro/Downloads/200808COVID19MEXICO.csv', encoding='latin1', low_memory=False)

# Convert object columns to datetime format
raw_data_2020_08_08['FECHA_INGRESO'] = pd.to_datetime(raw_data_2020_08_08['FECHA_INGRESO'], errors='coerce') 
raw_data_2020_08_08['FECHA_SINTOMAS'] = pd.to_datetime(raw_data_2020_08_08['FECHA_SINTOMAS'], errors='coerce') 

# Replace '9999-99-99' with NaT (Not a Time)
raw_data_2020_08_08['FECHA_DEF'] = raw_data_2020_08_08['FECHA_DEF'].replace('9999-99-99', pd.NaT)

# Convert FECHA_DEF to datetime after replacement
raw_data_2020_08_08['FECHA_DEF'] = pd.to_datetime(raw_data_2020_08_08['FECHA_DEF'], errors='coerce')

raw_data_2020_08_08.tail()

Unnamed: 0,FECHA_ACTUALIZACION,ID_REGISTRO,ORIGEN,SECTOR,ENTIDAD_UM,SEXO,ENTIDAD_NAC,ENTIDAD_RES,MUNICIPIO_RES,TIPO_PACIENTE,...,CARDIOVASCULAR,OBESIDAD,RENAL_CRONICA,TABAQUISMO,OTRO_CASO,RESULTADO,MIGRANTE,PAIS_NACIONALIDAD,PAIS_ORIGEN,UCI
1085892,2020-08-08,16f9f8,1,12,28,2,28,28,38,1,...,2,2,2,2,2,3,99,MÃ©xico,99,97
1085893,2020-08-08,156a86,2,12,9,2,9,9,11,2,...,98,98,98,98,1,3,99,MÃ©xico,99,2
1085894,2020-08-08,02a9f9,2,4,19,1,19,19,39,1,...,2,1,2,2,99,3,99,MÃ©xico,99,97
1085895,2020-08-08,110e67,2,12,9,1,9,9,7,1,...,2,2,2,2,2,3,99,MÃ©xico,99,97
1085896,2020-08-08,02668d,2,12,15,2,15,15,109,1,...,2,2,2,2,2,3,99,MÃ©xico,99,97


In [14]:
raw_data_2020_08_08.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1085897 entries, 0 to 1085896
Data columns (total 35 columns):
 #   Column               Non-Null Count    Dtype         
---  ------               --------------    -----         
 0   FECHA_ACTUALIZACION  1085897 non-null  object        
 1   ID_REGISTRO          1085897 non-null  object        
 2   ORIGEN               1085897 non-null  int64         
 3   SECTOR               1085897 non-null  int64         
 4   ENTIDAD_UM           1085897 non-null  int64         
 5   SEXO                 1085897 non-null  int64         
 6   ENTIDAD_NAC          1085897 non-null  int64         
 7   ENTIDAD_RES          1085897 non-null  int64         
 8   MUNICIPIO_RES        1085897 non-null  int64         
 9   TIPO_PACIENTE        1085897 non-null  int64         
 10  FECHA_INGRESO        1085897 non-null  datetime64[ns]
 11  FECHA_SINTOMAS       1085897 non-null  datetime64[ns]
 12  FECHA_DEF            66818 non-null    datetime64[ns]
 1

In [15]:

# Filter the data to include only confirmed cases (RESULTADO = 1, represents positive cases)
confirmed_cases_2020_08_08 = raw_data_2020_08_08[raw_data_2020_08_08['RESULTADO'] == 1].copy()

# Calculate the number of days from symptom onset to death
# FECHA_DEF: Date of death
# FECHA_SINTOMAS: Date of symptom onset
confirmed_cases_2020_08_08['OnsetToDeath'] = (confirmed_cases_2020_08_08['FECHA_DEF'] - confirmed_cases_2020_08_08['FECHA_SINTOMAS']).dt.days

# Calculate the number of days from symptom onset to hospital admission
# FECHA_INGRESO: Date of hospital admission
confirmed_cases_2020_08_08['OnsetToHospital'] = (confirmed_cases_2020_08_08['FECHA_INGRESO'] - confirmed_cases_2020_08_08['FECHA_SINTOMAS']).dt.days

# Filter out cases where OnsetToDeath is negative, keeping only realistic values (non-negative values)
confirmed_cases_2020_08_08 = confirmed_cases_2020_08_08[(confirmed_cases_2020_08_08['OnsetToDeath'] >= 0) | confirmed_cases_2020_08_08['OnsetToDeath'].isna()].copy()

# Filter out cases where OnsetToHospital is negative, keeping only realistic values (non-negative values)
confirmed_cases_2020_08_08 = confirmed_cases_2020_08_08[(confirmed_cases_2020_08_08['OnsetToHospital'] >= 0) | confirmed_cases_2020_08_08['OnsetToHospital'].isna()].copy()

confirmed_cases_2020_08_08.tail()

Unnamed: 0,FECHA_ACTUALIZACION,ID_REGISTRO,ORIGEN,SECTOR,ENTIDAD_UM,SEXO,ENTIDAD_NAC,ENTIDAD_RES,MUNICIPIO_RES,TIPO_PACIENTE,...,RENAL_CRONICA,TABAQUISMO,OTRO_CASO,RESULTADO,MIGRANTE,PAIS_NACIONALIDAD,PAIS_ORIGEN,UCI,OnsetToDeath,OnsetToHospital
826259,2020-08-08,0ff13d,1,12,19,2,19,19,25,1,...,2,2,2,1,99,MÃ©xico,99,97,,3
826260,2020-08-08,12d6fb,2,4,14,1,14,14,39,1,...,2,1,99,1,99,MÃ©xico,99,97,,1
826261,2020-08-08,122aa3,1,12,25,1,25,25,6,1,...,2,2,1,1,99,MÃ©xico,99,97,,6
826262,2020-08-08,1021a9,2,12,15,2,15,30,189,2,...,2,1,2,1,99,MÃ©xico,99,2,,3
826263,2020-08-08,19b480,2,12,14,2,20,14,67,1,...,2,2,1,1,99,MÃ©xico,99,97,,4


In [16]:
# Group confirmed cases by the date of symptom onset (FECHA_SINTOMAS) and the state of residence (ENTIDAD_RES)
state_cases_2020_08_08 = confirmed_cases_2020_08_08.groupby(['FECHA_SINTOMAS', 'ENTIDAD_RES']).size().reset_index(name='confirmed_cases')

# Group confirmed deaths by the date of death (FECHA_DEF) and the state of residence (ENTIDAD_RES)
state_deaths_2020_08_08 = confirmed_cases_2020_08_08.groupby(['FECHA_DEF', 'ENTIDAD_RES']).size().reset_index(name='confirmed_deaths')

# Rename the date columns in both DataFrames to a common name 'FECHA'
# This is done to facilitate merging the cases and deaths on the same date column
state_cases_2020_08_08.rename(columns={'FECHA_SINTOMAS': 'FECHA'}, inplace=True)
state_deaths_2020_08_08.rename(columns={'FECHA_DEF': 'FECHA'}, inplace=True)

# Merge the confirmed cases and confirmed deaths DataFrames on the date ('FECHA') and state ('ENTIDAD_RES')
processed_data_2020_08_08= pd.merge(state_cases_2020_08_08, state_deaths_2020_08_08, how='outer', on=['FECHA', 'ENTIDAD_RES']).fillna(0)

# Rename the merged columns for clarity and consistency: 'FECHA' to 'date' and 'ENTIDAD_RES' to 'state'
processed_data_2020_08_08.rename(columns={'FECHA': 'date', 'ENTIDAD_RES': 'state'}, inplace=True)

processed_data_2020_08_08.tail()

Unnamed: 0,date,state,confirmed_cases,confirmed_deaths
4712,2020-08-07,28,0.0,1.0
4713,2020-08-07,30,0.0,5.0
4714,2020-08-07,31,0.0,6.0
4715,2020-08-07,32,0.0,1.0
4716,2020-08-08,27,0.0,1.0


In [17]:
# Create a list of all unique dates in the data
all_dates_2020_08_08 = pd.date_range(start=processed_data_2020_08_08['date'].min(), 
                          end=processed_data_2020_08_08['date'].max(), 
                          freq='D')

# Create a sorted list of all unique states in your data
all_states_2020_08_08 = sorted(processed_data_2020_08_08['state'].unique())

print(all_dates_2020_08_08)
print(all_states_2020_08_08)

DatetimeIndex(['2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16',
               '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20',
               '2020-01-21', '2020-01-22',
               ...
               '2020-07-30', '2020-07-31', '2020-08-01', '2020-08-02',
               '2020-08-03', '2020-08-04', '2020-08-05', '2020-08-06',
               '2020-08-07', '2020-08-08'],
              dtype='datetime64[ns]', length=209, freq='D')
[np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14), np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19), np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29), np.int64(30), np.int64(31), np.int64(32)]


In [18]:
# Create a complete template DataFrame with all possible combinations of dates and states
complete_template_2020_08_08 = pd.DataFrame(list(product(all_dates_2020_08_08, all_states_2020_08_08)), columns=['date', 'state'])

# Merge the complete template with the state_time_series DataFrame
final_data_2020_08_08 = pd.merge(complete_template_2020_08_08, processed_data_2020_08_08, on=['date', 'state'], how='left')

# Fill missing values in the 'confirmed_cases' column with 0
final_data_2020_08_08['confirmed_cases'] = final_data_2020_08_08['confirmed_cases'].fillna(0)

# Fill missing values in the 'confirmed_deaths' column with 0
final_data_2020_08_08['confirmed_deaths'] = final_data_2020_08_08['confirmed_deaths'].fillna(0)

final_data_2020_08_08[final_data_2020_08_08['date'] == datetime(2020,8,1)]

Unnamed: 0,date,state,confirmed_cases,confirmed_deaths
6432,2020-08-01,1,42.0,2.0
6433,2020-08-01,2,37.0,12.0
6434,2020-08-01,3,111.0,3.0
6435,2020-08-01,4,30.0,11.0
6436,2020-08-01,5,194.0,28.0
6437,2020-08-01,6,57.0,9.0
6438,2020-08-01,7,27.0,6.0
6439,2020-08-01,8,12.0,3.0
6440,2020-08-01,9,425.0,31.0
6441,2020-08-01,10,52.0,4.0


In [44]:
# Save as a feather file in the Data folder for later use
data_folder = '/Users/ro/Desktop/Undergrad_AM_Thesis/Data'

file_path = os.path.join(data_folder, 'final_data_2020_08_08_df.feather')
final_data_2020_08_08.to_feather(file_path)

#### August 15, 2020 data

In [20]:
# Read the data
raw_data_2020_08_15 = pd.read_csv('/Users/ro/Downloads/200815COVID19MEXICO.csv', encoding='latin1', low_memory=False)

# Convert object columns to datetime format
raw_data_2020_08_15['FECHA_INGRESO'] = pd.to_datetime(raw_data_2020_08_15['FECHA_INGRESO'], errors='coerce') 
raw_data_2020_08_15['FECHA_SINTOMAS'] = pd.to_datetime(raw_data_2020_08_15['FECHA_SINTOMAS'], errors='coerce') 

# Replace '9999-99-99' with NaT (Not a Time)
raw_data_2020_08_15['FECHA_DEF'] = raw_data_2020_08_15['FECHA_DEF'].replace('9999-99-99', pd.NaT)

# Convert FECHA_DEF to datetime after replacement
raw_data_2020_08_15['FECHA_DEF'] = pd.to_datetime(raw_data_2020_08_15['FECHA_DEF'], errors='coerce')

raw_data_2020_08_15.tail()

Unnamed: 0,FECHA_ACTUALIZACION,ID_REGISTRO,ORIGEN,SECTOR,ENTIDAD_UM,SEXO,ENTIDAD_NAC,ENTIDAD_RES,MUNICIPIO_RES,TIPO_PACIENTE,...,CARDIOVASCULAR,OBESIDAD,RENAL_CRONICA,TABAQUISMO,OTRO_CASO,RESULTADO,MIGRANTE,PAIS_NACIONALIDAD,PAIS_ORIGEN,UCI
1171002,2020-08-15,1d7eaa,1,12,22,1,22,22,14,1,...,2,2,2,2,1,3,99,MÃ©xico,99,97
1171003,2020-08-15,0cc9a3,2,4,10,1,10,10,23,1,...,2,2,2,2,99,3,99,MÃ©xico,99,97
1171004,2020-08-15,0f436c,2,9,19,2,19,19,48,2,...,2,2,2,2,2,3,99,MÃ©xico,99,2
1171005,2020-08-15,146669,1,12,23,2,20,23,5,2,...,2,2,2,2,2,3,99,MÃ©xico,99,1
1171006,2020-08-15,1a44af,2,12,8,1,8,8,32,1,...,2,2,2,2,1,3,99,MÃ©xico,99,97


In [21]:
raw_data_2020_08_15.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1171007 entries, 0 to 1171006
Data columns (total 35 columns):
 #   Column               Non-Null Count    Dtype         
---  ------               --------------    -----         
 0   FECHA_ACTUALIZACION  1171007 non-null  object        
 1   ID_REGISTRO          1171007 non-null  object        
 2   ORIGEN               1171007 non-null  int64         
 3   SECTOR               1171007 non-null  int64         
 4   ENTIDAD_UM           1171007 non-null  int64         
 5   SEXO                 1171007 non-null  int64         
 6   ENTIDAD_NAC          1171007 non-null  int64         
 7   ENTIDAD_RES          1171007 non-null  int64         
 8   MUNICIPIO_RES        1171007 non-null  int64         
 9   TIPO_PACIENTE        1171007 non-null  int64         
 10  FECHA_INGRESO        1171007 non-null  datetime64[ns]
 11  FECHA_SINTOMAS       1171007 non-null  datetime64[ns]
 12  FECHA_DEF            72324 non-null    datetime64[ns]
 1

In [22]:

# Filter the data to include only confirmed cases (RESULTADO = 1, represents positive cases)
confirmed_cases_2020_08_15 = raw_data_2020_08_15[raw_data_2020_08_15['RESULTADO'] == 1].copy()

# Calculate the number of days from symptom onset to death
# FECHA_DEF: Date of death
# FECHA_SINTOMAS: Date of symptom onset
confirmed_cases_2020_08_15['OnsetToDeath'] = (confirmed_cases_2020_08_15['FECHA_DEF'] - confirmed_cases_2020_08_15['FECHA_SINTOMAS']).dt.days

# Calculate the number of days from symptom onset to hospital admission
# FECHA_INGRESO: Date of hospital admission
confirmed_cases_2020_08_15['OnsetToHospital'] = (confirmed_cases_2020_08_15['FECHA_INGRESO'] - confirmed_cases_2020_08_15['FECHA_SINTOMAS']).dt.days


# Filter out cases where OnsetToDeath is negative, keeping only realistic values (non-negative values)
confirmed_cases_2020_08_15 = confirmed_cases_2020_08_15[(confirmed_cases_2020_08_15['OnsetToDeath'] >= 0) | confirmed_cases_2020_08_15['OnsetToDeath'].isna()].copy()


# Filter out cases where OnsetToHospital is negative, keeping only realistic values (non-negative values)
confirmed_cases_2020_08_15 = confirmed_cases_2020_08_15[(confirmed_cases_2020_08_15['OnsetToHospital'] >= 0) | confirmed_cases_2020_08_15['OnsetToHospital'].isna()].copy()

confirmed_cases_2020_08_15.tail()

Unnamed: 0,FECHA_ACTUALIZACION,ID_REGISTRO,ORIGEN,SECTOR,ENTIDAD_UM,SEXO,ENTIDAD_NAC,ENTIDAD_RES,MUNICIPIO_RES,TIPO_PACIENTE,...,RENAL_CRONICA,TABAQUISMO,OTRO_CASO,RESULTADO,MIGRANTE,PAIS_NACIONALIDAD,PAIS_ORIGEN,UCI,OnsetToDeath,OnsetToHospital
867464,2020-08-15,131380,1,4,32,1,32,32,17,1,...,2,2,99,1,99,MÃ©xico,99,97,,2
867465,2020-08-15,1794fb,1,4,18,2,18,18,12,2,...,2,2,99,1,99,MÃ©xico,99,2,,1
867466,2020-08-15,12c504,1,12,1,1,1,1,1,1,...,2,2,1,1,99,MÃ©xico,99,97,,5
867467,2020-08-15,012828,2,4,13,2,9,13,48,1,...,2,2,99,1,99,MÃ©xico,99,97,,4
867468,2020-08-15,045d97,2,12,17,2,17,17,13,1,...,2,2,1,1,99,MÃ©xico,99,97,,2


In [23]:
# Group confirmed cases by the date of symptom onset (FECHA_SINTOMAS) and the state of residence (ENTIDAD_RES)
state_cases_2020_08_15 = confirmed_cases_2020_08_15.groupby(['FECHA_SINTOMAS', 'ENTIDAD_RES']).size().reset_index(name='confirmed_cases')

# Group confirmed deaths by the date of death (FECHA_DEF) and the state of residence (ENTIDAD_RES)
state_deaths_2020_08_15 = confirmed_cases_2020_08_15.groupby(['FECHA_DEF', 'ENTIDAD_RES']).size().reset_index(name='confirmed_deaths')

# Rename the date columns in both DataFrames to a common name 'FECHA'
# This is done to facilitate merging the cases and deaths on the same date column
state_cases_2020_08_15.rename(columns={'FECHA_SINTOMAS': 'FECHA'}, inplace=True)
state_deaths_2020_08_15.rename(columns={'FECHA_DEF': 'FECHA'}, inplace=True)

# Merge the confirmed cases and confirmed deaths DataFrames on the date ('FECHA') and state ('ENTIDAD_RES')
processed_data_2020_08_15 = pd.merge(state_cases_2020_08_15, state_deaths_2020_08_15, how='outer', on=['FECHA', 'ENTIDAD_RES']).fillna(0)

# Rename the merged columns for clarity and consistency: 'FECHA' to 'date' and 'ENTIDAD_RES' to 'state'
processed_data_2020_08_15.rename(columns={'FECHA': 'date', 'ENTIDAD_RES': 'state'}, inplace=True)

processed_data_2020_08_15

Unnamed: 0,date,state,confirmed_cases,confirmed_deaths
0,2020-01-13,19,1.0,0.0
1,2020-01-29,25,1.0,0.0
2,2020-02-06,2,1.0,0.0
3,2020-02-19,15,1.0,0.0
4,2020-02-21,15,1.0,0.0
...,...,...,...,...
4935,2020-08-14,28,0.0,2.0
4936,2020-08-14,30,0.0,5.0
4937,2020-08-14,31,2.0,5.0
4938,2020-08-14,32,0.0,1.0


In [24]:
# Create a list of all unique dates in the data
all_dates_2020_08_15 = pd.date_range(start=processed_data_2020_08_15['date'].min(), 
                          end=processed_data_2020_08_15['date'].max(), 
                          freq='D')

# Create a sorted list of all unique states in your data
all_states_2020_08_15 = sorted(processed_data_2020_08_15['state'].unique())

print(all_dates_2020_08_15)
print(all_states_2020_08_15)

DatetimeIndex(['2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16',
               '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20',
               '2020-01-21', '2020-01-22',
               ...
               '2020-08-06', '2020-08-07', '2020-08-08', '2020-08-09',
               '2020-08-10', '2020-08-11', '2020-08-12', '2020-08-13',
               '2020-08-14', '2020-08-15'],
              dtype='datetime64[ns]', length=216, freq='D')
[np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14), np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19), np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29), np.int64(30), np.int64(31), np.int64(32)]


In [25]:
# Create a complete template DataFrame with all possible combinations of dates and states
complete_template_2020_08_15 = pd.DataFrame(list(product(all_dates_2020_08_15, all_states_2020_08_15)), columns=['date', 'state'])

# Merge the complete template with the state_time_series DataFrame
final_data_2020_08_15 = pd.merge(complete_template_2020_08_15, processed_data_2020_08_15, on=['date', 'state'], how='left')

# Fill missing values in the 'confirmed_cases' column with 0
final_data_2020_08_15['confirmed_cases'] = final_data_2020_08_15['confirmed_cases'].fillna(0)

# Fill missing values in the 'confirmed_deaths' column with 0
final_data_2020_08_15['confirmed_deaths'] = final_data_2020_08_15['confirmed_deaths'].fillna(0)

final_data_2020_08_15[final_data_2020_08_15['date'] == datetime(2020,8,1)]

Unnamed: 0,date,state,confirmed_cases,confirmed_deaths
6432,2020-08-01,1,57.0,3.0
6433,2020-08-01,2,102.0,13.0
6434,2020-08-01,3,132.0,5.0
6435,2020-08-01,4,63.0,11.0
6436,2020-08-01,5,392.0,30.0
6437,2020-08-01,6,72.0,9.0
6438,2020-08-01,7,32.0,7.0
6439,2020-08-01,8,34.0,4.0
6440,2020-08-01,9,926.0,32.0
6441,2020-08-01,10,82.0,4.0


In [43]:
# Save as a feather file in the Data folder for later use
data_folder = '/Users/ro/Desktop/Undergrad_AM_Thesis/Data'

file_path = os.path.join(data_folder, 'final_data_2020_08_15_df.feather')
final_data_2020_08_15.to_feather(file_path)

## Consolidated Covid-19 dattabase for EDA and modelling

In this section, we will perform the ETL steps for the COVID-19 data up to the end of 2023. All the used databases can be found in Mexico's *Secretaria de Salud* [Open Data Historic Databases ](https://www.gob.mx/salud/documentos/datos-abiertos-bases-historicas-direccion-general-de-epidemiologia).

Due to constant changes in available data and to avoid issues related to delays in reporting, we will use the following databases for each years data:
* 2020 data corresponds to the public database published in 2021-10-31
* 2021 data corresponds to the public database published in 2022-08-09
* 2022 data corresponds to the public database published in 2023-04-25
* 2023 data corresponds to the public database published in 2024-04-30

In [27]:
DB_names = ['COVID19MEXICO-7.csv','230425COVID19MEXICO.csv','COVID19MEXICO2021.csv','COVID19MEXICO2020.csv']

# Read the data
raw_data_2020 = pd.read_csv('/Users/ro/Downloads/COVID19MEXICO2020.csv', encoding='latin1', low_memory=False)
raw_data_2021 = pd.read_csv('/Users/ro/Downloads/COVID19MEXICO2021.csv', encoding='latin1', low_memory=False)
raw_data_2022 = pd.read_csv('/Users/ro/Downloads/230425COVID19MEXICO.csv', encoding='latin1', low_memory=False)
raw_data_2023 = pd.read_csv('/Users/ro/Downloads/COVID19MEXICO-7.csv', encoding='latin1', low_memory=False)

In [28]:
# Convert object columns to datetime format
raw_data_2020['FECHA_INGRESO'] = pd.to_datetime(raw_data_2020['FECHA_INGRESO'], errors='coerce')
raw_data_2020['FECHA_SINTOMAS'] = pd.to_datetime(raw_data_2020['FECHA_SINTOMAS'], errors='coerce')

raw_data_2021['FECHA_INGRESO'] = pd.to_datetime(raw_data_2021['FECHA_INGRESO'], errors='coerce')
raw_data_2021['FECHA_SINTOMAS'] = pd.to_datetime(raw_data_2021['FECHA_SINTOMAS'], errors='coerce')

raw_data_2022['FECHA_INGRESO'] = pd.to_datetime(raw_data_2022['FECHA_INGRESO'], errors='coerce')
raw_data_2022['FECHA_SINTOMAS'] = pd.to_datetime(raw_data_2022['FECHA_SINTOMAS'], errors='coerce')

raw_data_2023['FECHA_INGRESO'] = pd.to_datetime(raw_data_2023['FECHA_INGRESO'], errors='coerce')
raw_data_2023['FECHA_SINTOMAS'] = pd.to_datetime(raw_data_2023['FECHA_SINTOMAS'], errors='coerce')

# Replace '9999-99-99' with NaT (Not a Time)
raw_data_2020['FECHA_DEF'] = raw_data_2020['FECHA_DEF'].replace('9999-99-99', pd.NaT)
raw_data_2020['FECHA_DEF'] = pd.to_datetime(raw_data_2020['FECHA_DEF'], errors='coerce')

raw_data_2021['FECHA_DEF'] = raw_data_2021['FECHA_DEF'].replace('9999-99-99', pd.NaT)
raw_data_2021['FECHA_DEF'] = pd.to_datetime(raw_data_2021['FECHA_DEF'], errors='coerce')

raw_data_2022['FECHA_DEF'] = raw_data_2022['FECHA_DEF'].replace('9999-99-99', pd.NaT)
raw_data_2022['FECHA_DEF'] = pd.to_datetime(raw_data_2022['FECHA_DEF'], errors='coerce')

raw_data_2023['FECHA_DEF'] = raw_data_2023['FECHA_DEF'].replace('9999-99-99', pd.NaT)
raw_data_2023['FECHA_DEF'] = pd.to_datetime(raw_data_2023['FECHA_DEF'], errors='coerce')

In [29]:
# Filter only confirmed Covid-19 cases
confirmed_cases_2020 = raw_data_2020[raw_data_2020['CLASIFICACION_FINAL'].isin([1, 2, 3])].copy()
confirmed_cases_2021 = raw_data_2021[raw_data_2021['CLASIFICACION_FINAL'].isin([1, 2, 3])].copy()
confirmed_cases_2022 = raw_data_2022[raw_data_2022['CLASIFICACION_FINAL'].isin([1, 2, 3])].copy()
confirmed_cases_2023 = raw_data_2023[raw_data_2023['CLASIFICACION_FINAL'].isin([1, 2, 3])].copy()

In [30]:
# Calculate time intervals from symptom onset to death and to hospital admission for confirmed cases across 2020-2023, filtering out negative values to retain only logical timeframes (we follow the logic of https://datos.covid-19.conacyt.mx/)
confirmed_cases_2020['OnsetToDeath'] = (confirmed_cases_2020['FECHA_DEF'] - confirmed_cases_2020['FECHA_SINTOMAS']).dt.days
confirmed_cases_2020['OnsetToHospital'] = (confirmed_cases_2020['FECHA_INGRESO'] - confirmed_cases_2020['FECHA_SINTOMAS']).dt.days
confirmed_cases_2020['HospitalToDeath'] = (confirmed_cases_2020['FECHA_DEF'] - confirmed_cases_2020['FECHA_INGRESO']).dt.days

confirmed_cases_2020 = confirmed_cases_2020[(confirmed_cases_2020['OnsetToDeath'] >= 0) | confirmed_cases_2020['OnsetToDeath'].isna()].copy()
confirmed_cases_2020 = confirmed_cases_2020[(confirmed_cases_2020['OnsetToHospital'] >= 0) | confirmed_cases_2020['OnsetToHospital'].isna()].copy()
confirmed_cases_2020 = confirmed_cases_2020[(confirmed_cases_2020['HospitalToDeath'] >= 0) | confirmed_cases_2020['HospitalToDeath'].isna()].copy()

confirmed_cases_2021['OnsetToDeath'] = (confirmed_cases_2021['FECHA_DEF'] - confirmed_cases_2021['FECHA_SINTOMAS']).dt.days
confirmed_cases_2021['OnsetToHospital'] = (confirmed_cases_2021['FECHA_INGRESO'] - confirmed_cases_2021['FECHA_SINTOMAS']).dt.days
confirmed_cases_2021['HospitalToDeath'] = (confirmed_cases_2021['FECHA_DEF'] - confirmed_cases_2021['FECHA_INGRESO']).dt.days

confirmed_cases_2021 = confirmed_cases_2021[(confirmed_cases_2021['OnsetToDeath'] >= 0) | confirmed_cases_2021['OnsetToDeath'].isna()].copy()
confirmed_cases_2021 = confirmed_cases_2021[(confirmed_cases_2021['OnsetToHospital'] >= 0) | confirmed_cases_2021['OnsetToHospital'].isna()].copy()
confirmed_cases_2021 = confirmed_cases_2021[(confirmed_cases_2021['HospitalToDeath'] >= 0) | confirmed_cases_2021['HospitalToDeath'].isna()].copy()


confirmed_cases_2022['OnsetToDeath'] = (confirmed_cases_2022['FECHA_DEF'] - confirmed_cases_2022['FECHA_SINTOMAS']).dt.days
confirmed_cases_2022['OnsetToHospital'] = (confirmed_cases_2022['FECHA_INGRESO'] - confirmed_cases_2022['FECHA_SINTOMAS']).dt.days
confirmed_cases_2022['HospitalToDeath'] = (confirmed_cases_2022['FECHA_DEF'] - confirmed_cases_2022['FECHA_INGRESO']).dt.days

confirmed_cases_2022 = confirmed_cases_2022[(confirmed_cases_2022['OnsetToDeath'] >= 0) | confirmed_cases_2022['OnsetToDeath'].isna()].copy()
confirmed_cases_2022 = confirmed_cases_2022[(confirmed_cases_2022['OnsetToHospital'] >= 0) | confirmed_cases_2022['OnsetToHospital'].isna()].copy()
confirmed_cases_2022 = confirmed_cases_2022[(confirmed_cases_2022['HospitalToDeath'] >= 0) | confirmed_cases_2022['HospitalToDeath'].isna()].copy()


confirmed_cases_2023['OnsetToDeath'] = (confirmed_cases_2023['FECHA_DEF'] - confirmed_cases_2023['FECHA_SINTOMAS']).dt.days
confirmed_cases_2023['OnsetToHospital'] = (confirmed_cases_2023['FECHA_INGRESO'] - confirmed_cases_2023['FECHA_SINTOMAS']).dt.days
confirmed_cases_2023['HospitalToDeath'] = (confirmed_cases_2023['FECHA_DEF'] - confirmed_cases_2023['FECHA_INGRESO']).dt.days

confirmed_cases_2023 = confirmed_cases_2023[(confirmed_cases_2023['OnsetToDeath'] >= 0) | confirmed_cases_2023['OnsetToDeath'].isna()].copy()
confirmed_cases_2023 = confirmed_cases_2023[(confirmed_cases_2023['OnsetToHospital'] >= 0) | confirmed_cases_2023['OnsetToHospital'].isna()].copy()
confirmed_cases_2023 = confirmed_cases_2023[(confirmed_cases_2023['HospitalToDeath'] >= 0) | confirmed_cases_2023['HospitalToDeath'].isna()].copy()

# Remove any registers outside of specific year
confirmed_cases_2020 = confirmed_cases_2020[confirmed_cases_2020['FECHA_SINTOMAS'] <= pd.Timestamp(2020, 12, 31)].copy()
confirmed_cases_2021 = confirmed_cases_2021[confirmed_cases_2021['FECHA_SINTOMAS'] <= pd.Timestamp(2021, 12, 31)].copy()
confirmed_cases_2022 = confirmed_cases_2022[confirmed_cases_2022['FECHA_SINTOMAS'] <= pd.Timestamp(2022, 12, 31)].copy()
confirmed_cases_2023 = confirmed_cases_2023[confirmed_cases_2023['FECHA_SINTOMAS'] <= pd.Timestamp(2023, 12, 31)].copy()

In [42]:
# Consolidate Onset to Death data in a new Dataframe for further exploration 
onset_to_death_2020 = confirmed_cases_2020[['FECHA_SINTOMAS', 'FECHA_INGRESO', 'FECHA_DEF', 'ENTIDAD_RES', 'SEXO', 'EDAD','OnsetToDeath']].reset_index(drop=True).copy()
onset_to_death_2020.rename(columns={'FECHA_SINTOMAS': 'onset_date', 'FECHA_INGRESO': 'hospital_date', 'FECHA_DEF': 'death_date', 'ENTIDAD_RES': 'state', 'SEXO': 'gender', 'EDAD': 'age'}, inplace=True)

onset_to_death_2021 = confirmed_cases_2021[['FECHA_SINTOMAS', 'FECHA_INGRESO', 'FECHA_DEF', 'ENTIDAD_RES', 'SEXO', 'EDAD','OnsetToDeath']].reset_index(drop=True).copy()
onset_to_death_2021.rename(columns={'FECHA_SINTOMAS': 'onset_date', 'FECHA_INGRESO': 'hospital_date', 'FECHA_DEF': 'death_date', 'ENTIDAD_RES': 'state', 'SEXO': 'gender', 'EDAD': 'age'}, inplace=True)

onset_to_death_2022 = confirmed_cases_2022[['FECHA_SINTOMAS', 'FECHA_INGRESO', 'FECHA_DEF', 'ENTIDAD_RES', 'SEXO', 'EDAD','OnsetToDeath']].reset_index(drop=True).copy()
onset_to_death_2022.rename(columns={'FECHA_SINTOMAS': 'onset_date', 'FECHA_INGRESO': 'hospital_date', 'FECHA_DEF': 'death_date', 'ENTIDAD_RES': 'state', 'SEXO': 'gender', 'EDAD': 'age'}, inplace=True)

onset_to_death_2023 = confirmed_cases_2023[['FECHA_SINTOMAS', 'FECHA_INGRESO', 'FECHA_DEF', 'ENTIDAD_RES', 'SEXO', 'EDAD','OnsetToDeath']].reset_index(drop=True).copy()
onset_to_death_2023.rename(columns={'FECHA_SINTOMAS': 'onset_date', 'FECHA_INGRESO': 'hospital_date', 'FECHA_DEF': 'death_date', 'ENTIDAD_RES': 'state', 'SEXO': 'gender', 'EDAD': 'age'}, inplace=True)

# Concatenate all the data into a single DataFrame
onset_to_death_df = pd.concat([onset_to_death_2020, onset_to_death_2021, onset_to_death_2022, onset_to_death_2023], ignore_index=True)
onset_to_death_df = onset_to_death_df[onset_to_death_df['death_date'].notna()].copy().reset_index(drop=True)

# Save as a feather file in the Data folder for later use
data_folder = '/Users/ro/Desktop/Undergrad_AM_Thesis/Data'

file_path = os.path.join(data_folder, 'onset_to_death_df.feather')
onset_to_death_df.to_feather(file_path)

In [41]:
# Consolidate Onset to Hospital data in a new DataFrame for further exploration
onset_to_hospital_2020 = confirmed_cases_2020[['FECHA_SINTOMAS', 'FECHA_INGRESO', 'ENTIDAD_RES', 'SEXO', 'EDAD', 'OnsetToHospital']].reset_index(drop=True).copy()
onset_to_hospital_2020.rename(columns={'FECHA_SINTOMAS': 'onset_date', 'FECHA_INGRESO': 'hospital_date', 'ENTIDAD_RES': 'state', 'SEXO': 'gender', 'EDAD': 'age'}, inplace=True)

onset_to_hospital_2021 = confirmed_cases_2021[['FECHA_SINTOMAS', 'FECHA_INGRESO', 'ENTIDAD_RES', 'SEXO', 'EDAD', 'OnsetToHospital']].reset_index(drop=True).copy()
onset_to_hospital_2021.rename(columns={'FECHA_SINTOMAS': 'onset_date', 'FECHA_INGRESO': 'hospital_date', 'ENTIDAD_RES': 'state', 'SEXO': 'gender', 'EDAD': 'age'}, inplace=True)

onset_to_hospital_2022 = confirmed_cases_2022[['FECHA_SINTOMAS', 'FECHA_INGRESO', 'ENTIDAD_RES', 'SEXO', 'EDAD', 'OnsetToHospital']].reset_index(drop=True).copy()
onset_to_hospital_2022.rename(columns={'FECHA_SINTOMAS': 'onset_date', 'FECHA_INGRESO': 'hospital_date', 'ENTIDAD_RES': 'state', 'SEXO': 'gender', 'EDAD': 'age'}, inplace=True)

onset_to_hospital_2023 = confirmed_cases_2023[['FECHA_SINTOMAS', 'FECHA_INGRESO', 'ENTIDAD_RES', 'SEXO', 'EDAD', 'OnsetToHospital']].reset_index(drop=True).copy()
onset_to_hospital_2023.rename(columns={'FECHA_SINTOMAS': 'onset_date', 'FECHA_INGRESO': 'hospital_date', 'ENTIDAD_RES': 'state', 'SEXO': 'gender', 'EDAD': 'age'}, inplace=True)

# Concatenate all the data into a single DataFrame
onset_to_hospital_df = pd.concat([onset_to_hospital_2020, onset_to_hospital_2021, onset_to_hospital_2022, onset_to_hospital_2023], ignore_index=True)

# Save as a feather file in the Data folder for later use
data_folder = '/Users/ro/Desktop/Undergrad_AM_Thesis/Data'
file_path = os.path.join(data_folder, 'onset_to_hospital_df.feather')
onset_to_hospital_df.to_feather(file_path)

In [33]:
# Group confirmed cases by the date of symptom onset (FECHA_SINTOMAS) and the state of residence (ENTIDAD_RES)
state_cases_2020 = confirmed_cases_2020.groupby(['FECHA_SINTOMAS', 'ENTIDAD_RES']).size().reset_index(name='confirmed_cases')
state_cases_2021 = confirmed_cases_2021.groupby(['FECHA_SINTOMAS', 'ENTIDAD_RES']).size().reset_index(name='confirmed_cases')
state_cases_2022 = confirmed_cases_2022.groupby(['FECHA_SINTOMAS', 'ENTIDAD_RES']).size().reset_index(name='confirmed_cases')
state_cases_2023 = confirmed_cases_2023.groupby(['FECHA_SINTOMAS', 'ENTIDAD_RES']).size().reset_index(name='confirmed_cases')


# Group confirmed deaths by the date of death (FECHA_DEF) and the state of residence (ENTIDAD_RES)
state_deaths_2020 = confirmed_cases_2020.groupby(['FECHA_DEF', 'ENTIDAD_RES']).size().reset_index(name='confirmed_deaths')
state_deaths_2021 = confirmed_cases_2021.groupby(['FECHA_DEF', 'ENTIDAD_RES']).size().reset_index(name='confirmed_deaths')
state_deaths_2022 = confirmed_cases_2022.groupby(['FECHA_DEF', 'ENTIDAD_RES']).size().reset_index(name='confirmed_deaths')
state_deaths_2023 = confirmed_cases_2023.groupby(['FECHA_DEF', 'ENTIDAD_RES']).size().reset_index(name='confirmed_deaths')


# Rename the date columns in both DataFrames to a common name 'FECHA'
state_cases_2020.rename(columns={'FECHA_SINTOMAS': 'date'}, inplace=True)
state_deaths_2020.rename(columns={'FECHA_DEF': 'date'}, inplace=True)

state_cases_2021.rename(columns={'FECHA_SINTOMAS': 'date'}, inplace=True)
state_deaths_2021.rename(columns={'FECHA_DEF': 'date'}, inplace=True)

state_cases_2022.rename(columns={'FECHA_SINTOMAS': 'date'}, inplace=True)
state_deaths_2022.rename(columns={'FECHA_DEF': 'date'}, inplace=True)

state_cases_2023.rename(columns={'FECHA_SINTOMAS': 'date'}, inplace=True)
state_deaths_2023.rename(columns={'FECHA_DEF': 'date'}, inplace=True)

# Merge the confirmed cases and confirmed deaths DataFrames on the date ('FECHA') and state ('ENTIDAD_RES')
processed_data_2020 = pd.merge(state_cases_2020, state_deaths_2020, how='outer', on=['date', 'ENTIDAD_RES']).fillna(0)
processed_data_2021 = pd.merge(state_cases_2021, state_deaths_2021, how='outer', on=['date', 'ENTIDAD_RES']).fillna(0)
processed_data_2022 = pd.merge(state_cases_2022, state_deaths_2022, how='outer', on=['date', 'ENTIDAD_RES']).fillna(0)
processed_data_2023 = pd.merge(state_cases_2023, state_deaths_2023, how='outer', on=['date', 'ENTIDAD_RES']).fillna(0)

# Rename columns for simplicity
processed_data_2020.rename(columns={'ENTIDAD_RES': 'state'}, inplace=True)
processed_data_2021.rename(columns={'ENTIDAD_RES': 'state'}, inplace=True)
processed_data_2022.rename(columns={'ENTIDAD_RES': 'state'}, inplace=True)
processed_data_2023.rename(columns={'ENTIDAD_RES': 'state'}, inplace=True)

In [34]:
# ------------------------------------------------
# Merge consecutive years: 2020 and 2021
merged_2020_2021 = pd.merge(processed_data_2020, processed_data_2021, on=['date', 'state'], how='outer', suffixes=('_2020', '_2021')).fillna(0)

# Sum confirmed cases and deaths for overlapping dates
merged_2020_2021['confirmed_cases'] = merged_2020_2021['confirmed_cases_2020'] + merged_2020_2021['confirmed_cases_2021']
merged_2020_2021['confirmed_deaths'] = merged_2020_2021['confirmed_deaths_2020'] + merged_2020_2021['confirmed_deaths_2021']

# Drop the temporary columns after summing
merged_2020_2021 = merged_2020_2021.drop(['confirmed_cases_2020', 'confirmed_cases_2021', 'confirmed_deaths_2020', 'confirmed_deaths_2021'], axis=1)


# ------------------------------------------------
# Merge consecutive years: 2022 and 2023
merged_2022_2023 = pd.merge(processed_data_2022, processed_data_2023, on=['date', 'state'], how='outer', suffixes=('_2022', '_2023')).fillna(0)

# Sum confirmed cases and deaths for overlapping dates
merged_2022_2023['confirmed_cases'] = merged_2022_2023['confirmed_cases_2022'] + merged_2022_2023['confirmed_cases_2023']
merged_2022_2023['confirmed_deaths'] = merged_2022_2023['confirmed_deaths_2022'] + merged_2022_2023['confirmed_deaths_2023']

# Drop the temporary columns after summing
merged_2022_2023 = merged_2022_2023.drop(['confirmed_cases_2022', 'confirmed_cases_2023', 'confirmed_deaths_2022', 'confirmed_deaths_2023'], axis=1)


# ------------------------------------------------
# Merge consecutive years: 2020-2021 with 2022-2023
processed_data_final = pd.merge(merged_2020_2021, merged_2022_2023, on=['date', 'state'], how='outer', suffixes=('_2021', '_2022')).fillna(0)

# Sum confirmed cases and deaths for overlapping dates
processed_data_final['confirmed_cases'] = processed_data_final['confirmed_cases_2021'] + processed_data_final['confirmed_cases_2022']
processed_data_final['confirmed_deaths'] = processed_data_final['confirmed_deaths_2021'] + processed_data_final['confirmed_deaths_2022']

# Drop the temporary columns after summing
processed_data_final = processed_data_final.drop(['confirmed_cases_2021', 'confirmed_cases_2022', 'confirmed_deaths_2021', 'confirmed_deaths_2022'], axis=1)

# Display the final merged DataFrame
processed_data_final

Unnamed: 0,date,state,confirmed_cases,confirmed_deaths
0,2020-02-19,15,1.0,0.0
1,2020-02-22,9,1.0,0.0
2,2020-02-22,13,1.0,0.0
3,2020-02-23,9,1.0,0.0
4,2020-02-25,7,1.0,0.0
...,...,...,...,...
42263,2024-01-07,7,0.0,1.0
42264,2024-01-10,9,0.0,1.0
42265,2024-01-11,15,0.0,1.0
42266,2024-01-25,23,0.0,1.0


In [35]:
# Create a complete template DataFrame with all possible combinations of dates and states
all_dates = pd.date_range(start=confirmed_cases_2020['FECHA_SINTOMAS'].min(), 
                          end=pd.Timestamp(2023,12,31), 
                          freq='D')

all_states = list(range(1, 33))

complete_template = pd.DataFrame(list(product(all_dates, all_states)), columns=['date', 'state'])

complete_template

Unnamed: 0,date,state
0,2020-02-19,1
1,2020-02-19,2
2,2020-02-19,3
3,2020-02-19,4
4,2020-02-19,5
...,...,...
45179,2023-12-31,28
45180,2023-12-31,29
45181,2023-12-31,30
45182,2023-12-31,31


In [36]:
# Merge the complete template with the procesed_data_final DataFrame
covid_df = pd.merge(complete_template, processed_data_final, on=['date', 'state'], how='left').fillna(0)
covid_df

Unnamed: 0,date,state,confirmed_cases,confirmed_deaths
0,2020-02-19,1,0.0,0.0
1,2020-02-19,2,0.0,0.0
2,2020-02-19,3,0.0,0.0
3,2020-02-19,4,0.0,0.0
4,2020-02-19,5,0.0,0.0
...,...,...,...,...
45179,2023-12-31,28,0.0,0.0
45180,2023-12-31,29,1.0,0.0
45181,2023-12-31,30,0.0,0.0
45182,2023-12-31,31,0.0,0.0


In [40]:
# Save as a feather file in the Data folder for later use
data_folder = '/Users/ro/Desktop/Undergrad_AM_Thesis/Data'
file_path = os.path.join(data_folder, 'covid_df.feather')
covid_df.to_feather(file_path)