# Covid-19 Mexico: Analysis & Forecasting 

#### Disclaimer

I used two datasets from complete this work. 

Fisrt dataset is cleaning and formating by me: 
[covid19-mexico-clean-order-by-states](https://www.kaggle.com/andresjramos/covid19-mexico-clean-order-by-states)

Second is from “Dirección General de Epidemiología": [Covid Mexico Raw Data](https://www.gob.mx/salud/documentos/datos-abiertos-152127)

I'm not the author of the second one. My principal objective is apply my knowledge like Data Scientist on these data to generate useful results.

#### My first notebook

This is the first notebook that I make like a Data Scientist and I would appreciate it very much if you could give me a little feedback.




**Data dictionary for better understanding:**

----

| Nº | Column Name  | Column descriptor|
|----|---------------------|------------------------|
| 1  | FECHA_ACTUALIZACION | Update date.|
| 2  | ID_REGISTRO         | Case Number ID.|
| 3  | ORIGEN              | It specifies if patient is being monitored by USMER (Healthcare Monitoring Unit for Respiratory Diseases) or not. |
| 4  | SECTOR              | It identifies the type of National Health System institution that provided the care.|
| 5  | ENTIDAD_UM          | Identifies the entity where the medical unit that provided the care is located.|
| 6  | SEXO                | Sex.|
| 7  | ENTIDAD_NAC         | Patient's birth place.|
| 8  | ENTIDAD_RES         | Entity of residence of the patient.|
| 9  | MUNICIPIO_RES       | Patient's neighborhood of residence.|
| 10 | TIPO_PACIENTE       | It identifies the type of care the patient received in the unit. It is called outpatient if returned home or inpatient if admitted to hospital.|
| 11 | FECHA_INGRESO       | Patient's date of admission to the care unit.|
| 12 | FECHA_SINTOMAS      | Symptom date.|
| 13 | FECHA_DEF           | Date of death.|
| 14 | INTUBADO            | Tracheal intubation.|
| 15 | NEUMONIA            | Pneumonia.|
| 16 | EDAD                | Age.|
| 17 | NACIONALIDAD        | Mexican or foreign.|
| 18 | EMBARAZO            | Pregnancy.|
| 19 | HABLA_LENGUA_INDIG  | Indigenous Langauge.|
| 20 | DIABETES            | Diabetes.|
| 21 | EPOC                | EPOC.|
| 22 | ASMA                | Asthma.|
| 23 | INMUSUPR            | Immunosuppression.|
| 24 | HIPERTENSION        | Hypertension.|
| 25 | OTRAS_COM           | Other diseases.|
| 26 | CARDIOVASCULAR      | Cardiovascular disease.|
| 27 | OBESIDAD            | Obesity.|
| 28 | RENAL_CRONICA       | Chronic kidney disease.|
| 29 | TABAQUISMO          | Smoking.|
| 30 | OTRO_CASO           | Identifies if the patient had contact with any other cases diagnosed with SARS CoV-2|
| 31 | RESULTADO           | COVID test result|
| 32 | MIGRANTE            | Migrant.|
| 33 | PAIS_NACIONALIDAD   | Nationality.|
| 34 | PAIS_ORIGEN         | Home state.|
| 35 | UCI                 | Identifies if the patient required admission to an Intensive Care Unit.|

## Import libraries

In [None]:
#!pip install plotly -U 

In [None]:
#EDA section
import pandas as pd
import numpy as np
import urllib
import zipfile
import os 
import shutil
from datetime import datetime, timedelta
pd.set_option('display.max_columns', None)
from sklearn.cluster import KMeans
from sklearn import preprocessing
from datetime import timedelta
import datetime as dt
import itertools

#Graphic libraries
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

#ML libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error,r2_score

import warnings
warnings.filterwarnings('ignore')

## Download Dataset

In [None]:
#Download the actual version of data
url = 'http://datosabiertos.salud.gob.mx/gobmx/salud/datos_abiertos/datos_abiertos_covid19.zip'
#Create a new folder to download the data
folder_path = 'DataCovidMx'
#to avoid duplicates when updating, it is necessary to remove the csv
if os.path.exists(folder_path):
    shutil.rmtree(folder_path)
    os.makedirs(folder_path)   
else:
    os.makedirs(folder_path)
#Download the file from URL
urllib.request.urlretrieve(url,"DataCovidMx/DataCovidCSV.zip")
#getting the file name
file_name = zipfile.ZipFile("DataCovidMx/DataCovidCSV.zip").namelist()
#extracting file
with zipfile.ZipFile("DataCovidMx/DataCovidCSV.zip", 'r') as zip_ref:
    zip_ref.extractall("DataCovidMx/")


In [None]:
#Load Datasets
df = pd.read_csv('DataCovidMx/'+ file_name[0], encoding= 'unicode_escape',low_memory=False)
clean_mx = pd.read_csv('../input/covid19-mexico-clean-order-by-states/Covid_19_Mexico_Clean_Complete.csv')
df.head()

## Data Cleaning and Formating

### Replace the place ID by Place Name

The columns "ENTIDAD_RES" and "MUNICIPIO_RES" refer to the state and municipality of residence respectively. These are coded by numbers so that they can be replaced by their respective names. 

For make this I will use mexican states by ID table and the Data Dictionary.xlsx

In [None]:
state_table = pd.read_excel('../input/dictionary/Catalogo.xlsx', sheet_name='Catálogo de ENTIDADES')
state_table.drop(labels= 'ABREVIATURA', axis = 1, inplace = True)
state_table.rename(columns= {'CLAVE_ENTIDAD':'ID', 'ENTIDAD_FEDERATIVA':'State'},inplace= True)
state_table.head()

In [None]:
#Import Data Dictionary - Municipality ID by State ID
Id_state = pd.read_excel("../input/dictionary/Catalogo.xlsx", sheet_name='Catálogo MUNICIPIOS')

#Creating a new column that contains the Municipality ID by state ID
Id_state['Full_ID']= Id_state['CLAVE_ENTIDAD'].astype(str) + '-' + Id_state['CLAVE_MUNICIPIO'].astype(str)

#Dictionary creation to replace the ID of the "ENTIDAD_RES" and "MUNICIPIO_RES" by name
dict_municipality = pd.Series(Id_state.MUNICIPIO.values, index = Id_state.Full_ID).to_dict()
dict_state = pd.Series(state_table.State.values, index= state_table.ID).to_dict()

#For replace the ID by name it's neccesary create a new column in the original DataSet that contain the State ID and Municipality ID. 
#Defie the new column
df['MUNICIPIO_RES'] = df['ENTIDAD_RES'].astype(str) + '-' + df['MUNICIPIO_RES'].astype(str)

#replace the code with the name
df['ENTIDAD_RES'].replace(dict_state,inplace = True)
df['MUNICIPIO_RES'].replace(dict_municipality,inplace = True)

#replace 1 & 2 as female & male
df['SEXO'].replace(1,'Female',inplace = True)
df['SEXO'].replace(2,'Male',inplace = True)

#Replace column names
actualNames = df.columns.to_list()
newNames = ['Update Date', 'ID', 'USMER','Health_institute', 'Location_institute','Sex','Birth_location','Residence_Entity', 'Residence_Municipality','Patien_type', 'Diagnosis_Date','Symptoms_Date','Death_Date','Tracheal_intubation','Pneumonia','Age','Nationality','Pregnancy','Indigenous_language','Diabetes','EPOC','Asthma','Immunosuppression','Hypertension','Other_comorbidity','Cardiovascular_disease','Obesity', 'Chronic_kidney_disease','smoking','Another_case','Result','Migrant','Nationality Country','Home_country','UCI']
replace_dict = dict(zip(actualNames,newNames))
df.rename(columns=replace_dict,inplace=True)

# Created a new column "Days until decease"
Days_decease = []
df['Symptoms_Date'] = pd.to_datetime(df['Symptoms_Date'])

for income, death in zip(df['Symptoms_Date'], df['Death_Date']):
  if death != '9999-99-99':
    deceaseDay = pd.to_datetime(death)
    days_until_decease = (deceaseDay-income).days
  else:
    days_until_decease = 0
  
  Days_decease.append(days_until_decease)

df['Days_decease'] = Days_decease

#clasify the death cases and non-death caes
df.loc[df.Death_Date != '9999-99-99', 'Death_Date'] = 1  #Pople death
df.loc[df.Death_Date == '9999-99-99', 'Death_Date'] = 0  #People alive
df['Death_Date'] = df['Death_Date'].astype(int)
df.rename(columns = {'Death_Date':'Dead'}, inplace = True)


df.head()

## EDA (Exploratory Data Analysis)

### Overall analysis

In [None]:
#order data by date
mexico = clean_mx.groupby('Date', as_index=False).sum()
mexico['Date'] = pd.to_datetime(mexico['Date'])

#plot data by confirmed and deaths
fig = px.bar(mexico, x='Date', y='Confirmed',
             hover_data=['Deaths'], color='Deaths',
             labels={}, height=400)
fig.show()

print('\n')
print("Recovered Cases: " ,clean_mx['Recovered'].sum())
print("Death Cases: ",clean_mx['Deaths'].sum())
print("Confirmed Cases: ",clean_mx['Confirmed'].sum())
print("Active Cases: ",clean_mx['Active'].sum())

### How many cases there are?


In [None]:
#plot data
fig = px.histogram(df, x = 'Result', color='Sex')
fig.show()

#Print results
print("\n")
print("Total of infected people(1):", df.loc[df.Result == 1, 'Result'].count())
print("Total of negative people(2):", df.loc[df.Result == 2, 'Result'].count())
print("Total of pending result(3):", df.loc[df.Result == 3, 'Result'].count())

### Who is most affected by the virus (female or male)?

In [None]:
#Positive cases
pc = df.loc[(df['Result'] == 1)]
#Percentage
fem_per = (pc.loc[(pc['Sex'] == 'Female','Sex')].count()  / pc['Sex'].count())*100
male_per = (pc.loc[(pc['Sex'] == 'Male','Sex')].count()  / pc['Sex'].count())*100
#print results
print("Percentage Female cases:", fem_per)
print("Percentage Male cases", male_per)

There is not much difference in infection between men and women.

This is important because we now know that sex does not influence the probability of infection.

### How many people have died?

In [None]:
label=['Death Cases','Surviving Cases']
values = [pc.loc[pc.Dead == 1 , 'Dead'].count(), pc.loc[pc.Dead == 0, 'Dead'].count()]

fig = go.Figure(data=[go.Pie(labels=label, values=values)])
fig.show()

In Mexico SARSCov-2 has a mortality rate of 11%, this means 62,594 death cases.

### How many inpatient and outpatient cases there are?

In Mexico like a hospital control measure the health systems decided to classify the positive cases as follow:

+ inpatient cases: Severe cases requiring special medical treatment.

+ Outpatient: Cases whose symptoms are not serious and can be quarantined at home.



In [None]:
fig = px.histogram(pc, x = 'Patien_type', color = 'Patien_type',title='Patient type')
fig.show()

print('\n')
print('Total outpatient:',pc.loc[pc.Patien_type == 1, 'Patien_type'].count())
print('Total inpatient',pc.loc[pc.Patien_type == 2, 'Patien_type'].count())

### Is age related to death?




In [None]:
fig = px.box(pc, y= 'Age', color= 'Dead')
fig.show()

People who were infected and didn't die have an average age of 42. While people who died from SARS-CoV-2 have an average age of 63.


Even though age seems to be an important factor, it is not a determining factor in whether a person will die or not. 

Then, there is a new question to be solved: 

### What do the people who died from SARS-Cov-2 have in common?

To answer this question, I will take advantage of K-Means segmentation

In [None]:
#Creating a new dataset with only death cases by COVID
death = df.loc[(df['Dead'] == 1) & (df['Result'] == 1)]
death.reset_index(inplace=True)
death = death.iloc[:,[15,16,20,21,22,23,24,26,27,28,29,36]]
print(death.shape)

There are several cases with the labels 97, 98 and 99. 

|Label |Meaning|
|------|-------|
|1     | Yes   |
|0     | No    |
|97    |Don´t apply|
|98    |Not known|
|99    |unspecified|


Basically these three labels have the same meaning, so we can choose between two options:

1. Drop all the data that cotanin these labels

2. Replace the data that contain these labels by only 0

I choose the second option because it is very sensitive data and I want to avoid losing as much data as possible.


In [None]:
#replace values 2, 97, 98 and 99 by 0
col = death.columns.to_list()
col.remove('Age')

for col in col: #Replace all columns different from 'EDAD' (Age)
  death[col].replace([2,97,98,99], 0, inplace = True)

In [None]:
X = death
X = preprocessing.StandardScaler().fit(X).transform(X)
X.shape

In [None]:
kclusters = 7
kmeans = KMeans(n_clusters = kclusters, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
kmeans.fit(X)

In [None]:
death.insert(0, 'Cluster', kmeans.labels_)

In [None]:
Clusters = pd.DataFrame(columns=[], index=[])
Clusters['Group 1'] = death.loc[death['Cluster'] == 0].describe().iloc[1]

Clusters['Group 2'] = death.loc[death['Cluster'] == 1].describe().iloc[1]

Clusters['Group 3'] = death.loc[death['Cluster'] == 2].describe().iloc[1]

Clusters['Group 4'] = death.loc[death['Cluster'] == 3].describe().iloc[1]

Clusters['Group 5'] = death.loc[death['Cluster'] == 4].describe().iloc[1]

Clusters['Group 6'] = death.loc[death['Cluster'] == 5].describe().iloc[1]

Clusters['Group 7'] = death.loc[death['Cluster'] == 6].describe().iloc[1]

Clusters.drop(index='Cluster',inplace=True)
#Clusters.loc[['Age','Days_decease','Days_decease STD']] = Clusters.loc[['Age','Days_decease','Days_decease STD']].apply(round)
Clusters.loc[['Pneumonia','Diabetes','EPOC','Asthma','Immunosuppression','Hypertension','Cardiovascular_disease', 'Obesity','Chronic_kidney_disease','smoking']] = Clusters.loc[['Pneumonia','Diabetes','EPOC','Asthma','Immunosuppression','Hypertension','Cardiovascular_disease', 'Obesity','Chronic_kidney_disease','smoking']] * 100
Clusters

In [None]:
fig = make_subplots(
    rows=3, cols=3,
    subplot_titles=("Group 1", "Group 2", "Group 3", "Group 4", "Group 5", "Group 6", "Group 7"))

fig.add_trace(go.Bar(y=Clusters['Group 1']),1,1)
fig.add_trace(go.Bar(y=Clusters['Group 2']),1,2)
fig.add_trace(go.Bar(y=Clusters['Group 3']),1,3)
fig.add_trace(go.Bar(y=Clusters['Group 4']),2,1)
fig.add_trace(go.Bar(y=Clusters['Group 5']),2,2)
fig.add_trace(go.Bar(y=Clusters['Group 6']),2,3)
fig.add_trace(go.Bar(y=Clusters['Group 7']),3,1)

fig.show()

Answering the question in this section:

**"What do the people who died from SARS-Cov-2 have in common?"**

**Primarily**, pneumonia stands out. Recent studies have indicated that SARS-CoV-2 infection may be a trigger for severe pneumonia.

In **second** place stands out diabetes and hypertension, it is known that these two diseases can produce chronic renal failure, which leads us to the **third** point. Only one group had chronic renal insufficiency, this group stands out for having the lowest average life span between date of symptoms and date of death (10 days, with an STD of 8 days).

The **fourth** relational characteristic is obesity which is present in at least 20% of all groups.

**Finally**, age is an important factor since patients who perished range from 59 to 63 years with a 14 year old STD.

**At least in Mexico**, the patients most likely to die have the following characteristics:

- Age between 50 and 70 years old.
- Diabetes
- Hypertension 
- Obesity 
- Chronic kidney problem.


### Geographical analysis

In [None]:
state = clean_mx.groupby('State', as_index = False).sum()

fig = px.bar(state, y='Confirmed', x='State', text='Confirmed')
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.show()

print(state.sort_values(by='Confirmed', ascending=False, ignore_index=True).head())

## Forecasting ML


In [None]:
#Transforming the data into a cumulative data to forecasting 
mexico['Deaths'] = mexico['Deaths'].cumsum()
mexico['Confirmed'] = mexico['Confirmed'].cumsum()
mexico['Recovered'] = mexico['Recovered'].cumsum()

#Generate a "day" column since the first case
mexico['Day'] = mexico['Date'].dt.dayofyear

#Plot cumulative data
fig = go.Figure()

fig.add_trace(go.Scatter(x=mexico['Date'], y=mexico['Deaths'],
                    mode='lines+markers',
                    name='Deaths'))

fig.add_trace(go.Scatter(x=mexico['Date'], y=mexico['Confirmed'],
                    mode='lines+markers',
                    name='Confirmed'))

fig.add_trace(go.Scatter(x=mexico['Date'], y=mexico['Recovered'],
                    mode='lines+markers',
                    name='Recovered'))

fig.show()

### Linear Model

In [None]:
#defining dependent and independent variables
X = mexico[['Day']]
y = mexico[['Confirmed']]

#split data into test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 1)

#training linear model
lm = LinearRegression()
lm.fit(X_train,y_train)

#evaluating the model
y_pred = lm.predict(X)

#r2 score
print(lm.score(X_test,y_test))

In [None]:
#predicted data
Mexico_lm = mexico.iloc[:,[0,2]]
Mexico_lm['lm_pred'] = y_pred

#plot data comparison
fig = go.Figure()

fig.add_trace(go.Scatter(x=Mexico_lm['Date'], y=Mexico_lm['Confirmed'],
                    mode='lines+markers',
                    name='Confirmed'))

fig.add_trace(go.Scatter(x=Mexico_lm['Date'], y=Mexico_lm['lm_pred'],
                    mode='lines',
                    name='predicted'))

fig.show()

### Polynomial model

In [None]:
#defining dependent and independent variables
X = mexico[['Day']]
y = mexico[['Confirmed']]

#Transforming data
degree = 8
Poly_reg = PolynomialFeatures(degree)
Xpoly = Poly_reg.fit_transform(X)

#split data into test and train
X_train, X_test, y_train, y_test = train_test_split(Xpoly, y, test_size=0.20, random_state = 2)

#Training model
pm=LinearRegression()
pm.fit(X_train,y_train)

#evaluating the model
y_pred_poly = pm.predict(Xpoly)

#r2 score
print(r2_score(y,y_pred_poly))

#MSE
print(mean_squared_error(y,y_pred_poly,squared=False))

In [None]:
Mexico_pm = mexico.iloc[:,[0,2]]
Mexico_pm['pm_pred'] = y_pred_poly

#plot data comparison
fig = go.Figure()

fig.add_trace(go.Scatter(x=Mexico_pm['Date'], y=Mexico_pm['Confirmed'],
                    mode='lines+markers',
                    name='Confirmed'))

fig.add_trace(go.Scatter(x=Mexico_pm['Date'], y=Mexico_pm['pm_pred'],
                    mode='lines',
                    name='predicted'))

fig.show()

## What is the scenario for the next 20 days?

In [None]:
#creating the next 15 days df
pred = pd.DataFrame(columns=[], index=[])
pred["Dates"] = pd.date_range(start="09-10-2020", end="9-30-2020")

In [None]:
#Prediction for Linear model
X = pred["Dates"].dt.dayofyear
X = X.values.reshape(-1,1)
prediction_linear = lm.predict(X)
pred['Prediction_linear'] = prediction_linear.round()

In [None]:
#Prediction for polynomial model
X = pred["Dates"].dt.dayofyear
X = X.values.reshape(-1,1)
X = Poly_reg.fit_transform(X)
prediction_poly = pm.predict(X)
pred['Prediction_poly'] = prediction_poly.round()

In [None]:
#Plot model comparison

fig = go.Figure()

fig.add_trace(go.Scatter(x=pred['Dates'], y=pred['Prediction_linear'],
                    mode='lines+markers',
                    name='Linear model prediction'))

fig.add_trace(go.Scatter(x=pred['Dates'], y=pred['Prediction_poly'],
                    mode='lines+markers',
                    name='Polynomial model prediction'))

fig.add_trace(go.Scatter(x=Mexico_pm['Date'], y=Mexico_pm['Confirmed'],
                    mode='lines+markers',
                    name='Confirmed'))


fig.show()

print(pred)