# Global and Local Analysis on the Zika Virus Outbreak

## Motivation and Objectives
The 2016 Zika Virus outbreak affected several countries, most of them located in the Americas. It is extremely important to keep records on cases related to the outbreak in order to monitor the disease. Each country reported the cases using different methodologies so this project aims to get some insights on the development of the epidemic locally but also globally. 



## Methodology

To start off, data are going to be ordered and cleaned and then, Pivot Tables will be generated to analyse data by country. These country pivot tables will enable understanding how each country reported cases and find equivalent data fields in order to visualize data on a global scale. 

On this notebook, visualizations will be presented using python library Plotly so that we are able to visualize data in interactive plots. The plotly graphs will be made available through an online report using the Datapane website.

Plus, after obtaining a final summarized dataset for the global cases, a Dashboard using Tableau Public will be produced. 

In [None]:
!pip install plotly

In [None]:
!pip install datapane

In [None]:
import pandas as pd
import numpy as np 
import os

#Visualization
import plotly
import seaborn as sns
import matplotlib.pyplot as plt 

import plotly.offline as py
import plotly.graph_objs as go
import plotly.express as px

#Report Publishing
import datapane as dp

In [None]:
#import plotly.io as pio
#pio.templates.default = "plotly"

Code below to read the csv file based on notebook [Exploring Zika Spread](https://www.kaggle.com/jungealexander/exploring-zika-spread):

In [None]:

df = pd.read_csv('../input/zika-virus-epidemic/cdc_zika.csv')

keep_rows = np.logical_and(pd.notnull(df['report_date']),
                           pd.notnull(df['value'])) 
df = df[keep_rows]
print('Removed {:d} out of {:d} rows with missing '
      'report_date or missing value.'.format(len(keep_rows) - sum(keep_rows),
                                             len(keep_rows)))

# clean report_date as some dates are delimited by underscores and some by hyphens,
# then convert to DatetimeIndex and sort by report_date
df['report_date'] = pd.to_datetime([d.replace('_', '-') for d in df['report_date']],
                               format='%Y-%m-%d')
df.sort_values(by='report_date', inplace=True, kind='mergesort')  # 'mergesort' is stable

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.isna().sum()

Value is as an object but it should be an integer as it is representing the number of cases/ suspected cases of zika

In [None]:
df['value'] = pd.to_numeric(df['value'], 'coerce')

In [None]:
df.dtypes

In [None]:
df.isna().sum()

#### Deleting NaN values from **value** column

In [None]:
df = df.dropna(subset=['value'])
df.isna().sum()

In [None]:
df['unit'].unique()

In [None]:
len(df['location'].unique())

In [None]:
df['location'].unique()

In [None]:
df['location_type'].unique()

### Creating a column for country from the location column
Data from some countries like Brazil and Colombia are represented by : CountryName-State. We want to visualize only country wide data so a new column is created for country name. 

In [None]:
df_loct = df.groupby('location_type').sum().reset_index()

In [None]:
df_loct.head(10)

#### Visualizing all locations 

In [None]:
countries = [s[0] for s in df['location'].str.split('-')]
df_grouped = df.groupby(countries)
for name, group in df_grouped:
    print(name)
    print(group['location'].unique())

PS: Please have in mind that the analysis below required some of my knowledge of Latin American countries (I am from Brazil) and also some google searches. 
-  **Argentina** : Argentina reported all locations as Provinces
- **Brazil**: Brazil has reported information by state and by region (Norte - North, Sul - South, Nordeste - Northeast, Centro - Middle-west, Sudeste - Southwest), so there are probably redundant rows. 
- **Colombia**:: reported all locations as Provinces
- **Dominican Republic** : reported some separate locations. A deeper analysis will be carried out in the Dominican Republic section. 
- **Ecuador**: reported some separate locations. A deeper analysis is required. 
- **El Salvador**: has some data points for the country and for separate locations. More details on the 'El salvador' section. 
- **Guatemala**: same as the El Salvador case. 
- **Haiti**: has data points reported for separate regions, provinces and the country. A deeper analysis is needed. 
- **Mexico**: has data points reported for each state and the whole country. An analysis is necessary to check redundant data. 
- **Nicaragua**: data points reported by state, municipality and country. Need to check for redundancies.
- **Panama**: data points reported by regions, sub-regions and country. Need to check for redundancies. 
- **Puerto Rico**: data reported by country. 
- **United States**: data reported by state, county, territorries (including Puerto Rico). Need to check for redundancies.

In [None]:
df['country'] = df['location']
df['country'] = df['country'].astype(str)
df['country'] = df['country'].apply(lambda x: pd.Series(x.split('-')))
df.country.unique()

Renaming some information to look better in the plots later on (this dataset will also be visualized using Tableau Public)

In [None]:
df['country'] = df['country'].replace(['Dominican_Republic', 'United_States', 'Puerto_Rico', 'United_States_Virgin_Islands'], 
                                      ['Dominican Republic', 'United States', 'Puerto Rico', 'United States'])

## Summarizing data on a pivot table

In [None]:
df_pv = df.pivot_table("value", ['report_date', 'country', 'location', 'location_type', 'data_field'], aggfunc="sum").reset_index()
df_pv.head()

In [None]:
df_pv['country'].unique()

## Inspecting data by Country

### Brazil

Norte (north), Nordeste (northeast), Sudeste (southwest), Sul (south) and Centro (middle-west), are rows in the dataframe summing up cases in these regions in Brazil, so this is probably duplicated data. 

#### Checking information on Brazil regions

In [None]:
df_br_region= df_pv[df_pv['country'].isin(['Norte', 'Centro', 'Nordeste', 'Sudeste', 'Sul'])].sort_values("report_date", ascending=True)
df_br_region = df_br_region.groupby('data_field').sum().reset_index()
df_br_region.head()

So, when the name of the location is the Brazilian region, the data is related to reported Zika cases only. 

In [None]:
df_br = df_pv[df_pv['country'].isin(['Brazil'])].sort_values("report_date", ascending=True)
df_br = df_br.groupby('data_field').sum().reset_index()
df_br.head()

The data assigned to location 'Brazil-State_name' is actually data on only microcephaly investigations. 

#### New Pivot Table 

In [None]:
df_pv2 = df.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index()
df_pv_br = df_pv2[df_pv2['country'].isin(['Brazil', 'Norte', 'Centro', 'Nordeste', 'Sudeste', 'Sul'])].sort_values("report_date", ascending=True)
df_pv_br['country'] = df_pv_br['country'].replace(['Norte', 'Centro', 'Nordeste', 'Sudeste', 'Sul'], 'Brazil')
#Summing up data by country again:
df_pv_br2 = df_pv_br.pivot_table('value', ['report_date', 'country', 'data_field'], aggfunc='sum').reset_index()
df_pv_br2.head()

In [None]:
df_pv_br2['country'].unique()

In [None]:
df_pv_br2.shape

In [None]:
df_pv_br2['data_field'].value_counts()

#### Summarizing data fields

- **Zika_reported** combines suspected and confirmed cases. Other countries separated the suspected from confirmed cases, therefore it will be a challenge to visualize global data by category due to this type of inconsistency. 
- **municipality_microcephaly_suspected** is the number of municipalities with suspected microcephaly cases. This information can be deleted from this analysis.
- **microcephaly_not** 	Reported cases of microcephaly and/or altered centreal nervous system suggestive of congenital infection in fetuses abortions stillbirths or recent live births - Discarded. 
- **municipality_microcephaly** number of municipalities with confirmed microcephaly cases. This information can be deleted from this analysis.

In [None]:
df_pv_br_mic = df_pv_br2[df_pv_br2['data_field'].isin(['microcephaly_confirmed', 
                                                  'microcephaly_under_investigation',
                                                  'microcephaly_fatal_confirmed',
                                                  'microcephaly_fatal_under_investigation',
                                                  'microcephaly_fatal_not'])]

In [None]:
fig1 = px.line(df_pv_br_mic, x="report_date", y="value", color='data_field')
fig1.update_layout(
    title={
        'text': "Microcephaly in Brazil",
        'x':0.5,
        'xanchor': 'center'})
fig1.show()

In [None]:
df_pv_br_zika = df_pv_br2[df_pv_br2['data_field'] == 'zika_reported']
df_pv_br_zika.head()

In [None]:
fig2 = px.line(df_pv_br_zika, x="report_date", y="value")
fig2.update_layout(
    title={
        'text': "Zika Virus in Brazil - Confirmed + Suspected Cases",
        'x':0.5,
        'xanchor': 'center'})
fig2.show()

### Argentina

In [None]:
df_pv_arg = df_pv[df_pv['country']=='Argentina'].sort_values("report_date", ascending=True)
df_pv_arg.head()

In [None]:
df_pv_arg['location_type'].unique()

All data points are represented by Province so we can assume that the sum of the 'value' column will give the total country cases. 

In [None]:
df_pv_arg.shape

In [None]:
df_pv_arg['data_field'].value_counts()

#### New Pivot Table for Argentina

In [None]:
df_pv_arg2 = df.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index()
df_pv_arg2 = df_pv_arg2[df_pv_arg2['country'] == 'Argentina'].sort_values("report_date", ascending=True)
df_pv_arg2.head()

In [None]:
df_pv_arg2.groupby("data_field")['value'].describe().reset_index()

Plotting Zika Virus Cases in Argentina

In [None]:
df_pv_arg_cases = df_pv_arg2[df_pv_arg2['data_field'].isin(['cumulative_confirmed_imported_cases',    
'cumulative_confirmed_local_cases','cumulative_probable_imported_cases','cumulative_probable_local_cases'])]

In [None]:
fig3 = px.line(df_pv_arg_cases, x="report_date", y="value", color='data_field')
fig3.update_layout(
    title={
        'text': "Zika Virus Cases In Argentina",
        'x':0.5,
        'xanchor': 'center'})
fig3.show()

In [None]:
df_pv_arg_study = df_pv_arg2[df_pv_arg2['data_field']=='cumulative_cases_under_study']

In [None]:
fig4 = px.line(df_pv_arg_study, x="report_date", y="value")
fig4.update_layout(
    title={
        'text': "Zika Virus Cases In Argentina Under Study",
        'x':0.5,
        'xanchor': 'center'})
fig4.show()

### Colombia

In [None]:
df_pv_col = df_pv[df_pv['country']=='Colombia'].sort_values("report_date", ascending=True)

In [None]:
df_pv_col_gp = df_pv_col.groupby(['location_type', 'data_field']).sum().reset_index()
df_pv_col_gp.head(10)

In [None]:
df_pv_col.shape

In [None]:
df_pv_col['location_type'].unique()

All information is presented by municipality in Colombia, so we can assume that summing up all values we will get the country's number for the data field. 

In [None]:
df_pv_col['data_field'].value_counts()

#### New Pivot table for Colombia

In [None]:
df_pv_col2 = df.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index()
df_pv_col2 = df_pv_col2[df_pv_col2['country'] == 'Colombia'].sort_values("report_date", ascending=True)
df_pv_col2.head()

#### Summarizing data fields

- **zika_confirmed_laboratory** lists cases confirmed in labs and **zika_confirmed_clinic** lists cases confirmed in clinics. These categories will be joined. 
- Similarly, **zika_suspected** and **zika_suspected_clinic** will be joined into one category. This will make it easier to visualize global data in a plot later on. 

In [None]:
df_pv_col2['data_field'] = df_pv_col2['data_field'].replace(['zika_confirmed_laboratory','zika_suspected',
                                                         'zika_suspected_clinic','zika_confirmed_clinic'], 
                                                        ['zika_confirmed','zika_suspected',
                                                         'zika_suspected','zika_confirmed'])

df_pv_col2 = df_pv_col2.groupby(['report_date', 'country', 'data_field']).sum().reset_index()
df_pv_col2.head()

In [None]:
fig5 = px.line(df_pv_col2, x="report_date", y="value", color='data_field')
fig5.update_layout(
    title={
        'text': "Zika Virus Cases In Colombia",
        'x':0.5,
        'xanchor': 'center'})
fig5.show()

### Dominican Republic

In [None]:
df_pv_dom = df_pv[df_pv['country']=='Dominican Republic'].sort_values("report_date", ascending=True)
df_pv_dom.head()

In [None]:
df_pv_dom.shape

In [None]:
df_pv_dom['location_type'].unique()

In [None]:
df_pv_dom['data_field'].value_counts()

We need to check whether there are redundancies in data reported as country, province and municipality

In [None]:
df_pv_dom_gp = df_pv_dom.groupby(['location_type', 'data_field']).sum().reset_index()
df_pv_dom_gp.head()

In [None]:
df_pv_dom_gp[df_pv_dom_gp['location_type'] == 'province']

In [None]:
df_pv_dom_gp[df_pv_dom_gp['location_type'] == 'country']

In [None]:
df_pv_dom_gp[df_pv_dom_gp['location_type'] == 'municipality']

**Country** reported the following cumulatives:
- Gbs reported cumulative = 990
- zika suspected cumulative = 30,137

**Province** reported the following cumulatives:
- zika suspected cumulative = 26,304 cases
- gbs confirmed cumulative = 33
- gbs reported cumulative = 990

**Municipality** reported the following cumulatives:
- zika suspected cumulative = 50


#### New Pivot Table for Dominican Republic

In [None]:
df_pv_dom2 = df.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index()
df_pv_dom2 = df_pv_dom2[df_pv_dom2['country'] == 'Dominican Republic'].sort_values("report_date", ascending=True)
df_pv_dom2.head()

 - *GBS* stands for Guillain Barre Syndrome, one of the potential consequences of Zika virus infection. A separate subset for this information will be created. 
 - *efe * stands for Febrile Rash illness is a zika predictor symptom. This will be represented in a separate plot. 
 - A subset for *microcephaly* will be created to see the evaluation of microcephaly cases in Dominican Republic alonside zika cases. 
 - A subset for zika confirmed and zika suspected cases will be created. 

#### Zika Confirmed and Suspected Cases

In [None]:
df_pv_dom_zika = df_pv_dom2[df_pv_dom2['data_field'].isin([ 
                                                         'zika_suspected_pregnant_cumulative',
                                                         'zika_suspected_cumulative',
                                                         'zika_confirmed_pcr_cumulative',
                                                         'zika_confirmed_pregnant_cumulative'
                                                         'total_zika_new_confirmed_pcr'
                                                        ])]

In [None]:
df_pv_dom_zika.groupby("data_field")['value'].describe().reset_index()

In [None]:
fig6 = px.line(df_pv_dom_zika, x="report_date", y="value", color='data_field')
fig6.update_layout(
    title={
        'text': "Zika Virus Cases In Dominican Republic",
        'x':0.5,
        'xanchor': 'center'})
fig6.show()

#### GBS cases and investigations

In [None]:
df_pv_dom_gbs = df_pv_dom2[df_pv_dom2['data_field'].isin(['gbs_reported', 'gbs_confirmed_cumulative',
                                                        'gbs_reported_4weeks', 'gbs_reported_cumulative',
                                                        'gbs_zika_confirmed_pregnant','gbs_zika_confirmed' 
                                                        ])]

In [None]:
df_pv_dom_gbs.groupby("data_field")['value'].describe().reset_index()

In [None]:
fig7 = px.line(df_pv_dom_gbs, x="report_date", y="value", color='data_field')
fig7.update_layout(
    title={
        'text': "Guillain Barre Syndrome Cases In Dominican Republic",
        'x':0.5,
        'xanchor': 'center'})
fig7.show()

#### Microcephaly cases and investigations

In [None]:
df_pv_dom_mic = df_pv_dom2[df_pv_dom2['data_field'].isin(['microcephaly_suspected',
                                                        'microcephaly_suspected_4weeks',
                                                        'microcephaly_confirmed_cumulative',
                                                        'microcephaly_suspected_cumulative'])]

In [None]:
df_pv_dom_mic.groupby("data_field")['value'].describe().reset_index()

In [None]:
df_pv_dom_mic['value'].sum()

There are only null values for microcaphaly cases. This will be discarded from the analysis. 

#### Febrile Rash Illness

In [None]:
df_pv_dom_efe = df_pv_dom2[df_pv_dom2['data_field']== 'efe_reported']

In [None]:
df_pv_dom_efe.groupby("data_field")['value'].describe().reset_index()                      

There was only one reported case of EFE so there is no need to make a plot for this. 

### Ecuador

In [None]:
df_pv_ecu = df_pv[df_pv['country']=='Ecuador'].sort_values("report_date", ascending=True)
df_pv_ecu.head()

In [None]:
df_pv_ecu.shape

In [None]:
df_pv_ecu['location_type'].unique()

In [None]:
df_pv_ecu_gp = df_pv_ecu.groupby(['location_type', 'data_field']).sum().reset_index()
df_pv_ecu_gp.head(30)

In [None]:
len(df_pv_ecu['data_field'].unique())

The data was reported in Ecuador as follows:
- **Country** reported confirmed cases by age group, total cumulative confirmed, total pregnant cases confirmed, total suspected cases. 
- **County** reported not applicable confirmed cases (not native, not imported cases), confirmed native (autochthonous) cases, cases in pregnant women. 
- **province** reported confirmed cases in pregnant women and suspected cumulative. 
- We can see that the suspected cumulative cases reported by the province and country are the same (1898) so we will consider only the data with location_ty = country from now on.

#### New Pivot Table for Ecuador

In [None]:
df_pv_country = df_pv[df_pv['location_type'] == 'country']
df_pv_ecu2 = df_pv_country[df_pv_country['country'] == 'Ecuador'].sort_values("report_date", ascending=True)
df_pv_ecu2.head()

In [None]:
df_pv_ecu2['data_field'].unique()

Only two variables will be considered: **total_zika_suspected_cumulative** and **total_zika_confirmed_cumulative**

In [None]:
df_pv_ecu_zika = df_pv_ecu2[df_pv_ecu2['data_field'].isin(['total_zika_suspected_cumulative',
                                                     'total_zika_confirmed_cumulative'])]

In [None]:
df_pv_ecu_zika.groupby("data_field")['value'].describe().reset_index()

In [None]:
fig8 = px.line(df_pv_ecu_zika, x="report_date", y="value", color='data_field')
fig8.update_layout(
    title={
        'text': "Zika Virus Cases Ecuador",
        'x':0.5,
        'xanchor': 'center'})
fig8.show()

### El Salvador

In [None]:
df_pv_els = df_pv[df_pv['country']=='El_Salvador'].sort_values("report_date", ascending=True)
df_pv_els.head()

In [None]:
df_pv_els.shape

In [None]:
df_pv_els['location_type'].unique()

In [None]:
df_pv_els['data_field'].value_counts()

Checking for redundancies:

In [None]:
df_pv_els_gp = df_pv_els.groupby(['location_type', 'data_field']).sum().reset_index()
df_pv_els_gp.head(20)

## to do sum up the cumulative per age 

In [None]:
len(df_pv_els['data_field'].unique())

From the above, we can see that:
- **Country** reports all confirmed cases, cumulative suspected cases by age group, cumulative suspected cases in pregnant women and weekely hospitalizations. 
- **Department** reports cumulative suspected total. 
- Cumulative suspected total reported by the departments totalized 117,362 cases, as the country reported 117,475. 
- Total confirmed cases were 472 cases. 

*Given the number of total confirmed cases we will consider as suspected cases, the number reported by the country*

#### New Pivot Table for El Salvador

According to the data description on El Salvador records on [github](https://github.com/cdcepi/zika/blob/master/El_Salvador/SV_Data_Guide.csv) *cumulative_suspected_total* represents annual cumulative supected cases of Zike and *cumulative_confirmed* represents annual cumulative confirmed cases. Only these categories will be considered on the plot. 

In [None]:
df_pv_els2 = df.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index()
df_pv_els2 = df_pv_els2[df_pv_els2['country'] == 'El_Salvador'].sort_values("report_date", ascending=True)
df_pv_els2.head()

In [None]:
df_pv_els_zika = df_pv_els2[df_pv_els2['data_field'].isin(['cumulative_confirmed',
                                                     'cumulative_suspected_total'])]

In [None]:
df_pv_els_zika.groupby("data_field")['value'].describe().reset_index()

In [None]:
fig9 = px.line(df_pv_els_zika, x="report_date", y="value", color='data_field')
fig9.update_layout(
    title={
        'text': "Zika Virus Cases El Salvador",
        'x':0.5,
        'xanchor': 'center'})
fig9.show()

### Guatemala

In [None]:
df_pv_gua = df_pv[df_pv['country']=='Guatemala'].sort_values("report_date", ascending=True)
df_pv_gua.head()

In [None]:
df_pv_gua.shape

In [None]:
df_pv_gua['data_field'].value_counts()

In [None]:
df_pv_gua['location_type'].unique()

In [None]:
df_pv_gua_gp = df_pv_gua.groupby(['location_type', 'data_field']).sum().reset_index()
df_pv_gua_gp.head(20)

**Country** reported:
- Total zika suspected = 296
- Total zika confirmed cumulative = 469
- Tota zika suspected cumulative = 4056

**Municipality** reported:
- Total zika suspected = 284
- Total zika confirmed cumulative = 804
- Total zika suspected cumulative = 4056

Checking if we need to sum up the total zika confirmed cases as both the country and municipality reported these data fields:

In [None]:
df_pv_gua_date = df_pv_gua.groupby(['report_date','location_type', 'data_field']).sum().reset_index()
df_pv_gua_date.head()

In [None]:
df_pv_gua_date[df_pv_gua_date['data_field'] == 'total_zika_confirmed_cumulative']

In [None]:
date = df_pv_gua_date[df_pv_gua_date['data_field'] == 'total_zika_confirmed_cumulative']
date['value'].sum()

Some dates had reports from both country and municipality, with the same number of cases, but in other cases only one authority reported cases. 

If we sum up the whole *value* column we will have some repeated values. For that reason, a new dataframe will be created, the data field will be renamed to country and duplicated values will be deleted. 


#### Cleaning the Guatemala Dataframe

In [None]:
df_pv_gua2 = df_pv[df_pv['country']=='Guatemala'].sort_values("report_date", ascending=True)
df_pv_gua2= df_pv_gua2[df_pv_gua2['data_field'].isin(['total_zika_confirmed_cumulative', 
                                                              'total_zika_suspected_cumulative',
                                                              ])]
df_pv_gua2.head()

In [None]:
#Just checking the sum for total confirmed cases:
total_confirmed_cases = df_pv_gua2[df_pv_gua2['data_field'] == 'total_zika_confirmed_cumulative']
total_confirmed_cases['value'].sum()

#### Renaming location type

In [None]:
df_pv_gua2['location_type'] = df_pv_gua2['location_type'].replace('municipality', 'country')

Deleting duplicated data

In [None]:
df_pv_gua2.duplicated(['report_date', 'value'], keep='last').sum()

In [None]:
df_pv_gua2 = df_pv_gua2.drop_duplicates(subset=['report_date', 'value'], keep='last')

In [None]:
df_pv_gua2.duplicated(['report_date', 'value'], keep='last').sum()

#### New pivot table for Guatemala from the cleaned dataframe df_pv_gua2


In [None]:
df_pv_gua3 = df_pv_gua2.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index().sort_values("report_date", ascending=True)
df_pv_gua3.head()

In [None]:
df_pv_gua3['data_field'].unique()

In [None]:
fig10 = px.line(df_pv_gua3, x="report_date", y="value", color='data_field')
fig10.update_layout(
    title={
        'text': "Zika Virus Cases in Guatemala",
        'x':0.5,
        'xanchor': 'center'})
fig10.show()

### Haiti

In [None]:
df_pv_hai = df_pv[df_pv['country']=='Haiti'].sort_values("report_date", ascending=True)
df_pv_hai.head()

In [None]:
df_pv_hai.tail()

In [None]:
df_pv_hai.shape

In [None]:
df_pv_hai_gp = df_pv_hai.groupby(['location_type', 'data_field']).sum().reset_index()
df_pv_hai_gp.head()

In [None]:
df_pv_hai.groupby("data_field")['value'].describe().reset_index()

There is only one date representing the statistics of zika in Haiti, so we will make a barplot:

In [None]:
df_pv_hai2 = df.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index()
df_pv_hai2 = df_pv_hai2[df_pv_hai2['country'] == 'Haiti'].sort_values("report_date", ascending=True)
df_pv_hai2.head()

In [None]:
fig11 = px.bar(df_pv_hai2, x="data_field", y="value")
fig11.update_layout(
    title={
        'text': "Zika Virus Cases in Haiti",
        'x':0.5,
        'xanchor': 'center'})
fig11.show()

### Mexico

In [None]:
df_pv_mex = df_pv[df_pv['country']=='Mexico'].sort_values("report_date", ascending=True)
df_pv_mex.head()

In [None]:
df_pv_mex.shape

In [None]:
df_pv_mex['data_field'].value_counts()

In [None]:
df_pv_mex['location_type'].unique()

**All data was reported by state, so we will consider that summing up all values per data field will give us the country's numbers for the respective datafield.**

There is a weekely data category and 2 yearly cumulative (male and female). Summing up male and female cumulatives into one category:


In [None]:
df_pv_mex['data_field'] = df_pv_mex['data_field'].replace(['yearly_cumulative_female', 'yearly_cumulative_male'], 'yearly_cumulative')

In [None]:
df_pv_mex.groupby("data_field")['value'].describe().reset_index()

#### New pivot table for Mexico to summarize data by data field and report date only

In [None]:
df_pv_mex2 = df_pv_mex.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index()
df_pv_mex2 = df_pv_mex2[df_pv_mex2['country'] == 'Mexico'].sort_values("report_date", ascending=True)
df_pv_mex2.head()

#### Plotting the data for Mexico

In [None]:
fig12 = px.line(df_pv_mex2, x="report_date", y="value", color='data_field')
fig12.update_layout(
    title={
        'text': "Zika Virus Cases in Mexico",
        'x':0.5,
        'xanchor': 'center'})
fig12.show()

### Nicaragua

In [None]:
df_pv_nic = df_pv[df_pv['country']=='Nicaragua'].sort_values("report_date", ascending=True)
df_pv_nic.head()

In [None]:
df_pv_nic.shape

In [None]:
df_pv_nic['data_field'].value_counts()

In [None]:
df_pv_nic['location_type'].unique()

There are 4 types of location so we need to check whether there are redundant data.

In [None]:
df_pv_nic_gp = df_pv_nic.groupby(['location_type', 'data_field']).sum().reset_index()
df_pv_nic_gp.head(20)

**Country** 
- total zika confirmed = 145
- total zika cumilative = 3805
- total zika new suspected = 386

**City**
- total zika confirmed = 5
- total zika confirmed cumulative = 182

**District**
- total zika confirmed = 10

**Municipality**
- total zika confirmed = 30

There are data fields being reported by various authorities. We will consider from now on only the records of the data reported by the country. 

#### New pivot table for Nicaragua considering only country records

In [None]:
df_pv_country = df_pv[df_pv['location_type'] == 'country']
df_pv_nic2 = df_pv.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index()
df_pv_nic2 = df_pv_nic2[df_pv_nic2['country'] == 'Nicaragua'].sort_values("report_date", ascending=True)
df_pv_nic2.head()

In [None]:
df_pv_nic2.groupby("data_field")['value'].describe().reset_index()

In [None]:
df_pv_nic_zika = df_pv_nic2[df_pv_nic2['data_field'].isin(['total_zika_confirmed_cumulative', 
                                                'total_zika_new_suspected'])]


In [None]:
fig13 = px.line(df_pv_nic_zika, x="report_date", y="value", color='data_field')
fig13.update_layout(
    title={
        'text': "Zika Virus Cases in Nicaragua",
        'x':0.5,
        'xanchor': 'center'})
fig13.show()

### Panama

In [None]:
df_pv_pan = df_pv[df_pv['country']=='Panama'].sort_values("report_date", ascending=True)
df_pv_pan.head()

In [None]:
df_pv_pan.tail()

In [None]:
df_pv_pan.shape

In [None]:
df_pv_pan['data_field'].value_counts()

In [None]:
df_pv_pan['location_type'].unique()

There are 4 different authorities reporting cases so we need to check if there are any redundancies. 

In [None]:
df_pv_pan_gp = df_pv_pan.groupby(['location_type', 'data_field']).sum().reset_index()
df_pv_pan_gp.shape

In [None]:
df_pv_pan_gp.head(25)

-**Provinces** reported cases in pregnant women. 
- **Districts** reported confirmed cases confirmed by labs. 
- **Counties** reported confirmed cases confirmed by labs. 
- **Country** reported confirmed cases by age group and gender

Sum of cases by age group and gender

In [None]:
total_gender= df_pv_pan_gp.iloc[0:2].sum()
print(total_gender)

In [None]:
total_ages= df_pv_pan_gp.iloc[2:13].sum()
print(total_ages)

In [None]:
#Country weekely confirmed cases:

total_weekly= df_pv_pan_gp.iloc[13:17].sum()
print(total_weekly)

In [None]:
#Confirmed cases in labs according to counties (2015 and 2016)

total_labs= df_pv_pan_gp.iloc[17:19].sum()
print(total_labs)

From this, we can consider that total confirmed cases are 182, as reported by the country in various ways to plot the graph

#### New pivot table to sum the data by data_field

In [None]:
df_pv_pan2 = df.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index()
df_pv_pan2 = df_pv_pan2[df_pv_pan2['country'] == 'Panama'].sort_values("report_date", ascending=True)
df_pv_pan2 = df_pv_pan2[df_pv_pan2['data_field'].isin(['Zika_confirmed_M','Zika_confirmed_F' ])]
df_pv_pan2.head()

In [None]:
df_pv_pan2.groupby("data_field")['value'].describe().reset_index()

Panama report contains only one date, so we will plot a barplot:

In [None]:
fig14 = px.bar(df_pv_pan2, x="data_field", y="value")
fig14.update_layout(
    title={
        'text': "Zika Virus Cases in Panama",
        'x':0.5,
        'xanchor': 'center'})
fig14.show()

### Puerto Rico

In [None]:
df_pv_pue = df_pv[df_pv['country']=='Puerto Rico'].sort_values("report_date", ascending=True)
df_pv_pue.head()

In [None]:
df_pv_pue.shape

In [None]:
df_pv_pue['data_field'].value_counts()

In [None]:
df_pv_pue['location_type'].unique()

There is only one location type, so we will consider that summing up the values for the data field will give the total number of cases per data field

In [None]:
df_pv_pue.groupby("data_field")['value'].describe().reset_index()

#### New pivot table for Puerto Rico 

In [None]:
df_pv_pue2 = df.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index()
df_pv_pue2 = df_pv_pue2[df_pv_pue2['country'] == 'Puerto Rico'].sort_values("report_date", ascending=True)
df_pv_pue2.head()

#### Plotting the graph for zika cases

In [None]:
df_pv_pue_zika = df_pv_pue2[df_pv_pue2['data_field'] == 'zika_confirmed_cumulative_2016']
                                                          

In [None]:
fig15 = px.line(df_pv_pue_zika, x="report_date", y="value", color='data_field')
fig15.update_layout(
    title={
        'text': "Zika Virus Cases in Puerto Rico",
        'x':0.5,
        'xanchor': 'center'})
fig15.show()

#### Plotting Guillain Barre Syndrome data for Puerto Rico

In [None]:
df_pv_pue_gbs = df_pv_pue2[df_pv_pue2['data_field'].isin(['GBS_reported_cumulative_2015-2016',
                                                          'GBS_reported_cumulative_2015-2016_zika',
                                                          ])]

In [None]:
fig16 = px.line(df_pv_pue_gbs, x="report_date", y="value", color='data_field')
fig16.update_layout(
    title={
        'text': "Total GBS Cases in Puerto Rico and cases related to Zika",
        'x':0.5,
        'xanchor': 'center'})
fig16.show()

#### Congenital developmental defects 

In [None]:
df_pv_pue_cong = df_pv_pue2[df_pv_pue2['data_field'].isin(['congenital_developmental_defects_reported_cummulative_2015-2016',
                                                          'congenital_developmental_defects_reported_cumulative_2015-2016'
                                                          ])]

In [None]:
df_pv_pue_cong

In [None]:
fig17 = px.line(df_pv_pue_cong, x="report_date", y="value", color='data_field')
fig17.update_layout(
    title={
        'text': "Cumulative Congenital developmental defects in Puerto Rico",
        'x':0.5,
        'xanchor': 'center'})
fig17.show()

### United States

In [None]:
df_pv_usa = df_pv[df_pv['country']=='United States'].sort_values("report_date", ascending=True)
df_pv_usa.head()

In [None]:
df_pv_usa.shape

In [None]:
df_pv_usa['data_field'].value_counts()

In [None]:
df_pv_usa['location_type'].unique()

### Checking if there are redundant data

In [None]:
df_pv_usa_gp = df_pv_usa.groupby(['location_type', 'data_field']).sum().reset_index()
df_pv_usa_gp.shape

In [None]:
df_pv_usa_gp.head(30)

#### Checking territories:

In [None]:
df_pv_usa_terr = df_pv_usa[df_pv_usa['location_type'] == 'territory']
df_pv_usa_terr.head()

In [None]:
df_pv_usa_terr['data_field'].unique()

In [None]:
len(df_pv_usa_terr['data_field'].unique())

In [None]:
df_pv_usa_terr['location'].unique()

In [None]:
df_pv_usa_pen = df_pv_usa_terr[df_pv_usa_terr['location'] == 'United_States-Pennsylvania††']
df_pv_usa_pen.head()

In [None]:
df_pv_usa_pen['data_field'].unique()

In [None]:
df_pv_usa_county = df_pv_usa[df_pv_usa['location_type'] == 'county']
df_pv_usa_county.head()

In [None]:
df_pv_usa_county['value'].sum()

In [None]:
df_pv_usa_county['location'].unique()

In [None]:
df_pv_usa_county['data_field'].unique()

In [None]:
df_pv_usa_state = df_pv_usa[df_pv_usa['location_type'] == 'state']
df_pv_usa_state.head()

In [None]:
df_pv_usa_state['location'].unique()

**State** :
- States reported: zika reported travel cases (zika_reported_travel), zika reported local cases (zika_reported_local), yearly_reported_travel_cases.

**County**:
- Counties reported: yearly reported travel cases, confirmed cases by labs, zika_reported. 
- it contains data from Virgin Islands

**Territory**
- In the territories location type we have the following data fields: confirmed cases by age, confirmed symptoms, confirmed cases by lab tests, by gender, local cases reported, zika_not, zika_pending, zika_no_specimen, zika_reported, zika_reported_local and zika_reported_travel. 	
- United_States-Pennsylvania†† reported online zika local cases and travel cases. 


#### Checking the common points :

In [None]:
df_pv_usa_state_gp = df_pv_usa_state.groupby('data_field').sum().reset_index()
df_pv_usa_state_gp.head()

In [None]:
df_pv_usa_state['report_date'].unique()

In [None]:
df_pv_usa_county_gp = df_pv_usa_county.groupby('data_field').sum().reset_index()
df_pv_usa_county_gp.head()

In [None]:
df_pv_usa_county['report_date'].unique()

In [None]:
df_pv_usa_terr_gp = df_pv_usa_terr.groupby('data_field').sum().reset_index()
df_pv_usa_terr_gp.head(30)

In [None]:
df_pv_usa_terr['report_date'].unique()

The collection of the data differs for each type of location. According to the CDC official website, the US states had [5,168 cases of Zika in 2016](https://www.cdc.gov/zika/reporting/2016-case-counts.html). 

The kaggle dataset contains data for the US from January to June, 2016 so we have to have in mind that the confirmed cases must be lower than 5,168. 

The official CDC Github repo's [data guide for the US](https://github.com/cdcepi/zika/blob/master/United_States/US_Data_Guide.csv) lists only the data fields related to the states. So we will consider as our subset for the US only data acquired by the US states. 

In [None]:
df_pv_usa.groupby("data_field")['value'].describe().reset_index()

#### New pivot table for the US (only the states)

In [None]:
df_pv_state = df_pv[df_pv['location_type'] == 'state']
df_pv_usa2 = df_pv_state.pivot_table("value", ['report_date', 'country', 'data_field'], aggfunc="sum").reset_index()
df_pv_usa2 = df_pv_usa2[df_pv_usa2['country'] == 'United States'].sort_values("report_date", ascending=True)
df_pv_usa2.head()

In [None]:
fig18 = px.line(df_pv_usa2, x="report_date", y="value", color='data_field')
fig18.update_layout(
    title={
        'text': "Zika Cases in the USA in 2016",
        'x':0.5,
        'xanchor': 'center'})
fig18.show()

#### USA Cases in Territories

In [None]:
df_pv_usa_terr.head()

In [None]:
df_pv_usa_terr2 = df_pv_usa_terr.sort_values("report_date", ascending=True)
df_pv_usa_terr2 = df_pv_usa_terr2[df_pv_usa_terr2['location'].isin(['United_States_Virgin_Islands', 
                                                                      'United_States-Puerto_Rico', 
                                                                      'United_States-American_Samoa',
                                                                      'United_States-US_Virgin_Islands'
                                                                     ])]
                                                                


In [None]:
df_pv_usa_terr2['location'] = df_pv_usa_terr2['location'].replace(
    'United_States-US_Virgin_Islands',
    'United_States_Virgin_Islands')

In [None]:
df_pv_usa_terr2['data_field'].unique()

In [None]:
df_pv_usa_terr_zika = df_pv_usa_terr2[df_pv_usa_terr2['data_field'].isin(['zika_reported_travel',
                                                                          'zika_reported_local'])]

In [None]:
fig19 = px.line(df_pv_usa_terr_zika, x="report_date", y="value", color='location')
fig19.update_layout(
    title={
        'text': "Zika Cases in the USA Territories in 2016",
        'x':0.5,
        'xanchor': 'center'})
fig19.show()

## Joining all Cleaned Country Dataframes

In [None]:
#List of datasets:
data_list = [df_pv_br_zika, df_pv_br_mic, df_pv_arg2, df_pv_col2, df_pv_dom_zika, 
             df_pv_dom_gbs, df_pv_ecu_zika, df_pv_els_zika, df_pv_gua3, df_pv_hai2,
             df_pv_mex2,df_pv_nic_zika, df_pv_pan2, df_pv_pue_zika, df_pv_pue_gbs, df_pv_usa2]

In [None]:
data = pd.concat(data_list)

In [None]:
type(data)

In [None]:
data.head()

In [None]:
data['country'].unique()

In [None]:
data.drop(['location', 'location_type'], axis=1, inplace=True)

In [None]:
data.head()

In [None]:
data['data_field'].unique()

In [None]:
list_zika = [df_pv_br_zika, df_pv_arg2, df_pv_col2, df_pv_dom_zika, 
             df_pv_ecu_zika, df_pv_els_zika, df_pv_gua3, df_pv_hai2,
             df_pv_mex2,df_pv_nic_zika, df_pv_pan2, df_pv_pue_zika, df_pv_usa2]

In [None]:
data_zika = pd.concat(list_zika)

In [None]:
data_zika.head()

In [None]:
data_zika.drop(['location', 'location_type'], axis=1, inplace=True)

In [None]:
data_zika['data_field'].unique()

In [None]:
data_zika['data_field'].value_counts()

In [None]:
data_zika.to_excel('data_zika.xlsx')