The data

This dataset shares publicly available data related to the ongoing Zika epidemic. It is being provided as a resource to the scientific community engaged in the public health response. The data provided here is not official and should be considered provisional and non-exhaustive. The data in reports may change over time, reflecting delays in reporting or changes in classifications. And while accurate representation of the reported data is the objective in the machine readable files shared here, that accuracy is not guaranteed. Before using any of these data, it is advisable to review the original reports and sources, which are provided whenever possible along with further information on the CDC Zika epidemic GitHub repo.

The dataset includes the following fields:

report_date - The report date is the date that the report was published. The date should be specified in standard ISO format (YYYY-MM-DD).
location - A location is specified for each observation following the specific names specified in the country place name database. This may be any place with a 'location_type' as listed below, e.g. city, state, country, etc. It should be specified at up to three hierarchical levels in the following format: [country]-[state/province]-[county/municipality/city], always beginning with the country name. If the data is for a particular city, e.g. Salvador, it should be specified: Brazil-Bahia-Salvador.
location_type - A location code is included indicating: city, district, municipality, county, state, province, or country. If there is need for an additional 'location_type', open an Issue to create a new 'location_type'.
data_field - The data field is a short description of what data is represented in the row and is related to a specific definition defined by the report from which it comes.
data_field_code - This code is defined in the country data guide. It includes a two letter country code (ISO-3166 alpha-2, list), followed by a 4-digit number corresponding to a specific report type and data type.
time_period - Optional. If the data pertains to a specific period of time, for example an epidemiological week, that number should be indicated here and the type of time period in the 'time_period_type', otherwise it should be NA.
time_period_type - Required only if 'time_period' is specified. Types will also be specified in the country data guide. Otherwise should be NA.
value - The observation indicated for the specific 'report_date', 'location', 'data_field' and when appropriate, 'time_period'.
unit - The unit of measurement for the 'data_field'. This should conform to the 'data_field' unit options as described in the country-specific data guide.


In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
import os

In [2]:
z = pd.read_csv('cdc_zika.csv', low_memory = False)

In [3]:
z.head()

Unnamed: 0,report_date,location,location_type,data_field,data_field_code,time_period,time_period_type,value,unit
0,2016-03-19,Argentina-Buenos_Aires,province,cumulative_confirmed_local_cases,AR0001,,,0,cases
1,2016-03-19,Argentina-Buenos_Aires,province,cumulative_probable_local_cases,AR0002,,,0,cases
2,2016-03-19,Argentina-Buenos_Aires,province,cumulative_confirmed_imported_cases,AR0003,,,2,cases
3,2016-03-19,Argentina-Buenos_Aires,province,cumulative_probable_imported_cases,AR0004,,,1,cases
4,2016-03-19,Argentina-Buenos_Aires,province,cumulative_cases_under_study,AR0005,,,127,cases


In [4]:
z.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107619 entries, 0 to 107618
Data columns (total 9 columns):
report_date         107612 non-null object
location            107612 non-null object
location_type       107612 non-null object
data_field          107612 non-null object
data_field_code     107612 non-null object
time_period         0 non-null float64
time_period_type    0 non-null float64
value               107481 non-null object
unit                107612 non-null object
dtypes: float64(2), object(7)
memory usage: 7.4+ MB


In [5]:
z.columns.tolist()

['report_date',
 'location',
 'location_type',
 'data_field',
 'data_field_code',
 'time_period',
 'time_period_type',
 'value',
 'unit']

In [6]:
z['country'] = z['location']
z['country'] = z['country'].astype(str)
z['country'] = z['country'].apply(lambda x: pd.Series(x.split('-')))
z.country.unique()

array(['Argentina', 'Brazil', 'Norte', 'Nordeste', 'Sudeste', 'Sul',
       'Centro', 'Colombia', 'nan', 'Dominican_Republic', 'Ecuador',
       'El_Salvador', 'Guatemala', 'Haiti', 'Mexico', 'Nicaragua',
       'Panama', 'Puerto_Rico', 'United_States',
       'United_States_Virgin_Islands'], dtype=object)

In [7]:
z['country'].value_counts()

Colombia                        86889
Dominican_Republic               5716
Brazil                           4253
Mexico                           2880
United_States                    2453
Argentina                        2016
El_Salvador                      1000
Ecuador                           796
United_States_Virgin_Islands      509
Guatemala                         480
Puerto_Rico                       260
Panama                            148
Nicaragua                         125
Haiti                              52
nan                                 7
Norte                               7
Nordeste                            7
Sudeste                             7
Centro                              7
Sul                                 7
Name: country, dtype: int64

In [8]:
#us
zus = z[z['country'] == "United_States"]

In [9]:
#Remove unwanted features
zus = zus.drop(['location_type', 'data_field_code', 'time_period_type', 'time_period', 'unit'], axis=1)

In [10]:
zus.head()

Unnamed: 0,report_date,location,data_field,value,country
104657,2016-02-24,United_States-Alabama,zika_reported_travel,1,United_States
104658,2016-02-24,United_States-Alabama,zika_reported_local,0,United_States
104659,2016-02-24,United_States-American_Samoa,zika_reported_local,4,United_States
104660,2016-02-24,United_States-American_Samoa,zika_reported_travel,0,United_States
104661,2016-02-24,United_States-Arkansas,zika_reported_travel,1,United_States


In [11]:
## Creating state/city column 
st = lambda x: pd.Series([i for i in reversed(x.split('-'))])
zus['location'] = zus.location.apply(st)
zus.rename(columns={'location':'state_city'}, inplace=True)

In [12]:
# Clean the new column
## Cleaning the state city column
zus.state_city = zus.state_city.map(lambda x: x.replace('_',' '))
zus.state_city.value_counts()

Florida                 39
Nebraska                34
Illinois                34
Massachusetts           34
Texas                   34
Oregon                  34
Louisiana               34
Virginia                34
Indiana                 34
Iowa                    34
Arkansas                34
Hawaii                  34
American Samoa          34
Maryland                34
US Virgin Islands       34
Minnesota               34
Ohio                    34
California              34
Delaware                34
New York                34
Washington              34
Tennessee               34
Georgia                 34
New Jersey              34
District of Columbia    34
Montana                 32
Puerto Rico             32
Colorado                32
Michigan                32
Missouri                32
                        ..
Leon County             15
Walton County           15
Hamilton County         15
Franklin County         15
IndianRiver County      15
Sumter County           15
C

In [13]:
zus.shape

(2453, 5)

In [14]:
## Reshaping the Data
us = pd.pivot_table(zus,
                             index=['country','state_city','report_date'],
                             columns=['data_field'],values=['value'],
                             aggfunc=sum)
us = us['value'].reset_index()

In [15]:
us.head()

data_field,country,state_city,report_date,yearly_reported_travel_cases,zika_reported_local,zika_reported_travel
0,United_States,Alabama,2016-02-24,,0,1
1,United_States,Alabama,2016-03-09,,0,1
2,United_States,Alabama,2016-03-16,,0,1
3,United_States,Alabama,2016-03-23,,0,2
4,United_States,Alabama,2016-04-06,,0,2


In [16]:
## Making Report date the index
print (us.report_date.dtype )
print ("++++++++++++++")
us.sort_values("report_date", inplace=True)
us.set_index("report_date", inplace=True)
us.index = us.index.to_datetime()

object
++++++++++++++


In [19]:
## Now creating a year, month and day column
us['year'] = us.index.year
us['month'] = us.index.month
us['day'] = us.index.day

In [20]:
data_categories = sorted(zus['data_field'].unique())
print(os.linesep.join(data_categories))

yearly_reported_travel_cases
zika_reported_local
zika_reported_travel
