In [1]:
import requests
import json
import datetime as dt
import time
import requests
import regex as re
import collections
import plotly.express as px
import plotly.graph_objects as go
from collections import Counter
from termcolor import colored
import statistics

import os
from os import path
from wordcloud import WordCloud

import scipy.stats as stats
from scipy.stats import variation
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
%autosave 120

Autosaving every 120 seconds


# Proposal Choice

## Disease Transmission Rate Regression: *How infrastructure of a country affects the spread of disease*

### Problem Statement
The goal of this project is to try and predict transmission rates based on the current infrastructure of a country. 'Infrastructure' will include any indicators/metrics that may have an impact on diseases transmission. This can include metrics on transportation, sanitation, healthcare, food/drug standards, government structure, telecommunication, public health funding, etc. Within each of their categories there are a multitude of sub-indicators, for example healthcare has many indicators such as money spent on lab testing, hospital capacity, number of patients per doctor, cost of medicine, etc. These many indicators will be used to predict the tranmission rate of a disease and map the probable spread of a disease over time given infrastructure metrics.

### Methods and Models

For this problem we will be building a time series model as we are studying the effects on spread of a disease over time given changing inputs. We will gather infrastructure data as well as transmission rate data for several diseases, and perform EDA and select features intuitively and by plotting relations. 

### Risks and Assumptions

The risk and pitfall of this model is that infrastructure data is only available and updated once every year and likely only shows a significant effect on society year to year. The rate of change for infrastructure is significantly less than the rate of change for transmission rates. The way it is set up is that we would have to regress all our indicators on the yearly transmission rate which isn't as fast/reactive a model as we liked. Tranmission rates vary from country to country, some are fairly flat for a disease, some are exponential and people are concerned with day to day, week to week transmission rates. One way we can get around this is by including a 'days since first case' variable when including transmission data. So we can aim to collect 30 day, 60 day, 90 day, 6 month, and yearly benchmarks for the transmission rates then include that time feature as a critical part of our model. 

Another thing to consider is that disease transmission rate also heavily depends on what type of transmission the disease is spread by (i.e. direct contact vs airborne has two very different transmission rates). For this, we will have to include a dummy variable based on the type of transmission. 

### Success Criteria

The success of our model will be based on how well our model predicts the transmission rate of a disease at a point in time given the: 

1) Country

2) Type of disease

3) Infrastructure metrics/inputs

4) Year


### Mapping

We will create two maps, one that plots the number of cases on a map given the country, disease, and year which can be adjusted via user inputs. 

The second map will contain the same inputs except with infrastructure sliders so the user can see what the number of cases might look like given a change in spending in a particular metric.

# Data Sources

**Infrastructure Data Sources:**

* World Health Organization (WHO)
* World Bank Group
* Statista
* Organisation for Economic Co-operation and Development


|Disease Datasets| Description|
|---|---|
|File name| Source & Description|
|||
|[Zika](https://data.world/data-society/zika-virus-epidemic)| data.world|
|[Ebola](https://data.world/brianray/ebola-cases)| data.world|
|[Tuberculosis](http://apps.who.int/gho/data/node.main.1320?lang=en)| WHO|
|[Cholera](http://apps.who.int/gho/data/node.main.175?lang=en)| WHO|
|[Malaria](http://apps.who.int/gho/data/node.main.MALARIAINCIDENCE?lang=en)| WHO|
|[Meningitis](https://apps.who.int/gho/data/node.main.181?lang=en)| WHO|


|Infrastructure Datasets| Description|
|---|---|
|File name| Source & Description|
|||
|[Health Infrastructure Data](http://apps.who.int/gho/data/view.main.30000)| WHO |
|[Country Infrastructure Data](https://data.worldbank.org/topic/infrastructure)| WHO |
|[]()| source|
|[]()| source|
|[]()| source|
|[]()| source|

# Preliminary EDA and Cleaning

## Plan

The goal of our preliminary EDA is to see what information is available to us, the breadth of our data including time frame and countries, the columns/metrics, etc.

Our goal for this preliminary EDA is to see which countries have large variation in the number of cases using Coeffecient of Variation as our metric. These countries will provide us with a smaller scope as to which regions we want to investigate. Exploring the world will most likely produce very general results (i.e., countries with more hospitals have less cases of Ebola) and studying countries that are relatively stable in terms of cases likely have stable infrastructure. The goal of this project will be to show that countries with volatile number of cases per year and volatile infrastructure need to stick to plan 'X Y or Z' in order to stabilize and minimize the number of cases for ad disease.

I expect that this data is very clean as it is coming from well established agencies such as WHO. Cleaning will possibly consists of just selecting the countries we want to study, renaming columns to cleaner names, and combining dataframes/columns.

In [3]:
cholera = pd.read_csv('../Data/Diseases/cholera.csv')

In [4]:
ebola = pd.read_csv('../Data/Diseases/ebola.csv')

In [5]:
malaria = pd.read_csv('../Data/Diseases/malaria.csv')

In [6]:
mngts = pd.read_csv('../Data/Diseases/meningitis.csv')

In [7]:
tb = pd.read_csv('../Data/Diseases/Tuberculosis.csv')

In [8]:
zika = pd.read_csv('../Data/Diseases/zika.csv')


Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.



In [9]:
tet = pd.read_csv('../Data/Diseases/Ttetanus.csv')

In [10]:
rubella = pd.read_csv('../Data/Diseases/Rubella.csv')

In [11]:
pert = pd.read_csv('../Data/Diseases/Pertussis.csv')

In [12]:
mumps = pd.read_csv('../Data/Diseases/Mumps.csv')

In [13]:
measles = pd.read_csv('../Data/Diseases/Measles.csv')

## Cholera

In [14]:
cholera.head()

Unnamed: 0,Country,Year,Number of reported cases of cholera
0,Afghanistan,2016,677
1,Afghanistan,2015,58064
2,Afghanistan,2014,45481
3,Afghanistan,2013,3957
4,Afghanistan,2012,12


In [15]:
cholera.dtypes

Country                                object
Year                                    int64
Number of reported cases of cholera    object
dtype: object

In [16]:
#checking scope of data
#cholera['Country'].unique()

In [17]:
cholera['Year'].unique()

array([2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2005, 2003,
       2002, 2001, 2000, 1999, 1998, 1997, 1995, 1994, 1993, 1965, 1960,
       2006, 1992, 1990, 1989, 1988, 1987, 1984, 1983, 1980, 1979, 1978,
       1977, 1976, 1975, 1974, 1973, 1972, 1971, 2007, 1996, 1991, 2004,
       1985, 1982, 1981, 1970, 1969, 1968, 1967, 1966, 1964, 1963, 1962,
       1961, 1959, 1958, 1957, 1956, 1955, 1954, 1953, 1952, 1951, 1950,
       1986, 1949])

### Cholera Cleaning

In [18]:
cholera.rename(columns={"Number of reported cases of cholera": "cholera_cases"}, inplace = True)

In [19]:
cholera['cholera_cases'] = cholera['cholera_cases'].str.replace(" ", "")

In [20]:
cholera['cholera_cases'] = cholera['cholera_cases'].astype('float')

In [21]:
cholera.dtypes

Country           object
Year               int64
cholera_cases    float64
dtype: object

In [22]:
cholera = cholera.set_index('Country')

In [23]:
cholera.head()

Unnamed: 0_level_0,Year,cholera_cases
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,2016,677.0
Afghanistan,2015,58064.0
Afghanistan,2014,45481.0
Afghanistan,2013,3957.0
Afghanistan,2012,12.0


In [24]:
cholera.dtypes

Year               int64
cholera_cases    float64
dtype: object

In [25]:
cholera.to_csv('../Data/Diseases/cleaned_disease/cholera_clean.csv', index = True)

**Notes**: This is a good dataset to use as we have a full list of countries along with data from 1949 to 2016.

* Incidence is totals per year.

## Ebola

In [26]:
ebola

Unnamed: 0,Indicator,Country,Date,value
0,"Cumulative number of confirmed, probable and s...",Guinea,2015-03-10,3285.0
1,Cumulative number of confirmed Ebola cases,Guinea,2015-03-10,2871.0
2,Cumulative number of probable Ebola cases,Guinea,2015-03-10,392.0
3,Cumulative number of suspected Ebola cases,Guinea,2015-03-10,22.0
4,"Cumulative number of confirmed, probable and s...",Guinea,2015-03-10,2170.0
...,...,...,...,...
17580,"Cumulative number of confirmed, probable and s...",Spain,2016-03-23,0.0
17581,Cumulative number of confirmed Ebola deaths,United States of America,2016-03-23,1.0
17582,Cumulative number of probable Ebola deaths,United States of America,2016-03-23,0.0
17583,Cumulative number of suspected Ebola deaths,United States of America,2016-03-23,0.0


In [27]:
ebola.columns

Index(['Indicator', 'Country', 'Date', 'value'], dtype='object')

In [28]:
ebola['Country'].unique()

array(['Guinea', 'Liberia', 'Sierra Leone', 'United Kingdom', 'Mali',
       'Nigeria', 'Senegal', 'Spain', 'United States of America', 'Italy',
       'Liberia 2', 'Guinea 2'], dtype=object)

In [29]:
ebola['Indicator'].unique()

array(['Cumulative number of confirmed, probable and suspected Ebola cases',
       'Cumulative number of confirmed Ebola cases',
       'Cumulative number of probable Ebola cases',
       'Cumulative number of suspected Ebola cases',
       'Cumulative number of confirmed, probable and suspected Ebola deaths',
       'Cumulative number of confirmed Ebola deaths',
       'Cumulative number of probable Ebola deaths',
       'Cumulative number of suspected Ebola deaths',
       'Number of confirmed Ebola cases in the last 21 days',
       'Number of confirmed, probable and suspected Ebola cases in the last 21 days',
       'Number of probable Ebola cases in the last 21 days',
       'Number of confirmed Ebola cases in the last 7 days',
       'Number of probable Ebola cases in the last 7 days',
       'Number of suspected Ebola cases in the last 7 days',
       'Number of confirmed, probable and suspected Ebola cases in the last 7 days',
       'Proportion of confirmed Ebola cases that a

In [30]:
#checked unqieu dates, was day by day data from 2014 to 2015

#ebola['Date'].unique()

**Notes**: This dataset isn't that great to use because we have a limited number of countries as well as a limited time range. This dataset isn't good for the question at hand. The reason for this is because Ebola was a very concentrated disease both in location and timeframe.

## Malaria

In [31]:
malaria.head()

Unnamed: 0.1,Unnamed: 0,Malaria incidence (per 1 000 population at risk),Malaria incidence (per 1 000 population at risk).1,Malaria incidence (per 1 000 population at risk).2,Malaria incidence (per 1 000 population at risk).3,Malaria incidence (per 1 000 population at risk).4,Malaria incidence (per 1 000 population at risk).5,Malaria incidence (per 1 000 population at risk).6,Malaria incidence (per 1 000 population at risk).7,Malaria incidence (per 1 000 population at risk).8,Malaria incidence (per 1 000 population at risk).9
0,Country,2017.0,2016.0,2015.0,2014.0,2013.0,2012.0,2011.0,2010.0,2005.0,2000.0
1,Afghanistan,23.01,23.0,14.22,11.26,8.75,11.76,19.86,15.92,28.91,92.64
2,Algeria,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.15,0.3
3,Angola,154.97,155.66,154.48,139.97,130.2,123.99,125.54,133.76,210.66,222.39
4,Argentina,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,1.29,2.37


In [32]:
malaria.columns

Index(['Unnamed: 0', 'Malaria incidence (per 1 000 population at risk)',
       'Malaria incidence (per 1 000 population at risk).1',
       'Malaria incidence (per 1 000 population at risk).2',
       'Malaria incidence (per 1 000 population at risk).3',
       'Malaria incidence (per 1 000 population at risk).4',
       'Malaria incidence (per 1 000 population at risk).5',
       'Malaria incidence (per 1 000 population at risk).6',
       'Malaria incidence (per 1 000 population at risk).7',
       'Malaria incidence (per 1 000 population at risk).8',
       'Malaria incidence (per 1 000 population at risk).9'],
      dtype='object')

In [33]:
#checking scope of data
#malaria['Unnamed: 0'].unique()

In [34]:
malaria.describe()

Unnamed: 0,Malaria incidence (per 1 000 population at risk),Malaria incidence (per 1 000 population at risk).1,Malaria incidence (per 1 000 population at risk).2,Malaria incidence (per 1 000 population at risk).3,Malaria incidence (per 1 000 population at risk).4,Malaria incidence (per 1 000 population at risk).5,Malaria incidence (per 1 000 population at risk).6,Malaria incidence (per 1 000 population at risk).7,Malaria incidence (per 1 000 population at risk).8,Malaria incidence (per 1 000 population at risk).9
count,108.0,108.0,108.0,108.0,108.0,108.0,108.0,108.0,108.0,107.0
mean,100.063889,100.196019,99.479815,100.22963,104.934444,108.130463,110.655185,116.469167,138.670833,155.540841
std,225.550583,226.678626,224.333345,225.993756,229.224004,232.368575,235.231131,235.976373,247.171742,249.7646
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.07,0.14,0.1175,0.1825,0.1525,0.1325,0.27,0.3675,1.275,4.345
50%,5.7,7.17,6.06,7.13,8.005,8.765,9.975,11.865,20.27,39.23
75%,145.0025,142.2275,148.2,146.5425,135.8325,133.265,135.3025,174.265,240.465,287.595
max,2017.0,2016.0,2015.0,2014.0,2013.0,2012.0,2011.0,2010.0,2005.0,2000.0


### Malaria Cleaning

In [35]:
#dtypes as expected
malaria.dtypes

Unnamed: 0                                             object
Malaria incidence (per 1 000 population at risk)      float64
Malaria incidence (per 1 000 population at risk).1    float64
Malaria incidence (per 1 000 population at risk).2    float64
Malaria incidence (per 1 000 population at risk).3    float64
Malaria incidence (per 1 000 population at risk).4    float64
Malaria incidence (per 1 000 population at risk).5    float64
Malaria incidence (per 1 000 population at risk).6    float64
Malaria incidence (per 1 000 population at risk).7    float64
Malaria incidence (per 1 000 population at risk).8    float64
Malaria incidence (per 1 000 population at risk).9    float64
dtype: object

In [36]:
malaria.rename(columns={"Malaria incidence (per 1 000 population at risk)": "2017",
                       "Malaria incidence (per 1 000 population at risk).1": "2016",
                       "Malaria incidence (per 1 000 population at risk).2": "2015",
                       "Malaria incidence (per 1 000 population at risk).3": "2014",
                       "Malaria incidence (per 1 000 population at risk).4": "2013",
                       "Malaria incidence (per 1 000 population at risk).5": "2012",
                       "Malaria incidence (per 1 000 population at risk).6": "2011",
                       "Malaria incidence (per 1 000 population at risk).7": "2010",
                       "Malaria incidence (per 1 000 population at risk).8": "2005",
                       "Malaria incidence (per 1 000 population at risk).9": "2000",
                       'Unnamed: 0': "Country"}, inplace = True)

In [37]:
malaria.drop(axis = 0, index = 0, inplace = True)

In [38]:
malaria = malaria.set_index('Country')

In [39]:
malaria.head()

Unnamed: 0_level_0,2017,2016,2015,2014,2013,2012,2011,2010,2005,2000
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Afghanistan,23.01,23.0,14.22,11.26,8.75,11.76,19.86,15.92,28.91,92.64
Algeria,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.15,0.3
Angola,154.97,155.66,154.48,139.97,130.2,123.99,125.54,133.76,210.66,222.39
Argentina,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,1.29,2.37
Armenia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05


In [40]:
malaria.to_csv('../Data/Diseases/cleaned_disease/malaria_clean.csv', index = True)

**Notes**: This can be a good dataset to use because we have lots of countries and a range from 2000 - 2017. However we might need to fill in some datapoints for years in the early 2000s but we can do that with some simple research. Additionally malaria has been around for some time meaning it is a disease that will have a lot of variation in how it is managed in each country.

* Incidence is per year per 1000 population.

## Meningitis

In [41]:
mngts.head()

Unnamed: 0.1,Unnamed: 0,Number of meningitis epidemic districts,Number of meningitis epidemic districts.1,Number of meningitis epidemic districts.2,Number of meningitis epidemic districts.3,Number of meningitis epidemic districts.4,Number of meningitis epidemic districts.5,Number of meningitis epidemic districts.6,Number of meningitis epidemic districts.7,Number of meningitis epidemic districts.8,Number of meningitis epidemic districts.9,Number of meningitis epidemic districts.10,Number of meningitis epidemic districts.11
0,Country,2014,2013,2012,2011.0,2010.0,2009.0,2008.0,2007.0,2006.0,2005.0,2004.0,2003.0
1,Benin,1,1,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
2,Burkina Faso,0,1,13,2.0,12.0,3.0,20.0,43.0,33.0,1.0,5.0,12.0
3,Cameroon,0,0,2,10.0,3.0,5.0,,0.0,,,,
4,Central African Republic,0,0,2,2.0,1.0,1.0,1.0,1.0,0.0,,,


In [42]:
mngts.columns

Index(['Unnamed: 0', 'Number of meningitis epidemic districts',
       'Number of meningitis epidemic districts.1',
       'Number of meningitis epidemic districts.2',
       'Number of meningitis epidemic districts.3',
       'Number of meningitis epidemic districts.4',
       'Number of meningitis epidemic districts.5',
       'Number of meningitis epidemic districts.6',
       'Number of meningitis epidemic districts.7',
       'Number of meningitis epidemic districts.8',
       'Number of meningitis epidemic districts.9',
       'Number of meningitis epidemic districts.10',
       'Number of meningitis epidemic districts.11'],
      dtype='object')

In [43]:
mngts['Unnamed: 0'].unique()

array(['Country', 'Benin', 'Burkina Faso', 'Cameroon',
       'Central African Republic', 'Chad', "Côte d'Ivoire",
       'Democratic Republic of the Congo', 'Ethiopia', 'Gambia', 'Ghana',
       'Guinea', 'Mali', 'Mauritania', 'Niger', 'Nigeria', 'Senegal',
       'South Sudan', 'Sudan', 'Togo'], dtype=object)

In [44]:
mngts.describe()

Unnamed: 0,Number of meningitis epidemic districts.3,Number of meningitis epidemic districts.4,Number of meningitis epidemic districts.5,Number of meningitis epidemic districts.6,Number of meningitis epidemic districts.7,Number of meningitis epidemic districts.8,Number of meningitis epidemic districts.9,Number of meningitis epidemic districts.10,Number of meningitis epidemic districts.11
count,14.0,15.0,15.0,14.0,13.0,14.0,12.0,9.0,9.0
mean,146.857143,137.066667,148.4,150.357143,160.230769,148.5,169.5,224.444444,227.777778
std,536.56498,518.146762,516.650587,534.843558,555.026449,534.712252,578.049149,667.336143,665.739772
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
50%,1.0,3.0,1.0,1.5,1.0,1.5,0.5,1.0,4.0
75%,8.0,6.0,6.0,10.25,4.0,6.5,5.75,5.0,12.0
max,2011.0,2010.0,2009.0,2008.0,2007.0,2006.0,2005.0,2004.0,2003.0


**Notes**: This dataset isn't that great to use because the number of countries are very limited to African countries. This can be used if we decidet o focus mostly on African countries. 

## Tuberculosis

In [45]:
tb.head()

Unnamed: 0,Country,Year,Number of incident tuberculosis cases,Incidence of tuberculosis (per 100 000 population per year),Number of incident tuberculosis cases in children aged 0 - 14,"Number of incident tuberculosis cases, (HIV-positive cases)",Incidence of tuberculosis (per 100 000 population) (HIV-positive cases)
0,Afghanistan,2018,70000 [45000-100000],189 [122-270],14000 [7400-21000],320 [120-640],0.87 [0.31-1.7]
1,Afghanistan,2017,69000 [44000-98000],189 [122-270],,300 [110-580],0.82 [0.3-1.6]
2,Afghanistan,2016,67000 [43000-95000],189 [122-270],,310 [120-600],0.88 [0.33-1.7]
3,Afghanistan,2015,65000 [42000-93000],189 [122-270],,290 [110-560],0.86 [0.33-1.6]
4,Afghanistan,2014,63000 [41000-90000],189 [122-270],,290 [110-560],0.88 [0.34-1.7]


In [46]:
tb.columns

Index(['Country', 'Year', 'Number of incident tuberculosis cases',
       'Incidence of tuberculosis (per 100 000 population per year)',
       'Number of incident tuberculosis cases in children aged 0 - 14',
       'Number of incident tuberculosis cases,  (HIV-positive cases)',
       'Incidence of tuberculosis (per 100 000 population) (HIV-positive cases)'],
      dtype='object')

In [47]:
#checking scope of data
#tb['Country'].unique()

In [48]:
tb['Year'].unique()

array([2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008,
       2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000])

### Tuberculosis Cleaning

In [49]:
#Only care about incidences
tb.drop(axis = 1, columns = ['Incidence of tuberculosis (per 100 000 population per year)',
       'Number of incident tuberculosis cases in children aged 0 - 14',
       'Number of incident tuberculosis cases,  (HIV-positive cases)',
       'Incidence of tuberculosis (per 100 000 population) (HIV-positive cases)'] , inplace = True)

In [50]:
tb.rename(columns={"Number of incident tuberculosis cases": "tuberculosis_incidence",
                       }, inplace = True)

In [51]:
#removing all characters after the first space for incidence column
tb = tb.astype(str).apply(lambda x: x.str.split().str[0])

In [52]:
tb = tb.set_index('Country')

In [53]:
tb.dtypes

Year                      object
tuberculosis_incidence    object
dtype: object

In [54]:
tb['Year'] = tb['Year'].astype('int')

In [55]:
tb['tuberculosis_incidence'] = tb['tuberculosis_incidence'].astype('int')

In [56]:
tb.head()

Unnamed: 0_level_0,Year,tuberculosis_incidence
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,2018,70000
Afghanistan,2017,69000
Afghanistan,2016,67000
Afghanistan,2015,65000
Afghanistan,2014,63000


In [57]:
tb.to_csv('../Data/Diseases/cleaned_disease/tb_clean.csv', index = True)

**Notes**: This is a good dataset to use because we have all the countries with data from 2000 to 2018.

## Zika

In [58]:
zika.head()

Unnamed: 0,report_date,location,location_type,data_field,data_field_code,time_period,time_period_type,value,unit
0,2016-03-19,Argentina-Buenos_Aires,province,cumulative_confirmed_local_cases,AR0001,,,0,cases
1,2016-03-19,Argentina-Buenos_Aires,province,cumulative_probable_local_cases,AR0002,,,0,cases
2,2016-03-19,Argentina-Buenos_Aires,province,cumulative_confirmed_imported_cases,AR0003,,,2,cases
3,2016-03-19,Argentina-Buenos_Aires,province,cumulative_probable_imported_cases,AR0004,,,1,cases
4,2016-03-19,Argentina-Buenos_Aires,province,cumulative_cases_under_study,AR0005,,,127,cases


In [59]:
zika.columns

Index(['report_date', 'location', 'location_type', 'data_field',
       'data_field_code', 'time_period', 'time_period_type', 'value', 'unit'],
      dtype='object')

**Notes**: Zika was very limted # of locations and time frames.

## Ttetanus

In [60]:
tet.head()

Unnamed: 0,WHO_REGION,ISO_code,Cname,Disease,2018,2017,2016,2015,2014,2013,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
0,EMR,AFG,Afghanistan,ttetanus,53.0,,37.0,74.0,39.0,24.0,...,51.0,951.0,168.0,698.0,2829.0,355.0,912.0,1481.0,1208.0,1618.0
1,EUR,ALB,Albania,ttetanus,0.0,1.0,,,,0.0,...,3.0,5.0,6.0,5.0,4.0,2.0,4.0,1.0,3.0,5.0
2,AFR,DZA,Algeria,ttetanus,1.0,0.0,0.0,0.0,0.0,0.0,...,63.0,50.0,415.0,129.0,343.0,74.0,79.0,100.0,164.0,86.0
3,EUR,AND,Andorra,ttetanus,0.0,,,,,0.0,...,,,,,,,,,,
4,AFR,AGO,Angola,ttetanus,340.0,,,305.0,330.0,360.0,...,2701.0,1631.0,778.0,129.0,893.0,1320.0,1115.0,1398.0,1383.0,1185.0


In [61]:
tet.columns

Index(['WHO_REGION', 'ISO_code', 'Cname', 'Disease', '2018', '2017', '2016',
       '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007',
       '2006', '2005', '2004', '2003', '2002', '2001', '2000', '1999', '1998',
       '1997', '1996', '1995', '1994', '1993', '1992', '1991', '1990', '1989',
       '1988', '1987', '1986', '1985', '1984', '1983', '1982', '1981', '1980'],
      dtype='object')

In [62]:
#checking scope of data
#tet['Cname'].unique()

### Ttetanus Cleaning

In [63]:
#checking dtypes, all years are in floats. 
#tet.dtypes

In [64]:
tet.drop(axis = 0, columns = ['WHO_REGION', 'ISO_code', 'Disease'], inplace = True)

In [65]:
tet.rename(columns={"Cname": "Country"}, inplace = True)

In [66]:
tet = tet.set_index('Country')

In [67]:
tet.head()

Unnamed: 0_level_0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,53.0,,37.0,74.0,39.0,24.0,37.0,20.0,23.0,19.0,...,51.0,951.0,168.0,698.0,2829.0,355.0,912.0,1481.0,1208.0,1618.0
Albania,0.0,1.0,,,,0.0,0.0,0.0,1.0,0.0,...,3.0,5.0,6.0,5.0,4.0,2.0,4.0,1.0,3.0,5.0
Algeria,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,63.0,50.0,415.0,129.0,343.0,74.0,79.0,100.0,164.0,86.0
Andorra,0.0,,,,,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
Angola,340.0,,,305.0,330.0,360.0,543.0,953.0,490.0,675.0,...,2701.0,1631.0,778.0,129.0,893.0,1320.0,1115.0,1398.0,1383.0,1185.0


In [68]:
tet.to_csv('../Data/Diseases/cleaned_disease/tet_clean.csv', index = True)

**Notes**: Very good dataset to use because we have data from 1980 to 2018 with all countries available.

## Rubella

In [69]:
rubella.head()

Unnamed: 0,WHO_REGION,ISO_code,Cname,Disease,2018,2017,2016,2015,2014,2013,...,2007,2006,2005,2004,2003,2002,2001,2000,1999,1998
0,EMR,AFG,Afghanistan,Rubella,37.0,53.0,42.0,59.0,43.0,367.0,...,152.0,196.0,,,,,,,,
1,EUR,ALB,Albania,Rubella,0.0,0.0,2.0,,,0.0,...,0.0,0.0,0.0,0.0,9.0,12.0,10.0,1752.0,15.0,
2,AFR,DZA,Algeria,Rubella,624.0,110.0,13.0,3.0,3.0,414.0,...,,,,,,,,,,
3,EUR,AND,Andorra,Rubella,0.0,0.0,0.0,,,0.0,...,0.0,22.0,0.0,0.0,0.0,,0.0,,,
4,AFR,AGO,Angola,Rubella,31.0,20.0,12.0,230.0,112.0,36.0,...,25.0,14.0,10.0,43.0,22.0,0.0,,,,


In [70]:
rubella.columns

Index(['WHO_REGION', 'ISO_code', 'Cname', 'Disease', '2018', '2017', '2016',
       '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007',
       '2006', '2005', '2004', '2003', '2002', '2001', '2000', '1999', '1998'],
      dtype='object')

In [71]:
#checking scope of data
#rubella['Cname'].unique()

### Rubella Cleaning

In [72]:
#column values are expected datatype
rubella.dtypes

WHO_REGION     object
ISO_code       object
Cname          object
Disease        object
2018          float64
2017          float64
2016          float64
2015          float64
2014          float64
2013          float64
2012          float64
2011          float64
2010          float64
2009          float64
2008          float64
2007          float64
2006          float64
2005          float64
2004          float64
2003          float64
2002          float64
2001          float64
2000          float64
1999          float64
1998          float64
dtype: object

In [73]:
#dropping columns we don't want and making column names more interpretable
rubella.drop(axis = 0, columns = ['WHO_REGION', 'ISO_code', 'Disease'], inplace = True)
rubella.rename(columns={"Cname": "Country"}, inplace = True)

In [74]:
rubella = rubella.set_index('Country')

In [75]:
rubella.head()

Unnamed: 0_level_0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,...,2007,2006,2005,2004,2003,2002,2001,2000,1999,1998
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,37.0,53.0,42.0,59.0,43.0,367.0,,750.0,46.0,501.0,...,152.0,196.0,,,,,,,,
Albania,0.0,0.0,2.0,,,0.0,1.0,5.0,5.0,0.0,...,0.0,0.0,0.0,0.0,9.0,12.0,10.0,1752.0,15.0,
Algeria,624.0,110.0,13.0,3.0,3.0,414.0,420.0,170.0,212.0,23.0,...,,,,,,,,,,
Andorra,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,...,0.0,22.0,0.0,0.0,0.0,,0.0,,,
Angola,31.0,20.0,12.0,230.0,112.0,36.0,65.0,24.0,38.0,10.0,...,25.0,14.0,10.0,43.0,22.0,0.0,,,,


In [76]:
rubella.to_csv('../Data/Diseases/cleaned_disease/rubella_clean.csv', index = True)

**Notes**: Very good dataset because we have data from 1998 to 2018 for all countries.

## Pertussis

In [77]:
pert.head()

Unnamed: 0,WHO_REGION,ISO_code,Cname,Disease,2018,2017,2016,2015,2014,2013,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
0,EMR,AFG,Afghanistan,pertussis,488.0,1.0,0.0,432.0,0.0,371.0,...,1494.0,4587.0,6073.0,5872.0,8531.0,6175.0,10209.0,8528.0,15388.0,15748.0
1,EUR,ALB,Albania,pertussis,19.0,7.0,43.0,,,6.0,...,302.0,508.0,112.0,115.0,172.0,89.0,126.0,312.0,280.0,137.0
2,AFR,DZA,Algeria,pertussis,17.0,6.0,2.0,0.0,0.0,69.0,...,32.0,45.0,69.0,24.0,520.0,894.0,395.0,663.0,967.0,710.0
3,EUR,AND,Andorra,pertussis,2.0,0.0,3.0,16.0,1.0,6.0,...,,,,,,,,,,
4,AFR,AGO,Angola,pertussis,0.0,,0.0,0.0,0.0,0.0,...,21674.0,14343.0,10015.0,6953.0,15846.0,23993.0,28461.0,31429.0,31481.0,54126.0


In [78]:
pert.columns

Index(['WHO_REGION', 'ISO_code', 'Cname', 'Disease', '2018', '2017', '2016',
       '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007',
       '2006', '2005', '2004', '2003', '2002', '2001', '2000', '1999', '1998',
       '1997', '1996', '1995', '1994', '1993', '1992', '1991', '1990', '1989',
       '1988', '1987', '1986', '1985', '1984', '1983', '1982', '1981', '1980'],
      dtype='object')

In [79]:
#checking scope of data
#pert['Cname'].unique()

### Pertussis Cleaning

In [80]:
#column dtypes are as expected
#pert.dtypes

In [81]:
#dropping columns we don't want and making column names more interpretable
pert.drop(axis = 0, columns = ['WHO_REGION', 'ISO_code', 'Disease'], inplace = True)
pert.rename(columns={"Cname": "Country"}, inplace = True)

In [82]:
pert = pert.set_index('Country')

In [83]:
pert.head()

Unnamed: 0_level_0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,488.0,1.0,0.0,432.0,0.0,371.0,1497.0,0.0,0.0,0.0,...,1494.0,4587.0,6073.0,5872.0,8531.0,6175.0,10209.0,8528.0,15388.0,15748.0
Albania,19.0,7.0,43.0,,,6.0,16.0,4.0,0.0,10.0,...,302.0,508.0,112.0,115.0,172.0,89.0,126.0,312.0,280.0,137.0
Algeria,17.0,6.0,2.0,0.0,0.0,69.0,104.0,1.0,0.0,1.0,...,32.0,45.0,69.0,24.0,520.0,894.0,395.0,663.0,967.0,710.0
Andorra,2.0,0.0,3.0,16.0,1.0,6.0,3.0,4.0,0.0,0.0,...,,,,,,,,,,
Angola,0.0,,0.0,0.0,0.0,0.0,1259.0,1554.0,2539.0,1127.0,...,21674.0,14343.0,10015.0,6953.0,15846.0,23993.0,28461.0,31429.0,31481.0,54126.0


In [84]:
pert.to_csv('../Data/Diseases/cleaned_disease/pert_clean.csv', index = True)

**Notes**: Also a very good dataset because we have data from 1980 to 2018 for all countries.

## Mumps

In [85]:
mumps.head()

Unnamed: 0,WHO_REGION,ISO_code,Cname,Disease,2018,2017,2016,2015,2014,2013,...,2007,2006,2005,2004,2003,2002,2001,2000,1999,1998
0,EMR,AFG,Afghanistan,Mumps,,,29.0,,0.0,0.0,...,,,,,,,,,,
1,EUR,ALB,Albania,Mumps,13.0,6.0,17.0,,,20.0,...,824.0,236.0,1696.0,896.0,2236.0,3124.0,1414.0,1651.0,1006.0,
2,AFR,DZA,Algeria,Mumps,,0.0,0.0,67.0,,27.0,...,,,,,,,,,,
3,EUR,AND,Andorra,Mumps,31.0,5.0,5.0,2.0,0.0,2.0,...,4.0,3.0,1.0,2.0,1.0,,4.0,,,
4,AFR,AGO,Angola,Mumps,,,,,,,...,,,,23.0,,0.0,,,,


In [86]:
mumps.columns

Index(['WHO_REGION', 'ISO_code', 'Cname', 'Disease', '2018', '2017', '2016',
       '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007',
       '2006', '2005', '2004', '2003', '2002', '2001', '2000', '1999', '1998'],
      dtype='object')

In [87]:
#checking scope of data
#mumps['Cname'].unique()

### Mumps Cleaning

In [88]:
#column dtypes are as expected
mumps.dtypes

WHO_REGION     object
ISO_code       object
Cname          object
Disease        object
2018          float64
2017          float64
2016          float64
2015          float64
2014          float64
2013          float64
2012          float64
2011          float64
2010          float64
2009          float64
2008          float64
2007          float64
2006          float64
2005          float64
2004          float64
2003          float64
2002          float64
2001          float64
2000          float64
1999          float64
1998          float64
dtype: object

In [89]:
#dropping columns we don't want and making column names more interpretable
mumps.drop(axis = 0, columns = ['WHO_REGION', 'ISO_code', 'Disease'], inplace = True)
mumps.rename(columns={"Cname": "Country"}, inplace = True)

In [90]:
mumps = mumps.set_index('Country')

In [91]:
mumps.head()

Unnamed: 0_level_0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,...,2007,2006,2005,2004,2003,2002,2001,2000,1999,1998
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,,,29.0,,0.0,0.0,,,0.0,,...,,,,,,,,,,
Albania,13.0,6.0,17.0,,,20.0,18.0,39.0,21.0,22.0,...,824.0,236.0,1696.0,896.0,2236.0,3124.0,1414.0,1651.0,1006.0,
Algeria,,0.0,0.0,67.0,,27.0,0.0,0.0,0.0,,...,,,,,,,,,,
Andorra,31.0,5.0,5.0,2.0,0.0,2.0,1.0,0.0,0.0,0.0,...,4.0,3.0,1.0,2.0,1.0,,4.0,,,
Angola,,,,,,,,,0.0,0.0,...,,,,23.0,,0.0,,,,


In [92]:
mumps.to_csv('../Data/Diseases/cleaned_disease/mumps_clean.csv', index = True)

**Notes**: Good dataset, data from 1998 to 2018 for all countries

## Measles

In [93]:
measles.head()

Unnamed: 0,WHO_REGION,ISO_code,Cname,Disease,2018,2017,2016,2015,2014,2013,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
0,EMR,AFG,Afghanistan,measles,2012.0,1511.0,638.0,1154.0,492.0,430.0,...,1170.0,4561.0,10357.0,8107.0,14457.0,16199.0,18808.0,20320.0,31107.0,32455.0
1,EUR,ALB,Albania,measles,1469.0,12.0,17.0,,,0.0,...,136034.0,0.0,0.0,0.0,0.0,0.0,17.0,3.0,,
2,AFR,DZA,Algeria,measles,3356.0,112.0,41.0,63.0,0.0,25.0,...,4169.0,2634.0,2500.0,3975.0,20114.0,22553.0,22126.0,29584.0,20849.0,15527.0
3,EUR,AND,Andorra,measles,0.0,0.0,0.0,,,0.0,...,,,,,,,,,,
4,AFR,AGO,Angola,measles,57.0,29.0,53.0,119.0,11699.0,8523.0,...,19820.0,21009.0,13368.0,15580.0,22822.0,22685.0,22589.0,30067.0,19714.0,29656.0


In [94]:
measles.columns

Index(['WHO_REGION', 'ISO_code', 'Cname', 'Disease', '2018', '2017', '2016',
       '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007',
       '2006', '2005', '2004', '2003', '2002', '2001', '2000', '1999', '1998',
       '1997', '1996', '1995', '1994', '1993', '1992', '1991', '1990', '1989',
       '1988', '1987', '1986', '1985', '1984', '1983', '1982', '1981', '1980'],
      dtype='object')

In [95]:
#checking scope of data
#measles['Cname'].unique()

### Measles Cleaning

In [96]:
#column dtypes are as expected
measles.dtypes

WHO_REGION     object
ISO_code       object
Cname          object
Disease        object
2018          float64
2017          float64
2016          float64
2015          float64
2014          float64
2013          float64
2012          float64
2011          float64
2010          float64
2009          float64
2008          float64
2007          float64
2006          float64
2005          float64
2004          float64
2003          float64
2002          float64
2001          float64
2000          float64
1999          float64
1998          float64
1997          float64
1996          float64
1995          float64
1994          float64
1993          float64
1992          float64
1991          float64
1990          float64
1989          float64
1988          float64
1987          float64
1986          float64
1985          float64
1984          float64
1983          float64
1982          float64
1981          float64
1980          float64
dtype: object

In [97]:
#dropping columns we don't want and making column names more interpretable
measles.drop(axis = 0, columns = ['WHO_REGION', 'ISO_code', 'Disease'], inplace = True)
measles.rename(columns={"Cname": "Country"}, inplace = True)

In [98]:
measles = measles.set_index('Country')

In [99]:
measles.head()

Unnamed: 0_level_0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,2012.0,1511.0,638.0,1154.0,492.0,430.0,2787.0,3013.0,1989.0,2861.0,...,1170.0,4561.0,10357.0,8107.0,14457.0,16199.0,18808.0,20320.0,31107.0,32455.0
Albania,1469.0,12.0,17.0,,,0.0,9.0,28.0,10.0,0.0,...,136034.0,0.0,0.0,0.0,0.0,0.0,17.0,3.0,,
Algeria,3356.0,112.0,41.0,63.0,0.0,25.0,18.0,112.0,103.0,107.0,...,4169.0,2634.0,2500.0,3975.0,20114.0,22553.0,22126.0,29584.0,20849.0,15527.0
Andorra,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
Angola,57.0,29.0,53.0,119.0,11699.0,8523.0,4458.0,1449.0,1190.0,2807.0,...,19820.0,21009.0,13368.0,15580.0,22822.0,22685.0,22589.0,30067.0,19714.0,29656.0


In [100]:
measles.to_csv('../Data/Diseases/cleaned_disease/measles_clean.csv', index = True)

**Notes**: Good dataset, data for all countries from 1980 to 2018.

# Preliminary Disease EDA Findings

The problem we are trying to solve is limited by time data points. Infrastructure really only shows significant change year over year and the effects also take months to a year to show effect on a population. Therefore we are limited to year by year information. It was critical that the datasets we use have ACCURATE data for an extended period of time, at least from the 2000s and also data that covers a wide range of countries, preferably all of them.

The datasets that cover a time range beginningin at least 2000 AND cover a wide range of countries (100+) are:
* Cholera 
* Malaria
* Tuberculosis
* Tetanus
* Rubella
* Pertussis
* Mumps
* Measles


# Selecting Countries for Modelling

To select countries for modelling and selection of how infrastructure may affect transmission rates we want countries who have a large Coeffecient of Variation (COV) across the number of cases over the time recorded time period for a given disease. The reason for this is because countries who have a relatively stable number of cases per year likely did not undergo much change in their healthcare infrastructure. For example, many 1st world countries have stable infrastructure and stable health care systems. The data is unlikely to be very volatile especially for the diseases being studied that have been around for some time. 

We will group together similar countries to control for environmental factors as well as cultural beliefs in regards to healthcare. We want to see if there can be something that is accounting for large changes in cases per year as well as what countries that have few cases are doing to control their cases.

## Finding Coeffecient of Variation (COV) for Each Country

In [143]:
#function takes the dataframe, rows to view, and name of csv you would like to give it
def var(df,rows,name):
    
    #input a dataframe and gets COV for dataframes where the index is country
    df['COV'] = (df.std(axis = 1))/(df.mean(axis=1))
    
    #creates a dataframe ordered by COV from highest to lowest
    df_order = df.sort_values(by ='COV' , ascending=False)
    df_order.to_csv('../Data/Diseases/COV/{df}_COV.csv'.format(df = name), index = True)
    
    #prints the n highest COV countries 
    print('\n'' Highest COV' '\n')
    print(df_order['COV'].head(rows))
    
    #prints the n lowest COV countries
    print('\n''Lowest COV' '\n')
    print(df_order['COV'].tail(rows))

## Cholera

In [144]:
#finding Coeffecient of Variation of # of cholera cases for each unique country 
cholera_cov = ((cholera.groupby('Country')['cholera_cases'].std())/(cholera.groupby('Country')['cholera_cases'].mean()))
cholera_cov = pd.DataFrame(data=cholera_cov)
cholera_cov.rename(columns={"cholera_cases": "Cholera COV"}, inplace = True)
cholera_order = cholera_cov.sort_values(by ='Cholera COV' , ascending=False)

In [145]:
cholera_order.to_csv('../Data/Diseases/COV/cholera_cov.csv', index = True)

In [103]:
#returning countries with the most COV
cholera_order.head(50)

Unnamed: 0_level_0,Cholera COV
Country,Unnamed: 1_level_1
Spain,3.513868
Nepal,3.298659
South Africa,2.954268
Zimbabwe,2.859931
Italy,2.833025
Cabo Verde,2.707833
Russian Federation,2.697688
Liberia,2.673549
China,2.563845
Gabon,2.550809


In [146]:
#returning countries with the least COV
cholera_order.tail(50)

Unnamed: 0_level_0,Cholera COV
Country,Unnamed: 1_level_1
Singapore,1.039722
Nicaragua,1.022871
Democratic Republic of the Congo,1.022443
Algeria,1.014352
Haiti,0.998567
Panama,0.979674
United Kingdom of Great Britain and Northern Ireland,0.959409
Romania,0.923461
Oman,0.916246
Cuba,0.88456


## Tuberculosis

In [147]:
tb_cov = ((tb.groupby('Country')['tuberculosis_incidence'].std())/(tb.groupby('Country')['tuberculosis_incidence'].mean()))
tb_cov = pd.DataFrame(data=tb_cov)
tb_cov.rename(columns={"tuberculosis_incidence": "Tuberculosis COV"}, inplace = True)
tb_order = cholera_cov.sort_values(by ='Cholera COV' , ascending=False)

In [148]:
tb_order.to_csv('../Data/Diseases/COV/tb_cov.csv', index = True)

In [149]:
#returning countries with the highest COV
tb_order.head(50)

Unnamed: 0_level_0,Cholera COV
Country,Unnamed: 1_level_1
Spain,3.513868
Nepal,3.298659
South Africa,2.954268
Zimbabwe,2.859931
Italy,2.833025
Cabo Verde,2.707833
Russian Federation,2.697688
Liberia,2.673549
China,2.563845
Gabon,2.550809


In [150]:
#returning countries with the least COV
tb_order.tail(50)

Unnamed: 0_level_0,Cholera COV
Country,Unnamed: 1_level_1
Singapore,1.039722
Nicaragua,1.022871
Democratic Republic of the Congo,1.022443
Algeria,1.014352
Haiti,0.998567
Panama,0.979674
United Kingdom of Great Britain and Northern Ireland,0.959409
Romania,0.923461
Oman,0.916246
Cuba,0.88456


## Malaria

In [151]:
var(malaria,50, 'malaria')


 Highest COV

Country
Turkmenistan                             3.305636
Armenia                                  3.261927
Kyrgyzstan                               3.251064
China                                    3.239704
Uzbekistan                               3.185205
Sri Lanka                                3.069808
Paraguay                                 2.899796
Algeria                                  2.812428
Iraq                                     2.704685
Ecuador                                  2.696586
El Salvador                              2.588815
Tajikistan                               2.402563
Azerbaijan                               2.266740
Bhutan                                   2.136711
Botswana                                 2.062346
Turkey                                   2.034178
Saudi Arabia                             2.024306
Georgia                                  1.879621
Iran (Islamic Republic of)               1.860759
Suriname                   

## Tetanus 

In [152]:
var(tet, 50, 'tet')


 Highest COV

Country
Cook Islands                                   4.843421
Tuvalu                                         4.580697
Marshall Islands (the)                         4.580697
Samoa                                          3.746119
San Marino                                     3.661218
Micronesia (Federated States of)               3.634935
Monaco                                         3.486132
Guyana                                         3.360069
Gambia                                         3.160467
Nauru                                          3.085659
Antigua and Barbuda                            3.070620
Djibouti                                       2.954132
Saint Vincent and the Grenadines               2.767050
Saint Kitts and Nevis                          2.651203
Suriname                                       2.602252
Lao People's Democratic Republic (the)         2.492680
Bahrain                                        2.393850
Democratic People's Repub

## Rubella

In [153]:
var(rubella, 50, 'rubella')


 Highest COV

Country
Albania                             4.199247
Greece                              4.157580
Suriname                            4.083194
Costa Rica                          3.980867
Bhutan                              3.957942
Cuba                                3.737685
Barbados                            3.578887
Niue                                3.533510
Samoa                               3.451656
Andorra                             3.413994
Guyana                              3.403458
Bulgaria                            3.382974
Latvia                              3.329485
Jamaica                             3.313333
Syrian Arab Republic (the)          3.261628
Canada                              3.245371
Bahamas (the)                       3.204119
Jordan                              3.176447
Honduras                            3.159338
Kyrgyzstan                          3.157190
Sweden                              3.140075
Netherlands (the)               

## Pertussis

In [154]:
var(pert, 50 ,'pert')


 Highest COV

Country
Antigua and Barbuda                 6.284813
Saint Lucia                         5.756746
Marshall Islands (the)              5.320070
Tonga                               5.183123
Nauru                               4.833133
Saint Vincent and the Grenadines    4.503061
Namibia                             4.304027
Seychelles                          4.178599
Cook Islands                        4.174728
Tuvalu                              3.825156
Dominica                            3.514560
Maldives                            3.429676
Jamaica                             3.425571
Guyana                              3.354611
Congo (the)                         3.294129
Tunisia                             3.273034
Sao Tome and Principe               3.241872
Vanuatu                             3.209972
Mauritius                           3.058900
Saudi Arabia                        3.022489
El Salvador                         3.006700
Iraq                            

## Mumps

In [155]:
var(mumps,80 ,'mumps')


 Highest COV

Country
Cook Islands                        3.951334
Saint Lucia                         3.767294
Micronesia (Federated States of)    3.593499
Latvia                              3.356046
Niue                                3.229989
                                      ...   
Belize                              1.519708
Turkey                              1.508161
Saudi Arabia                        1.504404
Zimbabwe                            1.501515
Ireland                             1.466160
Name: COV, Length: 80, dtype: float64

Lowest COV

Country
Bahrain                        0.943274
Uzbekistan                     0.937163
Qatar                          0.884452
Australia                      0.882269
Argentina                      0.861350
                                 ...   
Togo                                NaN
Tunisia                             NaN
Uganda                              NaN
United Republic of Tanzania         NaN
Viet Nam               

## Measles

In [156]:
var(measles, 50, 'measles')


 Highest COV

Country
Democratic People's Republic of Korea (the)    6.214219
Antigua and Barbuda                            5.683466
Palau                                          5.606546
Albania                                        5.470978
Saint Lucia                                    5.386755
Marshall Islands (the)                         5.209738
Grenada                                        5.170341
Uruguay                                        4.868620
Bahamas (the)                                  4.811027
Micronesia (Federated States of)               4.503446
Mauritius                                      4.475759
Saint Vincent and the Grenadines               4.411792
Hungary                                        4.244087
Niue                                           3.902340
Nauru                                          3.752777
Samoa                                          3.671533
Monaco                                         3.614141
Tuvalu                   

# Coeffecient of Variation Findings