In [1]:
import statistics
import scipy.stats as stats
from scipy.stats import variation
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
%autosave 120

Autosaving every 120 seconds


In [3]:
pd.set_option("display.max_rows", 500)

## Table of Contents
- [Reading in Data](#read) 
- [Disease EDA & Cleaning](#eda)
    - [Cholera EDA & Cleaning](#cholera)
    - [Ebola EDA & Cleaning](#ebola)
    - [Malaria EDA & Cleaning](#malaria)
    - [Meningitis EDA & Cleaning](#meningitis)
    - [Tuberculosis EDA & Cleaning](#tuberculosis)
    - [Zika EDA & Cleaning](#zika)
    - [Tetanus EDA & Cleaning](#tetanus)
    - [Rubella EDA & Cleaning](#rubella)
    - [Pertussis EDA & Cleaning](#pert)
    - [Mumps EDA & Cleaning](#mumps)
    - [Measles EDA & Cleaning](#measles)
- [EDA Findings](#find)
- [Selecting Countries for Modeling](#select)
    - [Cholera COV](#chol_cov)
    - [Malaria COV](#malaria_cov)
    - [Meningitis COV](#men_cov)
    - [Tuberculosis COV](#tb_cov)
    - [Tetanus COV](#tet_cov)
    - [Rubella COV](#rub_cov)
    - [Pertussis COV](#pert_cov)
    - [Mumps COV](#mumps_cov)
    - [Measles COV](#measles_cov)
    
- [COV Findings and Country Selection](#select2)
    
  
    


   

# Reading in Data

In [4]:
cholera = pd.read_csv('../Data/Diseases/cholera.csv')
ebola = pd.read_csv('../Data/Diseases/ebola.csv')
malaria = pd.read_csv('../Data/Diseases/malaria.csv')
mngts = pd.read_csv('../Data/Diseases/meningitis.csv')
tb = pd.read_csv('../Data/Diseases/Tuberculosis.csv')
zika = pd.read_csv('../Data/Diseases/zika.csv')
tet = pd.read_csv('../Data/Diseases/Ttetanus.csv')
rubella = pd.read_csv('../Data/Diseases/Rubella.csv')
pert = pd.read_csv('../Data/Diseases/Pertussis.csv')
mumps = pd.read_csv('../Data/Diseases/Mumps.csv')
measles = pd.read_csv('../Data/Diseases/Measles.csv')

  interactivity=interactivity, compiler=compiler, result=result)


# Disease EDA and Cleaning<a id='eda'></a>

The goal of our preliminary EDA is to see what information is available to us, the breadth of our data including time frame and countries, the columns/metrics, etc.

Our goal for this preliminary EDA is to see which countries have large variation in the number of cases using Coeffecient of Variation as our metric. These countries will provide us with a smaller scope as to which regions we want to investigate. Exploring the world will most likely produce very general results (i.e., countries with more hospitals have less cases of Ebola) and studying countries that are relatively stable in terms of cases likely have stable infrastructure. The goal of this project will be to show that countries with volatile number of cases per year and varying infrastructure. need to stick to plan 'X Y or Z' in order to stabilize and minimize the number of cases for ad disease.

I expect that this data is very clean as it is coming from well established agencies such as WHO. Cleaning will possibly consists of just selecting the countries we want to study, renaming columns to cleaner names, and combining dataframes/columns.

## Cholera<a id='cholera'></a>

In [5]:
cholera.head()

Unnamed: 0,Country,Year,Number of reported cases of cholera
0,Afghanistan,2016,677
1,Afghanistan,2015,58064
2,Afghanistan,2014,45481
3,Afghanistan,2013,3957
4,Afghanistan,2012,12


In [6]:
cholera.dtypes

Country                                object
Year                                    int64
Number of reported cases of cholera    object
dtype: object

In [7]:
#checking scope of data
#cholera['Country'].unique()

In [8]:
cholera['Year'].unique()

array([2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2005, 2003,
       2002, 2001, 2000, 1999, 1998, 1997, 1995, 1994, 1993, 1965, 1960,
       2006, 1992, 1990, 1989, 1988, 1987, 1984, 1983, 1980, 1979, 1978,
       1977, 1976, 1975, 1974, 1973, 1972, 1971, 2007, 1996, 1991, 2004,
       1985, 1982, 1981, 1970, 1969, 1968, 1967, 1966, 1964, 1963, 1962,
       1961, 1959, 1958, 1957, 1956, 1955, 1954, 1953, 1952, 1951, 1950,
       1986, 1949])

### Cholera Cleaning

In [9]:
cholera.rename(columns={"Number of reported cases of cholera": "cholera_cases"}, inplace = True)

In [10]:
cholera['cholera_cases'] = cholera['cholera_cases'].str.replace(" ", "")

In [11]:
cholera['cholera_cases'] = cholera['cholera_cases'].astype('float')

In [12]:
cholera.dtypes

Country           object
Year               int64
cholera_cases    float64
dtype: object

In [13]:
cholera = cholera.set_index('Country')

In [14]:
cholera.head()

Unnamed: 0_level_0,Year,cholera_cases
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,2016,677.0
Afghanistan,2015,58064.0
Afghanistan,2014,45481.0
Afghanistan,2013,3957.0
Afghanistan,2012,12.0


In [15]:
cholera.dtypes

Year               int64
cholera_cases    float64
dtype: object

In [16]:
cholera.to_csv('../Data/Diseases/cleaned_disease/cholera_clean.csv', index = True)

**Notes**: This is a good dataset to use as we have a full list of countries along with data from 1949 to 2016.

* Incidence is totals per year.

## Ebola<a id='ebola'></a>

In [17]:
ebola.head()

Unnamed: 0,Indicator,Country,Date,value
0,"Cumulative number of confirmed, probable and s...",Guinea,2015-03-10,3285.0
1,Cumulative number of confirmed Ebola cases,Guinea,2015-03-10,2871.0
2,Cumulative number of probable Ebola cases,Guinea,2015-03-10,392.0
3,Cumulative number of suspected Ebola cases,Guinea,2015-03-10,22.0
4,"Cumulative number of confirmed, probable and s...",Guinea,2015-03-10,2170.0


In [18]:
ebola.columns

Index(['Indicator', 'Country', 'Date', 'value'], dtype='object')

In [19]:
ebola['Country'].unique()

array(['Guinea', 'Liberia', 'Sierra Leone', 'United Kingdom', 'Mali',
       'Nigeria', 'Senegal', 'Spain', 'United States of America', 'Italy',
       'Liberia 2', 'Guinea 2'], dtype=object)

In [20]:
ebola['Indicator'].unique()

array(['Cumulative number of confirmed, probable and suspected Ebola cases',
       'Cumulative number of confirmed Ebola cases',
       'Cumulative number of probable Ebola cases',
       'Cumulative number of suspected Ebola cases',
       'Cumulative number of confirmed, probable and suspected Ebola deaths',
       'Cumulative number of confirmed Ebola deaths',
       'Cumulative number of probable Ebola deaths',
       'Cumulative number of suspected Ebola deaths',
       'Number of confirmed Ebola cases in the last 21 days',
       'Number of confirmed, probable and suspected Ebola cases in the last 21 days',
       'Number of probable Ebola cases in the last 21 days',
       'Number of confirmed Ebola cases in the last 7 days',
       'Number of probable Ebola cases in the last 7 days',
       'Number of suspected Ebola cases in the last 7 days',
       'Number of confirmed, probable and suspected Ebola cases in the last 7 days',
       'Proportion of confirmed Ebola cases that a

In [21]:
#checked unqieu dates, was day by day data from 2014 to 2015

#ebola['Date'].unique()

**Notes**: This dataset isn't that great to use because we have a limited number of countries as well as a limited time range. This dataset isn't good for the question at hand. The reason for this is because Ebola was a very concentrated disease both in location and timeframe.

## Malaria<a id='malaria'></a>

In [22]:
malaria.head()

Unnamed: 0.1,Unnamed: 0,Malaria incidence (per 1 000 population at risk),Malaria incidence (per 1 000 population at risk).1,Malaria incidence (per 1 000 population at risk).2,Malaria incidence (per 1 000 population at risk).3,Malaria incidence (per 1 000 population at risk).4,Malaria incidence (per 1 000 population at risk).5,Malaria incidence (per 1 000 population at risk).6,Malaria incidence (per 1 000 population at risk).7,Malaria incidence (per 1 000 population at risk).8,Malaria incidence (per 1 000 population at risk).9
0,Country,2017.0,2016.0,2015.0,2014.0,2013.0,2012.0,2011.0,2010.0,2005.0,2000.0
1,Afghanistan,23.01,23.0,14.22,11.26,8.75,11.76,19.86,15.92,28.91,92.64
2,Algeria,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.15,0.3
3,Angola,154.97,155.66,154.48,139.97,130.2,123.99,125.54,133.76,210.66,222.39
4,Argentina,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,1.29,2.37


In [23]:
malaria.columns

Index(['Unnamed: 0', 'Malaria incidence (per 1 000 population at risk)',
       'Malaria incidence (per 1 000 population at risk).1',
       'Malaria incidence (per 1 000 population at risk).2',
       'Malaria incidence (per 1 000 population at risk).3',
       'Malaria incidence (per 1 000 population at risk).4',
       'Malaria incidence (per 1 000 population at risk).5',
       'Malaria incidence (per 1 000 population at risk).6',
       'Malaria incidence (per 1 000 population at risk).7',
       'Malaria incidence (per 1 000 population at risk).8',
       'Malaria incidence (per 1 000 population at risk).9'],
      dtype='object')

In [24]:
#checking scope of data
#malaria['Unnamed: 0'].unique()

In [25]:
malaria.describe()

Unnamed: 0,Malaria incidence (per 1 000 population at risk),Malaria incidence (per 1 000 population at risk).1,Malaria incidence (per 1 000 population at risk).2,Malaria incidence (per 1 000 population at risk).3,Malaria incidence (per 1 000 population at risk).4,Malaria incidence (per 1 000 population at risk).5,Malaria incidence (per 1 000 population at risk).6,Malaria incidence (per 1 000 population at risk).7,Malaria incidence (per 1 000 population at risk).8,Malaria incidence (per 1 000 population at risk).9
count,108.0,108.0,108.0,108.0,108.0,108.0,108.0,108.0,108.0,107.0
mean,100.063889,100.196019,99.479815,100.22963,104.934444,108.130463,110.655185,116.469167,138.670833,155.540841
std,225.550583,226.678626,224.333345,225.993756,229.224004,232.368575,235.231131,235.976373,247.171742,249.7646
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.07,0.14,0.1175,0.1825,0.1525,0.1325,0.27,0.3675,1.275,4.345
50%,5.7,7.17,6.06,7.13,8.005,8.765,9.975,11.865,20.27,39.23
75%,145.0025,142.2275,148.2,146.5425,135.8325,133.265,135.3025,174.265,240.465,287.595
max,2017.0,2016.0,2015.0,2014.0,2013.0,2012.0,2011.0,2010.0,2005.0,2000.0


### Malaria Cleaning

In [26]:
#dtypes as expected
malaria.dtypes

Unnamed: 0                                             object
Malaria incidence (per 1 000 population at risk)      float64
Malaria incidence (per 1 000 population at risk).1    float64
Malaria incidence (per 1 000 population at risk).2    float64
Malaria incidence (per 1 000 population at risk).3    float64
Malaria incidence (per 1 000 population at risk).4    float64
Malaria incidence (per 1 000 population at risk).5    float64
Malaria incidence (per 1 000 population at risk).6    float64
Malaria incidence (per 1 000 population at risk).7    float64
Malaria incidence (per 1 000 population at risk).8    float64
Malaria incidence (per 1 000 population at risk).9    float64
dtype: object

In [27]:
malaria.rename(columns={"Malaria incidence (per 1 000 population at risk)": "2017",
                       "Malaria incidence (per 1 000 population at risk).1": "2016",
                       "Malaria incidence (per 1 000 population at risk).2": "2015",
                       "Malaria incidence (per 1 000 population at risk).3": "2014",
                       "Malaria incidence (per 1 000 population at risk).4": "2013",
                       "Malaria incidence (per 1 000 population at risk).5": "2012",
                       "Malaria incidence (per 1 000 population at risk).6": "2011",
                       "Malaria incidence (per 1 000 population at risk).7": "2010",
                       "Malaria incidence (per 1 000 population at risk).8": "2005",
                       "Malaria incidence (per 1 000 population at risk).9": "2000",
                       'Unnamed: 0': "Country"}, inplace = True)

In [28]:
malaria.drop(axis = 0, index = 0, inplace = True)

In [29]:
malaria = malaria.set_index('Country')

In [30]:
malaria.head()

Unnamed: 0_level_0,2017,2016,2015,2014,2013,2012,2011,2010,2005,2000
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Afghanistan,23.01,23.0,14.22,11.26,8.75,11.76,19.86,15.92,28.91,92.64
Algeria,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.15,0.3
Angola,154.97,155.66,154.48,139.97,130.2,123.99,125.54,133.76,210.66,222.39
Argentina,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,1.29,2.37
Armenia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05


In [31]:
malaria.to_csv('../Data/Diseases/cleaned_disease/malaria_clean.csv', index = True)

**Notes**: This can be a good dataset to use because we have lots of countries and a range from 2000 - 2017. However we might need to fill in some datapoints for years in the early 2000s but we can do that with some simple research. Additionally malaria has been around for some time meaning it is a disease that will have a lot of variation in how it is managed in each country.

* Incidence is per year per 1000 population.

## Meningitis <a id='meningitis'></a>

In [32]:
mngts.head()

Unnamed: 0.1,Unnamed: 0,Number of meningitis epidemic districts,Number of meningitis epidemic districts.1,Number of meningitis epidemic districts.2,Number of meningitis epidemic districts.3,Number of meningitis epidemic districts.4,Number of meningitis epidemic districts.5,Number of meningitis epidemic districts.6,Number of meningitis epidemic districts.7,Number of meningitis epidemic districts.8,Number of meningitis epidemic districts.9,Number of meningitis epidemic districts.10,Number of meningitis epidemic districts.11
0,Country,2014,2013,2012,2011.0,2010.0,2009.0,2008.0,2007.0,2006.0,2005.0,2004.0,2003.0
1,Benin,1,1,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
2,Burkina Faso,0,1,13,2.0,12.0,3.0,20.0,43.0,33.0,1.0,5.0,12.0
3,Cameroon,0,0,2,10.0,3.0,5.0,,0.0,,,,
4,Central African Republic,0,0,2,2.0,1.0,1.0,1.0,1.0,0.0,,,


In [33]:
mngts.columns

Index(['Unnamed: 0', 'Number of meningitis epidemic districts',
       'Number of meningitis epidemic districts.1',
       'Number of meningitis epidemic districts.2',
       'Number of meningitis epidemic districts.3',
       'Number of meningitis epidemic districts.4',
       'Number of meningitis epidemic districts.5',
       'Number of meningitis epidemic districts.6',
       'Number of meningitis epidemic districts.7',
       'Number of meningitis epidemic districts.8',
       'Number of meningitis epidemic districts.9',
       'Number of meningitis epidemic districts.10',
       'Number of meningitis epidemic districts.11'],
      dtype='object')

In [34]:
mngts['Unnamed: 0'].unique()

array(['Country', 'Benin', 'Burkina Faso', 'Cameroon',
       'Central African Republic', 'Chad', "Côte d'Ivoire",
       'Democratic Republic of the Congo', 'Ethiopia', 'Gambia', 'Ghana',
       'Guinea', 'Mali', 'Mauritania', 'Niger', 'Nigeria', 'Senegal',
       'South Sudan', 'Sudan', 'Togo'], dtype=object)

In [35]:
mngts.describe()

Unnamed: 0,Number of meningitis epidemic districts.3,Number of meningitis epidemic districts.4,Number of meningitis epidemic districts.5,Number of meningitis epidemic districts.6,Number of meningitis epidemic districts.7,Number of meningitis epidemic districts.8,Number of meningitis epidemic districts.9,Number of meningitis epidemic districts.10,Number of meningitis epidemic districts.11
count,14.0,15.0,15.0,14.0,13.0,14.0,12.0,9.0,9.0
mean,146.857143,137.066667,148.4,150.357143,160.230769,148.5,169.5,224.444444,227.777778
std,536.56498,518.146762,516.650587,534.843558,555.026449,534.712252,578.049149,667.336143,665.739772
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
50%,1.0,3.0,1.0,1.5,1.0,1.5,0.5,1.0,4.0
75%,8.0,6.0,6.0,10.25,4.0,6.5,5.75,5.0,12.0
max,2011.0,2010.0,2009.0,2008.0,2007.0,2006.0,2005.0,2004.0,2003.0


**Notes**: This dataset isn't that great to use because the number of countries are very limited to African countries. This can be used if we decidet o focus mostly on African countries. 

## Tuberculosis<a id='tuberculosis'></a>

In [36]:
tb.head()

Unnamed: 0,Country,Year,Number of incident tuberculosis cases,Incidence of tuberculosis (per 100 000 population per year),Number of incident tuberculosis cases in children aged 0 - 14,"Number of incident tuberculosis cases, (HIV-positive cases)",Incidence of tuberculosis (per 100 000 population) (HIV-positive cases)
0,Afghanistan,2018,70000 [45000-100000],189 [122-270],14000 [7400-21000],320 [120-640],0.87 [0.31-1.7]
1,Afghanistan,2017,69000 [44000-98000],189 [122-270],,300 [110-580],0.82 [0.3-1.6]
2,Afghanistan,2016,67000 [43000-95000],189 [122-270],,310 [120-600],0.88 [0.33-1.7]
3,Afghanistan,2015,65000 [42000-93000],189 [122-270],,290 [110-560],0.86 [0.33-1.6]
4,Afghanistan,2014,63000 [41000-90000],189 [122-270],,290 [110-560],0.88 [0.34-1.7]


In [37]:
tb.columns

Index(['Country', 'Year', 'Number of incident tuberculosis cases',
       'Incidence of tuberculosis (per 100 000 population per year)',
       'Number of incident tuberculosis cases in children aged 0 - 14',
       'Number of incident tuberculosis cases,  (HIV-positive cases)',
       'Incidence of tuberculosis (per 100 000 population) (HIV-positive cases)'],
      dtype='object')

In [38]:
#checking scope of data
#tb['Country'].unique()

In [39]:
tb['Year'].unique()

array([2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008,
       2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000])

### Tuberculosis Cleaning

In [40]:
#Only care about incidences
tb.drop(axis = 1, columns = ['Incidence of tuberculosis (per 100 000 population per year)',
       'Number of incident tuberculosis cases in children aged 0 - 14',
       'Number of incident tuberculosis cases,  (HIV-positive cases)',
       'Incidence of tuberculosis (per 100 000 population) (HIV-positive cases)'] , inplace = True)

In [41]:
tb.rename(columns={"Number of incident tuberculosis cases": "tuberculosis_incidence",
                       }, inplace = True)

In [42]:
tb = tb.set_index('Country')

In [43]:
#removing all characters after the first space for incidence column
tb = tb.astype(str).apply(lambda x: x.str.split().str[0])

In [44]:
tb.dtypes

Year                      object
tuberculosis_incidence    object
dtype: object

In [45]:
tb['Year'] = tb['Year'].astype('int')

In [46]:
tb['tuberculosis_incidence'] = tb['tuberculosis_incidence'].astype('int')

In [47]:
tb.head()

Unnamed: 0_level_0,Year,tuberculosis_incidence
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,2018,70000
Afghanistan,2017,69000
Afghanistan,2016,67000
Afghanistan,2015,65000
Afghanistan,2014,63000


In [48]:
tb.to_csv('../Data/Diseases/cleaned_disease/tb_clean.csv', index = True)

**Notes**: This is a good dataset to use because we have all the countries with data from 2000 to 2018.

## Zika<a id='zika'></a>

In [49]:
zika.head()

Unnamed: 0,report_date,location,location_type,data_field,data_field_code,time_period,time_period_type,value,unit
0,2016-03-19,Argentina-Buenos_Aires,province,cumulative_confirmed_local_cases,AR0001,,,0,cases
1,2016-03-19,Argentina-Buenos_Aires,province,cumulative_probable_local_cases,AR0002,,,0,cases
2,2016-03-19,Argentina-Buenos_Aires,province,cumulative_confirmed_imported_cases,AR0003,,,2,cases
3,2016-03-19,Argentina-Buenos_Aires,province,cumulative_probable_imported_cases,AR0004,,,1,cases
4,2016-03-19,Argentina-Buenos_Aires,province,cumulative_cases_under_study,AR0005,,,127,cases


In [50]:
zika.columns

Index(['report_date', 'location', 'location_type', 'data_field',
       'data_field_code', 'time_period', 'time_period_type', 'value', 'unit'],
      dtype='object')

**Notes**: Zika was very limted # of locations and time frames.

## Tetanus<a id='tetanus'></a>

In [51]:
tet.head()

Unnamed: 0,WHO_REGION,ISO_code,Cname,Disease,2018,2017,2016,2015,2014,2013,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
0,EMR,AFG,Afghanistan,ttetanus,53.0,,37.0,74.0,39.0,24.0,...,51.0,951.0,168.0,698.0,2829.0,355.0,912.0,1481.0,1208.0,1618.0
1,EUR,ALB,Albania,ttetanus,0.0,1.0,,,,0.0,...,3.0,5.0,6.0,5.0,4.0,2.0,4.0,1.0,3.0,5.0
2,AFR,DZA,Algeria,ttetanus,1.0,0.0,0.0,0.0,0.0,0.0,...,63.0,50.0,415.0,129.0,343.0,74.0,79.0,100.0,164.0,86.0
3,EUR,AND,Andorra,ttetanus,0.0,,,,,0.0,...,,,,,,,,,,
4,AFR,AGO,Angola,ttetanus,340.0,,,305.0,330.0,360.0,...,2701.0,1631.0,778.0,129.0,893.0,1320.0,1115.0,1398.0,1383.0,1185.0


In [52]:
tet.columns

Index(['WHO_REGION', 'ISO_code', 'Cname', 'Disease', '2018', '2017', '2016',
       '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007',
       '2006', '2005', '2004', '2003', '2002', '2001', '2000', '1999', '1998',
       '1997', '1996', '1995', '1994', '1993', '1992', '1991', '1990', '1989',
       '1988', '1987', '1986', '1985', '1984', '1983', '1982', '1981', '1980'],
      dtype='object')

In [53]:
#checking scope of data
#tet['Cname'].unique()

### Tetanus Cleaning

In [54]:
#checking dtypes, all years are in floats. 
#tet.dtypes

In [55]:
tet.drop(axis = 0, columns = ['WHO_REGION', 'ISO_code', 'Disease'], inplace = True)

In [56]:
tet.rename(columns={"Cname": "Country"}, inplace = True)

In [57]:
tet = tet.set_index('Country')

In [58]:
tet.head()

Unnamed: 0_level_0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,53.0,,37.0,74.0,39.0,24.0,37.0,20.0,23.0,19.0,...,51.0,951.0,168.0,698.0,2829.0,355.0,912.0,1481.0,1208.0,1618.0
Albania,0.0,1.0,,,,0.0,0.0,0.0,1.0,0.0,...,3.0,5.0,6.0,5.0,4.0,2.0,4.0,1.0,3.0,5.0
Algeria,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,63.0,50.0,415.0,129.0,343.0,74.0,79.0,100.0,164.0,86.0
Andorra,0.0,,,,,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
Angola,340.0,,,305.0,330.0,360.0,543.0,953.0,490.0,675.0,...,2701.0,1631.0,778.0,129.0,893.0,1320.0,1115.0,1398.0,1383.0,1185.0


In [59]:
tet.to_csv('../Data/Diseases/cleaned_disease/tet_clean.csv', index = True)

**Notes**: Very good dataset to use because we have data from 1980 to 2018 with all countries available.

## Rubella<a id='rubella'></a>

In [60]:
rubella.head()

Unnamed: 0,WHO_REGION,ISO_code,Cname,Disease,2018,2017,2016,2015,2014,2013,...,2007,2006,2005,2004,2003,2002,2001,2000,1999,1998
0,EMR,AFG,Afghanistan,Rubella,37.0,53.0,42.0,59.0,43.0,367.0,...,152.0,196.0,,,,,,,,
1,EUR,ALB,Albania,Rubella,0.0,0.0,2.0,,,0.0,...,0.0,0.0,0.0,0.0,9.0,12.0,10.0,1752.0,15.0,
2,AFR,DZA,Algeria,Rubella,624.0,110.0,13.0,3.0,3.0,414.0,...,,,,,,,,,,
3,EUR,AND,Andorra,Rubella,0.0,0.0,0.0,,,0.0,...,0.0,22.0,0.0,0.0,0.0,,0.0,,,
4,AFR,AGO,Angola,Rubella,31.0,20.0,12.0,230.0,112.0,36.0,...,25.0,14.0,10.0,43.0,22.0,0.0,,,,


In [61]:
rubella.columns

Index(['WHO_REGION', 'ISO_code', 'Cname', 'Disease', '2018', '2017', '2016',
       '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007',
       '2006', '2005', '2004', '2003', '2002', '2001', '2000', '1999', '1998'],
      dtype='object')

In [62]:
#checking scope of data
#rubella['Cname'].unique()

### Rubella Cleaning

In [63]:
#column values are expected datatype
rubella.dtypes

WHO_REGION     object
ISO_code       object
Cname          object
Disease        object
2018          float64
2017          float64
2016          float64
2015          float64
2014          float64
2013          float64
2012          float64
2011          float64
2010          float64
2009          float64
2008          float64
2007          float64
2006          float64
2005          float64
2004          float64
2003          float64
2002          float64
2001          float64
2000          float64
1999          float64
1998          float64
dtype: object

In [64]:
#dropping columns we don't want and making column names more interpretable
rubella.drop(axis = 0, columns = ['WHO_REGION', 'ISO_code', 'Disease'], inplace = True)
rubella.rename(columns={"Cname": "Country"}, inplace = True)

In [65]:
rubella = rubella.set_index('Country')

In [66]:
rubella.head()

Unnamed: 0_level_0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,...,2007,2006,2005,2004,2003,2002,2001,2000,1999,1998
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,37.0,53.0,42.0,59.0,43.0,367.0,,750.0,46.0,501.0,...,152.0,196.0,,,,,,,,
Albania,0.0,0.0,2.0,,,0.0,1.0,5.0,5.0,0.0,...,0.0,0.0,0.0,0.0,9.0,12.0,10.0,1752.0,15.0,
Algeria,624.0,110.0,13.0,3.0,3.0,414.0,420.0,170.0,212.0,23.0,...,,,,,,,,,,
Andorra,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,...,0.0,22.0,0.0,0.0,0.0,,0.0,,,
Angola,31.0,20.0,12.0,230.0,112.0,36.0,65.0,24.0,38.0,10.0,...,25.0,14.0,10.0,43.0,22.0,0.0,,,,


In [67]:
rubella.to_csv('../Data/Diseases/cleaned_disease/rubella_clean.csv', index = True)

**Notes**: Very good dataset because we have data from 1998 to 2018 for all countries.

## Pertussis<a id='pert'></a>

In [68]:
pert

Unnamed: 0,WHO_REGION,ISO_code,Cname,Disease,2018,2017,2016,2015,2014,2013,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
0,EMR,AFG,Afghanistan,pertussis,488.0,1.0,0.0,432.0,0.0,371.0,...,1494.0,4587.0,6073.0,5872.0,8531.0,6175.0,10209.0,8528.0,15388.0,15748.0
1,EUR,ALB,Albania,pertussis,19.0,7.0,43.0,,,6.0,...,302.0,508.0,112.0,115.0,172.0,89.0,126.0,312.0,280.0,137.0
2,AFR,DZA,Algeria,pertussis,17.0,6.0,2.0,0.0,0.0,69.0,...,32.0,45.0,69.0,24.0,520.0,894.0,395.0,663.0,967.0,710.0
3,EUR,AND,Andorra,pertussis,2.0,0.0,3.0,16.0,1.0,6.0,...,,,,,,,,,,
4,AFR,AGO,Angola,pertussis,0.0,,0.0,0.0,0.0,0.0,...,21674.0,14343.0,10015.0,6953.0,15846.0,23993.0,28461.0,31429.0,31481.0,54126.0
5,AMR,ATG,Antigua and Barbuda,pertussis,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,AMR,ARG,Argentina,pertussis,900.0,864.0,1686.0,975.0,561.0,1112.0,...,2943.0,3685.0,1722.0,1952.0,4654.0,16288.0,6115.0,6383.0,21695.0,27223.0
7,EUR,ARM,Armenia,pertussis,160.0,77.0,15.0,27.0,85.0,30.0,...,,,,,,,,,,
8,WPR,AUS,Australia,pertussis,12555.0,12114.0,20037.0,22508.0,11842.0,12319.0,...,614.0,153.0,291.0,601.0,587.0,261.0,332.0,274.0,170.0,124.0
9,EUR,AUT,Austria,pertussis,2197.0,1411.0,1270.0,579.0,370.0,,...,190.0,113.0,260.0,177.0,301.0,176.0,181.0,433.0,264.0,186.0


In [69]:
pert.columns

Index(['WHO_REGION', 'ISO_code', 'Cname', 'Disease', '2018', '2017', '2016',
       '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007',
       '2006', '2005', '2004', '2003', '2002', '2001', '2000', '1999', '1998',
       '1997', '1996', '1995', '1994', '1993', '1992', '1991', '1990', '1989',
       '1988', '1987', '1986', '1985', '1984', '1983', '1982', '1981', '1980'],
      dtype='object')

In [70]:
#checking scope of data
#pert['Cname'].unique()

### Pertussis Cleaning

In [71]:
#column dtypes are as expected
#pert.dtypes

In [72]:
#dropping columns we don't want and making column names more interpretable
pert.drop(axis = 0, columns = ['WHO_REGION', 'ISO_code', 'Disease'], inplace = True)
pert.rename(columns={"Cname": "Country"}, inplace = True)

In [73]:
pert = pert.set_index('Country')

In [74]:
pert.to_csv('../Data/Diseases/cleaned_disease/pert_clean.csv', index = True)

In [75]:
pert.head(60)

Unnamed: 0_level_0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,488.0,1.0,0.0,432.0,0.0,371.0,1497.0,0.0,0.0,0.0,...,1494.0,4587.0,6073.0,5872.0,8531.0,6175.0,10209.0,8528.0,15388.0,15748.0
Albania,19.0,7.0,43.0,,,6.0,16.0,4.0,0.0,10.0,...,302.0,508.0,112.0,115.0,172.0,89.0,126.0,312.0,280.0,137.0
Algeria,17.0,6.0,2.0,0.0,0.0,69.0,104.0,1.0,0.0,1.0,...,32.0,45.0,69.0,24.0,520.0,894.0,395.0,663.0,967.0,710.0
Andorra,2.0,0.0,3.0,16.0,1.0,6.0,3.0,4.0,0.0,0.0,...,,,,,,,,,,
Angola,0.0,,0.0,0.0,0.0,0.0,1259.0,1554.0,2539.0,1127.0,...,21674.0,14343.0,10015.0,6953.0,15846.0,23993.0,28461.0,31429.0,31481.0,54126.0
Antigua and Barbuda,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Argentina,900.0,864.0,1686.0,975.0,561.0,1112.0,1239.0,3185.0,804.0,1743.0,...,2943.0,3685.0,1722.0,1952.0,4654.0,16288.0,6115.0,6383.0,21695.0,27223.0
Armenia,160.0,77.0,15.0,27.0,85.0,30.0,8.0,1.0,4.0,11.0,...,,,,,,,,,,
Australia,12555.0,12114.0,20037.0,22508.0,11842.0,12319.0,23855.0,38040.0,34285.0,29545.0,...,614.0,153.0,291.0,601.0,587.0,261.0,332.0,274.0,170.0,124.0
Austria,2197.0,1411.0,1270.0,579.0,370.0,,571.0,309.0,414.0,183.0,...,190.0,113.0,260.0,177.0,301.0,176.0,181.0,433.0,264.0,186.0


**Notes**: Also a very good dataset because we have data from 1980 to 2018 for all countries.

## Mumps<a id='mumps'></a>

In [76]:
mumps.head()

Unnamed: 0,WHO_REGION,ISO_code,Cname,Disease,2018,2017,2016,2015,2014,2013,...,2007,2006,2005,2004,2003,2002,2001,2000,1999,1998
0,EMR,AFG,Afghanistan,Mumps,,,29.0,,0.0,0.0,...,,,,,,,,,,
1,EUR,ALB,Albania,Mumps,13.0,6.0,17.0,,,20.0,...,824.0,236.0,1696.0,896.0,2236.0,3124.0,1414.0,1651.0,1006.0,
2,AFR,DZA,Algeria,Mumps,,0.0,0.0,67.0,,27.0,...,,,,,,,,,,
3,EUR,AND,Andorra,Mumps,31.0,5.0,5.0,2.0,0.0,2.0,...,4.0,3.0,1.0,2.0,1.0,,4.0,,,
4,AFR,AGO,Angola,Mumps,,,,,,,...,,,,23.0,,0.0,,,,


In [77]:
mumps.columns

Index(['WHO_REGION', 'ISO_code', 'Cname', 'Disease', '2018', '2017', '2016',
       '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007',
       '2006', '2005', '2004', '2003', '2002', '2001', '2000', '1999', '1998'],
      dtype='object')

In [78]:
#checking scope of data
#mumps['Cname'].unique()

### Mumps Cleaning

In [79]:
#column dtypes are as expected
mumps.dtypes

WHO_REGION     object
ISO_code       object
Cname          object
Disease        object
2018          float64
2017          float64
2016          float64
2015          float64
2014          float64
2013          float64
2012          float64
2011          float64
2010          float64
2009          float64
2008          float64
2007          float64
2006          float64
2005          float64
2004          float64
2003          float64
2002          float64
2001          float64
2000          float64
1999          float64
1998          float64
dtype: object

In [80]:
#dropping columns we don't want and making column names more interpretable
mumps.drop(axis = 0, columns = ['WHO_REGION', 'ISO_code', 'Disease'], inplace = True)
mumps.rename(columns={"Cname": "Country"}, inplace = True)

In [81]:
mumps = mumps.set_index('Country')

In [82]:
mumps.head()

Unnamed: 0_level_0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,...,2007,2006,2005,2004,2003,2002,2001,2000,1999,1998
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,,,29.0,,0.0,0.0,,,0.0,,...,,,,,,,,,,
Albania,13.0,6.0,17.0,,,20.0,18.0,39.0,21.0,22.0,...,824.0,236.0,1696.0,896.0,2236.0,3124.0,1414.0,1651.0,1006.0,
Algeria,,0.0,0.0,67.0,,27.0,0.0,0.0,0.0,,...,,,,,,,,,,
Andorra,31.0,5.0,5.0,2.0,0.0,2.0,1.0,0.0,0.0,0.0,...,4.0,3.0,1.0,2.0,1.0,,4.0,,,
Angola,,,,,,,,,0.0,0.0,...,,,,23.0,,0.0,,,,


In [83]:
mumps.to_csv('../Data/Diseases/cleaned_disease/mumps_clean.csv', index = True)

**Notes**: Good dataset, data from 1998 to 2018 for all countries

## Measles<a id='measles'></a>

In [84]:
measles.head()

Unnamed: 0,WHO_REGION,ISO_code,Cname,Disease,2018,2017,2016,2015,2014,2013,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
0,EMR,AFG,Afghanistan,measles,2012.0,1511.0,638.0,1154.0,492.0,430.0,...,1170.0,4561.0,10357.0,8107.0,14457.0,16199.0,18808.0,20320.0,31107.0,32455.0
1,EUR,ALB,Albania,measles,1469.0,12.0,17.0,,,0.0,...,136034.0,0.0,0.0,0.0,0.0,0.0,17.0,3.0,,
2,AFR,DZA,Algeria,measles,3356.0,112.0,41.0,63.0,0.0,25.0,...,4169.0,2634.0,2500.0,3975.0,20114.0,22553.0,22126.0,29584.0,20849.0,15527.0
3,EUR,AND,Andorra,measles,0.0,0.0,0.0,,,0.0,...,,,,,,,,,,
4,AFR,AGO,Angola,measles,57.0,29.0,53.0,119.0,11699.0,8523.0,...,19820.0,21009.0,13368.0,15580.0,22822.0,22685.0,22589.0,30067.0,19714.0,29656.0


In [85]:
measles.columns

Index(['WHO_REGION', 'ISO_code', 'Cname', 'Disease', '2018', '2017', '2016',
       '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007',
       '2006', '2005', '2004', '2003', '2002', '2001', '2000', '1999', '1998',
       '1997', '1996', '1995', '1994', '1993', '1992', '1991', '1990', '1989',
       '1988', '1987', '1986', '1985', '1984', '1983', '1982', '1981', '1980'],
      dtype='object')

In [86]:
#checking scope of data
#measles['Cname'].unique()

### Measles Cleaning

In [87]:
#column dtypes are as expected
measles.dtypes

WHO_REGION     object
ISO_code       object
Cname          object
Disease        object
2018          float64
2017          float64
2016          float64
2015          float64
2014          float64
2013          float64
2012          float64
2011          float64
2010          float64
2009          float64
2008          float64
2007          float64
2006          float64
2005          float64
2004          float64
2003          float64
2002          float64
2001          float64
2000          float64
1999          float64
1998          float64
1997          float64
1996          float64
1995          float64
1994          float64
1993          float64
1992          float64
1991          float64
1990          float64
1989          float64
1988          float64
1987          float64
1986          float64
1985          float64
1984          float64
1983          float64
1982          float64
1981          float64
1980          float64
dtype: object

In [88]:
#dropping columns we don't want and making column names more interpretable
measles.drop(axis = 0, columns = ['WHO_REGION', 'ISO_code', 'Disease'], inplace = True)
measles.rename(columns={"Cname": "Country"}, inplace = True)

In [89]:
measles = measles.set_index('Country')

In [90]:
measles.head()

Unnamed: 0_level_0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,...,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,2012.0,1511.0,638.0,1154.0,492.0,430.0,2787.0,3013.0,1989.0,2861.0,...,1170.0,4561.0,10357.0,8107.0,14457.0,16199.0,18808.0,20320.0,31107.0,32455.0
Albania,1469.0,12.0,17.0,,,0.0,9.0,28.0,10.0,0.0,...,136034.0,0.0,0.0,0.0,0.0,0.0,17.0,3.0,,
Algeria,3356.0,112.0,41.0,63.0,0.0,25.0,18.0,112.0,103.0,107.0,...,4169.0,2634.0,2500.0,3975.0,20114.0,22553.0,22126.0,29584.0,20849.0,15527.0
Andorra,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
Angola,57.0,29.0,53.0,119.0,11699.0,8523.0,4458.0,1449.0,1190.0,2807.0,...,19820.0,21009.0,13368.0,15580.0,22822.0,22685.0,22589.0,30067.0,19714.0,29656.0


In [91]:
measles.to_csv('../Data/Diseases/cleaned_disease/measles_clean.csv', index = True)

**Notes**: Good dataset, data for all countries from 1980 to 2018.

# Preliminary Disease EDA Findings<a id='find'></a>

The problem we are trying to solve is limited by time data points. Infrastructure really only shows significant change year over year and the effects also take months to a year to show effect on a population. Therefore we are limited to year by year information. It was critical that the datasets we use have ACCURATE data for an extended period of time, at least from the 2000s and also data that covers a wide range of countries, preferably all of them.

The datasets that cover a time range beginningin at least 2000 AND cover a wide range of countries (100+) are:
* Cholera 
* Malaria
* Tuberculosis
* Tetanus
* Rubella
* Pertussis
* Mumps
* Measles


# Selecting Countries for Modelling

To select countries for modelling we want a country that has shown a significant amount of variation via the coeffecient of variation (COV), in regards to amount of cases over time for a disease. This means this country has not yet figured out how to stabilize/lower the transmission rates for this specific disease. Additionally we will be summing the total number of cases to see which countries are affected the most and to also void putting weights on COVs with very low number of cases.

We will select a country that has a high COV relative to the other countries while also having a significant amount of cases relative to other countries.

## Finding Coeffecient of Variation (COV) for Each Country<a id='select'></a>

In [92]:
#function takes the dataframe, rows to view, and name of csv you would like to give it
def var(df,rows,name):
    
    #input a dataframe and gets COV for dataframes where the index is country
    df['COV'] = (df.std(axis = 1))/(df.mean(axis=1))
    
    #Gets the sum of cases for the disease over the recorded time.
    df['SUM'] = df.sum(axis = 1)
    
    #creates a dataframe ordered by SUM from highest to lowest
    df_order = df.sort_values(by ='SUM' , ascending=False)
    
    #creating a dataframe from our series
    df_order = pd.DataFrame(data=df_order)
    df_order.to_csv('../Data/Diseases/COV/{df}_COV.csv'.format(df = name), index = True)
    
    #prints the n highest affected countries with their COV
    print('\n''Cases and COV Highest to Lowest' '\n')
    return(df_order[['COV','SUM']])
    

## Cholera COV<a id='cholera_cov'></a>

In [93]:
#finding Coeffecient of Variation of # of cholera cases for each unique country 
cholera_cov = ((cholera.groupby('Country')['cholera_cases'].std())/(cholera.groupby('Country')['cholera_cases'].mean()))
cholera_cov = pd.DataFrame(data=cholera_cov)
cholera_cov['SUM'] = cholera.groupby('Country')['cholera_cases'].sum()
cholera_cov.rename(columns={"cholera_cases": "Cholera COV"}, inplace = True)
cholera_order = cholera_cov.sort_values(by ='SUM' , ascending=False)

In [94]:
cholera_order.to_csv('../Data/Diseases/COV/cholera_cov.csv', index = True)

In [107]:
#returning countries with the most COV
cholera_order.head(50)

Unnamed: 0_level_0,Cholera COV,SUM
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
India,1.544639,1363250.0
Haiti,0.998567,795794.0
Peru,1.742105,736195.0
Democratic Republic of the Congo,1.022443,521607.0
Indonesia,1.498265,394945.0
Mozambique,1.378034,327913.0
Somalia,1.343618,311203.0
Nigeria,1.841082,310217.0
Bangladesh,1.224637,294647.0
Afghanistan,1.451583,263843.0


**Notes**: Many of the top countries with a combination of total and having high COV are in South America and Africa. I consider a high COV anything that is close to 2 and up for Cholera.

## Tuberculosis COV<a id='tb_cov'></a>

In [96]:
tb_cov = ((tb.groupby('Country')['tuberculosis_incidence'].std())/(tb.groupby('Country')['tuberculosis_incidence'].mean()))
tb_cov = pd.DataFrame(data=tb_cov)
tb_cov['SUM'] = tb.groupby('Country')['tuberculosis_incidence'].sum()
tb_cov.rename(columns={"tuberculosis_incidence": "Tuberculosis COV"}, inplace = True)
tb_order = tb_cov.sort_values(by ='SUM' , ascending=False)

In [97]:
tb_order.to_csv('../Data/Diseases/COV/tb_cov.csv', index = True)

In [98]:
#returning countries with the highest COV
tb_order.head(50)

Unnamed: 0_level_0,Tuberculosis COV,SUM
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
India,0.054138,57350000
China,0.152108,20846000
Indonesia,0.022286,15626000
Philippines,0.092746,9605000
Pakistan,0.112994,9143000
South Africa,0.181949,7807000
Nigeria,0.147794,6497000
Bangladesh,0.070401,6117000
Myanmar,0.123936,4587000
Ethiopia,0.155569,4441000


**Notes**: Much of the same, more Asian countries here however.

## Malaria COV<a id='malaria_cov'></a>

In [99]:
var(malaria,100, 'malaria')


Cases and COV Highest to Lowest



Unnamed: 0_level_0,COV,SUM
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Burkina Faso,0.14828,5114.52828
Central African Republic,0.092689,4244.162689
Equatorial Guinea,0.153084,4146.953084
Mali,0.063424,4111.113424
Sierra Leone,0.083266,4105.123266
Niger,0.103612,3875.643612
Guinea,0.095079,3809.965079
Togo,0.099061,3802.489061
Benin,0.099266,3676.439266
Mozambique,0.105011,3674.875011


**Notes**: African countries showing the highest variation and number of cases here.

## Tetanus COV <a id='tet_cov'></a>

In [100]:
var(tet, 100, 'tet')


Cases and COV Highest to Lowest



Unnamed: 0_level_0,COV,SUM
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
India,0.958351,545566.958351
Egypt,1.046117,110594.046117
Indonesia,1.013514,84546.013514
Pakistan,0.894978,74334.894978
Bangladesh,0.948317,66918.948317
Philippines (the),0.517801,63618.517801
Nigeria,0.746807,60113.746807
Brazil,0.91711,40568.91711
Sudan (the),1.860666,31216.860666
Myanmar,1.144768,25892.144768


**Notes**: Africa, South American, and Asian regions again appearing the most in terms of countries that fit our criteria.

## Rubella COV<a id='rub_cov'></a>

In [108]:
var(rubella, 100, 'rubella')


Cases and COV Highest to Lowest



Unnamed: 0_level_0,COV,SUM
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Russian Federation (the),2.380165,5105426.0
China,1.998171,1095915.0
Ukraine,2.222747,985185.8
Poland,2.086726,655271.2
Romania,2.552624,472920.9
Kazakhstan,2.36492,267705.8
Mexico,2.700803,180611.4
Belarus,2.567967,169689.0
Bangladesh,2.018375,142037.1
Venezuela (Bolivarian Republic of),2.575166,139108.8


**Notes**: Starting to see a pattern here, a lot of Middle Eastern/African countries, as well as South American countries such as Brazil/Mexico showing up with lots of cases and high variations.

## Pertussis COV<a id='pert_cov'></a>

In [109]:
var(pert, 100 ,'pert')


Cases and COV Highest to Lowest



Unnamed: 0_level_0,COV,SUM
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
India,3.128788,6489030.0
China,3.232473,5241451.0
Nigeria,2.901534,1917778.0
Viet Nam,3.262609,1104021.0
Pakistan,3.248588,985339.0
Kenya,2.495399,940788.1
Russian Federation (the),2.724512,872025.5
United States of America (the),3.12167,851964.1
Spain,2.971633,781680.5
Brazil,3.210497,716658.7


**Notes**: India and China are usually at the top of these lists.

## Mumps COV<a id='mumps_cov'></a>

In [110]:
var(mumps,100 ,'mumps')


Cases and COV Highest to Lowest



Unnamed: 0_level_0,COV,SUM
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
China,1.94444,8543184.0
Japan,2.083308,3754727.0
Poland,2.125714,988687.5
Romania,2.341173,711833.7
Nepal,1.591254,710555.9
Iraq,2.263043,444323.8
Russian Federation (the),2.465719,352596.4
Thailand,2.259894,329516.9
Ukraine,2.177861,277631.5
Argentina,2.105247,273478.9


**Notes**: High COV for this disease, interesting to study.

## Measles COV<a id='measles_cov'></a>

In [104]:
var(measles, 100, 'measles')


Cases and COV Highest to Lowest



Unnamed: 0_level_0,COV,SUM
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
China,1.537845,7278752.0
Nigeria,0.871232,2966637.0
India,0.836881,2877811.0
Kenya,1.805617,1665946.0
Spain,2.025422,1299670.0
Malawi,1.526739,1296324.0
Democratic Republic of the Congo (the),1.235746,1182917.0
Indonesia,1.353987,1178259.0
Niger (the),0.94166,1020342.0
United Kingdom of Great Britain and Northern Ireland (the),1.580107,978427.6


**Notes**: Latin American countries seem to have the highest variation in terms of region.

# Coeffecient of Variation Findings<a id='select2'></a>

Many of the same countries appear on the top of the list. Asian Countries such as India and China and African countries seem to have the most number of cases however Latin American countries such as Brazil, Mexico and others stand out to me as they have a high number of cases as well AND a very high variation relative to the other countries. This means that they have am unstable environment in terms of being able to treat the associated disease, making them a good case study. We need to further investigate the infrastructure of those countries to see if a country will make a good candidate

Candidates
* Latin American Countries (Brazi/Mexico)
* African Countries (Nigeria, Kenya)
* Asian Countries (China, India, Japan)