## Analysis on acute lower respiratory infections

Team: Runtime Terrors

Members:    
> Vinu Prasad Bhambore (vpb2)

> Srijith Srinath (ssrina2)

> Dhruman Jayesh Shah (djshah5)

##### Notebook Update 7 - 04/03/2020

Name of the dataset: PAHO Regional Mortality Dataset
    
Background: The PAHO Regional Mortality Database is an integrated database consisting of national datasets from Member States and is updated annually. The dataset includes data from 48 countries and territories.  The source of the national datasets varies by country. For some countries the national institution mandated to collect, integrate, and disseminate mortality data and information is the Ministry of Health, and for others it is the National Institute of Statistics.

#### Importing the necessary packages

In [1]:
import pandas as pd
import numpy as np
import warnings
from collections import defaultdict
import matplotlib.pyplot as plt 
import seaborn as sns 

In [2]:
mortality_df = pd.read_csv('Mortality_Data.csv')
mortality_df.head()

Unnamed: 0,CountryName,MortalityYear,Gender,AgeGroupCode,ICD10,Deaths
0,Brazil,2017,Male,21,I479,1
1,Brazil,2017,Male,21,C925,1
2,Brazil,2017,Male,21,I451,1
3,Brazil,2017,Male,21,D292,1
4,Brazil,2017,Male,21,L519,1


Filtering countries by Bronchitis and Tuberculosis

In [3]:
bronchitis_df = mortality_df[mortality_df['ICD10'].str.match(r'(^J20[0-9]*)|(^J40[0-9]*)|(^J41[0-9]*)|(^J42[0-9]*)')].copy()
bronchitis_df['Class'] = 'Bronchitis'
bronchitis_df.head()

Unnamed: 0,CountryName,MortalityYear,Gender,AgeGroupCode,ICD10,Deaths,Class
563,Brazil,2017,Male,21,J410,1,Bronchitis
885,Brazil,2017,Male,17,J40,1,Bronchitis
2106,Brazil,2017,Male,16,J42,1,Bronchitis
3794,Brazil,2017,Male,22,J418,1,Bronchitis
4444,Brazil,2017,Male,20,J411,1,Bronchitis


In [4]:
tuberculosis_df = mortality_df[mortality_df['ICD10'].str.match(r'(^A15[0-9]*)|(^A17[0-9]*)|(^A18[0-9]*)|(^A19[0-9]*)')].copy()
tuberculosis_df['Class'] = 'Tuberculosis'
tuberculosis_df.head()

Unnamed: 0,CountryName,MortalityYear,Gender,AgeGroupCode,ICD10,Deaths,Class
515,Brazil,2017,Male,21,A178,1,Tuberculosis
598,Brazil,2017,Male,21,A182,1,Tuberculosis
684,Brazil,2017,Male,21,A180,1,Tuberculosis
858,Brazil,2017,Male,17,A156,1,Tuberculosis
874,Brazil,2017,Male,17,A198,1,Tuberculosis


Cocatenating bronchitis_df and tuberculosis_df into one main dataframe

In [5]:
main_df = pd.concat([bronchitis_df,tuberculosis_df])
main_df.reset_index(drop=True, inplace=True)
main_df

Unnamed: 0,CountryName,MortalityYear,Gender,AgeGroupCode,ICD10,Deaths,Class
0,Brazil,2017,Male,21,J410,1,Bronchitis
1,Brazil,2017,Male,17,J40,1,Bronchitis
2,Brazil,2017,Male,16,J42,1,Bronchitis
3,Brazil,2017,Male,22,J418,1,Bronchitis
4,Brazil,2017,Male,20,J411,1,Bronchitis
...,...,...,...,...,...,...,...
49357,Ecuador,2017,FeMale,24,A150,1,Tuberculosis
49358,Ecuador,2017,FeMale,24,A181,1,Tuberculosis
49359,Ecuador,2017,FeMale,25,A180,1,Tuberculosis
49360,Ecuador,2017,FeMale,28,A185,1,Tuberculosis


Segregating the countries into different zones

In [6]:
Caribbean = ["Cuba", "Puerto Rico", "St. Vincent and the Grenadines", "St. Lucia", "Jamaica", "Aruba", 
             "St. Kitts and Nevis", "Dominica", "Dominican Republic", "Barbados", "Antigua and Barbuda", 
             "Grenada", "Haiti", "Trinidad and Tobago", "Curacao", "Bahamas, The", "Virgin Islands (U.S.)", 
             "Cayman Islands", "Turks and Caicos Islands"]
CentralAmerica = ["Mexico", "Guatemala", "Panama", "Nicaragua", "El Salvador", "Costa Rica", "Belize", "Honduras"]
SouthAmerica = ["Brazil", "Colombia", "Argentina", "Peru", "Chile", "Paraguay", "Uruguay", "Venezuela, RB", 
                "Ecuador", "Suriname", "Bolivia", "Guyana"]
NorthernAmerica = ["United States", "Canada", "Bermuda"]

In [7]:
## make changes to the country names, in order to get same formating throughout
main_df.loc[main_df['CountryName']=='United States of America', 'CountryName'] = 'United States'
main_df.loc[main_df['CountryName']=='Saint Vincent and the Grenadines', 'CountryName'] = 'St. Vincent and the Grenadines'
main_df.loc[main_df['CountryName']=='Venezuela', 'CountryName'] = 'Venezuela, RB'
main_df.loc[main_df['CountryName']=='Saint Lucia', 'CountryName'] = 'St. Lucia'
main_df.loc[main_df['CountryName']=='SaintKittsandNevis', 'CountryName'] = 'St. Kitts and Nevis'
main_df.loc[main_df['CountryName']=='Virgin Islands (US)', 'CountryName'] = 'Virgin Islands (U.S.)'
main_df.loc[main_df['CountryName']=='TurksandCaicosIslands', 'CountryName'] = 'Turks and Caicos Islands'
main_df.loc[main_df['CountryName']=='CaymanIslands', 'CountryName'] = 'Cayman Islands'
main_df.loc[main_df['CountryName']=='Brazil ', 'CountryName'] = 'Brazil'
main_df.loc[main_df['CountryName']=='Bahamas', 'CountryName'] = 'Bahamas, The'
main_df.loc[main_df['CountryName']=='Bolivia ', 'CountryName'] = 'Bolivia'


## remove countries that aren't available in World Bank Datasets
main_df = main_df[main_df['CountryName']!='Montserrat']
main_df = main_df[main_df['CountryName']!='Martinique']
main_df = main_df[main_df['CountryName']!='Guadeloupe']
main_df = main_df[main_df['CountryName']!='French Guiana']

In [8]:
list_of_countries = main_df['CountryName'].unique()
len(list_of_countries)

42

In [9]:
main_df.loc[main_df['CountryName'].isin(Caribbean), 'Zone'] = 'Caribbean'
main_df.loc[main_df['CountryName'].isin(CentralAmerica), 'Zone'] = 'CentralAmerica'
main_df.loc[main_df['CountryName'].isin(SouthAmerica), 'Zone'] = 'SouthAmerica'
main_df.loc[main_df['CountryName'].isin(NorthernAmerica), 'Zone'] = 'NorthernAmerica'

In [10]:
main_df

Unnamed: 0,CountryName,MortalityYear,Gender,AgeGroupCode,ICD10,Deaths,Class,Zone
0,Brazil,2017,Male,21,J410,1,Bronchitis,SouthAmerica
1,Brazil,2017,Male,17,J40,1,Bronchitis,SouthAmerica
2,Brazil,2017,Male,16,J42,1,Bronchitis,SouthAmerica
3,Brazil,2017,Male,22,J418,1,Bronchitis,SouthAmerica
4,Brazil,2017,Male,20,J411,1,Bronchitis,SouthAmerica
...,...,...,...,...,...,...,...,...
49357,Ecuador,2017,FeMale,24,A150,1,Tuberculosis,SouthAmerica
49358,Ecuador,2017,FeMale,24,A181,1,Tuberculosis,SouthAmerica
49359,Ecuador,2017,FeMale,25,A180,1,Tuberculosis,SouthAmerica
49360,Ecuador,2017,FeMale,28,A185,1,Tuberculosis,SouthAmerica


In [11]:
#main_df.to_csv("mortality_filtered.csv", index=False)

### Now we'll read the datasets which would act as predictors for our model and clean them

#### 1. Reading the GDP data from local directory 

In [12]:
gdp_df = pd.read_csv('gdp_per_capita.csv')

In [13]:
gdp_df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Aruba,ABW,GDP per capita (constant 2010 US$),NY.GDP.PCAP.KD,,,,,,,...,23512.6026,24233.00108,23781.2573,24635.76495,24697.49403,24452.60657,24277.40679,24485.08328,,
1,Afghanistan,AFG,GDP per capita (constant 2010 US$),NY.GDP.PCAP.KD,,,,,,,...,543.303042,528.736648,576.190126,587.56509,583.656193,574.184114,571.073775,571.542506,563.825663,
2,Angola,AGO,GDP per capita (constant 2010 US$),NY.GDP.PCAP.KD,,,,,,,...,3587.883798,3579.960081,3748.449445,3796.882622,3843.198241,3748.320623,3530.309423,3409.929285,3229.61974,
3,Albania,ALB,GDP per capita (constant 2010 US$),NY.GDP.PCAP.KD,,,,,,,...,4094.362119,4209.886951,4276.62018,4327.392449,4413.309627,4524.386108,4681.840039,4865.209546,5079.40112,
4,Andorra,AND,GDP per capita (constant 2010 US$),NY.GDP.PCAP.KD,,,,,,,...,39736.35406,38207.59591,38192.43989,39111.07953,40790.19801,41767.52651,42949.66624,43858.07751,44569.78301,


In [14]:
gdp_df = gdp_df.drop(['Country Code','Indicator Name','Indicator Code','1960','1961','1962','1963','1964','1965','1966','1967','1968','1969','1970','1971','1972','1973',
                      '1974','1975','1976','1977','1978','1979','1980','1981','1982','1983','1984','1985','1986','1987',
                      '1988','1989','1990','1991','1992','1993','1994','2019'], axis=1)

In [15]:
gdp_df.head()

Unnamed: 0,Country Name,1995,1996,1997,1998,1999,2000,2001,2002,2003,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,Aruba,26705.181,26087.77581,27190.50118,27151.9241,26954.40451,28417.38421,26966.05479,25508.30252,25469.28741,...,24463.69225,23512.6026,24233.00108,23781.2573,24635.76495,24697.49403,24452.60657,24277.40679,24485.08328,
1,Afghanistan,,,,,,,,330.303553,343.08089,...,488.300251,543.303042,528.736648,576.190126,587.56509,583.656193,574.184114,571.073775,571.542506,563.825663
2,Angola,1922.416193,2113.750766,2195.718859,2225.684357,2201.52946,2195.630582,2213.681162,2433.804393,2423.293995,...,3549.577857,3587.883798,3579.960081,3748.449445,3796.882622,3843.198241,3748.320623,3530.309423,3409.929285,3229.61974
3,Albania,1703.286747,1869.871255,1676.131932,1835.651965,2085.432,2244.631092,2453.631476,2572.728837,2725.179233,...,3928.461732,4094.362119,4209.886951,4276.62018,4327.392449,4413.309627,4524.386108,4681.840039,4865.209546,5079.40112
4,Andorra,32917.64882,34175.26007,37293.28241,38595.72321,40035.48245,40801.54213,41420.84618,42396.3024,45519.49238,...,41979.36888,39736.35406,38207.59591,38192.43989,39111.07953,40790.19801,41767.52651,42949.66624,43858.07751,44569.78301


In [16]:
gdp_df_melted = gdp_df.melt(id_vars=["Country Name"], 
        var_name="Year", 
        value_name="GDP")
gdp_df_melted.head()

Unnamed: 0,Country Name,Year,GDP
0,Aruba,1995,26705.181
1,Afghanistan,1995,
2,Angola,1995,1922.416193
3,Albania,1995,1703.286747
4,Andorra,1995,32917.64882


In [17]:
gdp_df_melted = gdp_df_melted.loc[gdp_df_melted['Country Name'].isin(list_of_countries)]

In [18]:
gdp_df_melted['Zone'] = ""

In [19]:
gdp_df_melted.head()

Unnamed: 0,Country Name,Year,GDP,Zone
0,Aruba,1995,26705.181,
7,Argentina,1995,7666.530004,
10,Antigua and Barbuda,1995,11201.74013,
21,"Bahamas, The",1995,27018.34496,
24,Belize,1995,3375.374233,


In [20]:
gdp_df_melted.loc[gdp_df_melted['Country Name'].isin(Caribbean), 'Zone'] = 'Caribbean'
gdp_df_melted.loc[gdp_df_melted['Country Name'].isin(CentralAmerica), 'Zone'] = 'CentralAmerica'
gdp_df_melted.loc[gdp_df_melted['Country Name'].isin(SouthAmerica), 'Zone'] = 'SouthAmerica'
gdp_df_melted.loc[gdp_df_melted['Country Name'].isin(NorthernAmerica), 'Zone'] = 'NorthernAmerica'

In [21]:
gdp_df_melted.to_csv("gdp_filtered.csv", index=False)

#### 2. Reading the Health Expenditure data

In [22]:
health_df = pd.read_csv('Health_Expenditure.csv')

In [23]:
health_df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Aruba,ABW,Current health expenditure per capita (current...,SH.XPD.CHEX.PC.CD,,,,,,,...,,,,,,,,,,
1,Afghanistan,AFG,Current health expenditure per capita (current...,SH.XPD.CHEX.PC.CD,,,,,,,...,45.58775,51.553259,52.218506,55.96755,60.112761,60.088813,57.24876,,,
2,Angola,AGO,Current health expenditure per capita (current...,SH.XPD.CHEX.PC.CD,,,,,,,...,96.643701,122.117809,122.242944,143.703204,131.751875,108.68067,95.220799,,,
3,Albania,ALB,Current health expenditure per capita (current...,SH.XPD.CHEX.PC.CD,,,,,,,...,203.208588,246.80376,246.742546,277.668997,313.262897,264.434603,271.543043,,,
4,Andorra,AND,Current health expenditure per capita (current...,SH.XPD.CHEX.PC.CD,,,,,,,...,3754.731346,4013.911834,3857.161116,4107.733984,4346.258747,3698.117574,3834.730581,,,


In [24]:
health_df = health_df.drop(['Country Code','Indicator Name','Indicator Code','1960','1961','1962','1963','1964','1965','1966','1967','1968','1969','1970','1971','1972','1973',
                      '1974','1975','1976','1977','1978','1979','1980','1981','1982','1983','1984','1985','1986','1987',
                      '1988','1989','1990','1991','1992','1993','1994','1995','1996','1997','1998','1999','2016','2017',
                      '2018','2019'], axis=1)

In [25]:
health_df.head()

Unnamed: 0,Country Name,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Aruba,,,,,,,,,,,,,,,,
1,Afghanistan,,,16.249542,17.490737,20.927087,24.446512,28.416662,31.840183,38.700492,42.30451,45.58775,51.553259,52.218506,55.96755,60.112761,60.088813
2,Angola,12.963033,28.854245,28.961365,34.718297,49.526718,53.930701,69.4247,92.452016,135.208309,119.808614,96.643701,122.117809,122.242944,143.703204,131.751875,108.68067
3,Albania,75.531472,81.946417,89.858329,113.583982,151.980517,165.865512,172.795596,216.413135,239.684351,206.94482,203.208588,246.80376,246.742546,277.668997,313.262897,264.434603
4,Andorra,2050.647513,2081.27533,2256.349073,2774.089627,3161.482406,3536.122706,3689.705702,4094.544269,4201.729595,3911.895963,3754.731346,4013.911834,3857.161116,4107.733984,4346.258747,3698.117574


In [26]:
health_df_melted = health_df.melt(id_vars=["Country Name"], 
        var_name="Year", 
        value_name="Health_Expenditure")
health_df_melted.head()

Unnamed: 0,Country Name,Year,Health_Expenditure
0,Aruba,2000,
1,Afghanistan,2000,
2,Angola,2000,12.963033
3,Albania,2000,75.531472
4,Andorra,2000,2050.647513


In [27]:
health_df_melted = health_df_melted.loc[health_df_melted['Country Name'].isin(list_of_countries)]
health_df_melted.head()

Unnamed: 0,Country Name,Year,Health_Expenditure
0,Aruba,2000,
7,Argentina,2000,705.199321
10,Antigua and Barbuda,2000,383.915161
21,"Bahamas, The",2000,1084.29286
24,Belize,2000,132.615056


In [28]:
health_df_melted["Zone"] = ""

In [29]:
health_df_melted.loc[health_df_melted['Country Name'].isin(Caribbean), 'Zone'] = 'Caribbean'
health_df_melted.loc[health_df_melted['Country Name'].isin(CentralAmerica), 'Zone'] = 'CentralAmerica'
health_df_melted.loc[health_df_melted['Country Name'].isin(SouthAmerica), 'Zone'] = 'SouthAmerica'
health_df_melted.loc[health_df_melted['Country Name'].isin(NorthernAmerica), 'Zone'] = 'NorthernAmerica'

In [30]:
health_df_melted.head()

Unnamed: 0,Country Name,Year,Health_Expenditure,Zone
0,Aruba,2000,,Caribbean
7,Argentina,2000,705.199321,SouthAmerica
10,Antigua and Barbuda,2000,383.915161,Caribbean
21,"Bahamas, The",2000,1084.29286,Caribbean
24,Belize,2000,132.615056,CentralAmerica


In [31]:
health_df_melted.to_csv("health_filtered.csv", index=False)

#### 3. Reading the physicians per 1000 people data

In [32]:
physician_df = pd.read_csv('Physicians.csv')

In [33]:
physician_df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Aruba,ABW,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,,,,,,,...,,,,,,,,,,
1,Afghanistan,AFG,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,0.035,,,,,0.063,...,0.2396,0.2553,0.245,0.2894,0.3039,0.2907,0.284,,,
2,Angola,AGO,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,0.067,,,,,0.076,...,,,,,,,,0.2149,,
3,Albania,ALB,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,0.276,,,,,0.481,...,1.2379,1.2225,1.2658,1.2706,,,1.1998,,,
4,Andorra,AND,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,,,,,,,...,4.0,,,,,3.3333,,,,


In [34]:
physician_df = physician_df.drop(['Country Code','Indicator Name','Indicator Code','1960','1961','1962','1963','1964','1965','1966','1967','1968','1969','1970','1971','1972','1973',
                      '1974','1975','1976','1977','1978','1979','1980','1981','1982','1983','1984','1985','1986','1987',
                      '1988','1989','1990','1991','1992','1993','1994','2019'], axis=1)

In [35]:
physician_df.head()

Unnamed: 0,Country Name,1995,1996,1997,1998,1999,2000,2001,2002,2003,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,Aruba,1.12,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,,,0.11,,,,0.1957,,,...,0.2156,0.2396,0.2553,0.245,0.2894,0.3039,0.2907,0.284,,
2,Angola,,,0.0584,,,,,,,...,0.1311,,,,,,,,0.2149,
3,Albania,1.306,1.354,1.295,1.289,1.282,1.389,,1.305,,...,1.144,1.2379,1.2225,1.2658,1.2706,,,1.1998,,
4,Andorra,2.231,,2.435,2.47,2.594,2.549,2.594,,3.3333,...,3.112,4.0,,,,,3.3333,,,


In [36]:
physician_df_melted = physician_df.melt(id_vars=["Country Name"], 
        var_name="Year", 
        value_name="Number_of_Physicians_per1000_people")
physician_df_melted.head()

Unnamed: 0,Country Name,Year,Number_of_Physicians_per1000_people
0,Aruba,1995,1.12
1,Afghanistan,1995,
2,Angola,1995,
3,Albania,1995,1.306
4,Andorra,1995,2.231


In [37]:
physician_df_melted = physician_df_melted.loc[physician_df_melted['Country Name'].isin(list_of_countries)]
physician_df_melted.head()

Unnamed: 0,Country Name,Year,Number_of_Physicians_per1000_people
0,Aruba,1995,1.12
7,Argentina,1995,2.68
10,Antigua and Barbuda,1995,0.76
21,"Bahamas, The",1995,1.49
24,Belize,1995,0.6


In [38]:
physician_df_melted["Zone"] = ""

In [39]:
physician_df_melted.loc[physician_df_melted['Country Name'].isin(Caribbean), 'Zone'] = 'Caribbean'
physician_df_melted.loc[physician_df_melted['Country Name'].isin(CentralAmerica), 'Zone'] = 'CentralAmerica'
physician_df_melted.loc[physician_df_melted['Country Name'].isin(SouthAmerica), 'Zone'] = 'SouthAmerica'
physician_df_melted.loc[physician_df_melted['Country Name'].isin(NorthernAmerica), 'Zone'] = 'NorthernAmerica'

In [40]:
physician_df_melted

Unnamed: 0,Country Name,Year,Number_of_Physicians_per1000_people,Zone
0,Aruba,1995,1.12,Caribbean
7,Argentina,1995,2.68,SouthAmerica
10,Antigua and Barbuda,1995,0.76,Caribbean
21,"Bahamas, The",1995,1.49,Caribbean
24,Belize,1995,0.60,CentralAmerica
...,...,...,...,...
6320,Uruguay,2018,,SouthAmerica
6321,United States,2018,,NorthernAmerica
6323,St. Vincent and the Grenadines,2018,,Caribbean
6324,"Venezuela, RB",2018,,SouthAmerica


In [41]:
physician_df_melted.to_csv("physician_filtered.csv", index=False)

#### 4. Reading the population Data

In [42]:
population_df = pd.read_csv("population.csv")
population_df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,...,101669.0,102046.0,102560.0,103159.0,103774.0,104341.0,104872.0,105366.0,105845.0,
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,8996973.0,9169410.0,9351441.0,9543205.0,9744781.0,9956320.0,...,29185507.0,30117413.0,31161376.0,32269589.0,33370794.0,34413603.0,35383128.0,36296400.0,37172386.0,
2,Angola,AGO,"Population, total",SP.POP.TOTL,5454933.0,5531472.0,5608539.0,5679458.0,5735044.0,5770570.0,...,23356246.0,24220661.0,25107931.0,26015780.0,26941779.0,27884381.0,28842484.0,29816748.0,30809762.0,
3,Albania,ALB,"Population, total",SP.POP.TOTL,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,...,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0,2866376.0,
4,Andorra,AND,"Population, total",SP.POP.TOTL,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,...,84449.0,83747.0,82427.0,80774.0,79213.0,78011.0,77297.0,77001.0,77006.0,


In [43]:
population_df = population_df.drop(['Country Code','Indicator Name','Indicator Code','1960','1961','1962','1963','1964','1965','1966','1967','1968','1969','1970','1971','1972','1973',
                      '1974','1975','1976','1977','1978','1979','1980','1981','1982','1983','1984','1985','1986','1987',
                      '1988','1989','1990','1991','1992','1993','1994','2018','2019'], axis=1)

In [44]:
population_df.head()

Unnamed: 0,Country Name,1995,1996,1997,1998,1999,2000,2001,2002,2003,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Aruba,80324.0,83200.0,85451.0,87277.0,89005.0,90853.0,92898.0,94992.0,97017.0,...,101358.0,101455.0,101669.0,102046.0,102560.0,103159.0,103774.0,104341.0,104872.0,105366.0
1,Afghanistan,18110657.0,18853437.0,19357126.0,19737765.0,20170844.0,20779953.0,21606988.0,22600770.0,23680871.0,...,27722276.0,28394813.0,29185507.0,30117413.0,31161376.0,32269589.0,33370794.0,34413603.0,35383128.0,36296400.0
2,Angola,13945206.0,14400719.0,14871570.0,15359601.0,15866869.0,16395473.0,16945753.0,17519417.0,18121479.0,...,21695634.0,22514281.0,23356246.0,24220661.0,25107931.0,26015780.0,26941779.0,27884381.0,28842484.0,29816748.0
3,Albania,3187784.0,3168033.0,3148281.0,3128530.0,3108778.0,3089027.0,3060173.0,3051010.0,3039616.0,...,2947314.0,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0
4,Andorra,63850.0,64360.0,64327.0,64142.0,64370.0,65390.0,67341.0,70049.0,73182.0,...,83862.0,84463.0,84449.0,83747.0,82427.0,80774.0,79213.0,78011.0,77297.0,77001.0


In [45]:
population_df_melted = population_df.melt(id_vars=["Country Name"], 
        var_name="Year", 
        value_name="Population")
population_df_melted.head()

Unnamed: 0,Country Name,Year,Population
0,Aruba,1995,80324.0
1,Afghanistan,1995,18110657.0
2,Angola,1995,13945206.0
3,Albania,1995,3187784.0
4,Andorra,1995,63850.0


In [46]:
population_df_melted = population_df_melted.loc[population_df_melted['Country Name'].isin(list_of_countries)]
population_df_melted['Zone'] = ""
population_df_melted.head()

Unnamed: 0,Country Name,Year,Population,Zone
0,Aruba,1995,80324.0,
7,Argentina,1995,34828170.0,
10,Antigua and Barbuda,1995,68670.0,
21,"Bahamas, The",1995,280184.0,
24,Belize,1995,206963.0,


In [47]:
population_df_melted.loc[population_df_melted['Country Name'].isin(Caribbean), 'Zone'] = 'Caribbean'
population_df_melted.loc[population_df_melted['Country Name'].isin(CentralAmerica), 'Zone'] = 'CentralAmerica'
population_df_melted.loc[population_df_melted['Country Name'].isin(SouthAmerica), 'Zone'] = 'SouthAmerica'
population_df_melted.loc[population_df_melted['Country Name'].isin(NorthernAmerica), 'Zone'] = 'NorthernAmerica'

In [48]:
population_df_melted.to_csv("population_filtered.csv", index=False)

### Now we will combine all of the above indicators to our main dataframe consisting of the WHO mortality data

In [49]:
main_df.drop(['Gender', 'AgeGroupCode'], axis=1, inplace=True)
main_df["ICD10"] = main_df["ICD10"].str[:3]
main_df.head()

Unnamed: 0,CountryName,MortalityYear,ICD10,Deaths,Class,Zone
0,Brazil,2017,J41,1,Bronchitis,SouthAmerica
1,Brazil,2017,J40,1,Bronchitis,SouthAmerica
2,Brazil,2017,J42,1,Bronchitis,SouthAmerica
3,Brazil,2017,J41,1,Bronchitis,SouthAmerica
4,Brazil,2017,J41,1,Bronchitis,SouthAmerica


In [50]:
main_df_clean = pd.DataFrame(main_df.groupby(['CountryName', 'MortalityYear', 'ICD10', 'Class', 'Zone'])['Deaths'].sum())
main_df_clean.reset_index(inplace=True)
main_df_clean.head()

Unnamed: 0,CountryName,MortalityYear,ICD10,Class,Zone,Deaths
0,Antigua and Barbuda,1997,J20,Bronchitis,Caribbean,1
1,Antigua and Barbuda,1998,J42,Bronchitis,Caribbean,1
2,Antigua and Barbuda,2005,J20,Bronchitis,Caribbean,1
3,Antigua and Barbuda,2012,J42,Bronchitis,Caribbean,1
4,Antigua and Barbuda,2016,J40,Bronchitis,Caribbean,1


#### Merging population df with main df

In [51]:
pop_df = pd.read_csv("population_filtered.csv")

In [52]:
pop_df.head()

Unnamed: 0,Country Name,Year,Population,Zone
0,Aruba,1995,80324.0,Caribbean
1,Argentina,1995,34828170.0,SouthAmerica
2,Antigua and Barbuda,1995,68670.0,Caribbean
3,"Bahamas, The",1995,280184.0,Caribbean
4,Belize,1995,206963.0,CentralAmerica


In [53]:
final_df = pd.read_csv("mortality_filtered.csv")
final_df.head()

Unnamed: 0,CountryName,MortalityYear,ICD10,Deaths,Class,Zone
0,Brazil,2017,J41,1,Bronchitis,SouthAmerica
1,Brazil,2017,J40,1,Bronchitis,SouthAmerica
2,Brazil,2017,J42,1,Bronchitis,SouthAmerica
3,Brazil,2017,J41,1,Bronchitis,SouthAmerica
4,Brazil,2017,J41,1,Bronchitis,SouthAmerica


In [54]:
final_df_cleaned = pd.DataFrame(final_df.groupby(['CountryName', 'MortalityYear', 'ICD10', 'Class', 'Zone'])['Deaths'].sum())
final_df_cleaned.reset_index(inplace=True)
final_df_cleaned.head()

Unnamed: 0,CountryName,MortalityYear,ICD10,Class,Zone,Deaths
0,Antigua and Barbuda,1997,J20,Bronchitis,Caribbean,1
1,Antigua and Barbuda,1998,J42,Bronchitis,Caribbean,1
2,Antigua and Barbuda,2005,J20,Bronchitis,Caribbean,1
3,Antigua and Barbuda,2012,J42,Bronchitis,Caribbean,1
4,Antigua and Barbuda,2016,J40,Bronchitis,Caribbean,1


In [55]:
final_df_merge = final_df_cleaned.merge(pop_df, left_on=['CountryName','MortalityYear'], 
                                                right_on=['Country Name','Year'], how='left')
final_df_merge.drop(['Zone_y', 'Country Name', 'Year'], axis=1, inplace=True)
final_df_merge.columns

Index(['CountryName', 'MortalityYear', 'ICD10', 'Class', 'Zone_x', 'Deaths',
       'Population'],
      dtype='object')

In [56]:
final_df_merge.head()

Unnamed: 0,CountryName,MortalityYear,ICD10,Class,Zone_x,Deaths,Population
0,Antigua and Barbuda,1997,J20,Bronchitis,Caribbean,1,71704.0
1,Antigua and Barbuda,1998,J42,Bronchitis,Caribbean,1,73224.0
2,Antigua and Barbuda,2005,J20,Bronchitis,Caribbean,1,81465.0
3,Antigua and Barbuda,2012,J42,Bronchitis,Caribbean,1,90409.0
4,Antigua and Barbuda,2016,J40,Bronchitis,Caribbean,1,94527.0


#### Reading the filtered physician data to merge with our final dataframe

In [57]:
phy_df = pd.read_csv("physician_filtered.csv")
phy_df.head()

Unnamed: 0,Country Name,Year,Number_of_Physicians_per1000_people,Zone
0,Aruba,1995,1.12,Caribbean
1,Argentina,1995,2.68,SouthAmerica
2,Antigua and Barbuda,1995,0.76,Caribbean
3,"Bahamas, The",1995,1.49,Caribbean
4,Belize,1995,0.6,CentralAmerica


In [58]:
final_df_merge = final_df_cleaned.merge(phy_df, left_on=['CountryName','MortalityYear'], 
                                                right_on=['Country Name','Year'], how='left')
final_df_merge.drop(['Zone_x','Zone_y', 'Country Name', 'Year'], axis=1, inplace=True)
final_df_merge.head()

Unnamed: 0,CountryName,MortalityYear,ICD10,Class,Deaths,Number_of_Physicians_per1000_people
0,Antigua and Barbuda,1997,J20,Bronchitis,1,
1,Antigua and Barbuda,1998,J42,Bronchitis,1,
2,Antigua and Barbuda,2005,J20,Bronchitis,1,
3,Antigua and Barbuda,2012,J42,Bronchitis,1,
4,Antigua and Barbuda,2016,J40,Bronchitis,1,


#### Reading the filtered health expenditure data to merge with final dataframe

In [59]:
df_health = pd.read_csv("health_filtered.csv")
df_health.head()

Unnamed: 0,Country Name,Year,Health_Expenditure,Zone
0,Aruba,2000,,Caribbean
1,Argentina,2000,705.199321,SouthAmerica
2,Antigua and Barbuda,2000,383.915161,Caribbean
3,"Bahamas, The",2000,1084.29286,Caribbean
4,Belize,2000,132.615056,CentralAmerica


In [60]:
final_df_merge = final_df_cleaned.merge(df_health, left_on=['CountryName','MortalityYear'], 
                                                right_on=['Country Name','Year'], how='left')
final_df_merge.drop(['Zone_x','Zone_y', 'Country Name', 'Year'], axis=1, inplace=True)
final_df_merge.head()

Unnamed: 0,CountryName,MortalityYear,ICD10,Class,Deaths,Health_Expenditure
0,Antigua and Barbuda,1997,J20,Bronchitis,1,
1,Antigua and Barbuda,1998,J42,Bronchitis,1,
2,Antigua and Barbuda,2005,J20,Bronchitis,1,518.435457
3,Antigua and Barbuda,2012,J42,Bronchitis,1,644.418095
4,Antigua and Barbuda,2016,J40,Bronchitis,1,


#### Reading the filtered GDP dataset to merge with final dataframe

In [61]:
df_gdp = pd.read_csv("gdp_filtered.csv")
df_gdp.head()

Unnamed: 0,Country Name,Year,GDP,Zone
0,Aruba,1995,26705.181,Caribbean
1,Argentina,1995,7666.530004,SouthAmerica
2,Antigua and Barbuda,1995,11201.74013,Caribbean
3,"Bahamas, The",1995,27018.34496,Caribbean
4,Belize,1995,3375.374233,CentralAmerica


In [62]:
final_df_merge = final_df_cleaned.merge(df_gdp, left_on=['CountryName','MortalityYear'], 
                                                right_on=['Country Name','Year'], how='left')
final_df_merge.drop(['Zone_x','Zone_y', 'Country Name', 'Year'], axis=1, inplace=True)
final_df_merge.head()

Unnamed: 0,CountryName,MortalityYear,ICD10,Class,Deaths,GDP
0,Antigua and Barbuda,1997,J20,Bronchitis,1,12062.0076
1,Antigua and Barbuda,1998,J42,Bronchitis,1,12370.46666
2,Antigua and Barbuda,2005,J20,Bronchitis,1,14097.09667
3,Antigua and Barbuda,2012,J42,Bronchitis,1,12876.88929
4,Antigua and Barbuda,2016,J40,Bronchitis,1,13917.95112


In [63]:
final_df_merge.head()

Unnamed: 0,CountryName,MortalityYear,ICD10,Class,Deaths,GDP
0,Antigua and Barbuda,1997,J20,Bronchitis,1,12062.0076
1,Antigua and Barbuda,1998,J42,Bronchitis,1,12370.46666
2,Antigua and Barbuda,2005,J20,Bronchitis,1,14097.09667
3,Antigua and Barbuda,2012,J42,Bronchitis,1,12876.88929
4,Antigua and Barbuda,2016,J40,Bronchitis,1,13917.95112


In [64]:
final_dataframe = pd.concat([final_df_cleaned, pop_df, phy_df, df_health, df_gdp], axis=1)

In [65]:
final_dataframe.drop(['Country Name','Year'], axis=1, inplace=True)

In [66]:
final_dataframe.drop(["Zone"], axis=1, inplace=True)

In [67]:
final_dataframe.head()

Unnamed: 0,CountryName,MortalityYear,ICD10,Class,Deaths,Population,Number_of_Physicians_per1000_people,Health_Expenditure,GDP
0,Antigua and Barbuda,1997,J20,Bronchitis,1,80324.0,1.12,,26705.181
1,Antigua and Barbuda,1998,J42,Bronchitis,1,34828170.0,2.68,705.199321,7666.530004
2,Antigua and Barbuda,2005,J20,Bronchitis,1,68670.0,0.76,383.915161,11201.74013
3,Antigua and Barbuda,2012,J42,Bronchitis,1,280184.0,1.49,1084.29286,27018.34496
4,Antigua and Barbuda,2016,J40,Bronchitis,1,206963.0,0.6,132.615056,3375.374233


In [68]:
final_dataframe.loc[final_dataframe['CountryName'].isin(Caribbean), 'Zone'] = 'Caribbean'
final_dataframe.loc[final_dataframe['CountryName'].isin(CentralAmerica), 'Zone'] = 'CentralAmerica'
final_dataframe.loc[final_dataframe['CountryName'].isin(SouthAmerica), 'Zone'] = 'SouthAmerica'
final_dataframe.loc[final_dataframe['CountryName'].isin(NorthernAmerica), 'Zone'] = 'NorthernAmerica'

In [69]:
final_dataframe.head()

Unnamed: 0,CountryName,MortalityYear,ICD10,Class,Deaths,Population,Number_of_Physicians_per1000_people,Health_Expenditure,GDP,Zone
0,Antigua and Barbuda,1997,J20,Bronchitis,1,80324.0,1.12,,26705.181,Caribbean
1,Antigua and Barbuda,1998,J42,Bronchitis,1,34828170.0,2.68,705.199321,7666.530004,Caribbean
2,Antigua and Barbuda,2005,J20,Bronchitis,1,68670.0,0.76,383.915161,11201.74013,Caribbean
3,Antigua and Barbuda,2012,J42,Bronchitis,1,280184.0,1.49,1084.29286,27018.34496,Caribbean
4,Antigua and Barbuda,2016,J40,Bronchitis,1,206963.0,0.6,132.615056,3375.374233,Caribbean


In [70]:
final_dataframe.to_csv("final_df_filtered.csv", index=False)