# **Analysis of Heart Disease Mortality Data**

#### **Samarth Tuli, Ndenko Fontem, Yi Zhu, Kimia Samieinejad**    


### **Importance of Heart Disease Mortality Research**    
People across the world have experiencing several physical diseases ranging from diabetes to arthritis and mental health disorders from depression to suicide within the past 2 decades, which has made global health a significant priority of the research community. However, one of the most important disease types that is being studied is heart disease, especially coronary (ischemic) artery disease. According to the [World Health Organization](https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1), heart disease is the leading global cause of death, which results in 17.9 million annual deaths. 4 out of every 5 heart disease deaths are caused by heart attacks and strokes, which can be caused by any mental or physical health factors, and one third of heart disease deaths occur prematurely to people under the age of 70, which means everyone is vulnerable to heart disease. Research efforts led by NIH and WHO have shown that a wide variety of behavioral, medical, and socioeconomic risk factors can be underlying causes for heart disease mortality including tobacco and alcohol consumption, obesity, lack of physical activity, malnutrition, increased blood pressure, and restricted access to primary healthcare facilities. Other commorbidities (pre-existing conditions) such as diabetes, arthritis, chronic kidney disease, and anxiety problems. Given the massive amount of data collected on these factors and how heart disease deaths can vary for each country's population, data scientists have a significant role to break down this data into insights that can guide the future path of heart disease mortality research. In our dataset, we are defining heart disease mortality rate as the number of heart disease related deaths per 100,000 people and will cover it for all countries from 2012-2017 and will focus upon countries within North America and Europe.

### **What is coronary heart disease?**

According to the NIH's latest research, [coronary heart disease](https://www.nhlbi.nih.gov/health/coronary-heart-disease) is a cardiovascular disease where the arteries cannot provide sufficient oxygen to a person's bloodstream. The primary cause of CHD is high cholesterol forming plaque along the lining of the arteries, which can constrict blood flow, cause blood vessels to stop functioning normally, and also increase the chances of severe chest pain or heart attacks or strokes or cardiac arrest. Although the risk of coronary heart disease can be reduced through lifestyle changes, many people don't take immediate action. This has resulted in it becoming globally widespread with 650000 deaths per year due to generalized heart disease, 11% of adults being diagnosed with heart disease, and 366000 annual deaths due to CHD specifically in the US alone. This background clearly demonstrates that heart disease mortality data needs to be analyzed by data scientists to provide key insights that will mitigate risk for future heart patients.

### **Tutorial Purpose**

The objective of this tutorial is to evaluate many factors that may positively or negatively affect the heart disease mortality rates of populations across different countries so that we can get a better understanding of what factors should be focused upon most by the research community to reduce the overall risk of heart disease-related deaths in North America and European countries. Data science is the right tool to achieve this because it will allow us to deconstruct complex heart disease mortality data into specific insights and recommendations that can be used by heart disease research leaders and policymakers to take immediate action through a 5-stage pipeline: data collection and processing, exploratory data analysis and visualizations, analysis/hypothesis testing/use of ML models, and insights & policy decisions.

## **Data Collection and Processing**

We collected data from the Global Health Observatory database on the [World Health Organization](https://www.who.int/data/gho). This specialized agency established by the United Nations is responsible for harmonizing the global health activties and aiding governments around the world in inhancing their healthcare systems. The specific input features we got data for from WHO for each country and year were age-standardized suicide rate (per 100000 people), mean BMI, raised blood pressure, mean HDL cholesterol, meany systolic blood pressure, health expenditure as % of GDP, and % of overweight adults. We also collected data from [Our World in Data](https://ourworldindata.org/). We got data on heart disease mortality rate (deaths per 100,000 people), which is our target feature we're trying to predict. We also retrieved two more input features, which are deaths caused by type 1 and 2 diabetes in each country.  Our World is a online scientific publication that focuses on global challenges such as poverty, disease, and inequality. We collected data for all 183 countries in 2012-2017 for all 9 features (including heart disease mortality rate) from WHO and Our World in Data. However, two issues we encountered for our datasets are that some of them do not contain data for all 183 countries or our designated year range (2012-2017) in each feature. To combat this problem, we performed imputation via linear regression to predict and fill in the missing values for each input feature.

In [56]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re


# Import and process age-standardized suicide rate data for both sexes (# of people per 100,000)
# Need 3 fields (country, year, and suicides per 100000 people)
suicide_rates_all_countries = pd.read_csv('data/suicide rate per 100k 2012-2017.csv')
suicide_rates_all_countries = suicide_rates_all_countries[suicide_rates_all_countries["Dim1"] == "Both sexes"]
suicide_rates_all_countries = suicide_rates_all_countries[['Location', 'Period', 'FactValueNumeric']]
suicide_rates_all_countries = suicide_rates_all_countries.rename(columns = {"Location": "Country", "Period": "Year", "FactValueNumeric":"Suicides per 10000 people"})
suicide_rates_all_countries = suicide_rates_all_countries.sort_values(by = ['Country', 'Year'])
suicide_rates_all_countries = suicide_rates_all_countries.reset_index(drop = True)
suicide_rates_all_countries

Unnamed: 0,Country,Year,Suicides per 10000 people
0,Afghanistan,2012,6.24
1,Afghanistan,2013,6.17
2,Afghanistan,2014,6.02
3,Afghanistan,2015,5.99
4,Afghanistan,2016,6.01
...,...,...,...
1093,Zimbabwe,2013,31.40
1094,Zimbabwe,2014,30.83
1095,Zimbabwe,2015,30.74
1096,Zimbabwe,2016,28.70


In [58]:
bmi_all_countries = pd.read_csv('data/BMI.csv')
bmi_all_countries = bmi_all_countries[['Location', 'Period', 'Dim1','FactValueNumeric']]
bmi_all_countries = bmi_all_countries.rename(columns = {"Location": "Country", "Period": "Year", "Dim1":"Sex", "Dim2":"Age", "FactValueNumeric": "Mean BMI"})
bmi_all_countries = bmi_all_countries[bmi_all_countries["Sex"] == "Both sexes"]
bmi_all_countries = bmi_all_countries.sort_values(by=['Country', 'Year'])
bmi_all_countries = bmi_all_countries.reset_index(drop = True)
bmi_all_countries

Unnamed: 0,Country,Year,Sex,Mean BMI
0,Afghanistan,2012,Both sexes,22.9
1,Afghanistan,2013,Both sexes,23.0
2,Afghanistan,2014,Both sexes,23.2
3,Afghanistan,2015,Both sexes,23.3
4,Afghanistan,2016,Both sexes,23.4
...,...,...,...,...
970,Zimbabwe,2012,Both sexes,23.7
971,Zimbabwe,2013,Both sexes,23.7
972,Zimbabwe,2014,Both sexes,23.8
973,Zimbabwe,2015,Both sexes,23.8


In [59]:
mean_HDL_cholesterol = pd.read_csv("data/cholestrol2012-2017.csv")
mean_HDL_cholesterol = mean_HDL_cholesterol[['Location', 'Period', 'Dim1','FactValueNumeric']]
mean_HDL_cholesterol = mean_HDL_cholesterol.rename(columns = {"Location": "Country", "Period": "Year", "Dim1":"Sex", "Dim2":"Age", "FactValueNumeric": "Cholesterol Level"})
mean_HDL_cholesterol = mean_HDL_cholesterol[mean_HDL_cholesterol["Sex"] == "Both sexes"]
mean_HDL_cholesterol = mean_HDL_cholesterol.sort_values(by=['Country', 'Year'])
mean_HDL_cholesterol = mean_HDL_cholesterol.reset_index(drop = True)
mean_HDL_cholesterol
#no missing data

Unnamed: 0,Country,Year,Sex,Cholesterol Level
0,Afghanistan,2012,Both sexes,1.1
1,Afghanistan,2013,Both sexes,1.1
2,Afghanistan,2014,Both sexes,1.1
3,Afghanistan,2015,Both sexes,1.1
4,Afghanistan,2016,Both sexes,1.1
...,...,...,...,...
1141,Zimbabwe,2013,Both sexes,1.3
1142,Zimbabwe,2014,Both sexes,1.3
1143,Zimbabwe,2015,Both sexes,1.3
1144,Zimbabwe,2016,Both sexes,1.3


In [60]:
low_phys_activity = pd.read_csv("data/disease-burden-by-risk-factor.csv")
low_phys_activity = low_phys_activity[['Entity', 'Year', 'DALYs (Disability-Adjusted Life Years) - Cause: All causes - Risk: Low physical activity - Sex: Both - Age: All Ages (Number)']]
low_phys_activity = low_phys_activity.loc[low_phys_activity['Year'].isin([2012,2013,2014,2015,2016,2017])]
low_phys_activity = low_phys_activity.rename(columns = {"Entity": "Country",  "DALYs (Disability-Adjusted Life Years) - Cause: All causes - Risk: Low physical activity - Sex: Both - Age: All Ages (Number)": "Low Physical Activity DALYs"})
low_phys_activity = low_phys_activity.sort_values(by=['Country', 'Year'])
low_phys_activity.insert(2, 'Sex', 'Both sexes')
low_phys_activity = low_phys_activity.reset_index(drop = True)
low_phys_activity

Unnamed: 0,Country,Year,Sex,Low Physical Activity DALYs
0,Afghanistan,2012,Both sexes,87161.46
1,Afghanistan,2013,Both sexes,89240.38
2,Afghanistan,2014,Both sexes,91722.71
3,Afghanistan,2015,Both sexes,94787.74
4,Afghanistan,2016,Both sexes,97995.82
...,...,...,...,...
1363,Zimbabwe,2013,Both sexes,6607.48
1364,Zimbabwe,2014,Both sexes,6911.81
1365,Zimbabwe,2015,Both sexes,7196.88
1366,Zimbabwe,2016,Both sexes,7424.67


In [61]:
spirits_consumption = pd.read_csv("data/spirits-consumption-per-person.csv")
spirits_consumption = spirits_consumption[['Entity', 'Year', 'Indicator:Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol) - Beverage Types:Spirits']]
spirits_consumption = spirits_consumption.loc[spirits_consumption['Year'].isin([2012,2013,2014,2015,2016,2017])]
spirits_consumption = spirits_consumption.rename(columns = {"Entity": "Country",  "Indicator:Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol) - Beverage Types:Spirits": "Spirits Consumption"})
spirits_consumption = spirits_consumption.sort_values(by=['Country', 'Year'])
spirits_consumption.insert(2, 'Sex', 'Both sexes')
spirits_consumption = spirits_consumption.reset_index(drop = True)
spirits_consumption



Unnamed: 0,Country,Year,Sex,Spirits Consumption
0,Afghanistan,2012,Both sexes,0.00
1,Afghanistan,2013,Both sexes,0.00
2,Afghanistan,2014,Both sexes,0.01
3,Afghanistan,2015,Both sexes,0.00
4,Afghanistan,2016,Both sexes,0.01
...,...,...,...,...
1123,Zimbabwe,2013,Both sexes,0.47
1124,Zimbabwe,2014,Both sexes,0.36
1125,Zimbabwe,2015,Both sexes,0.36
1126,Zimbabwe,2016,Both sexes,0.35
