# **Analysis of Heart Disease Mortality Data**

#### **Samarth Tuli, Ndenko Fontem, Yi Zhu, Kimia Samieinejad**    


### **Importance of Heart Disease Mortality Research**    
People across the world have experiencing several physical diseases ranging from diabetes to arthritis and mental health disorders from depression to suicide within the past 2 decades, which has made global health a significant priority of the research community. However, one of the most important disease types that is being studied is heart disease, especially coronary (ischemic) artery disease. According to the [World Health Organization](https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1), heart disease is the leading global cause of death, which results in 17.9 million annual deaths. 4 out of every 5 heart disease deaths are caused by heart attacks and strokes, which can be caused by any mental or physical health factors, and one third of heart disease deaths occur prematurely to people under the age of 70, which means everyone is vulnerable to heart disease. Research efforts led by NIH and WHO have shown that a wide variety of behavioral, medical, and socioeconomic risk factors can be underlying causes for heart disease mortality including tobacco and alcohol consumption, obesity, lack of physical activity, malnutrition, increased blood pressure, and restricted access to primary healthcare facilities. Other commorbidities (pre-existing conditions) such as diabetes, arthritis, chronic kidney disease, and anxiety problems. Given the massive amount of data collected on these factors and how heart disease deaths can vary for each country's population, data scientists have a significant role to break down this data into insights that can guide the future path of heart disease mortality research. In our dataset, we are defining heart disease mortality rate as the number of heart disease related deaths per 100,000 people and will cover it for all countries from 2012-2017 and will focus upon countries within North America and Europe.

### **What is coronary heart disease?**

According to the NIH's latest research, [coronary heart disease](https://www.nhlbi.nih.gov/health/coronary-heart-disease) is a cardiovascular disease where the arteries cannot provide sufficient oxygen to a person's bloodstream. The primary cause of CHD is high cholesterol forming plaque along the lining of the arteries, which can constrict blood flow, cause blood vessels to stop functioning normally, and also increase the chances of severe chest pain or heart attacks or strokes or cardiac arrest. Although the risk of coronary heart disease can be reduced through lifestyle changes, many people don't take immediate action. This has resulted in it becoming globally widespread with 650000 deaths per year due to generalized heart disease, 11% of adults being diagnosed with heart disease, and 366000 annual deaths due to CHD specifically in the US alone. This background clearly demonstrates that heart disease mortality data needs to be analyzed by data scientists to provide key insights that will mitigate risk for future heart patients.

### **Tutorial Purpose**

The objective of this tutorial is to evaluate many factors that may positively or negatively affect the heart disease mortality rates of populations across different countries so that we can get a better understanding of what factors should be focused upon most by the research community to reduce the overall risk of heart disease-related deaths in North America and European countries. Data science is the right tool to achieve this because it will allow us to deconstruct complex heart disease mortality data into specific insights and recommendations that can be used by heart disease research leaders and policymakers to take immediate action through a 5-stage pipeline: data collection and processing, exploratory data analysis and visualizations, analysis/hypothesis testing/use of ML models, and insights & policy decisions.

## **Data Collection and Processing**

We collected data from the Global Health Observatory database on the [World Health Organization](https://www.who.int/data/gho). This specialized agency established by the United Nations is responsible for harmonizing the global health activties and aiding governments around the world in inhancing their healthcare systems. The specific input features we got data for from WHO for each country and year were age-standardized suicide rate (per 100000 people), mean BMI, raised blood pressure, mean HDL cholesterol, meany systolic blood pressure, health expenditure as % of GDP, and % of overweight adults. We also collected data from [Our World in Data](https://ourworldindata.org/). We got data on heart disease mortality rate (deaths per 100,000 people), which is our target feature we're trying to predict. We also retrieved two more input features, which are deaths caused by type 1 and 2 diabetes in each country.  Our World is a online scientific publication that focuses on global challenges such as poverty, disease, and inequality. We collected data for all 183 countries in 2012-2017 for all 9 features (including heart disease mortality rate) from WHO and Our World in Data. However, two issues we encountered for our datasets are that some of them do not contain data for all 183 countries or our designated year range (2012-2017) in each feature. To combat this problem, we performed imputation via linear regression to predict and fill in the missing values for each input feature.

In [None]:
# packages for data collection and processing
import numpy as np
import pandas as pd

# packages for plotting graphs (feature visualization)
import matplotlib.pyplot as plt

# packages for ML and filling in missing values
from sklearn import linear_model

### **Add heart disease mortality (death rate) data**
The first step we will take is to import the heart disease death rates (target feature) data and remove the columns we won't use. This feature is an age-standardized estimate for both sexes and has data from 1990-2019 but we will only be using the rows from 2012-2017 in this table. 

In [None]:
# Import and process age-standardized heart disease death rates data for both sexes (# of heart disease deaths per 100000 people)
heart_disease_death_rates = pd.read_csv('data/cardiovascular-disease-death-rates.csv')
heart_disease_death_rates = heart_disease_death_rates[['Entity', 'Year', 'Deaths - Cardiovascular diseases - Sex: Both - Age: Age-standardized (Rate)']]
heart_disease_death_rates = heart_disease_death_rates.rename(columns = {"Entity":"Country", 'Deaths - Cardiovascular diseases - Sex: Both - Age: Age-standardized (Rate)':'Heart Disease Mortality Rate (Deaths per 100K people)'})
heart_disease_death_rates = heart_disease_death_rates[(heart_disease_death_rates['Year']>=2012) & (heart_disease_death_rates['Year']<=2017)]
heart_disease_death_rates = heart_disease_death_rates.reset_index(drop = True)
percentage_missing_values_heart_disease_mortality = ((float(heart_disease_death_rates['Heart Disease Mortality Rate (Deaths per 100K people)'].isna().sum()))/heart_disease_death_rates['Heart Disease Mortality Rate (Deaths per 100K people)'].count()) * 100.0
print("Percentage Missing Values: ", percentage_missing_values_heart_disease_mortality)
print(len(heart_disease_death_rates['Country'].unique()))
heart_disease_death_rates

### **Add anxiety disorder prevalence data**
The second step we will take is to import the anxiety disorder prevalence data and remove the columns we aren't going to use. This feature is an age-standardized estimate and is a % of each country's population that comes from Our World in Data. The anxiety prevalence table includes data from 2012-2017 for both sexes. There are no missing values in the suicide data before joining, so no imputation is required. However, after merging the suicide data with the heart disease mortality data frame, we can see that 6.5% of the suicide data has missing values in the main data frame. Therefore, we have to use linear regression to predict the remaining values (not done yet), which will introduce a little bias.

In [None]:
# Import and process % of population with anxiety disorders data for both sexes
# Extract 3 fields (country, year, and anxiety disorder prevalence
anxiety_prevalence_all_countries = pd.read_csv('data/anxiety-disorders-prevalence.csv')
anxiety_prevalence_all_countries = anxiety_prevalence_all_countries[['Entity', 'Year', 'Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized']]
anxiety_prevalence_all_countries = anxiety_prevalence_all_countries[(anxiety_prevalence_all_countries['Year']>=2012) & (anxiety_prevalence_all_countries['Year']<=2017)]
anxiety_prevalence_all_countries = anxiety_prevalence_all_countries.rename(columns = {'Entity':'Country', 'Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized':"% of population with anxiety disorders"})
anxiety_prevalence_all_countries = anxiety_prevalence_all_countries.sort_values(by = ['Country', 'Year'])
anxiety_prevalence_all_countries = anxiety_prevalence_all_countries.reset_index(drop = True)
# Compute % of missing values in anxiety disorder data
percentage_missing_values_anxiety_before = ((float(anxiety_prevalence_all_countries['% of population with anxiety disorders'].isna().sum()))/anxiety_prevalence_all_countries['% of population with anxiety disorders'].count()) * 100.0
print(percentage_missing_values_anxiety_before)
# Compute % of missing values in anxiety after join
heart_disease_death_rates['% of population with anxiety disorders'] = anxiety_prevalence_all_countries['% of population with anxiety disorders']
percentage_missing_values_anxiety_after = ((float(heart_disease_death_rates['% of population with anxiety disorders'].isna().sum()))/heart_disease_death_rates['% of population with anxiety disorders'].count()) * 100.0
# Outputting heart disease death rates df with anxiety column
print(percentage_missing_values_anxiety_after)
heart_disease_death_rates

### **Add BMI data**

The third step we will take is to import the mean BMI data and remove the columns we aren't going to use. This feature data comes from the World Health Organization. The mean BMI table includes data from 2012-2017 that separates BMI  by sex. We will only consider the mean BMI for both sexes combined (is an average of both sexes separately). There are no missing values in the BMI data before joining, so no imputation is required. However, after merging the BMI data with the heart disease mortality data frame, we can see that 43.25% of the BMI data has missing values in the main data frame. Therefore, we have to use linear regression to predict the remaining values (not done yet), which will introduce some bias.

In [None]:
# Import and process mean BMI data for both sexes
bmi_all_countries = pd.read_csv('data/BMI.csv')
bmi_all_countries = bmi_all_countries[['Location', 'Period', 'Dim1','FactValueNumeric']]
bmi_all_countries = bmi_all_countries.rename(columns = {"Location": "Country", "Period": "Year", "Dim1":"Sex", "Dim2":"Age", "FactValueNumeric": "Mean BMI"})
bmi_all_countries = bmi_all_countries[bmi_all_countries["Sex"] == "Both sexes"]
# Sort data frame by country and year and reset index
bmi_all_countries = bmi_all_countries.sort_values(by=['Country', 'Year'])
bmi_all_countries = bmi_all_countries.reset_index(drop = True)
bmi_all_countries = bmi_all_countries[['Country', 'Year', 'Mean BMI']]
# Compute % of missing values in BMI data frame before join
percentage_missing_values_bmi_before = ((float(bmi_all_countries['Mean BMI'].isna().sum()))/bmi_all_countries['Mean BMI'].count()) * 100.0
print("Percentage of Missing Values for Mean BMI Before Join: ", percentage_missing_values_bmi_before)
# Merge BMI data with heart disease mortality data frame and compute % of missing values after join
heart_disease_death_rates['Mean BMI'] = bmi_all_countries['Mean BMI']
percentage_missing_values_bmi_after = ((float(heart_disease_death_rates['Mean BMI'].isna().sum()))/heart_disease_death_rates['Mean BMI'].count()) * 100.0
print("Percentage of Missing Values in Mean BMI After Join: ", percentage_missing_values_bmi_after)
heart_disease_death_rates

### **Add mean HDL cholesterol data**

The fourth step we will take is to import the mean HDL cholesterol and remove the columns we aren't going to use. This feature data comes from the World Health Organization. The mean BMI table includes data from 2012-2017 that separates BMI by sex. We will only consider the mean HDL cholesterol for both sexes combined (is an average of both sexes separately). There are no missing values in the mean HDL cholesterol data before joining, so no imputation is required. However, after merging the cholesterol data with the heart disease mortality data frame, we can see that _% of the cholesterol data has missing values in the main data frame. Therefore, we have to use linear regression to predict the remaining values (not done yet), which will introduce some bias.

In [None]:
# Import mean HDL cholesterol data of both sexes
mean_HDL_cholesterol = pd.read_csv("data/cholestrol2012-2017.csv")
mean_HDL_cholesterol = mean_HDL_cholesterol[['Location', 'Period', 'Dim1','FactValueNumeric']]
mean_HDL_cholesterol = mean_HDL_cholesterol.rename(columns = {"Location": "Country", "Period": "Year", "Dim1":"Sex", "Dim2":"Age", "FactValueNumeric": "Cholesterol Level"})
mean_HDL_cholesterol = mean_HDL_cholesterol[mean_HDL_cholesterol["Sex"] == "Both sexes"]
# Sort data frame by country and year and reset index
mean_HDL_cholesterol = mean_HDL_cholesterol.sort_values(by=['Country', 'Year'])
mean_HDL_cholesterol = mean_HDL_cholesterol.reset_index(drop = True)
mean_HDL_cholesterol  = mean_HDL_cholesterol[['Country', 'Year', 'Cholesterol Level']]
# Compute % of missing values in cholesterol data frame before join
percentage_missing_values_cholesterol_before = ((float(mean_HDL_cholesterol['Cholesterol Level'].isna().sum()))/mean_HDL_cholesterol['Cholesterol Level'].count()) * 100.0
print("Percentage of Missing Values for Mean HDL Cholesterol Before Join: ", percentage_missing_values_cholesterol_before)
# Merge cholesteorl data with heart disease mortality data frame and compute % of missing values after join
heart_disease_death_rates['Mean HDL Cholesterol'] = mean_HDL_cholesterol['Cholesterol Level']
percentage_missing_values_cholesterol_after = ((float(heart_disease_death_rates['Mean HDL Cholesterol'].isna().sum()))/heart_disease_death_rates['Mean HDL Cholesterol'].count()) * 100.0
print("Percentage of Missing Values in Cholesterol Level After Join: ", percentage_missing_values_cholesterol_after)
heart_disease_death_rates

In [None]:
low_phys_activity = pd.read_csv("data/disease-burden-by-risk-factor.csv")
low_phys_activity = low_phys_activity[['Entity', 'Year', 'DALYs (Disability-Adjusted Life Years) - Cause: All causes - Risk: Low physical activity - Sex: Both - Age: All Ages (Number)']]
low_phys_activity = low_phys_activity.loc[low_phys_activity['Year'].isin([2012,2013,2014,2015,2016,2017])]
low_phys_activity = low_phys_activity.rename(columns = {"Entity": "Country",  "DALYs (Disability-Adjusted Life Years) - Cause: All causes - Risk: Low physical activity - Sex: Both - Age: All Ages (Number)": "Low Physical Activity DALYs"})
low_phys_activity = low_phys_activity.sort_values(by=['Country', 'Year'])
low_phys_activity.insert(2, 'Sex', 'Both sexes')
low_phys_activity = low_phys_activity.reset_index(drop = True)
low_phys_activity = low_phys_activity[['Country', 'Year', 'Low Physical Activity DALYs']]
low_phys_activity

In [None]:
spirits_consumption = pd.read_csv("data/spirits-consumption-per-person.csv")
spirits_consumption = spirits_consumption[['Entity', 'Year', 'Indicator:Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol) - Beverage Types:Spirits']]
spirits_consumption = spirits_consumption.loc[spirits_consumption['Year'].isin([2012,2013,2014,2015,2016,2017])]
spirits_consumption = spirits_consumption.rename(columns = {"Entity": "Country",  "Indicator:Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol) - Beverage Types:Spirits": "Spirits Consumption"})
spirits_consumption = spirits_consumption.sort_values(by=['Country', 'Year'])
spirits_consumption.insert(2, 'Sex', 'Both sexes')
spirits_consumption = spirits_consumption.reset_index(drop = True)
spirits_consumption = spirits_consumption[['Country', 'Year', 'Spirits Consumption']]
spirits_consumption

In [None]:
share_deaths_smoking = pd.read_csv("data/share-deaths-smoking.csv")
share_deaths_smoking = share_deaths_smoking[['Entity', 'Year', 'Deaths - Cause: All causes - Risk: Smoking - OWID - Sex: Both - Age: Age-standardized (Percent)']]
share_deaths_smoking = share_deaths_smoking.loc[share_deaths_smoking['Year'].isin([2012,2013,2014,2015,2016,2017])]
share_deaths_smoking = share_deaths_smoking.rename(columns = {"Entity": "Country",  "Deaths - Cause: All causes - Risk: Smoking - OWID - Sex: Both - Age: Age-standardized (Percent)": "smoking death percentage"})
share_deaths_smoking = share_deaths_smoking.sort_values(by=['Country', 'Year'])
share_deaths_smoking.insert(2, 'Sex', 'Both sexes')
share_deaths_smoking = share_deaths_smoking.reset_index(drop = True)
share_deaths_smoking = share_deaths_smoking[['Country', 'Year', 'smoking death percentage']]
share_deaths_smoking