# Cause Of Death Project

A straightforward way to assess the health status of a population is to focus on mortality – or concepts like child mortality or life expectancy, which are based on mortality estimates. A focus on mortality, however, does not take into account that the burden of diseases is not only that they kill people, but that they cause suffering to people who live with them. Assessing health outcomes by both mortality and morbidity (the prevalent diseases) provides a more encompassing view on health outcomes. This is the topic of this entry. The sum of mortality and morbidity is referred to as the ‘burden of disease’ and can be measured by a metric called ‘Disability Adjusted Life Years‘ (DALYs). DALYs are measuring lost health and are a standardized metric that allow for direct comparisons of disease burdens of different diseases across countries, between different populations, and over time. Conceptually, one DALY is the equivalent of losing one year in good health because of either premature death or disease or disability. One DALY represents one lost year of healthy life. The first ‘Global Burden of Disease’ (GBD) was GBD 1990 and the DALY metric was prominently featured in the World Bank’s 1993 World Development Report. Today it is published by both the researchers at the Institute of Health Metrics and Evaluation (IHME) and the ‘Disease Burden Unit’ at the World Health Organization (WHO), which was created in 1998. The IHME continues the work that was started in the early 1990s and publishes the Global Burden of Disease study.
Content
In this Dataset, we have Historical Data of different cause of deaths for all ages around the World. The key features of this Dataset are: Meningitis, Alzheimer's Disease and Other Dementias, Parkinson's Disease, Nutritional Deficiencies, Malaria, Drowning, Interpersonal Violence, Maternal Disorders, HIV/AIDS, Drug Use Disorders, Tuberculosis, Cardiovascular Diseases, Lower Respiratory Infections, Neonatal Disorders, Alcohol Use Disorders, Self-harm, Exposure to Forces of Nature, Diarrheal Diseases, Environmental Heat and Cold Exposure, Neoplasms, Conflict and Terrorism, Diabetes Mellitus, Chronic Kidney Disease, Poisonings, Protein-Energy Malnutrition, Road Injuries, Chronic Respiratory Diseases, Cirrhosis and Other Chronic Liver Diseases, Digestive Diseases, Fire, Heat, and Hot Substances, Acute Hepatitis.


Dataset Glossary (Column-wise)
•	01. Country/Territory - Name of the Country/Territory
•	02. Code - Country/Territory Code
•	03. Year - Year of the Incident
•	04. Meningitis - No. of People died from Meningitis
•	05. Alzheimer's Disease and Other Dementias - No. of People died from Alzheimer's Disease and Other Dementias
•	06. Parkinson's Disease - No. of People died from Parkinson's Disease
•	07. Nutritional Deficiencies - No. of People died from Nutritional Deficiencies
•	08. Malaria - No. of People died from Malaria
•	09. Drowning - No. of People died from Drowning
•	10. Interpersonal Violence - No. of People died from Interpersonal Violence
•	11. Maternal Disorders - No. of People died from Maternal Disorders
•	12. Drug Use Disorders - No. of People died from Drug Use Disorders
•	13. Tuberculosis - No. of People died from Tuberculosis
•	14. Cardiovascular Diseases - No. of People died from Cardiovascular Diseases
•	15. Lower Respiratory Infections - No. of People died from Lower Respiratory Infections
•	16. Neonatal Disorders - No. of People died from Neonatal Disorders
•	17. Alcohol Use Disorders - No. of People died from Alcohol Use Disorders
•	18. Self-harm - No. of People died from Self-harm
•	19. Exposure to Forces of Nature - No. of People died from Exposure to Forces of Nature
•	20. Diarrheal Diseases - No. of People died from Diarrheal Diseases
•	21. Environmental Heat and Cold Exposure - No. of People died from Environmental Heat and Cold Exposure
•	22. Neoplasms - No. of People died from Neoplasms
•	23. Conflict and Terrorism - No. of People died from Conflict and Terrorism
•	24. Diabetes Mellitus - No. of People died from Diabetes Mellitus
•	25. Chronic Kidney Disease - No. of People died from Chronic Kidney Disease
•	26. Poisonings - No. of People died from Poisoning
•	27. Protein-Energy Malnutrition - No. of People died from Protein-Energy Malnutrition
•	28. Chronic Respiratory Diseases - No. of People died from Chronic Respiratory Diseases
•	29. Cirrhosis and Other Chronic Liver Diseases - No. of People died from Cirrhosis and Other Chronic Liver Diseases
•	30. Digestive Diseases - No. of People died from Digestive Diseases
•	31. Fire, Heat, and Hot Substances - No. of People died from Fire or Heat or any Hot Substances
•	32. Acute Hepatitis - No. of People died from Acute Hepatitis

Steps to Follow
https://www.kaggle.com/code/spscientist/a-simple-tutorial-on-exploratory-data-analysis 
https://en.wikipedia.org/wiki/Exploratory_data_analysis#:~:text=In%20statistics%2C%20exploratory%20data%20analysis,and%20other%20data%20visualization%20methods. 


# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from statsmodels.stats.outliers_influence import variance_inflation_factor
import warnings
warnings.filterwarnings('ignore')



In [None]:
data = pd.read_csv('COD.csv')
data

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.columns

In [None]:
data.dtypes

In [None]:
data.shape

In [None]:
data.isnull().sum()

In [None]:
desc = data.describe().T
desc['range']=desc['max']-desc['min']
desc

In [None]:
data.describe(include='object').T

In [None]:
data.describe(include='object').T

In [None]:
cont_data = data.select_dtypes(include=['int64','float64'])

cat_data= data.select_dtypes(include=['object'])

cont_columns = cont_data.columns

cat_columns = cat_data.columns

In [None]:
cont_columns

In [None]:
cat_columns

In [None]:
from scipy import stats

for i in cat_columns:
    print('For',i,', most frequent value is: ',stats.mode(data[i]),'\n')

In [None]:
for i in cat_columns:
    print('For column',i,'unique values are: ',data[i].unique())
    print('For column',i,'count of unique values are: ',data[i].nunique(),'\n\n')
    

In [None]:
for i in cat_columns:
    print('For column --',i,'-- value counts are: \n',data[i].value_counts(),'\n\n')


In [None]:
data.columns

In [None]:
diseases = data.columns[3:]

In [None]:
total_deaths=[]
for i in diseases:
    total_deaths.append(data[i].sum())
total_deaths

In [None]:
len(total_deaths)

In [None]:
diseases = data.columns[3:]

In [None]:
diseases

In [None]:
df_diseasesum = pd.DataFrame({'Diseases':diseases,'total_deaths':total_deaths})


In [None]:
df_diseasesum['total_deaths'] = df_diseasesum['total_deaths'].astype('float64')


In [None]:
df_diseasesum.sort_values(by='total_deaths',ascending=False)


In [None]:
plt.figure(figsize=(20,15))
plt.pie(df_diseasesum['total_deaths'], labels=df_diseasesum['Diseases'], autopct='%1.0f%%')
plt.xticks(rotation=90)
plt.show()

# Univariate Analysis

In [None]:
for i in cat_columns:
    f= plt.figure(figsize=(12,5))
    ax = sns.countplot(x=data[i],data=data)
    plt.xticks(rotation=90)

In [None]:
plt.figure(figsize=(12,5))
plt.plot('total_deaths',data=df_diseasesum)
plt.xlabel('total_deaths')
plt.xticks(ticks=np.arange(3225,32,1) , labels=diseases,rotation=90)
plt.show()

In [None]:
for i in cont_columns:
    plt.figure(figsize=(30,15))
    sns.scatterplot(x = 'Country/Territory', y = i, data = df)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
# Using Histogram 

for i in cont_columns:
    plt.figure(figsize=(10,6),facecolor='yellow')
    plt.hist(data[i])
    plt.xlabel(i)
    plt.show()

In [None]:
plt.figure(figsize=(50,100))
ax = sns.catplot(x="Country/Territory",y='Meningitis', kind="box", data=data)
plt.xticks(rotation=90)
plt.show()

In [None]:
data.columns

In [None]:
# Distribution of continuous variables
for i in cont_columns:
    plt.figure(figsize=(12,7),facecolor='yellow')
    sns.distplot(data[i])
    plt.show()

# Checking Outliers

In [None]:
# Using Box Plot
for i in cont_columns:
    plt.figure(figsize=(12,7),facecolor='yellow')
    sns.boxplot(data[i])
    plt.show()

In [None]:
data.columns

In [None]:
for i in data['Country/Territory']:
    print('For country:',i)
    df_c = data[data['Country/Territory']==i]
    print(df_c.describe().T)
    print('\n\n\n')

In [None]:
# Using Violin Plot 
for i in diseases:
    plt.figure(figsize=(30,30),facecolor='yellow')
    sns.violinplot(x="Country/Territory",y=i,data=data)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
#  Using Strip Plot
for i in diseases:
    plt.figure(figsize=(30,30),facecolor='orange')
    sns.stripplot(x = 'Country/Territory', y =i, data = data)
    plt.xticks(rotation=90)
    plt.show()

# Numerical Data Analysis

In [None]:
for i in diseases:
    for j in ['Country/Territory', 'Year']:
        plt.figure(figsize=(30,30),facecolor='yellow')
        sns.scatterplot(x = data.index, y = i, data = data,hue=j)
        plt.xticks(rotation=90)
        plt.show()

# Relationship Analysis

In [None]:
for i in diseases:
    for j in ['Year']:
        plt.figure(figsize=(30,30),facecolor='red')
        sns.lmplot(x=i,y = j, data = data)
        plt.xticks(rotation=90)
        plt.show()

In [None]:
for i in diseases:
    for j in ['Year']:
        plt.figure(figsize=(30,30),facecolor='red')
        sns.lmplot(x=i,y = j, data = data,hue='Country/Territory')
        plt.xticks(rotation=90)
        plt.show()


In [None]:
sns.pairplot(data,hue='Country/Territory')

In [None]:
for i in diseases:
    print(data.groupby(['Country/Territory'])[i].count())

In [None]:
for i in diseases:
    grouped_mean = data.groupby(['Country/Territory'])[i].mean()

In [None]:
grouped_mean.sort_values(ascending=False)

In [None]:
pd.set_option('display.max_rows',None)
grouped_sum = data.groupby(['Country/Territory'])[diseases].sum()

In [None]:
grouped_sum.max().sort_values(ascending=False)

In [None]:
grouped_sum.min().sort_values(ascending=False)

In [None]:
grouped_sum_data = pd.DataFrame(grouped_sum)


In [None]:
grouped_sum_data

In [None]:
grouped_sum_data['Meningitis'].sum()


In [None]:
grouped_f = data.groupby(['Country/Territory'])[diseases].agg(["mean"])


In [None]:
pd.set_option('display.max_rows',None)
grouped_f.T

In [None]:
grouped_max = data.groupby(['Country/Territory'])[diseases].agg(["max",'min'])
pd.set_option('display.max_rows',None)
grouped_max.describe().T.sort_values(by='max',ascending=False)

In [None]:
grouped_max.describe().T.sort_values(by='min',ascending=False)

In [None]:
grouped_years = data.groupby(['Year'])[diseases].agg(["max",'min'])
pd.set_option('display.max_rows',None)
grouped_years.describe().T.sort_values(by='max',ascending=False)

In [None]:
grouped_years.describe().T.sort_values(by='min',ascending=False)

In [None]:
grouped_sum_data

In [None]:
data.columns

In [None]:
#Correlation

In [None]:
# Correlation between features and target

In [None]:
# Replacing attrition column values:

data.drop(columns = 'Year',axis = 1).corrwith(data.Year).plot(kind='bar',grid=True,figsize=(10,7),title='correlation between features and target')
plt.show()

In [None]:
# Relationship between dependent and independent variable

df_corr = data.corr().abs()
plt.figure(figsize=(20,15))
sns.heatmap(df_corr,annot=True,annot_kws={'size':10})
plt.show()

In [None]:
# Checking Skewness

In [None]:
data.skew().sort_values(ascending=False)

In [None]:
data['Country/Territory'].nunique()

In [None]:
data['Total number of deaths'] = data[diseases].sum(axis=1)


In [None]:
# Countries having maximum deaths

deathsbycountry = data.sort_values(by='Total number of deaths',ascending=False)[['Total number of deaths','Country/Territory']]


In [None]:
pd.set_option('display.max_rows',None)
deathsbycountry

In [None]:
deathsbycountry[0:10]

In [None]:
# Country dfs
data_India = data[data['Country/Territory']=='India'].sort_values(by='Total number of deaths',ascending=False)


In [None]:
#  Using Scatterplot

for i in diseases:
    plt.figure(figsize=(9,7),facecolor='yellow')
    sns.scatterplot(x = 'Year', y = i, data = data_India)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
# Using Bar Plot
for i in diseases:
    plt.figure(figsize=(7,5),facecolor='red')
    sns.barplot(y = i,x='Year', data = data_china)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
# Russia

data_russia = data[data['Country/Territory']=='Russia'].sort_values(by='Total number of deaths',ascending=False)

# Scatterplot

print('\n\n Scatterplot: ')

for i in diseases:
    plt.figure(figsize=(7,5),facecolor='red')
    sns.scatterplot(x = 'Year', y = i, data = data_russia)
    plt.xticks(rotation=90)
    plt.show()
     
print('\n\nBarplot')
    
# Barplot

for i in diseases:
    plt.figure(figsize=(7,5),facecolor='red')
    sns.barplot(y = i,x='Year', data = data_russia)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
# USA

data_USA = data[data['Country/Territory']=='United States'].sort_values(by='Total number of deaths',ascending=False)

# Scatterplot

for i in diseases:
    plt.figure(figsize=(7,5),facecolor='red')
    sns.scatterplot(x = 'Year', y = i, data = data_USA)
    plt.xticks(rotation=90)
    plt.show()
    
print('\n\nBarplot')
    
# Barplot

for i in diseases:
    plt.figure(figsize=(7,5),facecolor='red')
    sns.barplot(y = i,x='Year', data = data_USA)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
# China

data_china = data[data['Country/Territory']=='China'].sort_values(by='Total number of deaths',ascending=False)


In [None]:
data_china


In [None]:
# Using Scatterplot

for i in diseases:
    plt.figure(figsize=(7,5),facecolor='red')
    sns.scatterplot(x = 'Year', y = i, data = data_china)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
#  Using Densityplot

for i in diseases:
    plt.figure(figsize=(7,5),facecolor='red')
    sns.barplot(y = i,x='Year', data = data_china)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
# Indonesia

data_indonesia = data[data['Country/Territory']=='Indonesia'].sort_values(by='Total number of deaths',ascending=False)

# Using Scatterplot

print('\n\n Scatterplot: ')

for i in diseases:
    plt.figure(figsize=(7,5),facecolor='red')
    sns.scatterplot(x = 'Year', y = i, data = data_indonesia)
    plt.xticks(rotation=90)
    plt.show()
     
print('\n\nBarplot')
    
#  Using Barplot

for i in diseases:
    plt.figure(figsize=(7,5),facecolor='red')
    sns.barplot(y = i,x='Year', data = data_indonesia)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
cause_of_deaths = ['Meningitis',
       'Alzheimer\'s Disease and Other Dementias', 'Parkinson\'s Disease',
       'Nutritional Deficiencies', 'Malaria', 'Drowning',
       'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
       'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
       'Lower Respiratory Infections', 'Neonatal Disorders',
       'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
       'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
       'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
       'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
       'Road Injuries', 'Chronic Respiratory Diseases',
       'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
       'Fire, Heat, and Hot Substances', 'Acute Hepatitis']

In [None]:
data['Total deaths'] = data[cause_of_deaths].sum(axis=1)

In [None]:
data['Total deaths']

In [None]:
# Top 10 Total_no_of_Deaths

top10_Total_no_of_Deaths = data.sort_values(by='Total deaths',ascending=False)[:10][['Total deaths','Country/Territory']]

top10_Total_no_of_Deaths

In [None]:
cuz_of_death = data.groupby('Country/Territory').sum()

In [None]:
cuz_of_death.drop('Year',axis=1,inplace=True)

In [None]:
cuz_of_death.sort_values(by='Total deaths',ascending=False)

In [None]:
#Top reasons of death countrywise:

In [None]:
china_10 = cuz_of_death.sort_values(by='Total deaths',ascending =False)[:1]


In [None]:
china_10.iloc[0].sort_values(ascending=False)[1:11]


In [None]:
plt.figure(figsize=(8,4),dpi=200)
china_10.iloc[0].sort_values(ascending=False)[1:11].plot(kind='barh')
plt.xlabel("Total no.of Deaths")
plt.ylabel("Causes of Deaths")
plt.title("Top 10 Causes of Deaths in China")
plt.show();

In [None]:
India_10 = cuz_of_death.sort_values(by='Total deaths',ascending =False)[1:2]
India_10.iloc[0].sort_values(ascending=False)[1:11]

In [None]:
plt.figure(figsize=(8,4),dpi=200)
India_10.iloc[0].sort_values(ascending=False)[1:11].plot(kind='barh')
plt.xlabel("Total no.of Deaths")
plt.ylabel("Causes of Deaths")
plt.title("Top 10 Causes of Deaths in India")
plt.show();

In [None]:
usa_10 = cuz_of_death.sort_values(by='Total deaths',ascending =False)[2:3]
usa_10.iloc[0].sort_values(ascending=False)[1:11]

In [None]:
plt.figure(figsize=(8,4),dpi=200)
usa_10.iloc[0].sort_values(ascending=False)[1:11].plot(kind='barh')
plt.xlabel("Total no.of Deaths")
plt.ylabel("Causes of Deaths")
plt.title("Top 10 Causes of Deaths in United States")
plt.show();

In [None]:
# PCA analysis:

from sklearn.decomposition import PCA
# Using PCA i.e. Principal Component Analysis that is a diamensionallity reduction technique:

pca = PCA()
pca.fit_transform(grouped_sum_data)

In [None]:
# Using Scree Plot to identify best components:

plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Principal Components')
plt.ylabel('Variance Covered')
plt.title('PCA')
plt.show()