## Final project (Data Processing)
### William Bærenholdt, May 2025, University of Amsterdam

#### Packages
Initially, the used/necessary packages are loaded. One might note that other than the usual packages which has been used during the SP1, SP2 and DP courses two other packages are imported, namely scipy and bokeh. Both are used during the project to derive results, either visually (bokeh) or statistically (scipy). $\texttt{Output\_notebook}$ ensures that Bokeh plots are displayed in the notebook.

In [431]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.layouts import gridplot, row
from bokeh.palettes import Blues, Reds
from scipy import stats

output_notebook()

#### Data loading and handling/preparing

We operate under the scientific question: "Is there a gender that we expect to live longer than the other in Denmark?" Although I do not believe in a binary gender perception, it is assumed for the current project that there are two genders: Female and male, and it is thus investigated whether women or men have a longer life expectancy (or the same). Originally, the project was intended as a study in the Netherlands, but since I am not Dutch, Danish data was more readily available, and it is assessed not to have a significant impact where the data comes from. The following dataset is loaded:

In [432]:
deaths = pd.read_csv('deaths_denmark.csv', sep=',')
exposure = pd.read_csv('exposure_denmark.csv', sep=',')

The first data represents the number of deaths in Denmark per age per year from 1837 to 2023 per gender. We examine whether the data is loaded correctly and to gain an immediate understanding of the dataset.

In [433]:
display(deaths)

Unnamed: 0,Year,Age,Female,Male,Total
0,1837,0,3315.00,4376.00,7691.00
1,1837,1,865.94,963.70,1829.64
2,1837,2,582.06,657.30,1239.36
3,1837,3,366.21,387.87,754.08
4,1837,4,242.79,236.13,478.92
...,...,...,...,...,...
20752,2023,106,7.00,0.00,7.00
20753,2023,107,9.00,1.00,10.00
20754,2023,108,6.00,0.00,6.00
20755,2023,109,0.00,0.00,0.00


The dataset appears as expected and is presented according to the explanation above.

The next dataset represents the number persons exposed to death. For example, if there are three individuals who are 20 years old and one of them dies exactly half a year into the year, the number would be 2.5 (as 2 individuals have been alive for the entire year, while one individual has been alive for half of the year). We load the data again to examine and gain an immediate understanding.

In [434]:
display(exposure)

Unnamed: 0,Year,Age,Female,Male,Total
0,1837,0,17789.32,18477.69,36267.01
1,1837,1,15431.99,15730.77,31162.75
2,1837,2,14136.94,14373.96,28510.90
3,1837,3,13247.14,13446.53,26693.67
4,1837,4,12985.10,13186.90,26172.00
...,...,...,...,...,...
20752,2023,106,12.00,2.00,14.00
20753,2023,107,8.00,0.00,8.00
20754,2023,108,4.00,0.00,4.00
20755,2023,109,0.00,0.00,0.00


The dataset appears as expected and is presented according to the explanation above.

We combine the two datasets, as it will later be necessary to use $\frac{\text{deaths}}{\text{exposure}}$. We merge by year and age since these are the common columns in both datasets. Due to common column names we rename everything.

In [435]:
data = pd.merge(deaths, exposure, on=['Year', 'Age'])

# Since the column names from the former columns are identical we need to give new column names to make sure we understand the information.
data = data.rename(columns = {'Year': 'year',
                              'Age': 'age',
                              'Female_x': 'deaths_female',
                              'Male_x': 'deaths_male',
                              'Total_x': 'deaths_total',
                              'Female_y': 'exposure_female',
                              'Male_y': 'exposure_male',
                              'Total_y': 'exposure_total'})

display(data)

Unnamed: 0,year,age,deaths_female,deaths_male,deaths_total,exposure_female,exposure_male,exposure_total
0,1837,0,3315.00,4376.00,7691.00,17789.32,18477.69,36267.01
1,1837,1,865.94,963.70,1829.64,15431.99,15730.77,31162.75
2,1837,2,582.06,657.30,1239.36,14136.94,14373.96,28510.90
3,1837,3,366.21,387.87,754.08,13247.14,13446.53,26693.67
4,1837,4,242.79,236.13,478.92,12985.10,13186.90,26172.00
...,...,...,...,...,...,...,...,...
20752,2023,106,7.00,0.00,7.00,12.00,2.00,14.00
20753,2023,107,9.00,1.00,10.00,8.00,0.00,8.00
20754,2023,108,6.00,0.00,6.00,4.00,0.00,4.00
20755,2023,109,0.00,0.00,0.00,0.00,0.00,0.00


The merge seems to have worked fine, and the new titles are representative of what the columns contain.

We now calculate the MR (: mortality rate), which is the number of deaths over exposure. This number provides an understanding of how frequently deaths occur within the age-group and will later be necessary for calculating life expectancies. Since the study only focuses on men and women, we only calculate the mortality rate for these two genders, omitting the "total" column for now.

In [436]:
data['mr_female'] = data['deaths_female'] / data['exposure_female']
data['mr_male'] = data['deaths_male'] / data['exposure_male']

display(data)

Unnamed: 0,year,age,deaths_female,deaths_male,deaths_total,exposure_female,exposure_male,exposure_total,mr_female,mr_male
0,1837,0,3315.00,4376.00,7691.00,17789.32,18477.69,36267.01,0.186348,0.236826
1,1837,1,865.94,963.70,1829.64,15431.99,15730.77,31162.75,0.056113,0.061262
2,1837,2,582.06,657.30,1239.36,14136.94,14373.96,28510.90,0.041173,0.045729
3,1837,3,366.21,387.87,754.08,13247.14,13446.53,26693.67,0.027644,0.028845
4,1837,4,242.79,236.13,478.92,12985.10,13186.90,26172.00,0.018698,0.017906
...,...,...,...,...,...,...,...,...,...,...
20752,2023,106,7.00,0.00,7.00,12.00,2.00,14.00,0.583333,0.000000
20753,2023,107,9.00,1.00,10.00,8.00,0.00,8.00,1.125000,inf
20754,2023,108,6.00,0.00,6.00,4.00,0.00,4.00,1.500000,
20755,2023,109,0.00,0.00,0.00,0.00,0.00,0.00,,


#### Failure data

Note: The higher the mortality rate the more likely is the person to die. Therefore we expect motality rates to be higher within age and for age 0 (due to newborn deaths).

We observe some issues in the columns for mortality rates where values such as 0, inf, and NaN occur. However, this only occurs for high ages, which is due to the fact that the number of deaths and exposure is very low, and we risk dividing (0) by 0, which is not neither possible mathematically nor making logically sense in Python. Therefore, we need a solution for the high ages so that we can use the dataset.

We thus set a deterministic MR for all individuals older than 100 years, as we realize that it is for these age groups that "strange" mortality rates occur. We set the MR to be the average MR among all individuals aged between 90 and 110 years. Note: There are many different ways to determine MR for high ages, and this is likely not the most accurate, but it works fine for the purpose of the project.

In [437]:
year_min = data['year'].min()
year_max = data['year'].max()

for year in range(year_min, year_max + 1):
    """
    This function replaces the mortality rate for people above the age of 100 years old. The new 
    mortality rate is the average motality rate of all people between the age between 90 and 110.
    """
    filtered_data = data[(data['year'] == year) & (data['age'] >= 90)].copy()

    mr_female = filtered_data['deaths_female'].sum() / filtered_data['exposure_female'].sum()
    mr_male = filtered_data['deaths_male'].sum() / filtered_data['exposure_male'].sum()

    data.loc[(data['year'] == year) & (data['age'] >= 100), 'mr_female'] = mr_female
    data.loc[(data['year'] == year) & (data['age'] >= 100), 'mr_male'] = mr_male

display(data)

Unnamed: 0,year,age,deaths_female,deaths_male,deaths_total,exposure_female,exposure_male,exposure_total,mr_female,mr_male
0,1837,0,3315.00,4376.00,7691.00,17789.32,18477.69,36267.01,0.186348,0.236826
1,1837,1,865.94,963.70,1829.64,15431.99,15730.77,31162.75,0.056113,0.061262
2,1837,2,582.06,657.30,1239.36,14136.94,14373.96,28510.90,0.041173,0.045729
3,1837,3,366.21,387.87,754.08,13247.14,13446.53,26693.67,0.027644,0.028845
4,1837,4,242.79,236.13,478.92,12985.10,13186.90,26172.00,0.018698,0.017906
...,...,...,...,...,...,...,...,...,...,...
20752,2023,106,7.00,0.00,7.00,12.00,2.00,14.00,0.224912,0.275157
20753,2023,107,9.00,1.00,10.00,8.00,0.00,8.00,0.224912,0.275157
20754,2023,108,6.00,0.00,6.00,4.00,0.00,4.00,0.224912,0.275157
20755,2023,109,0.00,0.00,0.00,0.00,0.00,0.00,0.224912,0.275157


#### Expected remaining lifetime
The data now looks good again and is ready to be used. For now, MR is denoted as $\mu(x,t)$, where $x$ represents age and $t$ represents year. This means that the mortality rate for a 109-year-old in the year 2023 is 0.275157, i.e., $\mu(109,2023)=0.275157$. Harshly, this can be interpreted as: "We expect 27.5 % of all 109-years-old to die within a year".

To calculate the expected remaining lifetime $T$ time, we use the formula from "Introduction to Mathematics in Life Insurance" (University of Copenhagen):

$$
T_x^t=\sum_{i=x}^{x_{\max }} e^{-\sum_{j=x}^{i} \mu(j, t)}
$$

where $x_{\max}$ is the maximum age a person can have (which in this case is 110 years). This means that if, for example, someone is 20 years old in 2023 and wants to find their expected remaining lifetime, it is calculated as:

$$
T_{20}^{2023} = \sum_{i=20}^{110} e^{-\sum_{j=20}^{i} \mu(j, 2023)}=e^{-\left( \mu(20,2023) \right)}+e^{-\left( \mu(20,2023) + \mu(21,2023)\right)}+...+ e^{-\left( \mu(20,2023) +...+ \mu(110,2023)\right)}
$$

It is crucial for the rest of the project that we have a function that can calculate the expected remaining lifetimes. Therefore, this function is now defined:

In [438]:
def expected_death_age(age, year, gender):
    """
    This functions calculates the estimated number of years that a person of input age will live according to the
    mortality rates from the input year. By the end it is added to the age so the result is the age we expect the person
    of input gender to be when dieing.
    """
    data_ = data[data['year'] == year]
    mortality_rate_gender = f"mr_" + gender

    # Creates a list where each element is the former element plus the motality rate of a person of one year older
    list = []
    mortality_rate_sum = 0
    for age_variable in range(age, 110+1):
        mortality_rate = float(data_[data_['age'] == age_variable][mortality_rate_gender].iloc[0])
        mortality_rate_sum = mortality_rate_sum + mortality_rate
        list.append(mortality_rate_sum)

    # We sum over the exponentials taken in minus to the input to get the expected life duration.
    array = np.array(list)    
    remaining_years = sum(np.exp(-array))

    return remaining_years + age

It is noted that the input age is added to the expected remaining lifetime by the end of the function. This means that the function calculates how old a person (gender, year, age) is expected to become (rather than how many years the person has left).

The function takes into account both gender, age, and year. Therefore, with the function, we can calculate all expected ages at death for both genders in all years and will now do so in various ways.

#### Analysis (table)

To get a brief overview, a table is now created to gain an immediate understanding of the function and the results. The following table takes two lists as input: ages and years, and gender. For input years, we can thus see the expected age at death for female and males at different ages.

In [439]:
def table_expected_death(ages, years, gender):
    """
    This functions creates a table that shows the expected age of death of a X-years old in the year of Y,
    where X and Y is a list of ages and years respectively.
    """
    
    table = {
        "Age": [],
        "Year": [],
        "Expected_death_age": []
    }

    # For all combinations of ages and years we calculate the expected age at death.
    for year in years:
        for age in ages:
            expected_deaths = expected_death_age(age, year, gender)
            table["Age"].append(age)
            table["Year"].append(year)
            table["Expected_death_age"].append(round(expected_deaths, 2))

    # Using pandas to get the table in the right format
    df = pd.DataFrame(table)
    df_pivot = pd.pivot_table(df, columns='Year', index='Age', values='Expected_death_age')

    return df_pivot

We select various ages and years and set up the table for women and men side by side.

In [440]:
# Deterministic chosen ages and years and insert in function
ages_ = [0, 20, 40, 60, 80]
years_ = [2003, 2013, 2023]

table_female = table_expected_death(ages_, years_, "female")
table_male = table_expected_death(ages_, years_, "male")

# Combine the table for display
final_table = pd.concat([table_female, table_male], axis=1, keys=['(Female): Expected age at death', '(Male): Expected age at death'])

display(final_table)

Unnamed: 0_level_0,(Female): Expected age at death,(Female): Expected age at death,(Female): Expected age at death,(Male): Expected age at death,(Male): Expected age at death,(Male): Expected age at death
Year,2003,2013,2023,2003,2013,2023
Age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
0,78.75,81.38,82.77,74.18,77.23,78.98
20,79.36,81.81,83.16,74.89,77.69,79.4
40,79.8,82.11,83.44,75.78,78.38,79.91
60,81.83,83.76,84.68,78.49,80.69,81.59
80,88.07,88.84,89.15,86.24,87.21,87.52


There are two immediate trends that we observe in the diagram: Firstly, it appears that Danish people are living longer, as individuals in all five age groups have a higher expected age at death each time the years increase.

Thus, an additional scientific question arises: Are Danes in general living longer and longer?

Regarding our initial question about whether one gender lives longer than the other, there is an indication that yes, women live longer than men. This is observed when cross-investigating the table (checking expected age at death for the same ages and years).

However, it is necessary to point out that the above table only considers 3 years and 5 different age groups. This is not enough to answer the two scientific questions, so we need to examine all ages and all years. Fortunately, we can do this with plots, and particularly the Python package Bokeh can help us derive results. With its visually clear plots and interactive features, Bokeh makes it easier to derive results.

When we fix the age and examine for all years, we can investigate the difference between women's and men's life expectancy and thus answer the initial question.

When we fix the year and examine for all ages, we can investigate the trend in both men's and women's expected age at death, and thus examine whether Danish people are generally living longer.

#### Analysis (plots)

First, we examine the difference between women and men. For this purpose, a function is used to create a plot showing the expected age at death for a specific age for both men and women.

In [441]:
def expected_death_per_year_fixed_age(start_year, end_year, age):
    """
    This function takes age as a fixed input and returns a plot which show the expected age at death from each year
    between input start and end year (both female (red) and male (blue) gender)
    """
    x_values = []
    y_values_female = []
    y_values_male = []

    for year in range(start_year, end_year+1):
        x_values.append(year)
        y_values_female.append(expected_death_age(age, year, "female"))
        y_values_male.append(expected_death_age(age, year, "male"))

    # Create plot
    p = figure(title=f"Expected age of death of a {age}-year-olds (per year)", x_axis_label="Year", y_axis_label="Expected age of death", width=400, height=400)
    p.line(x_values, y_values_female, line_width=3, color="red", legend_label="Female")
    p.line(x_values, y_values_male, line_width=3, color="blue", legend_label="Male")
    p.legend.location = "top_left"

    # Adds hover effect
    hover = HoverTool(tooltips=[("Year", "@x"), ("Expected age of death", "@y")])
    p.add_tools(hover)

    return p

We examine for 3 different age groups with a 40-year interval over the past 50 years.

In [442]:
p1 = expected_death_per_year_fixed_age(1973, 2023, 0)
p2 = expected_death_per_year_fixed_age(1973, 2023, 40)
p3 = expected_death_per_year_fixed_age(1973, 2023, 80)

show(row(p1, p2, p3))

The same trend observed in the table persists: There is a difference in the expected age at death between women and men. We hypothesize that women live longer than men, as both a newborn, an adult, and an elderly woman are expected to die later than a man of the same age. However, the difference seems to diminish as we approach later years. In other words, it may suggest that the difference in age at death between women and men is not as significant as it has been? We temporarily pause this thought and proceed with the other scientific question. 

The graph also indicates that the expected age at death is increasing over the years, and that there was a particularly steep rise in the mid-90s for both men and women. Another way to investigate this difference is by fixing the age and examining different years. This is done by the function below, which takes a list of years as input and checks how many years everyone aged 0 to 80 has left to live in the respective year.

In [445]:
def plot_expected_age_of_death(years, gender):
    """
    For different years this plot shows the expected number of years left for the ages 0 till 80-years 
    old in that year. Note that this is especially different from the other plots which showed age at death 
    from different ages (and not the remaining years). The result is a plot which uses two different kinds of 
    stylings to distinguish between years
    """

    p = figure(title=f"Expected remaining lifetime per year ({gender})",
               x_axis_label="Age", y_axis_label="Remaining years",
               width=400, height=400)

    # Define a list of line styles where each plot will have a different style
    line_styles = ['solid', 'dashed', 'dotted', 'dotdash', 'dashdot']

    for index, year in enumerate(years):
        x_values = []
        y_values = []
        
        for age in range(0, 81):
            x_values.append(age)
            y_values.append(expected_death_age(age, year, gender) - age) # Subtract age to get remaining lifetime instead of expected age at death.

        # Use different shades of colors for the lines for each plot
        if gender == 'female':
            line_color = Reds[5][index % 5]

        elif gender == 'male':
            line_color = Blues[5][index % 5]

        p.line(x_values, y_values, line_width=3, line_color=line_color, line_dash=line_styles[index % len(line_styles)], legend_label=f"{year}")

    return p

It is noted that age is subtracted in line 20 of the above code. Thus, it is no longer the expected age at death that is calculated by our $\texttt{expected\_death\_age}$ function, but instead the expected remaining lifetime. We examine 10-year intervals over the past 50 years for both women and men.

In [446]:
years_ = [2023, 2013, 2003, 1993, 1983]

p4_female = plot_expected_age_of_death(years_, "female")
p4_male = plot_expected_age_of_death(years_, "male")

show(row(p4_female, p4_male))

Naturally, the expected remaining lifetime decreases with age (we expect a 0-year-old to live more years than an 80-year-old). The Bokeh package now becomes very useful as we can clearly see, with stylistic differences and the zoom function, that the later the year, the higher the expected remaining lifetime for all ages. (The graph for 2023 is above the graph for 2013, which is above the graph for 2003...)

Everything thus suggests that Danes are living longer and longer and thereby answers the second scientific question. However, this may raise questions about the function used to calculate the age at death. In that function, we used the common death rate for a 20-year-old in 2023 and added it to that of a 21-year-old in 2023... However, it should be noted that if a person is 20 years old in 2023, they will be 21 years old in 2024, and by above arguments have a lower mortality rate (because we expect them to live longer). This is a question for further investigation but is just noted and left for now in the project.

#### Analysis (statistics)

So far, our investigations have largely consisted of visual observations and subsequent assumptions. Now, we wish to test more scientifically, for which we use statistical tests. We proceed with the initial scientific question, but rephrase it due to the visual observations: Is there statistical evidence that women live longer than men in Denmark? For this, we use the Python package $\texttt{stats}$, which can, among other things, be used to investigate whether there is a statistical difference in two datasets.

We use the following function, which at a significance level of 0.05 decides whether there is a difference in life expectancy between women and men in the given year/given time period.

In [447]:
def p_value_under_signifance_level_life_expectancy(year, end_year = 0, significance_level = 0.05):
    """
    This function takes at least one year as an input and calculates whether there is statistical evidence 
    for a difference in life expectancy between genders in the specific year, at a significance level (which 
    by default is 0.05). If there is a second input year, it calculates whether there is a statistical 
    difference during the entire time period.
    """

    list_male = []
    list_female = []

    # We only consider the input year if no time period is given
    if end_year == 0:
        end_year = year

    for age in range(0, 110+1):
        for year_ in range(year, end_year+1):
            list_male.append(expected_death_age(age, year_, 'male'))
            list_female.append(expected_death_age(age, year_, 'female'))

    t_statistic, p_value = stats.ttest_ind(list_male, list_female)

    return p_value < significance_level

The visual results from earlier sections indicated that it was particularly in recent years that the difference between the life expectancy of women and men had decreased. Therefore, we only investigate whether there is statistical evidence for a difference in the most recent years. Of course, this is also because we are only interested in such investigations regarding the present.

We check for the last year, 2023.

In [448]:
p_value_under_signifance_level_life_expectancy(2023)

True

Statistical tests should always be taken with a pinch of salt, but to put it mildly, we say that with a 95% confidence level, there was indeed a difference in life expectancy between men and women in 2023. Thus, women have a longer life expectancy than men.

We examine whether this has changed over the past years and therefore consider all years since 2014 (last 10 years).

In [449]:
start_year = 2014
end_year = 2022

check = []
for year in range(start_year, end_year+1):
    check.append(p_value_under_signifance_level_life_expectancy(year))

check

[True, True, True, True, True, True, True, True, True]

The statistical tests are consistent and thus confirm the result from 2023. Over the previously 10 years, it is expected that women will have a longer life expectancy than men each year.

Finally, we test whether, if we look at a period from the last 10 years (as a form of unity), there is a difference.

In [450]:
p_value_under_signifance_level_life_expectancy(2014, 2023)

True

A consistent result has been observed.

## Conclusion
The project focused on life expectancies among men and women. The primary scientific question was whether one gender is expected to live longer than the other. To answer this question, it was necessary to find a method to calculate life expectancies, which is why the utilized data was carefully reviewed. This led to a formula that could be converted into a function to calculate life expectancies. The function played a crucial role and was used countless times throughout the project.

It quickly became apparent that if there was a difference in life expectancy, it was women who lived longer than men. This prompted a subsequent scientific question: Do women live longer than men? This question was investigated statistically using the python package scipy, and the answer was unanimous: Women have a longer life expectancy than men.

During the project, a trend was observed indicating that the expected age at death was increasing each year. Therefore, the scientific question "Are we expected to live longer and longer?" was examined, with the Bokeh Python package proving particularly useful due to its interactive design, making it easier to visually explore and thus answer the question. The answer was again yes, we expect people to get older and older.

Finally, it is worth noting that here in the conclusion (and likely also throughout the project), "we" is used to refer to us in the study. All data is based on Danish life expectancy data, and thus the results may not necessarily apply to the entire world population but only to the (lovely) Danish people.