Firstly, we need to check what kind of data we have and define the research questions we want to answer.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import numpy as np

data = pd.read_csv('all_data.csv')
print(data.head())

**So, a few research questions to look into:**
1. How much does life expectancy differ by country?
2. Is the time-development of life expectancy similar between different countries?
3. Is the time-development of the GDP similar between different countries?
4. Is there a clear correlation between GDP and life expectancy?

In [None]:
# Possible values of "Year" to find the most recent data
print(data["Year"].unique())

In [None]:
# Let's choose the data from 2015
data_2015 = data.loc[data["Year"] == 2015]
# Check that the data looks like expected
print(data_2015.head())

# Then plotting the life expectancy for each country
ax = data_2015.plot.bar(x='Country', y='Life expectancy at birth (years)', title='Life expentancy at birth separated by country in 2015')
ax.grid(axis='y')
ax.set_axisbelow(True)
plt.show()


This answers the first question. In most of the countries included in the data, the life expectancy is around 75-82 years; however, in Zimbabwe, it is around 60.

In [None]:
# Now for the second question. We plot the time development by the country to see if they change similarly
# Using only the needed columns
plot_df = data[['Year', 'Country', 'Life expectancy at birth (years)']]

# Ensure 'Year' is numeric, and sorted for line plotting
plot_df = plot_df.sort_values(['Country', 'Year'])

# Grouping by country
groups = plot_df.groupby('Country')

plt.figure(figsize=(10, 6))
for name, group in groups:
    # Double check: select columns by name, ensure Series, and drop NA for plotting
    plt.plot(group['Year'].to_numpy(), group['Life expectancy at birth (years)'].to_numpy(), marker='o', label=name)

plt.xlabel('Year')
plt.ylabel('Life expectancy at birth (years)')
plt.title('Time development of the life expectancy at birth (years) for Each country')
plt.legend(title='Country')
plt.grid(True)
plt.show()

We have our answer. For most of the countries included, the life expectancy has been increasing slowly but steadily. Zimbabwe has a stark difference, the life expectancy was declining at first, reached a minimum in 2004, which is below 45 years (30 years less than other countries). After that, the life expectancy has been rapidly increasing, but it is still well below the others at the end of the data set.

In [None]:
# Now, for the time-development of the GDP:
# Using only the needed columns
plot_df = data[['Year', 'Country', 'GDP']]

# Ensure 'Year' is numeric or datetime if appropriate, and sorted for line plotting
plot_df = plot_df.sort_values(['Country', 'Year'])

# Grouping by country
groups = plot_df.groupby('Country')

plt.figure(figsize=(10, 6))
for name, group in groups:
    # Double check: select columns by name, ensure Series, and drop NA for plotting
    plt.plot(group['Year'].to_numpy(), group['GDP'].to_numpy(), marker='o', label=name)

plt.xlabel('Year')
plt.ylabel('GDP')
plt.title('Time development of the GDP for Each country')
plt.legend(title='Country')
plt.grid(True)
plt.show()

# Compute GDP percentage increase relative to the first year per country
plot_df['GDP_pct_increase'] = plot_df.groupby('Country')['GDP'].transform(
    lambda x: (x / x.iloc[0] - 1) * 100
)

# Plot line chart for percentage increase
plt.figure(figsize=(10, 6))
groups = plot_df.groupby('Country')

for name, group in groups:
    plt.plot(
        group['Year'].to_numpy(),
        group['GDP_pct_increase'].to_numpy(),
        marker='o',
        label=name
    )

plt.xlabel('Year')
plt.ylabel('GDP increase (%) compared to first year')
plt.title('GDP percentage growth relative to first year (per country)')
plt.legend(title='Country')
plt.grid(True)
plt.show()

There are clear differences between the countries here. Zimbabwe's economy has been stagnant throughout the whole period in the data set, and Chile and Mexico have also had only a little growth in GDP. Germany's GDP grew slightly more, but the USA and China are in their own category in the absolute GDP growth. The financial crisis in 2008-2009 shows up in the USA's GDP and slightly also for Germany and Mexico, but not nearly as pronounced. For other countries included, there is no clear sign of it. Percentage-wise increase compared to the beginning of the dataset makes it clear that even for Zimbabwe, there was a change in GDP, just not at the same scale as for the others. Percentage-wise, the growth has been fastest for China, as its GDP in 2015 was more than eight times that of 2000. In Chile, the GDP growth also exceeded 200%, but in all of the other countries included, it is below that. In Chile, Germany, and Mexico, there is a decrease in 2009, while in Zimbabwe, it was already in 2008. For China and the USA, no percentage-wise decrease in GDP is observed, even though for the USA, there is a decrease in the absolute values.

In [None]:
# Now for the correlation of GDP and Life expectancy
# Scatterplot colored by a Country
sns.scatterplot(
    data=data,
    x='GDP',
    y='Life expectancy at birth (years)',
    hue='Country',          # Replace 'Region' with your actual third variable name
    palette='viridis'
)
plt.xlabel('GDP')
plt.ylabel('Life expectancy at birth (years)')
plt.title('GDP vs Life expectancy (colored by Country)')
plt.grid(True)
plt.show()

From the scatterplot, we can see that with very low GDP the life expectancy can be anything between 45 to 80 years. With a GDP higher than that there seems to be two different lines of correlation, with the life expectancy increasing at different rates with respect to GDP growth. We can see that for Zimbabwe the correlation is linear as the GDP has stayed zero the whole time, but we'll plot the linear regressions for other countries to check how well they fit into the data.

In [None]:
# Restrict to area of interest 
restricted_data = data[data['Life expectancy at birth (years)'] > 70]

plt.figure(figsize=(10, 6))
palette = sns.color_palette("tab10", n_colors=len(restricted_data['Country'].unique()))

# Loop through each country
for (country, subset), color in zip(restricted_data.groupby('Country'), palette):
    x = subset['GDP'].to_numpy()
    y = subset['Life expectancy at birth (years)'].to_numpy()
    
    # Scatter points
    plt.scatter(x, y, color=color, alpha=0.6, label=country)
    
    # Compute regression
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
    
    # Regression line
    reg_x = np.linspace(x.min(), x.max(), 100)
    reg_y = intercept + slope * reg_x
    plt.plot(reg_x, reg_y, color=color, linewidth=2)
    
    # Annotate equation and R²
    xpos, ypos = x.mean(), y.mean()
    plt.text(
        xpos,
        ypos,
        f"{country}\ny={intercept:.1f}+{slope:.4e}x\nR²={r_value**2:.3f}",
        color=color,
        fontsize=9,
        ha='center'
    )

plt.xlabel('GDP')
plt.ylabel('Life expectancy at birth (years)')
plt.title('Life expectancy vs GDP by Country (with regression lines)')
plt.legend(title='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

As we can see, most of them have fairly linear dependency, with China being the least linear (R²=0.825) though it still is fairly decent description.

# **Conclusions** #
As has now been seen, we know that in the countries included in this data set, the life expectancy is, for most of them, close to 80 years (in 2015), except for Zimbabwe, which has improved the life expectancy to 60, but it's still far behind the others. The life expectancy has had a positive trend, apart from a minimum for Zimbabwe around 2004. 

Similarly, GDP has been steadily growing for most countries, though the rate of growth is highly dependent on the country. Also, for most countries, temporary stagnation or a decrease in the absolute or percentage-wise time-development of GDP is visible in some years. 

As for the correlation between GDP and life expectancy, it seems to be almost linear for all of the countries, only with a country-dependent slope and starting point.