# Exploratory Data Analysis on Life Expectancy in Japan

## Introduction

Japan is known for having one of the highest life expectancies in the world. However, not all of its 47 prefectures show the same longevity rates. This project applies data science techniques to investigate which social, economic, and health-related variables contribute most to these differences.

We used Python (Pandas, Seaborn, Matplotlib) to clean, explore, visualize, and analyze the dataset, with the goal of generating insights and recommendations for public health improvement.


## Data Loading and Overview

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Dataset loading
df = pd.read_csv("Japan_life_expectancy.csv")
df.head()

## Data Cleaning and Preparation

In [None]:
df.info()
df.describe()

In [None]:
# Renaming columns 

df.rename(columns={
    'Pshic_hosp': 'Psychiatric_hosp',
    'Beds_psic': 'Psychiatric_beds',
    'Income_per capita': 'Income_per_capita',
    'Avg_hours': 'Avg_hours_worked',
    'Hospitals': 'Priv_hospitals',
    'Educ_exp': 'Education_exp'
}, inplace=True)

In [None]:
#removing unnecesary columns 
df_numerico = df.drop(columns=['Prefecture'])

## Exploratory Data Analysis (EDA)

In [None]:
# Statistical summary

media = df_numerico.mean()
mediana = df_numerico.median()
moda = df_numerico.mode().iloc[0]
minimo = df_numerico.min()
maximo = df_numerico.max()
desv_std = df_numerico.std()
varianza = df_numerico.var()
rango = maximo - minimo

In [None]:
summary = pd.DataFrame({
    'Media': media,
    'Mediana': mediana,
    'Moda': moda,
    'Maximo': maximo,
    'Minimo': minimo,
    'Desviacion Estandar': desv_std,
    'Varianza': varianza,
    'Rango': rango
})


# adding borders and styling

summary.style.set_table_styles(
    [{'selector': 'th, td', 'props': [('border', '1px solid blue')]}]
)


## Descriptive Analysis

In [None]:
sns.histplot(df['Welfare_exp'], kde=True, color='skyblue')
plt.title('Distribution of public spending on social welfare')
plt.ylabel('Frequency')
plt.xlabel('Percentage of public expenditure on welfare (%)')
plt.show()


In [None]:
sns.histplot(df['Public_Hosp'], kde=True, color='skyblue')
plt.title('Distribution of Public Hospitals in Japan')
plt.ylabel('Frecuencia')
plt.xlabel('Hospitales Públicos')
plt.show()

In [None]:
sns.histplot(df['Life_expectancy'], kde=True, color='skyblue')
plt.title('Distribution of Average Life Expectancy')
plt.ylabel('Frecuencia')
plt.xlabel('Años')
plt.show()

In [None]:
sns.scatterplot(x='Avg_hours_worked', y='Life_expectancy', data=df, color='teal', s=60)
plt.title('Relationship between Hours Worked and Life Expectancy')
plt.xlabel('Horas promedio trabajadas por semana')
plt.ylabel('Expectativa de vida (años)')
plt.grid(True)
plt.tight_layout()
plt.show()

## Correlation Analysis

In [None]:
sns.scatterplot(x='Health_exp', y='Life_expectancy', data=df, color='seagreen', s=60)
plt.title('Relationship between Health Spending and Life Expectancy')
plt.xlabel('Gasto público en salud (%)')
plt.ylabel('Expectativa de vida (años)')
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
sns.scatterplot(x='Income_per_capita', y='Life_expectancy', data=df, color='royalblue', s=60)
plt.title('Relationship between Per Capita Income and Life Expectancy')
plt.xlabel('Ingreso per cápita (en yenes)')
plt.ylabel('Expectativa de vida (años)')
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
sns.boxplot(x=df['Welfare_exp'], color='lightblue')
plt.title('Boxplot of social welfare spending')
plt.xlabel('Welfare_exp')
plt.show()

In [None]:
sns.histplot(df['Education_exp'], kde=True, bins=10, color='lightcoral')
plt.title('Distribution of public spending on education')
plt.xlabel('gasto público en educación')
plt.ylabel('Frecuencia')
plt.grid(False)
plt.tight_layout()
plt.show()

In [None]:
sns.histplot(df['Salary'], kde=True, bins=10, color='steelblue')
plt.title('Average salary distribution')
plt.xlabel('Salary')
plt.ylabel('Frecuencia')
plt.grid(True)
plt.tight_layout()
plt.show()

## Predictive Analysis

In [None]:
# Calculate the correlation matrix of all numerical variables.
correlaciones = df_numerico.corr()

correlacion_con_vida = correlaciones['Life_expectancy'].sort_values(ascending=False)
print(correlacion_con_vida)


In [None]:
plt.figure(figsize=(12, 10))
sns.heatmap(df_numerico.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix between Variables')
plt.tight_layout()
plt.show()

In [None]:
# Calculation of current averages
vida_actual = df['Life_expectancy'].mean()
health_exp_actual = df['Health_exp'].mean()
income_actual = df['Income_per_capita'].mean()
educ_exp_actual = df['Education_exp'].mean()
avg_hours_actual = df['Avg_hours_worked'].mean()

# Progressive scenario for 2030
health_exp_2030 = health_exp_actual * 1.10      # +10%
income_2030 = income_actual * 1.15              # +15%
educ_exp_2030 = educ_exp_actual * 1.08          # +8%
avg_hours_2030 = avg_hours_actual * 0.98        # -2%


cor_health = df['Life_expectancy'].corr(df['Health_exp'])
cor_income = df['Life_expectancy'].corr(df['Income_per_capita'])
cor_educ = df['Life_expectancy'].corr(df['Education_exp'])
cor_hours = df['Life_expectancy'].corr(df['Avg_hours_worked'])

# Estimated change in Life_expectancy
cambio = (
    (health_exp_2030 - health_exp_actual) * cor_health +
    (income_2030 - income_actual) * cor_income +
    (educ_exp_2030 - educ_exp_actual) * cor_educ +
    (avg_hours_2030 - avg_hours_actual) * cor_hours
)

# Projection to 2030
vida_2030 = vida_actual + cambio

# Results
print("Expectativa de vida actual:", round(vida_actual, 2))
print("Expectativa de vida proyectada para 2030:", round(vida_2030, 2))

## Prescriptive Analysis and Recommendations

In [None]:
# Relative changes (proportional)
rel_change_health = 0.10  # +10%
rel_change_income = 0.15  # +15%
rel_change_educ = 0.08    # +8%
rel_change_hours = -0.02  # -2%

cambio_normalizado = (
    rel_change_health * cor_health +
    rel_change_income * cor_income +
    rel_change_educ * cor_educ +
    rel_change_hours * cor_hours
)

# We scale the final change to a reasonable range (e.g., 1 life expectancy point)
# We assume that a combined correlation change of 1.0 equals 2 additional years of life

impacto_esperado = cambio_normalizado * 2

vida_2030 = vida_actual + impacto_esperado


In [None]:
etiquetas = ['2020', 'Proyección (2030)']
valores = [vida_actual, vida_2030]

plt.figure(figsize=(6, 4))
plt.bar(etiquetas, valores, color=['steelblue', 'mediumseagreen'])
plt.ylabel('Expectativa de vida (años)')
plt.title('Proyección de Expectativa de Vida en Japón')
plt.ylim(min(valores) - 1, max(valores) + 1)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
#  Get the top 5 prefectures with the highest life expectancy
top_5 = df.sort_values(by='Life_expectancy', ascending=False).head(5)

#  Calculate the national average of health expenditure
avg_health_exp = df['Health_exp'].mean()

#  Prepare data for the bar chart
names = list(top_5['Prefecture']) + ['National Average']
values = list(top_5['Health_exp']) + [avg_health_exp]

#  Create the bar chart
plt.figure(figsize=(10, 5))
bars = plt.bar(names, values, color=['seagreen'] * 5 + ['gray'])

#  Chart aesthetics
plt.title('Health Expenditure in the 5 Prefectures with the Highest Life Expectancy')
plt.ylabel('Health Expenditure (%)')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.5)

# 6. Add value labels on top of each bar
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, height + 0.1, round(height, 2), ha='center', va='bottom')

plt.tight_layout()
plt.show()


## Conclusion

The analysis revealed clear and measurable relationships between life expectancy and several key factors across Japan’s 47 prefectures:

Education: Prefectures with higher percentages of university or technical education (e.g., Nara, Kyoto) showed a strong positive correlation with life expectancy.

Income: A moderate positive relationship was found between income per capita and life expectancy, indicating that wealthier regions tend to live longer.

Public Health Expenditure: Higher investment in health services was associated with increased longevity. For example, Nara, with the highest life expectancy, also had above-average spending on health (4.06%).

Work-Life Balance: Although weaker, there was a negative correlation between average working hours and life expectancy, suggesting that reducing overwork could contribute to longer and healthier lives.

A hypothetical scenario simulating modest improvements (e.g., +10% in health spending, +15% in income per capita, -2% in working hours) projected an increase in national life expectancy from 84.53 to approximately 84.8 years by 2030.

## References

- [Japan Life Expectancy Dataset - Kaggle](https://www.kaggle.com/datasets/gianinamariapetrascu/japan-life-expectancy)  
- DataCamp Resources  
- JMP Statistics Knowledge Portal  
