# How does climate change feel around the globe?
# The final project from Spiced Academy
# Notebook for analysis

Check cleaning.ipynb for data cleaning.

Check vizzes.ipynb for generating visualizations.

## Contents

[The questions](#questions)

[Importing libraries and packages](#import)

[Correlation between population exposure and time](#corr_pop_time)

[Exposure change for the whole world](#exp_world)

[Temperature vs exposure worldwide](#temp_exp)

[The effects in rich and poor countries (GDP per capita)](#poor_rich)

## The questions <a id='questions'></a>

In this project, I attempt to answer these pressing questions related to periods of heat which are ever more frequent in virtually every part of the wolrd:
1. What percentage of people has direct experience with extreme heat?
2. Does this number change over time?
3. How much is it related to the global temperature anomaly?
4. Is there a clear link between wealth and heat exposure of populations?

## Importing libraries and packages <a id='import'></a>

In [1]:
import pandas as pd   # df workflow

# for clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

from scipy import stats   # for A/B test

## Correlation between population exposure and time <a id='corr_pop_time'></a>

We are going to calculate correlation between population exposure and time for individual countries.

Read the cleaned data frame from a file.


In [2]:
df_analyze_corr = pd.read_csv('../exported_dfs/exposures_summed.csv')

Keep only the lines where 'measure' contains data about population exposures.


In [3]:
df_analyze_corr =\
df_analyze_corr.loc[~(df_analyze_corr['measure'].str.contains('TEMP') |\
                      df_analyze_corr['measure'].str.contains('UTCI_POP_IND'))]
df_analyze_corr.reset_index(drop=True, inplace=True)

We need a sum of exposures for all durations, therefore group by reference area, measure and time period and calculate the sum.

In [5]:
df_analyze_corr = df_analyze_corr.groupby(['ref_area', 'measure', 'time_period'], as_index=False)['exposure'].sum()

**The actual correlations.**

Below, select the measure in which we are interested.

In [6]:
df_analyze_corr = df_analyze_corr[df_analyze_corr['measure']=='HD_TN_POP_IND']   # HERE SELECT THE MEASURE
df_analyze_corr = df_analyze_corr.drop('measure', axis='columns')   # do not need the column any longer

A loop, calculating correlation coefficient for each country.


In [7]:
countries = list(df_analyze_corr['ref_area'].unique())   # list of countries present
corr_coeffs = []   # empty list for correlation coefficients

for country in countries:
    df_country = df_analyze_corr[df_analyze_corr['ref_area']==country]   # filter for the country
    df_corr = df_country.corr(numeric_only=True)   # calculate correlation coefficients for numeric columns
    coeff = df_corr.iloc[0,1]   # choose the relevant number
    corr_coeffs.append(coeff)   # store the coefficient inside of the list

We can store the result in a data frame.


In [8]:
df_corr_results = pd.DataFrame(columns=['countries','corr_coeff'])
df_corr_results['countries'] = countries
df_corr_results['corr_coeff'] = corr_coeffs

Save the data frame into a file.

In [9]:
df_corr_results.to_csv('../exported_dfs/correlations.csv', index=False)

## Exposure change for the whole world <a id='exp_world'></a>

Bonus (not included in the graduation presentation): If we want to describe the trends of population exposures worldwide, we need to calculate weighted average over all countries for every year.

Read the cleaned data frames from files.

In [14]:
df_exp_summed = pd.read_csv('../exported_dfs/exposures_summed.csv')
df_pop = pd.read_csv('../exported_dfs/populations_clean.csv')

Select the measure we are interested in.

In [15]:
df_exp_measure = df_exp_summed[df_exp_summed['measure']=='HD_TN_POP_IND'] # HERE SELECT THE MEASURE

Merge data frames with exposure and population size.


In [16]:
df_exp_pop = pd.merge(left=df_exp_measure, right=df_pop, how='left', on=['ref_area', 'time_period'])

For further calculations, we need to replace missing values with zeros.


In [17]:
df_exp_pop['population'] = df_exp_pop['population'].fillna(0)

A loop, calculating weighted average for each year.

In [18]:
# new df for years and weighted averages:

years = list(df_exp_pop['time_period'].unique())   # list of years present
exp_world = []   # empty list for worldwide exposures

for year in years:
    df_calc = df_exp_pop[df_exp_pop['time_period']==year]   # filter for the year
    avg_world = sum(df_calc['exposure'] * df_calc['population'])/ sum(df_calc['population'])   # weighted average
    exp_world.append(avg_world)   # store the average inside of the list

We can store the result in a data frame.


In [19]:
df_world_exp = pd.DataFrame(columns=['year','exposure'])
df_world_exp['year'] = years
df_world_exp['exposure'] = exp_world

Save the data frame into a file.

In [20]:
df_world_exp.to_csv('../exported_dfs/world_exp.csv', index=False)

## Temperature vs exposure worldwide <a id='temp_exp'></a>

We will merge the data on worldwide population exposure and temperature anomalies to check for correlation and to visualize any trends.

Read the cleaned data frames from files.

In [21]:
df_temp_anomaly = pd.read_csv('../exported_dfs/temp_anomaly_clean.csv')
df_world_exp = pd.read_csv('../exported_dfs/world_exp.csv')

Merge temperature anomaly and worldwide population exposure data frames.


In [22]:
df_temp_anomaly_world_exp = pd.merge(df_temp_anomaly, df_world_exp)

Calculate the correlation coefficients.

In [23]:
df_temp_anomaly_world_exp.corr()

Unnamed: 0,year,avg_anomaly,exposure
year,1.0,0.929837,0.915518
avg_anomaly,0.929837,1.0,0.930867
exposure,0.915518,0.930867,1.0


## The effects in rich and poor countries (GDP per capita) <a id='poor_rich'></a>

We will investigate, if the population exposure is in general greater in poor countries. We will first calculate correlation between population exposure and GDP per capita and then also use k-means to cluster countries using these two variables. For two clusters, we will also perform an A/B test to compare the group characteristics.

**Correlation between population exposure and GDP per capita**

Read the cleaned data frames from files.

In [24]:
df_exp_summed = pd.read_csv('../exported_dfs/exposures_summed.csv')
df_gdp = pd.read_csv('../exported_dfs/gdp_clean.csv')

First, select the measure we are interested in.

In [25]:
df_exp_summed = df_exp_summed[df_exp_summed['measure']=='HD_TN_POP_IND'] # HERE SELECT THE MEASURE
df_exp_summed = df_exp_summed.drop('measure', axis='columns') # do not need the column any longer

Merge the data frames for population exposures and GDP per capita. Both data frames contain 'country' column but some names are slightly different, drop them first.


In [26]:
df_exp_summed.drop('country', axis='columns', inplace=True)
df_gdp.drop('country', axis='columns', inplace=True)
df_exp_gdp = pd.merge(df_exp_summed, df_gdp, how='inner', on=['ref_area', 'time_period'])

The correlation between numerical variables.


In [27]:
df_exp_gdp.corr(numeric_only=True)

Unnamed: 0,time_period,exposure,gdp
time_period,1.0,-0.004992,0.013863
exposure,-0.004992,1.0,-0.317923
gdp,0.013863,-0.317923,1.0


**k-means**

The algorithm gives more insightful results when omitting small countries with extremely high GDP per capita and zero population exposure. Drop these countries.


In [28]:
df_exp_gdp = df_exp_gdp[~df_exp_gdp['ref_area'].isin(['LUX', 'BMU', 'LIE', 'MCO'])]

Group by countries, get average population exposure and GDP per capita to cluster.

We can remember that GDP data is only available for the period 2017 - 2021. This is a reasonable time interval to explore, since the averaging supresses fluctuations but all the data is recent and describes the current situation.


In [29]:
df_exp_gdp_avg = df_exp_gdp.groupby('ref_area')[['exposure', 'gdp']].mean()

Create a data frame with numerical columns only.


In [30]:
df_num = df_exp_gdp_avg[['exposure', 'gdp']]

Standardize data (standard deviation of all columns 1, mean 0).

In [31]:
scaler = StandardScaler()
scaler.fit(df_num)
df_num_scaled = scaler.transform(df_num) # array of standardized data
df_standardized = pd.DataFrame(df_num_scaled, columns=df_num.columns) # data frame from the array

Optional: Elbow diagram to derive appropriate number of clusters.


In [None]:
# K = range(2, 10)   # try the fitting with up to 10 clusters
# inertia = []
# for k in K:
#     kmeans = KMeans(n_clusters=k,
#                     n_init=10)
#     kmeans.fit(df_standardized)
#     inertia.append(kmeans.inertia_)

# plt.figure(figsize=(16,8))
# plt.plot(K, inertia, 'bx-')
# plt.xlabel('k')
# plt.ylabel('inertia')
# plt.xticks(np.arange(min(K), max(K), 1.0))
# plt.title('Elbow Diagram')

k-Means (**choose number of clusters here**)

In [32]:
kmeans = KMeans(n_clusters=2) # CHOOSE NO OF CLUSTERS
kmeans.fit(df_standardized)

  super()._check_params_vs_input(X, default_n_init=10)


Defining the clusters (aray of labels).


In [33]:
clusters = kmeans.predict(df_standardized)

Numbers of members in each cluster.


In [40]:
pd.Series(clusters).value_counts()

0    98
1    95
Name: count, dtype: int64

Adding clusters to the original data frame.


In [41]:
df_clustered = df_exp_gdp_avg.copy()   # copy of the original data frame
df_clustered['cluster'] = clusters   # new column with cluster labels
df_clustered['cluster'] = df_clustered['cluster'].astype(str)   # labels as string, useful for visualizations
df_clustered = df_clustered.reset_index()

Save the data frame into a file.

In [42]:
df_clustered.to_csv('../exported_dfs/clusters.csv', index=False)

**For 2 clusters: A/B test**

Making separate data frames for each custer to compare.

In [43]:
df_cluster_0 = df_clustered[df_clustered['cluster']=='0']
df_cluster_1 = df_clustered[df_clustered['cluster']=='1']

Two-tailed A/B test, p-value.


In [44]:
test_statistic, pvalue = stats.ttest_ind(df_cluster_0['gdp'], df_cluster_1['gdp'])
print ('GDP:', test_statistic, pvalue)

test_statistic, pvalue = stats.ttest_ind(df_cluster_0['exposure'], df_cluster_1['exposure'])
print ('Exposure:', test_statistic, pvalue)

GDP: -7.324105263740695 6.555060849682889e-12
Exposure: 20.36103305776211 9.637934922370356e-50
