

# Exploring the Association between Culture and Development

Hoffstede defined natural culture as the collective programing of people which affect how they think, behavor, and act as a group. The development of a nation include their health status, economic output, energy consumption, and many other aspects. It is only natural to think culture and development are related as how we think and act may affect how we produce and consume. Wise verse, how we produce and consume may affect how we think and act. 

This exploratory study will use world development indicators data from the World Bank and national cultural dimensions data from the Hoffstede to find out if the association exists and the nature of the association.

## Part 1 - Compute Statistics

### Step 1 - Import Python libraries

In [1]:
import pandas as pd
import plotly.express as px
import statsmodels.api as sm
from scipy.stats import pearsonr

## Step 2 - Load and Merge Datasets

In [2]:
# The national culture dimension dataset was download from Hoffstede and preprocessed and uploaded to the GitHub

CULTURE_DATA_URL = "https://raw.githubusercontent.com/wcj365/public_data/master/national_culture_dimensions.csv"

df_cult = pd.read_csv(CULTURE_DATA_URL)

df_cult.head()

Unnamed: 0,Country Code,Country Name,PDI,IDV,MAS,UAI,LTO,IVR
0,ARG,Argentina,49,46,56,86,20,62
1,AUS,Australia,38,90,61,51,21,71
2,AUT,Austria,11,55,79,70,60,63
3,BEL,Belgium,65,75,54,94,82,57
4,BGD,Bangladesh,80,20,55,60,47,20


In [5]:
df_cult.columns

Index(['Country Code', 'Country Name', 'PDI', 'IDV', 'MAS', 'UAI', 'LTO',
       'IVR'],
      dtype='object')

In [3]:
# The world development indicators dataset was downloaded from the World Bank website.
# The dataset is over 150 MB in size and exceed the limit of GitHub. 
# It is uploaded to the Google Drive.

WDI_DATA_URL = "https://drive.google.com/file/d/15uXGxk5aedtuss3yd3OsY2WU6UcJ1Vta/view?usp=sharing"

df_wdi = pd.read_csv("../data/WDIData.csv")

print(df_wdi.shape)

df_wdi.head()

(383572, 66)


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,16.433999,16.789043,17.196986,17.597176,18.034249,18.345878,18.695306,19.149942,19.501837,
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,6.196543,6.397917,6.580066,6.786218,6.941323,7.096843,7.254828,7.460783,7.599289,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,37.434876,37.660864,37.857526,38.204173,38.303515,38.421813,38.482409,38.692053,38.793983,
3,Africa Eastern and Southern,AFE,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,31.682318,31.610692,31.82495,33.744405,38.733352,40.092163,42.880977,44.073912,45.609604,
4,Africa Eastern and Southern,AFE,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,19.30785,18.535523,17.485006,16.329765,24.372504,25.153292,27.227391,29.383,30.163364,


In [7]:
# Merge the two datasets 

df_wdi.drop(columns=["Country Name", "Indicator Name"], inplace=True)
df_cult.drop(columns=["Country Name"], inplace=True)

df_merge = pd.merge(df_wdi, df_cult, how="inner", on="Country Code")

print(df_merge.shape)
df_merge.head()

(86520, 70)


Unnamed: 0,Country Code,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2018,2019,2020,2021,PDI,IDV,MAS,UAI,LTO,IVR
0,ARG,EG.CFT.ACCS.ZS,,,,,,,,,...,99.8,99.9,99.9,,49,46,56,86,20,62
1,ARG,EG.CFT.ACCS.RU.ZS,,,,,,,,,...,96.6,97.0,97.2,,49,46,56,86,20,62
2,ARG,EG.CFT.ACCS.UR.ZS,,,,,,,,,...,99.9,99.9,99.9,,49,46,56,86,20,62
3,ARG,EG.ELC.ACCS.ZS,,,,,,,,,...,99.989578,100.0,100.0,,49,46,56,86,20,62
4,ARG,EG.ELC.ACCS.RU.ZS,,,,,,,,,...,99.871811,100.0,100.0,,49,46,56,86,20,62


In [8]:
df_merge["Country Code"].nunique()

60

In [9]:
df_merge["Indicator Code"].nunique()

1442

### Step 3 - Compute and Save the Correlation Statistics

For every each and every indicator, compute:

- Pearson's Correlation Coefficent and its P-value
- OLS coefficient and its P-value
- Coefficient of Determination (R Squared)
- Adjusted Coefficient of Determination (R Squared Adjusted)

These statistics are saved to a file in CSV format for future exploration.

#### 3.1 Compute Statistics

In [27]:
indicator_list = df_merge["Indicator Code"].unique()
year_list = [str(year) for year in range(1960, 2022)]

result_list = []


for year in year_list:

    for indicator in indicator_list:

        _df = df_merge[df_merge["Indicator Code"] == indicator][[year, "IDV"]].dropna()


        if _df.shape[0] <= 2:  # Must have at least three data points
            row = [year, indicator, _df.shape[0], None, None, None, None, None, None]
            result_list.append(row)
            continue
        else:
            X = _df["IDV"]
            Y = _df[year]
            pearson = pearsonr(X, Y)
            X = sm.add_constant(X) # adding a constant
            model = sm.OLS(Y, X).fit()
            row = [year, 
                   indicator, 
                   _df.shape[0], 
                   round(pearson[0],2), 
                   round(pearson[1],2), 
                   round(model.params[1],2), 
                   round(model.pvalues[1],2), 
                   round(model.rsquared,2), 
                   round(model.rsquared_adj,2)
            ]
 
        result_list.append(row)

  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss


#### 3.2 Save the Statistics to a Dataframe

In [28]:
column_list = [
    "Year", 
    "Indicator Code", 
    "Countries", 
    "Pearson R",
    "Pearson P-value",
    "Coefficient",
    "P_value", 
    "R_Squared", 
    "R_Squared_Adj"
]

df_results = pd.DataFrame(result_list, columns=column_list)

print(df_results.shape)
df_results.head()

(89404, 9)


Unnamed: 0,Year,Indicator Code,Countries,Pearson R,Pearson P-value,Coefficient,P_value,R_Squared,R_Squared_Adj
0,1960,EG.CFT.ACCS.ZS,0,,,,,,
1,1960,EG.CFT.ACCS.RU.ZS,0,,,,,,
2,1960,EG.CFT.ACCS.UR.ZS,0,,,,,,
3,1960,EG.ELC.ACCS.ZS,0,,,,,,
4,1960,EG.ELC.ACCS.RU.ZS,0,,,,,,


In [29]:
df_results.sample(10)

Unnamed: 0,Year,Indicator Code,Countries,Pearson R,Pearson P-value,Coefficient,P_value,R_Squared,R_Squared_Adj
19118,1973,SL.EMP.TOTL.SP.MA.NE.ZS,3,-0.9,0.29,-0.2,0.29,0.8,0.61
63847,2004,NE.EXP.GNFS.KD,57,0.44,0.0,4673216000.0,0.0,0.19,0.18
8476,1965,SH.MED.SAOP.P5,0,,,,,,
43811,1990,NY.GDP.FCST.KD,39,0.4,0.01,26945280000.0,0.01,0.16,0.14
57691,2000,FX.OWN.TOTL.PL.ZS,0,,,,,,
65722,2005,DC.DAC.NZLL.CD,20,-0.19,0.42,-49022.56,0.42,0.04,-0.02
58058,2000,SL.EMP.1524.SP.MA.NE.ZS,40,0.36,0.02,0.23,0.02,0.13,0.11
83514,2017,SP.POP.TECH.RD.P6,34,0.59,0.0,22.41,0.0,0.34,0.32
28718,1979,SP.POP.TECH.RD.P6,0,,,,,,
73669,2011,TM.TAX.TCOM.BR.ZS,56,-0.44,0.0,-0.58,0.0,0.19,0.18


In [30]:
df_results.sort_values("Pearson R")

Unnamed: 0,Year,Indicator Code,Countries,Pearson R,Pearson P-value,Coefficient,P_value,R_Squared,R_Squared_Adj
63604,2004,SL.SRV.0714.MA.ZS,3,-1.0,0.05,-1.36,0.05,0.99,0.99
15592,1970,FR.INR.RINR,4,-1.0,0.00,-0.12,0.00,0.99,0.99
77886,2014,per_lm_alllm.adq_pop_tot,3,-1.0,0.05,-1.23,0.05,0.99,0.99
18476,1972,FR.INR.RINR,4,-1.0,0.00,-0.12,0.00,0.99,0.99
77852,2013,SP.DYN.WFRT,3,-1.0,0.04,-0.05,0.04,1.00,0.99
...,...,...,...,...,...,...,...,...,...
89397,2021,SG.VAW.GOES.ZS,0,,,,,,
89398,2021,SG.VAW.NEGL.ZS,0,,,,,,
89399,2021,SG.VAW.REFU.ZS,0,,,,,,
89400,2021,SP.M15.2024.FE.ZS,0,,,,,,


### 3.3 Add Topic to the Dataframe

In [35]:
df_series = pd.read_csv("../data/WDISeries.csv")

print(df_series.shape)

df_series.sample(5)

(1442, 21)


Unnamed: 0,Series Code,Topic,Indicator Name,Short definition,Long definition,Unit of measure,Periodicity,Base Period,Other notes,Aggregation method,...,Notes from original source,General comments,Source,Statistical concept and methodology,Development relevance,Related source links,Other web links,Related indicators,License Type,Unnamed: 20
22,AG.LND.TRAC.ZS,Environment: Agricultural production,"Agricultural machinery, tractors per 100 sq. k...",,Agricultural machinery refers to the number of...,,Annual,,,Weighted average,...,,,"Food and Agriculture Organization, electronic ...",A tractor provides the power and traction to m...,Agricultural land covers more than one-third o...,,,,CC BY-4.0,
1266,SP.POP.0004.MA.5Y,Health: Population: Structure,"Population ages 00-04, male (% of male populat...",,Male population between the ages 0 to 4 as a p...,,Annual,,,,...,,,United Nations Population Division. World Popu...,,,,,,CC BY-4.0,
1130,SL.EMP.TOTL.SP.MA.ZS,Social Protection & Labor: Economic activity,"Employment to population ratio, 15+, male (%) ...",,Employment to population ratio is the proporti...,,Annual,,,Weighted average,...,"Given the exceptional situation, including the...",National estimates are also available in the W...,International Labour Organization. “ILO Modell...,The employment-to-population ratio indicates h...,Four targets were added to the UN Millennium D...,,,,CC BY-4.0,
1183,SL.TLF.ACTI.MA.ZS,Social Protection & Labor: Labor force structure,"Labor force participation rate, male (% of mal...",,Labor force participation rate is the proporti...,,Annual,,,Weighted average,...,"Given the exceptional situation, including the...",National estimates are also available in the W...,International Labour Organization. “ILO modell...,The labor force is the supply of labor availab...,Estimates of women in the labor force and empl...,,,,CC BY-4.0,
1021,SH.STA.OWGH.ME.ZS,Health: Nutrition,"Prevalence of overweight (modeled estimate, % ...",,Prevalence of overweight children is the perce...,,Annual,,,Weighted average,...,,Once considered only a high-income economy pro...,"UNICEF, WHO, World Bank: Joint child malnutrit...",,,,,,CC BY-4.0,


In [36]:
df_series.rename(columns={"Series Code":"Indicator Code"}, inplace=True)

df_series.head(1).T

Unnamed: 0,0
Indicator Code,AG.AGR.TRAC.NO
Topic,Environment: Agricultural production
Indicator Name,"Agricultural machinery, tractors"
Short definition,
Long definition,Agricultural machinery refers to the number of...
Unit of measure,
Periodicity,Annual
Base Period,
Other notes,
Aggregation method,Sum


In [38]:
df_merge = pd.merge(df_results, df_series[["Indicator Code", "Indicator Name", "Topic"]], how="left", on="Indicator Code")

print(df_merge.shape)

df_merge.sample(5)

(89404, 11)


Unnamed: 0,Year,Indicator Code,Countries,Pearson R,Pearson P-value,Coefficient,P_value,R_Squared,R_Squared_Adj,Indicator Name,Topic
31382,1981,SH.STA.OWGH.ZS,0,,,,,,,"Prevalence of overweight, weight for height (%...",Health: Nutrition
24289,1976,SE.SEC.PRIV.ZS,0,,,,,,,"School enrollment, secondary, private (% of to...",Education: Participation
29632,1980,SP.DYN.IMRT.IN,55,-0.57,0.0,-0.78,0.0,0.33,0.32,"Mortality rate, infant (per 1,000 live births)",Health: Mortality
64511,2004,BX.PEF.TOTL.CD.WD,56,0.23,0.09,242685700.0,0.09,0.05,0.04,"Portfolio equity, net inflows (BoP, current US$)",Economic Policy & Debt: Balance of payments: C...
18489,1972,GC.REV.XGRT.GD.ZS,26,0.58,0.0,0.16,0.0,0.34,0.31,"Revenue, excluding grants (% of GDP)",Public Sector: Government finance: Revenue


#### 3.4 Save the Dataframe to a File

In [32]:
df_merge.sort_values("Pearson R").to_csv("../data/results.csv", index=False)