# World Bank Ed Stats Model Building

This data has been collected and encoded by the World Bank as indicators for educational performance and attainment, as well as expenditure on education, since 1970. The data spans the countries of the world and aggregates some regions and socio-economic distinctions. The dataset is sparse, however, with a majority of null values. As I prepare the dataset for modeling, I have three objectives:

### Arrange the Data 
### Context and Visualizations
### Null Handling and Feature Selection

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import difflib
from collections import Counter

%matplotlib inline

#Open and read the CSV file to a Dataframe
data = pd.read_csv('EdStatsData.csv')

# Save a dictionary matching the indicator code to its indicator name
ind_codes = data['Indicator Code'].unique().tolist()
ind_names = data['Indicator Name'].unique().tolist()
indicator_dict = {k:v for (k,v) in zip(ind_codes, ind_names)}

## Arrange the Data
First, we will remove some of the unnecessary columns, like country codes, indicator names, and future years. There are far too many indicators to check the missing values for each, so we will isolate the indicator group from the prefix to the indicator code. Then we will separate the data frame into countries, regions, and socio-economic levels. Our immediate interest is the set of countries.

In [2]:
# Remove columns for years that have yet to happen
data.drop(['Country Code', 'Indicator Name', '2020', '2025', '2030', '2035', '2040', '2045', '2050', '2055',
          '2060', '2065', '2070', '2075', '2080', '2085', '2090', '2095', '2100', 'Unnamed: 69'], axis=1, inplace=True)
data.head()

Unnamed: 0,Country Name,Indicator Code,1970,1971,1972,1973,1974,1975,1976,1977,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Arab World,UIS.NERA.2,,,,,,,,,...,,,,,,,,,,
1,Arab World,UIS.NERA.2.F,,,,,,,,,...,,,,,,,,,,
2,Arab World,UIS.NERA.2.GPI,,,,,,,,,...,,,,,,,,,,
3,Arab World,UIS.NERA.2.M,,,,,,,,,...,,,,,,,,,,
4,Arab World,SE.PRM.TENR,54.822121,54.894138,56.209438,57.267109,57.991138,59.36554,60.999962,61.92268,...,84.011871,84.195961,85.211998,85.24514,86.101669,85.51194,85.320152,,,


In [3]:
# Create a column to separate out the group from the indicator
data['Indicator Group'] = [s.split('.')[0] for s in data['Indicator Code']] 

In [4]:
# Separate data frame by country, region, or socio-economic distinction
regions = ['Arab World', 'East Asia & Pacific', 'East Asia & Pacific (excluding high income)', 'Euro area', 'Europe & Central Asia', 
 'Europe & Central Asia (excluding high income)', 'European Union', 'Latin America & Caribbean', 'Latin America & Caribbean (excluding high income)', 
  'Middle East & North Africa', 'Middle East & North Africa (excluding high income)', 'Middle income', 'North America', 'South Asia', 
           'Sub-Saharan Africa', 'Sub-Saharan Africa (excluding high income)', 'OECD members', 'World']

income_levels = ['Heavily indebted poor countries (HIPC)', 'High income', 'Least developed countries: UN classification', 'Low & middle income', 
 'Low income', 'Lower middle income', 'Upper middle income']

reg_df = data[data['Country Name'].isin(regions)]

inc_df = data[data['Country Name'].isin(income_levels)]

cntry_df = data[~data['Country Name'].isin(regions)]
cntry_df = cntry_df[~cntry_df['Country Name'].isin(income_levels)]

# Remove initial data frame from working memory  
del data

cntry_df.head()

Unnamed: 0,Country Name,Indicator Code,1970,1971,1972,1973,1974,1975,1976,1977,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,Indicator Group
91625,Afghanistan,UIS.NERA.2,,,,,7.05911,,,,...,,,,,47.43679,50.627232,,,,UIS
91626,Afghanistan,UIS.NERA.2.F,,,,,2.53138,,,,...,,,,,34.073261,37.641541,,,,UIS
91627,Afghanistan,UIS.NERA.2.GPI,,,,,0.22154,,,,...,,,,,0.56706,0.59837,,,,UIS
91628,Afghanistan,UIS.NERA.2.M,,,,,11.42652,,,,...,,,,,60.087059,62.906952,,,,UIS
91629,Afghanistan,SE.PRM.TENR,,,,,,,,,...,,,,,,,,,,SE


### Reindexing

Now that we have the data frame with just the countries of the world for years that have actually occurred, it is time to get the indicators set as the columns, grouped by their indicator group. 

In [5]:
# Set the desired columns as indices
cntry_df.set_index(['Country Name', 'Indicator Group', 'Indicator Code'], inplace=True)
# Transpose to get the row index to be the year
df_t = cntry_df.transpose()
# Stack the Country Name column as a secondary index for the rows
df_t = df_t.stack('Country Name')
df_t.head()

Unnamed: 0_level_0,Indicator Group,BAR,BAR,BAR,BAR,BAR,BAR,BAR,BAR,BAR,BAR,...,UIS,UIS,UIS,UIS,UIS,UIS,UIS,UIS,XGDP,XGDP
Unnamed: 0_level_1,Indicator Code,BAR.NOED.1519.FE.ZS,BAR.NOED.1519.ZS,BAR.NOED.15UP.FE.ZS,BAR.NOED.15UP.ZS,BAR.NOED.2024.FE.ZS,BAR.NOED.2024.ZS,BAR.NOED.2529.FE.ZS,BAR.NOED.2529.ZS,BAR.NOED.25UP.FE.ZS,BAR.NOED.25UP.ZS,...,UIS.XUNIT.US.4.FSGOV,UIS.XUNIT.US.56.FSGOV,UIS.XUNIT.USCONST.1.FSGOV,UIS.XUNIT.USCONST.2.FSGOV,UIS.XUNIT.USCONST.23.FSGOV,UIS.XUNIT.USCONST.3.FSGOV,UIS.XUNIT.USCONST.4.FSGOV,UIS.XUNIT.USCONST.56.FSGOV,XGDP.23.FSGOV.FDINSTADM.FFD,XGDP.56.FSGOV.FDINSTADM.FFD
Unnamed: 0_level_2,Country Name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
1970,Afghanistan,91.44,77.08,97.21,88.81,94.8,78.4,98.6,91.09,99.25,94.22,...,,,,,,,,,,
1970,Albania,26.56,21.7,41.88,37.92,28.2,28.38,31.77,28.91,48.28,43.8,...,,,,,,,,,,
1970,Algeria,69.7,52.9,87.39,73.64,69.7,52.9,91.5,77.3,95.9,84.4,...,,,,,,,,,,
1970,American Samoa,,,,,,,,,,,...,,,,,,,,,,
1970,Andorra,,,,,,,,,,,...,,,,,,,,,,


## How much sparsity is there?

Below I will look into what percentage of each indicator group contains null values to determine next steps in finding feasible indicators with which to work.

In [6]:
# Create a list of indicator groups and iterate through that list to determine the percent of null values for that group
cols = df_t.columns
lst = [e[0] for e in cols]
col_lst = list(set(lst))

# Create a dictionary with the group as the key and the percent sparsity as the value in a list for the key
group_dict = {}

for group in col_lst:
    na_pct = np.sum(df_t[group].isna().sum()) / df_t[group].size
    group_dict[group] = []
    group_dict[group].append(na_pct)
    print('{} Null Percentage: {:2f}'.format(group, na_pct))

NY Null Percentage: 0.375971
BAR Null Percentage: 0.872629
PRJ Null Percentage: 0.967372
OECD Null Percentage: 0.961929
LO Null Percentage: 0.994228
XGDP Null Percentage: 0.882260
UIS Null Percentage: 0.869046
SE Null Percentage: 0.680125
HH Null Percentage: 0.988921
SH Null Percentage: 0.405946
SP Null Percentage: 0.466671
IT Null Percentage: 0.645111
SL Null Percentage: 0.801463
SABER Null Percentage: 0.998478


## Context 
That was a lot of missing data. I figured there had to be some reason for that many null values, so I did some investigation into the programs that collected the data. There were several programs that collected data in which I was uninterested or over a scope which did not meet the scope of my project. 

### Next Steps
The research I did is not exhaustive, but provides a good first look to help me reduce features in which I am uninterested. I want to work with data that has 10 or more years of history and could be descriptive of most countries (not just one or two regions or the set of OECD nations). This exploration also indicated where I could look to combine features or reduce them for redundancy. So now I will work on the next steps indicated in the last column of the above table.

In [7]:
# Drop columns not of interest
df_t.drop(['IT', 'SABER', 'PRJ', 'XGDP', 'OECD', 'SL', 'SH'], axis=1, inplace=True)

In [8]:
# Drop Learning Outcomes that are note the literacy indicator
df_t.drop([e for e in df_t['LO'] if not e.startswith('LO.EGRA')], axis=1, level=1, inplace=True)

In [9]:
# Create a data frame where null values are indicated by 0 all other values by 1
lo_sparsity = df_t['LO']
lo_sparsity = lo_sparsity.applymap(lambda x: 1 if -100<x<100 else 0)
lo_sparsity = lo_sparsity.T
lo_sparsity.head()

Unnamed: 0_level_0,1970,1970,1970,1970,1970,1970,1970,1970,1970,1970,...,2016,2016,2016,2017,2017,2017,2017,2017,2017,2017
Country Name,Afghanistan,Albania,Algeria,American Samoa,Andorra,Angola,Antigua and Barbuda,Argentina,Armenia,Aruba,...,"Yemen, Rep.",Zambia,Zimbabwe,Afghanistan,Bangladesh,Fiji,Liberia,Sierra Leone,Tajikistan,Ukraine
Indicator Code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
LO.EGRA.CLPM.AFA.2GRD,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
LO.EGRA.CLPM.AFA.3GRD,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
LO.EGRA.CLPM.AMH.2GRD,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
LO.EGRA.CLPM.AMH.3GRD,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
LO.EGRA.CLPM.BMN.2GRD,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# Remove columns where there are no values and put the year in the index
lo_sparsity = lo_sparsity[['2008', '2009', '2010', '2011', '2012', '2013', '2014']]
lo_sparsity = lo_sparsity.stack(0)
lo_sparsity = lo_sparsity.unstack(0).stack()

In [11]:
# Drop countries with no values and create a dictionary to show which countries had how many values during this period
sp_dict = {}
for column in lo_sparsity.columns:
    if lo_sparsity[column].sum() == 0:
        lo_sparsity.drop(column, axis=1, inplace=True)
    else:
        sp_dict[column] = lo_sparsity[column].sum()
sp_dict

{'Egypt, Arab Rep.': 14,
 'Ethiopia': 96,
 'Ghana': 84,
 'Guyana': 30,
 'Indonesia': 7,
 'Jordan': 28,
 'Malawi': 54,
 'Mali': 36,
 'Nicaragua': 26,
 'Philippines': 33,
 'Rwanda': 29,
 'Senegal': 6,
 'Tanzania': 14,
 'West Bank and Gaza': 8,
 'Zambia': 51}

In [12]:
del lo_sparsity

For over 400 indicators through the course of 7 years, I have values for 15 countries ranging from 6 values to 96 values. These indicators do not meet the criteria I was hoping for in this analysis. Therefore I will be dropping them from the data frame.

In [13]:
# Drop all LO columns as they are no longer of interest
df_t.drop('LO', axis=1, inplace=True)

In [14]:
# Drop the Africa Dataset, since we are interested in data for global comparison
df_t.drop([e for e in df_t['UIS'] if e.startswith('UIS.AFR')], axis=1, level=1, inplace=True)

In [15]:
# Create dictionaries of indicators to names and names to indicators for groups that may have common indicators
usb_dict = {k:v for k, v in indicator_dict.items() if k.startswith('UIS') or k.startswith('BAR') or k.startswith('SE')}
usb_dict = {k:v for k, v in usb_dict.items() if not k.startswith('UIS.AFR')}
reversed_usb = {v:k for k,v in usb_dict.items()}

In [16]:
# Create a dictionary of indicator codes of potentially similar indicators based on close matches of indicator names
u = [v for k, v in usb_dict.items() if k.startswith('UIS')]
s = [v for k, v in usb_dict.items() if k.startswith('SE')]
b = [v for k, v in usb_dict.items() if k.startswith('BAR')]
usb_list = [u, s, b]
match_dict = {}
for i in range(len(usb_list) - 1):
    for name in usb_list[i]:
        matches = difflib.get_close_matches(name, usb_list[i+1], cutoff=0.8)
        if len(usb_list) - i > 2:
            matches.extend(difflib.get_close_matches(name, usb_list[i+2], cutoff=0.8))
        for match in matches:
            if reversed_usb[match] not in match_dict.keys():
                match_dict[reversed_usb[match]] = reversed_usb[name]
len(match_dict)

159

In [17]:
# Create a data frame to compare the indicators that are potential matches by their names, the total countries polled over
# which years, and the mean value of the measurements taken by each

def matches_to_df(df, matches, indicator_dict):
    '''A function designed to take a data frame, a dictionary of close matches for column names, and a dictionary matching
    codes to long form names in order to output a data frame that compares the values of the matching columns.'''
    df_list = []
    
    for k, v in matches.items():
        code_a, name_a, code_b, name_b = k, indicator_dict[k], v, indicator_dict[v]
        
        index_1 = df.index[df['SE', k] > 0].tolist()
        index_1 = [i[0] for i in index_1]
        index_1 = Counter(index_1)
        
        fyear_a, lyear_a = min(index_1.keys()), max(index_1.keys())
        cntry_a = np.mean(list(index_1.values()))
        mean_a = np.nanmean(df['SE', k])
        
        index_2 = df.index[df['UIS', v] > 0].tolist()
        index_2 = [i[0] for i in index_2]
        index_2 = Counter(index_2)
        
        fyear_b, lyear_b = min(index_2.keys()), max(index_2.keys())
        cntry_b = np.mean(list(index_2.values()))
        mean_b = np.nanmean(df['UIS', v])
        
        df_list.append([code_a, name_a, fyear_a, lyear_a, cntry_a, mean_a, code_b, name_b, fyear_b, lyear_b, cntry_b, mean_b])
        
    new_df = pd.DataFrame(df_list, columns=['Code1', 'Name1', 'Start_Year1', 'End_Year1', 'Countries1', 'Mean1', 
                                   'Code2', 'Name2', 'Start_Year2', 'End_Year2', 'Countries2', 'Mean2'])
    return new_df

matches_df = matches_to_df(df_t, match_dict, indicator_dict)
matches_df.head()

Unnamed: 0,Code1,Name1,Start_Year1,End_Year1,Countries1,Mean1,Code2,Name2,Start_Year2,End_Year2,Countries2,Mean2
0,SE.PRM.TENR,"Adjusted net enrolment rate, primary, both sex...",1999,2016,120.277778,89.974554,UIS.NERA.2,"Adjusted net enrolment rate, lower secondary, ...",1970,2015,52.282609,65.687601
1,SE.SEC.ENRR.LO,"Gross enrolment ratio, lower secondary, both s...",1981,2016,84.472222,85.556647,UIS.NERA.2,"Adjusted net enrolment rate, lower secondary, ...",1970,2015,52.282609,65.687601
2,SE.SEC.NENR,"Net enrolment rate, secondary, both sexes (%)",1999,2016,90.388889,69.635851,UIS.NERA.2,"Adjusted net enrolment rate, lower secondary, ...",1970,2015,52.282609,65.687601
3,SE.PRM.TENR.FE,"Adjusted net enrolment rate, primary, female (%)",1999,2016,105.833333,88.311146,UIS.NERA.2.F,"Adjusted net enrolment rate, lower secondary, ...",1970,2015,49.130435,66.528
4,SE.PRM.TENR.MA,"Adjusted net enrolment rate, primary, male (%)",1999,2016,105.833333,89.501506,UIS.NERA.2.F,"Adjusted net enrolment rate, lower secondary, ...",1970,2015,49.130435,66.528


### Knowing when to change approaches

After much trial and error surrounding the above approach to algorithmically get similar indicators for combining, I have come to the realization that whatever cutoff pecentage I set, I will be left with a mix of good matches and bad matches. I cannot rely on python to decide which indicators could be combined. I will review the documentation in the EdStatsSeries.csv file to find indicators of interest to me and produce a fitting research question to investigate.

### Question 1: Does increased expenditure on secondary education lead to increased graduation from secondary school?

**Indicators:** Expenditure on education as a percent of GDP, expenditure on secondary as a percentage of education expenditure, graduation rate (dropout rate in secondary and survival rate primary to secondary)

### Question 2: Does increased expenditure on secondary education lead to increases in GDP growth?

**Indicators:** Expenditure on education as % of GDP,  expenditure on secondary, tertiary enrollment, expenditure on secondary as a percentage of education expenditure, secondary graduation rate, GDP


In [18]:
# Create a copy of the data frame with only the indicators of interest based on research questions
features = df_t[['UIS', 'SE', 'NY']].copy()
features = features.droplevel(level=0, axis=1)
features = features.filter(items=['UIS.XPUBP.2', 'UIS.XPUBP.3', 'SE.XPD.SECO.ZS', 'SE.XPD.TOTL.GD.ZS',
      'UIS.DR.2.GPV.T', 'UIS.SR.2.GPV.GLAST.CP.T', 'SE.TOT.ENRR', 'UIS.GER.1T6.GPI', 'NY.GDP.MKTP.PP.KD', 'NY.GDP.MKTP.KD'])
features.describe(include=[np.number])

Indicator Code,UIS.XPUBP.2,UIS.XPUBP.3,SE.XPD.SECO.ZS,SE.XPD.TOTL.GD.ZS,UIS.DR.2.GPV.T,UIS.SR.2.GPV.GLAST.CP.T,SE.TOT.ENRR,UIS.GER.1T6.GPI,NY.GDP.MKTP.PP.KD,NY.GDP.MKTP.KD
count,893.0,912.0,2787.0,3538.0,3018.0,3062.0,4544.0,3524.0,4949.0,7765.0
mean,19.455717,17.37262,34.440755,1058.906,22.565076,77.759177,66.578365,0.946994,395989900000.0,254896000000.0
std,5.665325,6.879592,10.459614,62722.94,18.303711,18.369106,20.547472,0.171068,1377841000000.0,1026936000000.0
min,4.80754,0.72513,0.0,0.0,0.00732,2.04531,4.45282,0.1598,21333030.0,21441970.0
25%,15.66346,12.251185,27.234015,3.118185,7.03562,66.341652,54.923434,0.892458,9412085000.0,3445324000.0
50%,18.93944,18.225176,34.873241,4.29407,18.82256,81.505028,70.051586,0.995405,39187240000.0,15129450000.0
75%,22.589491,22.33394,41.390339,5.454852,33.886797,93.319462,79.084873,1.046823,245910200000.0,117824900000.0
max,42.08086,48.063122,79.395638,3730834.0,97.954689,100.0,119.382507,1.72888,19852010000000.0,16887540000000.0


This data frame does not have balanced density for all indicators. I will start by trying to impute values from similar columns. SE.XPD.SECO.ZS and the two UIS.XPUBP columns are all measuring the percentage of education spending dedicated to secondary education. 

In [19]:
# Create a dataframe to determine where there are values and where there are nulls in these similar indicators
mask = features.filter(['UIS.XPUBP.2', 'UIS.XPUBP.3', 'SE.XPD.SECO.ZS']).applymap(lambda x: 1 if 0<x<100 else 0)
mask['values'] = mask.sum(axis=1)
mask.head()

Unnamed: 0_level_0,Indicator Code,UIS.XPUBP.2,UIS.XPUBP.3,SE.XPD.SECO.ZS,values
Unnamed: 0_level_1,Country Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1970,Afghanistan,0,0,0,0
1970,Albania,0,0,0,0
1970,Algeria,0,0,0,0
1970,American Samoa,0,0,0,0
1970,Andorra,0,0,0,0


Now that we have a data frame showing where there are values and where there are nulls, let's find where the SE column has nulls but the other columns have values. We can use the indices to impute values in the features data frame.

In [20]:
# Create a list of indices where the predominant indicator is missing values but its counterparts have values
temp = mask[mask['values']>0].copy()
temp.drop(temp[temp['SE.XPD.SECO.ZS']==1].index, inplace=True)
indexes = list(zip(temp.index.levels[0][temp.index.codes[0]], temp.index.levels[1][temp.index.codes[1]]))

In [21]:
# Using the indices we found, we can now impute the values from the UIS columns into the SE column in features
features['SE.XPD.SECO.ZS'][indexes] = np.nanmean((features['UIS.XPUBP.2'][indexes], features['UIS.XPUBP.3'][indexes]), axis=0)

In [22]:
# Delete the temporary data frames we used and drop columns we no longer need
del mask, temp
features.drop(['UIS.XPUBP.2', 'UIS.XPUBP.3'], axis=1, inplace=True)
features.describe()

Indicator Code,SE.XPD.SECO.ZS,SE.XPD.TOTL.GD.ZS,UIS.DR.2.GPV.T,UIS.SR.2.GPV.GLAST.CP.T,SE.TOT.ENRR,UIS.GER.1T6.GPI,NY.GDP.MKTP.PP.KD,NY.GDP.MKTP.KD
count,2826.0,3538.0,3018.0,3062.0,4544.0,3524.0,4949.0,7765.0
mean,34.237027,1058.906,22.565076,77.759177,66.578365,0.946994,395989900000.0,254896000000.0
std,10.550818,62722.94,18.303711,18.369106,20.547472,0.171068,1377841000000.0,1026936000000.0
min,0.0,0.0,0.00732,2.04531,4.45282,0.1598,21333030.0,21441970.0
25%,26.845243,3.118185,7.03562,66.341652,54.923434,0.892458,9412085000.0,3445324000.0
50%,34.695194,4.29407,18.82256,81.505028,70.051586,0.995405,39187240000.0,15129450000.0
75%,41.273087,5.454852,33.886797,93.319462,79.084873,1.046823,245910200000.0,117824900000.0
max,79.395638,3730834.0,97.954689,100.0,119.382507,1.72888,19852010000000.0,16887540000000.0


Now that we have imputed values from similar columns, we can work to interpolate values to flesh out our features a bit more. This assumes an essentially linear relationship between points in a given feature where there are missing values between those points. 

In [23]:
# Rearrange the data frame so columns are country, indicator; rows are year
features = features.unstack(0).transpose().unstack(0).copy()
# Interpolate null values in a linear fashion with a maximum of 12 nulls filled in a row
features.interpolate(method='linear', axis=0, limit=12, inplace=True)
# Nest country back under year in the row indices
features = features.stack('Country Name')
# Drop rows with nulls
features.dropna(how='any', axis=0, inplace=True)

features.describe()

Indicator Code,NY.GDP.MKTP.KD,NY.GDP.MKTP.PP.KD,SE.TOT.ENRR,SE.XPD.SECO.ZS,SE.XPD.TOTL.GD.ZS,UIS.DR.2.GPV.T,UIS.GER.1T6.GPI,UIS.SR.2.GPV.GLAST.CP.T
count,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0
mean,258967500000.0,362613800000.0,71.545079,33.07972,2742.72,17.326305,0.978838,82.688752
std,696519800000.0,959400000000.0,18.650922,10.677381,101054.9,15.656115,0.134733,15.669643
min,29483210.0,29333430.0,13.97193,0.0,0.0,0.01391,0.42301,3.85428
25%,5527789000.0,12344850000.0,61.504879,25.414009,3.19778,4.361155,0.93085,74.439796
50%,20541670000.0,41217900000.0,73.412727,33.647171,4.18628,13.617765,1.01257,86.382233
75%,169793600000.0,260454800000.0,84.971573,40.363781,5.35554,25.5602,1.0594,95.68254
max,6682403000000.0,13957940000000.0,119.382507,71.518257,3730834.0,96.145721,1.72888,100.0
