# World Bank Ed Stats Model Building

This data has been collected and encoded by the World Bank as indicators for educational performance and attainment, as well as expenditure on education, since 1970. The data spans the countries of the world and aggregates some regions and socio-economic distinctions. The dataset is sparse, however, with a majority of null values. As I prepare the dataset for modeling, I have three objectives:

### Arrange the Data 
### Context and Visualizations
### Null Handling and Feature Selection

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Open and read the CSV file to a Dataframe
data = pd.read_csv('EdStatsData.csv')

# Save a dictionary matching the indicator code to its indicator name
ind_codes = data['Indicator Code'].unique().tolist()
ind_names = data['Indicator Name'].unique().tolist()
indicator_dict = {k:v for (k,v) in zip(ind_codes, ind_names)}
#data.dropna(axis=1, thresh=75000, inplace=True)

## Arrange the Data
First, we will remove some of the unnecessary columns, like country codes, indicator names, and future years. There are far too many indicators to check the missing values for each, so we will isolate the indicator group from the prefix to the indicator code. Then we will separate the data frame into countries, regions, and socio-economic levels. Our immediate interest is the set of countries.

In [2]:
# Remove columns for years that have yet to happen
data.drop(['Country Code', 'Indicator Name', '2020', '2025', '2030', '2035', '2040', '2045', '2050', '2055',
          '2060', '2065', '2070', '2075', '2080', '2085', '2090', '2095', '2100', 'Unnamed: 69'], axis=1, inplace=True)
data.head()

Unnamed: 0,Country Name,Indicator Code,1970,1971,1972,1973,1974,1975,1976,1977,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Arab World,UIS.NERA.2,,,,,,,,,...,,,,,,,,,,
1,Arab World,UIS.NERA.2.F,,,,,,,,,...,,,,,,,,,,
2,Arab World,UIS.NERA.2.GPI,,,,,,,,,...,,,,,,,,,,
3,Arab World,UIS.NERA.2.M,,,,,,,,,...,,,,,,,,,,
4,Arab World,SE.PRM.TENR,54.822121,54.894138,56.209438,57.267109,57.991138,59.36554,60.999962,61.92268,...,84.011871,84.195961,85.211998,85.24514,86.101669,85.51194,85.320152,,,


In [3]:
# Create a column to separate out the group from the indicator
data['Indicator Group'] = [s.split('.')[0] for s in data['Indicator Code']] 

In [4]:
# Separate data frame by country, region, or socio-economic distinction
regions = ['Arab World', 'East Asia & Pacific', 'East Asia & Pacific (excluding high income)', 'Euro area', 'Europe & Central Asia', 
 'Europe & Central Asia (excluding high income)', 'European Union', 'Latin America & Caribbean', 'Latin America & Caribbean (excluding high income)', 
  'Middle East & North Africa', 'Middle East & North Africa (excluding high income)', 'Middle income', 'North America', 'South Asia', 
           'Sub-Saharan Africa', 'Sub-Saharan Africa (excluding high income)', 'OECD members', 'World']

income_levels = ['Heavily indebted poor countries (HIPC)', 'High income', 'Least developed countries: UN classification', 'Low & middle income', 
 'Low income', 'Lower middle income', 'Upper middle income']

reg_df = data[data['Country Name'].isin(regions)]

inc_df = data[data['Country Name'].isin(income_levels)]

cntry_df = data[~data['Country Name'].isin(regions)]
cntry_df = cntry_df[~cntry_df['Country Name'].isin(income_levels)]

# Remove initial data frame from working memory  
del data

cntry_df.head()

Unnamed: 0,Country Name,Indicator Code,1970,1971,1972,1973,1974,1975,1976,1977,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,Indicator Group
91625,Afghanistan,UIS.NERA.2,,,,,7.05911,,,,...,,,,,47.43679,50.627232,,,,UIS
91626,Afghanistan,UIS.NERA.2.F,,,,,2.53138,,,,...,,,,,34.073261,37.641541,,,,UIS
91627,Afghanistan,UIS.NERA.2.GPI,,,,,0.22154,,,,...,,,,,0.56706,0.59837,,,,UIS
91628,Afghanistan,UIS.NERA.2.M,,,,,11.42652,,,,...,,,,,60.087059,62.906952,,,,UIS
91629,Afghanistan,SE.PRM.TENR,,,,,,,,,...,,,,,,,,,,SE


### Reindexing

Now that we have the data frame with just the countries of the world for years that have actually occurred, it is time to get the indicators set as the columns, grouped by their indicator group. The rows should be indexed by country, then year. 

In [5]:
# Set the desired columns as indices
cntry_df.set_index(['Country Name', 'Indicator Group', 'Indicator Code'], inplace=True)

In [6]:
# Transpose to get the row index to be the year
df_t = cntry_df.transpose()

In [7]:
# Stack the Country Name column as a secondary index for the rows
df_t = df_t.stack('Country Name')
df_t.head()

Unnamed: 0_level_0,Indicator Group,BAR,BAR,BAR,BAR,BAR,BAR,BAR,BAR,BAR,BAR,...,UIS,UIS,UIS,UIS,UIS,UIS,UIS,UIS,XGDP,XGDP
Unnamed: 0_level_1,Indicator Code,BAR.NOED.1519.FE.ZS,BAR.NOED.1519.ZS,BAR.NOED.15UP.FE.ZS,BAR.NOED.15UP.ZS,BAR.NOED.2024.FE.ZS,BAR.NOED.2024.ZS,BAR.NOED.2529.FE.ZS,BAR.NOED.2529.ZS,BAR.NOED.25UP.FE.ZS,BAR.NOED.25UP.ZS,...,UIS.XUNIT.US.4.FSGOV,UIS.XUNIT.US.56.FSGOV,UIS.XUNIT.USCONST.1.FSGOV,UIS.XUNIT.USCONST.2.FSGOV,UIS.XUNIT.USCONST.23.FSGOV,UIS.XUNIT.USCONST.3.FSGOV,UIS.XUNIT.USCONST.4.FSGOV,UIS.XUNIT.USCONST.56.FSGOV,XGDP.23.FSGOV.FDINSTADM.FFD,XGDP.56.FSGOV.FDINSTADM.FFD
Unnamed: 0_level_2,Country Name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
1970,Afghanistan,91.44,77.08,97.21,88.81,94.8,78.4,98.6,91.09,99.25,94.22,...,,,,,,,,,,
1970,Albania,26.56,21.7,41.88,37.92,28.2,28.38,31.77,28.91,48.28,43.8,...,,,,,,,,,,
1970,Algeria,69.7,52.9,87.39,73.64,69.7,52.9,91.5,77.3,95.9,84.4,...,,,,,,,,,,
1970,American Samoa,,,,,,,,,,,...,,,,,,,,,,
1970,Andorra,,,,,,,,,,,...,,,,,,,,,,


In [8]:
# Unstack the year index and stack it back, nested within the country name
df_t = df_t.unstack(0).stack()
df_t.head()

Unnamed: 0_level_0,Indicator Group,BAR,BAR,BAR,BAR,BAR,BAR,BAR,BAR,BAR,BAR,...,UIS,UIS,UIS,UIS,UIS,UIS,UIS,UIS,XGDP,XGDP
Unnamed: 0_level_1,Indicator Code,BAR.NOED.1519.FE.ZS,BAR.NOED.1519.ZS,BAR.NOED.15UP.FE.ZS,BAR.NOED.15UP.ZS,BAR.NOED.2024.FE.ZS,BAR.NOED.2024.ZS,BAR.NOED.2529.FE.ZS,BAR.NOED.2529.ZS,BAR.NOED.25UP.FE.ZS,BAR.NOED.25UP.ZS,...,UIS.XUNIT.US.4.FSGOV,UIS.XUNIT.US.56.FSGOV,UIS.XUNIT.USCONST.1.FSGOV,UIS.XUNIT.USCONST.2.FSGOV,UIS.XUNIT.USCONST.23.FSGOV,UIS.XUNIT.USCONST.3.FSGOV,UIS.XUNIT.USCONST.4.FSGOV,UIS.XUNIT.USCONST.56.FSGOV,XGDP.23.FSGOV.FDINSTADM.FFD,XGDP.56.FSGOV.FDINSTADM.FFD
Country Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
Afghanistan,1970,91.44,77.08,97.21,88.81,94.8,78.4,98.6,91.09,99.25,94.22,...,,,,,,,,,,
Afghanistan,1971,,,,,,,,,,,...,,,,,,,,,,
Afghanistan,1972,,,,,,,,,,,...,,,,,,,,,,
Afghanistan,1973,,,,,,,,,,,...,,,,,,,,,,
Afghanistan,1974,,,,,,,,,,,...,,,,,,,,,,


In [9]:
# Create a list of indicator groups and iterate through that list to determine the percent of null values for that group
cols = df_t.columns
lst = [e[0] for e in cols]
col_lst = list(set(lst))

for group in col_lst:
    na_pct = np.sum(df_t[group].isna().sum()) / df_t[group].size
    print('{} Null Percentage: {:2f}'.format(group, na_pct))

BAR Null Percentage: 0.872629
SE Null Percentage: 0.680125
SABER Null Percentage: 0.998478
OECD Null Percentage: 0.961929
SL Null Percentage: 0.801463
PRJ Null Percentage: 0.967372
XGDP Null Percentage: 0.882260
IT Null Percentage: 0.645111
LO Null Percentage: 0.994228
SH Null Percentage: 0.405946
UIS Null Percentage: 0.869046
NY Null Percentage: 0.375971
HH Null Percentage: 0.988921
SP Null Percentage: 0.466671


In [None]:
# predicting sparsity in the most recent three to five years
# research the programs
# heatmap with year and indicators and region white for value and black for missingness
# timeseries heatmap
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

In [None]:
sklearn_pca = PCA(n_components=10)
Y_sklearn = sklearn_pca.fit_transform(df_t)

print(
    'The percentage of total variance in the dataset explained by each',
    'component from Sklearn PCA.\n',
    sklearn_pca.explained_variance_ratio_)