**The World Bank : Education Data**

In this notebook, we will analyze international education spending collected by The World Bank. This dataset combines key education statistics from a variety of sources to provide a look at globally literacy, spending, and acces. We are going to find the answer to questions like:
* Of total government spending, what percentage is spent on education?
* How about Indonesia and countries in South East Asia?
* we will also look another interesting indicator which provide in this dataset like GDP per capita


The first line of code is import Bigquery package and make client object

In [2]:
from google.cloud import bigquery

In [3]:
client = bigquery.Client()

Using Kaggle's public dataset BigQuery integration.


connect to *world_bank_intl_education* dataset

In [4]:
dataset_ref = client.dataset('world_bank_intl_education', project='bigquery-public-data')

In [5]:
dataset = client.get_dataset(dataset_ref)

Every dataset is just collection of tables. Let's see what tables are in this dataset

In [6]:
tables = list(client.list_tables(dataset))
for table in tables:
    print(table.table_id)

country_series_definitions
country_summary
international_education
series_summary


Explore the contents each table in more detail, for more understanding and look for table that contains data we need

In [7]:
table_ref = dataset_ref.table('country_series_definitions')
table = client.get_table(table_ref)
client.list_rows(table, max_results=5).to_dataframe()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,country_code,series_code,description
0,ALB,SP.POP.TOTL,"Data sources : Institute of Statistics, Eurostat"
1,AUS,SP.POP.TOTL,Data sources : Australian Bureau of Statistics
2,AUS,SP.POP.GROW,Data sources: Australian Bureau of Statistics
3,AZE,SP.POP.TOTL,"Data sources : Eurostat, State Statistical Com..."
4,AZE,SP.POP.GROW,"Data sources: Eurostat, State Statistical Comm..."


In [8]:
table_ref = dataset_ref.table('country_summary')
table = client.get_table(table_ref)
client.list_rows(table, max_results=5).to_dataframe()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,country_code,short_name,table_name,long_name,two_alpha_code,currency_unit,special_notes,region,income_group,wb_two_code,...,government_accounting_concept,imf_data_dissemination_standard,latest_population_census,latest_household_survey,source_of_most_recent_income_and_expenditure_data,vital_registration_complete,latest_agricultural_census,latest_industrial_data,latest_trade_data,latest_water_withdrawal_data
0,ARB,Arab World,Arab World,Arab World,1A,,Arab World aggregate. Arab World is composed o...,,,1A,...,,,,,,,,,,
1,EAP,East Asia & Pacific (developing only),East Asia & Pacific,East Asia & Pacific (developing only),4E,,East Asia and Pacific regional aggregate (does...,,,4E,...,,,,,,,,,,
2,EAS,East Asia & Pacific (all income levels),East Asia & Pacific (all income levels),East Asia & Pacific (all income levels),Z4,,East Asia and Pacific regional aggregate (incl...,,,Z4,...,,,,,,,,,,
3,ECA,Europe & Central Asia (developing only),Europe & Central Asia,Europe & Central Asia (developing only),7E,,Europe and Central Asia regional aggregate (do...,,,7E,...,,,,,,,,,,
4,ECS,Europe & Central Asia (all income levels),Europe & Central Asia (all income levels),Europe & Central Asia (all income levels),Z7,,Europe and Central Asia regional aggregate (in...,,,Z7,...,,,,,,,,,,


In [23]:
table_ref = dataset_ref.table('international_education')
table = client.get_table(table_ref)
client.list_rows(table, max_results=10).to_dataframe()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,country_name,country_code,indicator_name,indicator_code,value,year
0,Chad,TCD,"Enrolment in lower secondary education, both s...",UIS.E.2,135079.0,2002
1,Chad,TCD,"Enrolment in lower secondary education, both s...",UIS.E.2,92857.0,1999
2,Chad,TCD,"Enrolment in lower secondary education, both s...",UIS.E.2,181979.0,2005
3,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,101428.0,2008
4,Chad,TCD,"Repeaters in primary education, all grades, bo...",UIS.R.1,49578.0,1971
5,Chad,TCD,"Repeaters in primary education, all grades, bo...",UIS.R.1,79342.0,1977
6,Chad,TCD,"Repeaters in primary education, all grades, bo...",UIS.R.1,74299.0,1976
7,Chad,TCD,"Repeaters in primary education, all grades, bo...",UIS.R.1,386092.0,2010
8,Chad,TCD,"Repeaters in primary education, all grades, bo...",UIS.R.1,135635.0,1988
9,Chad,TCD,"Teachers in lower secondary education, both se...",UIS.T.2,7747.0,2012


In [22]:
table_ref = dataset_ref.table('series_summary')
table = client.get_table(table_ref)
client.list_rows(table, max_results=5).to_dataframe()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,series_code,topic,indicator_name,short_definition,long_definition,unit_of_measure,periodicity,base_period,other_notes,aggregation_method,limitations_and_exceptions,notes_from_original_source,general_comments,source,statistical_concept_and_methodology,development_relevance,related_source_links,other_web_links,related_indicators,license_type
0,BAR.NOED.1519.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15-19 with...,Percentage of female population age 15-19 with...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,
1,BAR.NOED.1519.ZS,Attainment,Barro-Lee: Percentage of population age 15-19 ...,Percentage of population age 15-19 with no edu...,Percentage of population age 15-19 with no edu...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,
2,BAR.NOED.15UP.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15+ with n...,Percentage of female population age 15+ with n...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,
3,BAR.NOED.15UP.ZS,Attainment,Barro-Lee: Percentage of population age 15+ wi...,Percentage of population age 15+ with no educa...,Percentage of population age 15+ with no educa...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,
4,BAR.NOED.2024.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 20-24 with...,Percentage of female population age 20-24 with...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,


The table that suit for our analytic is *international_education*, let's see structure of this table for more understanding

In [10]:
table.schema

[SchemaField('country_name', 'STRING', 'NULLABLE', '', (), None),
 SchemaField('country_code', 'STRING', 'NULLABLE', '', (), None),
 SchemaField('indicator_name', 'STRING', 'NULLABLE', '', (), None),
 SchemaField('indicator_code', 'STRING', 'NULLABLE', '', (), None),
 SchemaField('value', 'FLOAT', 'NULLABLE', '', (), None),
 SchemaField('year', 'INTEGER', 'NULLABLE', '', (), None)]

let's check the data collection by year, when it started and ended year the data was stored 

In [11]:
query = """
        SELECT DISTINCT year
        FROM `bigquery-public-data.world_bank_intl_education.international_education`
        ORDER BY year DESC
        """
query_job = client.query(query)
debt_total = query_job.to_dataframe()
print(debt_total)

  "Cannot create BigQuery Storage client, the dependency "


    year
0   2100
1   2095
2   2090
3   2085
4   2080
..   ...
60  1974
61  1973
62  1972
63  1971
64  1970

[65 rows x 1 columns]


it can be seen from the query results above that there is data from the year that exceeds the current year, so we will not use that row of data

As we saw in table *international_education*, there is a column called indicator_code, and if we want to see total spending goverment in education we need indicator code *SE.XPD.TOTL.GD.ZS* which correponds to "Expenditure on education as % of total goverment spending"

**1. Indonesia VS World**

Let's start our analysis begin with how many spending goverment of Indonesia in education compare with average spending of entire countires in the world and we will use 10 year data from 2010 until 2020

In [47]:
query = """
        SELECT
            (SELECT ROUND(AVG(value)) 
            FROM `bigquery-public-data.world_bank_intl_education.international_education`
            WHERE NOT country_name  = 'Indonesia'
            AND indicator_code = 'SE.XPD.TOTL.GB.ZS' AND year >= 2010 AND year < 2021)
            AS world_education_spending_pct,
            (SELECT ROUND(AVG(value))
            FROM `bigquery-public-data.world_bank_intl_education.international_education`
            WHERE country_name = 'Indonesia'
            AND indicator_code = 'SE.XPD.TOTL.GB.ZS' AND year >= 2010 AND year < 2021)
            AS Indonesia_education_spending_pct
        """
ina_edu_debt = client.query(query).result().to_dataframe()
ina_edu_debt.head()

  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,world_education_spending_pct,Indonesia_education_spending_pct
0,15.0,18.0


Interesting, we can see from result query above that Indonesia spend 18% fraction of total goverment spending for education a little bit above of world

**2. ASEAN Countries**

How about countries in ASEAN how many fraction of their GDP spend on education?

In [48]:
query = """
        SELECT country_name, ROUND(AVG(value)) AS edu_spending_pct
        FROM `bigquery-public-data.world_bank_intl_education.international_education`
        WHERE country_name IN ('Indonesia', 'Malaysia', 'Singapore', 'Thailand', 'Philippines',
        'Cambodia', 'Myanmar', 'Brunei Darussalam', 'Vietnam', 'Lao PDR')
        AND indicator_code = 'SE.XPD.TOTL.GB.ZS' AND year >= 2010 AND year <2021
        GROUP BY country_name
        ORDER BY edu_spending_pct DESC
        """
asean_edu_debt = client.query(query).result().to_dataframe()
asean_edu_debt.head(10)

  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,country_name,edu_spending_pct
0,Malaysia,20.0
1,Thailand,20.0
2,Singapore,20.0
3,Indonesia,18.0
4,Vietnam,18.0
5,Lao PDR,10.0
6,Brunei Darussalam,9.0
7,Cambodia,8.0


Malaysia,Thailand, and Singapore spend much more their total spending for education than Indonesia. For Philippines and Myanmar not show in result because there no data record in the year we set 

**3. World**

Which countries spend the largest fraction of their total spending on education?

In [49]:
query = """
        SELECT country_name, ROUND(AVG(value)) AS edu_spending_pct
        FROM `bigquery-public-data.world_bank_intl_education.international_education`
        WHERE indicator_code = 'SE.XPD.TOTL.GB.ZS' AND year >=2010 AND year < 2021
        GROUP BY country_name
        ORDER BY edu_spending_pct DESC
        LIMIT 10
        """
world_edu_debt = client.query(query).result().to_dataframe()
world_edu_debt.head(10)

  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,country_name,edu_spending_pct
0,"Congo, Rep.",29.0
1,Ethiopia,28.0
2,Namibia,26.0
3,Ghana,26.0
4,Zimbabwe,24.0
5,Tunisia,23.0
6,Costa Rica,23.0
7,Senegal,23.0
8,Nicaragua,23.0
9,Benin,23.0


Countires in Africa spend more fraction their total spending on education than other countries in other continent

**4. Identify Interesting Codes to Explore**

There are 1000s of code in this dataset, so it would be time comsuming to review them all. Let's start most indicator_code in Indonesia

In [37]:
query = """
        SELECT indicator_code, indicator_name, COUNT(1) num_rows, country_name
        FROM `bigquery-public-data.world_bank_intl_education.international_education`
        WHERE country_name = 'Indonesia' AND year >= 2010 AND year < 2021
        GROUP BY indicator_code, indicator_name, country_name
        ORDER BY num_rows DESC
        """
ina_indicator = client.query(query).result().to_dataframe()
ina_indicator.head(10)

  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,indicator_code,indicator_name,num_rows,country_name
0,SE.COM.DURS,Duration of compulsory education (years),7,Indonesia
1,SE.PRM.AGES,Official entrance age to primary education (ye...,7,Indonesia
2,SE.SEC.AGES,Official entrance age to lower secondary educa...,7,Indonesia
3,SE.SEC.DURS,Theoretical duration of secondary education (y...,7,Indonesia
4,SP.POP.TOTL,"Population, total",7,Indonesia
5,UIS.THAGE.0,Official entrance age to pre-primary education...,7,Indonesia
6,UIS.THDUR.0,Theoretical duration of pre-primary education ...,7,Indonesia
7,UIS.SAP.1.G1,Population of the official entrance age to pri...,7,Indonesia
8,IT.NET.USER.P2,Internet users (per 100 people),7,Indonesia
9,NY.GDP.MKTP.CD,GDP at market prices (current US$),7,Indonesia


In [38]:
query = """
        SELECT indicator_code, indicator_name, COUNT(1) num_rows
        FROM `bigquery-public-data.world_bank_intl_education.international_education`
        WHERE year >= 2010 AND year < 2021
        GROUP BY indicator_code, indicator_name
        HAVING num_rows > 150
        ORDER BY num_rows DESC
        """
world_indicator = client.query(query).result().to_dataframe()
world_indicator.head(10)

  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,indicator_code,indicator_name,num_rows
0,SP.POP.TOTL,"Population, total",1654
1,SP.POP.GROW,Population growth (annual %),1653
2,IT.NET.USER.P2,Internet users (per 100 people),1578
3,NY.GDP.PCAP.CD,GDP per capita (current US$),1553
4,NY.GDP.MKTP.CD,GDP at market prices (current US$),1553
5,NY.GDP.MKTP.KD,GDP at market prices (constant 2005 US$),1531
6,NY.GDP.PCAP.KD,GDP per capita (constant 2005 US$),1531
7,SP.POP.1564.TO.ZS,"Population, ages 15-64 (% of total)",1523
8,SP.POP.0014.TO,"Population, ages 0-14, total",1513
9,SP.POP.0014.FE.IN,"Population, ages 0-14, female",1513


after exploration what is interesting is indicator_code of GDP, lets dig in

**5. Rich Countires**



After we explore interesting indicator code in this dataset, let's dive in GDP to see which countries more rich than others, for that case we will use indicator_code = *'NY.GDP.PCAP.CD'* and focus year in 2016 because the following year or the latest data is not complete

In [17]:
query = """
         SELECT indicator_code, indicator_name, country_name, year, value
         FROM `bigquery-public-data.world_bank_intl_education.international_education`
         WHERE indicator_code = 'NY.GDP.PCAP.CD' AND year = 2016
         ORDER BY value DESC
         """
gdp = client.query(query).result().to_dataframe()
gdp.head(20)

  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,indicator_code,indicator_name,country_name,year,value
0,NY.GDP.PCAP.CD,GDP per capita (current US$),Luxembourg,2016,100573.139978
1,NY.GDP.PCAP.CD,GDP per capita (current US$),Switzerland,2016,79890.524005
2,NY.GDP.PCAP.CD,GDP per capita (current US$),"Macao SAR, China",2016,73186.960143
3,NY.GDP.PCAP.CD,GDP per capita (current US$),Norway,2016,70911.757159
4,NY.GDP.PCAP.CD,GDP per capita (current US$),Ireland,2016,63861.921982
5,NY.GDP.PCAP.CD,GDP per capita (current US$),Iceland,2016,59976.942565
6,NY.GDP.PCAP.CD,GDP per capita (current US$),Qatar,2016,59324.338773
7,NY.GDP.PCAP.CD,GDP per capita (current US$),United States,2016,57638.159088
8,NY.GDP.PCAP.CD,GDP per capita (current US$),North America,2016,56081.944482
9,NY.GDP.PCAP.CD,GDP per capita (current US$),Denmark,2016,53549.700671


we can see that from result above that many rich countries come from European continent, Luxembourg top the list with GDP per capita 100.573 USD   

**6. Indonesia's GDP Growth**

How about Indonesia's GDP growth, does it grow every year or vice versa?

In [18]:
query = """
        SELECT indicator_code, Indicator_name, country_name, year, value
        FROM `bigquery-public-data.world_bank_intl_education.international_education`
        WHERE indicator_code = 'NY.GDP.PCAP.CD' AND country_name = 'Indonesia'
        AND year >= 2010 AND year <= 2016
        ORDER BY year DESC
        """
pop = client.query(query).result().to_dataframe()
pop.head(20)

  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,indicator_code,Indicator_name,country_name,year,value
0,NY.GDP.PCAP.CD,GDP per capita (current US$),Indonesia,2016,3570.294888
1,NY.GDP.PCAP.CD,GDP per capita (current US$),Indonesia,2015,3336.106686
2,NY.GDP.PCAP.CD,GDP per capita (current US$),Indonesia,2014,3491.595887
3,NY.GDP.PCAP.CD,GDP per capita (current US$),Indonesia,2013,3620.663981
4,NY.GDP.PCAP.CD,GDP per capita (current US$),Indonesia,2012,3687.953996
5,NY.GDP.PCAP.CD,GDP per capita (current US$),Indonesia,2011,3634.276805
6,NY.GDP.PCAP.CD,GDP per capita (current US$),Indonesia,2010,3113.480635


Indonesia's GDP growth has increased and decreased every years, the largest increase occurred in 2010 to 2011 by 16%

**7. ASEAN Countries GDP**


we will group countries in ASEAN based on their GDP. This grouping is based on a range that has been determined by the world bank

In [19]:
query = """
        SELECT country_name, value,
            CASE
                WHEN value >= 12535 THEN 'High Income'
                WHEN value >= 4046 AND value < 12535 THEN 'Upper Middle Income'
                WHEN value >= 1036 AND value < 4046 THEN 'Lower Middle Income'
                ELSE 'Low Income'
            END AS Group_Based_on_GDP_per_Capita
        FROM `bigquery-public-data.world_bank_intl_education.international_education`
        WHERE country_name IN ('Indonesia', 'Malaysia', 'Thailand', 'Singapore', 'Brunei Darussalam',
        'Myanmar', 'Philippines', 'Lao PDR', 'Vietnam', 'Cambodia')
        AND indicator_code = 'NY.GDP.PCAP.CD' AND year = 2016
        ORDER BY value DESC
        """
gdp_asean = client.query(query).result().to_dataframe()
gdp_asean.head(10)

  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,country_name,value,Group_Based_on_GDP_per_Capita
0,Singapore,52962.491569,High Income
1,Brunei Darussalam,26939.417509,High Income
2,Malaysia,9508.23775,Upper Middle Income
3,Thailand,5910.620932,Upper Middle Income
4,Indonesia,3570.294888,Lower Middle Income
5,Philippines,2951.071929,Lower Middle Income
6,Lao PDR,2353.136848,Lower Middle Income
7,Vietnam,2214.387662,Lower Middle Income
8,Cambodia,1269.907238,Lower Middle Income
9,Myanmar,1195.515372,Lower Middle Income


**8. Which one is more?**

which are more high-income countries or low income countires? the query below will answer our question

In [20]:
query = """
        WITH gdp_group AS
        (SELECT 
            CASE
                WHEN value >= 12535 THEN 'High Income Countries'
                WHEN value >= 4046 AND value < 12535 THEN 'Upper Middle Income Countries'
                WHEN value >= 1036 AND value < 4046 THEN 'Lower Middle Income Countries'
                ELSE 'Low Income Countries'
            END AS group_countries_based_on_gdp
          FROM `bigquery-public-data.world_bank_intl_education.international_education`
          WHERE indicator_code = 'NY.GDP.PCAP.CD' AND year = 2016
        )
        SELECT count(1) AS num_countries, group_countries_based_on_gdp
        FROM gdp_group
        GROUP BY group_countries_based_on_gdp
        ORDER BY num_countries DESC
        """
world_gdp = client.query(query).result().to_dataframe()
world_gdp.head()

  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,num_countries,group_countries_based_on_gdp
0,62,High Income Countries
1,59,Upper Middle Income Countries
2,57,Lower Middle Income Countries
3,33,Low Income Countries
