In this notebook, I more thoroughly clean and analyze the `edstats_series` file that was obtained from [The World Bank](https://datacatalog.worldbank.org/dataset/education-statistics). There were no file keys to help identify what the column labels meant, so I will review the columns and make assumptions along the way. The goal of this project is to better develop my data cleaning skills and possibly some data visualizations.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
edstats_series = pd.read_csv("/Users/scottmaccarone/Desktop/Coding/My_Fun_Projects/Data Cleaning_Education Stats_World Bank/Edstats_csv 2/EdStatsSeries.csv")

In [3]:
edstats_series.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3665 entries, 0 to 3664
Data columns (total 21 columns):
Series Code                            3665 non-null object
Topic                                  3665 non-null object
Indicator Name                         3665 non-null object
Short definition                       2156 non-null object
Long definition                        3665 non-null object
Unit of measure                        0 non-null float64
Periodicity                            99 non-null object
Base Period                            314 non-null object
Other notes                            552 non-null object
Aggregation method                     47 non-null object
Limitations and exceptions             14 non-null object
Notes from original source             0 non-null float64
General comments                       14 non-null object
Source                                 3665 non-null object
Statistical concept and methodology    23 non-null object
Developme

In [4]:
pd.options.display.max_columns = 25

In [5]:
edstats_series.head(10)

Unnamed: 0,Series Code,Topic,Indicator Name,Short definition,Long definition,Unit of measure,Periodicity,Base Period,Other notes,Aggregation method,Limitations and exceptions,Notes from original source,General comments,Source,Statistical concept and methodology,Development relevance,Related source links,Other web links,Related indicators,License Type,Unnamed: 20
0,BAR.NOED.1519.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15-19 with...,Percentage of female population age 15-19 with...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
1,BAR.NOED.1519.ZS,Attainment,Barro-Lee: Percentage of population age 15-19 ...,Percentage of population age 15-19 with no edu...,Percentage of population age 15-19 with no edu...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
2,BAR.NOED.15UP.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15+ with n...,Percentage of female population age 15+ with n...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
3,BAR.NOED.15UP.ZS,Attainment,Barro-Lee: Percentage of population age 15+ wi...,Percentage of population age 15+ with no educa...,Percentage of population age 15+ with no educa...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
4,BAR.NOED.2024.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 20-24 with...,Percentage of female population age 20-24 with...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
5,BAR.NOED.2024.ZS,Attainment,Barro-Lee: Percentage of population age 20-24 ...,Percentage of population age 20-24 with no edu...,Percentage of population age 20-24 with no edu...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
6,BAR.NOED.2529.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 25-29 with...,Percentage of female population age 25-29 with...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
7,BAR.NOED.2529.ZS,Attainment,Barro-Lee: Percentage of population age 25-29 ...,Percentage of population age 25-29 with no edu...,Percentage of population age 25-29 with no edu...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
8,BAR.NOED.25UP.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 25+ with n...,Percentage of female population age 25+ with n...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
9,BAR.NOED.25UP.ZS,Attainment,Barro-Lee: Percentage of population age 25+ wi...,Percentage of population age 25+ with no educa...,Percentage of population age 25+ with no educa...,,,,,,,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,


In [6]:
edstats_series.isnull().sum()

Series Code                               0
Topic                                     0
Indicator Name                            0
Short definition                       1509
Long definition                           0
Unit of measure                        3665
Periodicity                            3566
Base Period                            3351
Other notes                            3113
Aggregation method                     3618
Limitations and exceptions             3651
Notes from original source             3665
General comments                       3651
Source                                    0
Statistical concept and methodology    3642
Development relevance                  3662
Related source links                   3450
Other web links                        3665
Related indicators                     3665
License Type                           3665
Unnamed: 20                            3665
dtype: int64

The only useable columns seem to be `Series Code`, `Topic`, `Indicator Name`, `Long definition`, and `Source`. I will select only those columns below and assign the result back to `edstats_series`:

In [7]:
edstats_series = edstats_series[['Series Code', 'Topic', 'Indicator Name', 'Long definition', 'Source']]
edstats_series[:10]

Unnamed: 0,Series Code,Topic,Indicator Name,Long definition,Source
0,BAR.NOED.1519.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15-19 with...,Robert J. Barro and Jong-Wha Lee: http://www.b...
1,BAR.NOED.1519.ZS,Attainment,Barro-Lee: Percentage of population age 15-19 ...,Percentage of population age 15-19 with no edu...,Robert J. Barro and Jong-Wha Lee: http://www.b...
2,BAR.NOED.15UP.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15+ with n...,Robert J. Barro and Jong-Wha Lee: http://www.b...
3,BAR.NOED.15UP.ZS,Attainment,Barro-Lee: Percentage of population age 15+ wi...,Percentage of population age 15+ with no educa...,Robert J. Barro and Jong-Wha Lee: http://www.b...
4,BAR.NOED.2024.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 20-24 with...,Robert J. Barro and Jong-Wha Lee: http://www.b...
5,BAR.NOED.2024.ZS,Attainment,Barro-Lee: Percentage of population age 20-24 ...,Percentage of population age 20-24 with no edu...,Robert J. Barro and Jong-Wha Lee: http://www.b...
6,BAR.NOED.2529.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 25-29 with...,Robert J. Barro and Jong-Wha Lee: http://www.b...
7,BAR.NOED.2529.ZS,Attainment,Barro-Lee: Percentage of population age 25-29 ...,Percentage of population age 25-29 with no edu...,Robert J. Barro and Jong-Wha Lee: http://www.b...
8,BAR.NOED.25UP.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 25+ with n...,Robert J. Barro and Jong-Wha Lee: http://www.b...
9,BAR.NOED.25UP.ZS,Attainment,Barro-Lee: Percentage of population age 25+ wi...,Percentage of population age 25+ with no educa...,Robert J. Barro and Jong-Wha Lee: http://www.b...


In [9]:
edstats_series['Topic'].unique()

array(['Attainment', 'Education Equality',
       'Infrastructure: Communications', 'Learning Outcomes',
       'Economic Policy & Debt: National accounts: US$ at current prices: Aggregate indicators',
       'Economic Policy & Debt: National accounts: US$ at constant 2010 prices: Aggregate indicators',
       'Economic Policy & Debt: Purchasing power parity',
       'Economic Policy & Debt: National accounts: Atlas GNI & GNI per capita',
       'Teachers', 'Education Management Information Systems (SABER)',
       'Early Child Development (SABER)',
       'Engaging the Private Sector (SABER)',
       'School Health and School Feeding (SABER)',
       'School Autonomy and Accountability (SABER)',
       'School Finance (SABER)', 'Student Assessment (SABER)',
       'Teachers (SABER)', 'Tertiary Education (SABER)',
       'Workforce Development (SABER)', 'Literacy', 'Background',
       'Primary', 'Secondary', 'Tertiary', 'Early Childhood Education',
       'Pre-Primary', 'Expenditures'

In [12]:
edstats_series['Indicator Name'].value_counts()

Age population, age 17, total, UNESCO                                                                                                                  1
Projection: Percentage of the population age 15+ by highest level of educational attainment. Lower Secondary. Male                                     1
SABER: (Student Assessment) Policy Goal 3 Lever 3: Assessment Quality                                                                                  1
Survival rate to the last grade of lower secondary general education, female (%)                                                                       1
DHS: Proportion of out-of-school. Primary. Female                                                                                                      1
Capital expenditure as % of total expenditure in public institutions (%)                                                                               1
SABER: (School Autonomy and Accountability) Policy Goal 2: Level of autonomy in pe

For consistency, I want to make all column names lowercase. I also want to remove any whitespace and replace it with underscores:

In [13]:
edstats_series.columns = (edstats_series.columns
                          .str.lower()
                          .str.strip()
                          .str.replace(' ', '_'))

In [14]:
edstats_series[:10]

Unnamed: 0,series_code,topic,indicator_name,long_definition,source
0,BAR.NOED.1519.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15-19 with...,Robert J. Barro and Jong-Wha Lee: http://www.b...
1,BAR.NOED.1519.ZS,Attainment,Barro-Lee: Percentage of population age 15-19 ...,Percentage of population age 15-19 with no edu...,Robert J. Barro and Jong-Wha Lee: http://www.b...
2,BAR.NOED.15UP.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15+ with n...,Robert J. Barro and Jong-Wha Lee: http://www.b...
3,BAR.NOED.15UP.ZS,Attainment,Barro-Lee: Percentage of population age 15+ wi...,Percentage of population age 15+ with no educa...,Robert J. Barro and Jong-Wha Lee: http://www.b...
4,BAR.NOED.2024.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 20-24 with...,Robert J. Barro and Jong-Wha Lee: http://www.b...
5,BAR.NOED.2024.ZS,Attainment,Barro-Lee: Percentage of population age 20-24 ...,Percentage of population age 20-24 with no edu...,Robert J. Barro and Jong-Wha Lee: http://www.b...
6,BAR.NOED.2529.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 25-29 with...,Robert J. Barro and Jong-Wha Lee: http://www.b...
7,BAR.NOED.2529.ZS,Attainment,Barro-Lee: Percentage of population age 25-29 ...,Percentage of population age 25-29 with no edu...,Robert J. Barro and Jong-Wha Lee: http://www.b...
8,BAR.NOED.25UP.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 25+ with n...,Robert J. Barro and Jong-Wha Lee: http://www.b...
9,BAR.NOED.25UP.ZS,Attainment,Barro-Lee: Percentage of population age 25+ wi...,Percentage of population age 25+ with no educa...,Robert J. Barro and Jong-Wha Lee: http://www.b...


In [15]:
edstats_series['source'].value_counts()

UNESCO Institute for Statistics                                                                                                                                                                                                                                                                                                                                                                                                                                                             1269
Early Grade Reading Assessment (EGRA): https://www.eddataglobal.org/reading/                                                                                                                                                                                                                                                                                                                                                                                                                 403
Robert J. Barro and Jong-Wha Lee: http

The `edstats_series` dataset seems to be a compendium of data sources organized by various educational topics (such as "Education Equality", "Teachers", "School Finance", "Literacy", etc.). This data set does NOT provide actual data sets regarding these topics, only the sources to find data sets on these topics. Therefore, I will pursue these other sources to find data sets that I can analyze and draw conclusions.