# Motivation and problem statement: 
I plan to do a human-centered data science project to understand how early childhood education (ECE) statistics, including pre-school and kindergarten education statistics, differ around the world.

While education has been an important cultural component for many countries to equip the younger generations with knowledge and skills to learn and adapt in the world, the access and progression to education are inequal. According to [The United Nations Children's Fund (UNICEF)](https://www.unicef.org/education/early-childhood-education), quality ECE matters to every child, because it 1) sets strong foundation for one's learning, 2) helps make education systems more effective and efficient, and 3) serves as an effective strategy to promote economic growth. However, global ECE inequality persists; nearly 50% of all pre-primary-age children are not enrolled in any form of ECE programs. Therefore, it is crucial to critically examine the differences of resources and consider factors that cause or are correlated to different ECE programs globally. 

By working on this data science project, I hope to learn to think critically and reflect ethically on issues and challenges around global education and utilize the skills I have learned in this course to interpret the findings from my data analysis to highlight interesting, concerning, and/or confusing patterns I have discovered in the data.

# Data selected for analysis: 
I plan to use the dataset called “Global Education Statistics", which contains more than 4,000 indicators comparing education statistics (EdStats) including literacy, population, and expenditures among 216 countries. I choose to work on this dataset because 1) it is provided by [Word Bank Group](https://www.worldbank.org/en/home), an international financial organization with high authority; 2) the coverage of data can be traced back to the 1970s, and the quarterly update frequency has been consistent; 3) the 4,000+ indicators holds data from international and regional learning assessments, equity data from household surveys, data from educational attainment projections, and much more, which cover the full educational cycle from pre-elementary to vocational education.
- Published data: https://datacatalog.worldbank.org/dataset/education-statistics
- API: https://datahelpdesk.worldbank.org/knowledgebase/topics/125589
- Databank: https://databank.worldbank.org/reports.aspx?source=education-statistics-~-all-indicators
- License: https://datacatalog.worldbank.org/public-licenses#cc-by 
- Term of use: This dataset is classified as Public under the Access to Information Classification Policy. Users inside and outside the World Bank organization can access this dataset.

This dataset is suitable for my research goal because it is a big data —— not only is it a long dataset but it is also a wide one. I will be able to not only focus on ECE indicators but also other relevant indicators, such as ECE teachers statistics and ECE students populations in individual countries; as such, my point of view will not be limited to unitary factors. By understanding the global ECE educational gaps, I aim to highlight some human-centered educational challenges and raise public awareness on relevant topics. 

Some ethical considerations include: 1) before data analysis, there may exist sampling biases and other types of biases in the original dataset that could skew the results; 2) when performing data analysis, my choices of data wrangling matter (e.g., what columns to keep and rows to delete), which I need to explain clearly of any decisions I will be making; 3) after performing data analysis, I am not familiar of the history and culture of most of the countries, therefore resulting in misunderstanding and misinterpretation of certain patterns, trends, or correlations in the data. 

# Unknowns and dependencies: 
Technical unknowns include: 1) the dataset may be updated during the time of my analysis and before my final submission, which could result in inconsistencies between my presentation and the new dataset; 2) though unlikely, the link to access the dataset may become invalid or inaccessible at any time. 

I have to also be mindful of the time constraints. Dealing with such a big dataset, I need to carefully plan the scope of my analysis and focus on specific research directions and questions, so the findings and results presentation will not be vague or too lengthy.

# Research Questions:
- Basic: How do early childhood education programs differ among countries around the world? 
- Reach: Which available indicators (i.e., statistical indicators presented in the same dataset about educational facts and resources) are most likely to form correlation relationships with indicators of early childhood education?

# Related Work:
Four summative reports provided by U.S. Department of Education, National Center for Education Statistics from statistics collected and analyzed in the U.S. and other nations in [2016](https://nces.ed.gov/fastfacts/display.asp?id=4), [2018](https://nces.ed.gov/fastfacts/display.asp?id=56), [2019](https://nces.ed.gov/fastfacts/display.asp?id=90), and [2020](https://nces.ed.gov/fastfacts/display.asp?id=516) showed that 1) the percentage of children who were enrolled and attended full-day preschool/kindergarten ECE programs increased from 47%/60% in 2000 to 54%/81% in 2018, respectively; 2) more than 50% of surveyed children aged between 4 and 5 being surveyed had center-based care as their primary ECE programs, followed by home-based and no ECE care; 3) children who received center-based care performed better in academic skills and learning behaviors at the beginning of kindergarten; 4) more families, especially White and Asian/Pacific Islander parents, started to engage themselves in informal home literacy experiences, such as reading, telling stories, teaching letters/words/numbers, or visiting a library, to promote literacy development of their children; and 5) average mathematics and reading scores of first-time kindergartners were higher for White and Asian children, and such gap persisted among fifth-graders. Though prior research has focused extensively on ECE impacts on children' learning capacity and compositions (i.e., mostly racial/ethnic and socio-economic status compositions) of children in ECE programs, few had examined and compared the national patterns of ECE on a macro-level scale and via a human-centered data science perspective, incorporating the economic/political/cultural characterstics of different countries to understand not only the statistical but also the ethical and humanitarian differences and similarities in the ECE data around the globe.

# Methodology
### Data wrangling
1) For the discovery phase, I will first skim through the first 100 rows of the original data and its supplementary data to understand the basic context. 
2) Next, I plan to restructure the data, only using a subset of the data for the analysis. All indicators related to ECE will be selected and applied to 216 countries in the years of 21th century. 
3) I will then clean the data by deleting empty values, removing outliers, and standardizing inputs.
4) (Optional, for reach question) I will also enrich the dataset by joining it with another supplement EdStats dataset from World Bank Group with information about each country. 

Via data wrangling, I will be able to familirize myself with the patterns in the original dataset, exclude or solve any errors in the data, and enrich the data with values from other datasets.

### Data analysis
The ECE-related indicators which I will be use in this project include but are not limited to 1) enrollment ratio (varied by sexes, types of institutions and programs, 2) total enrollment rate, 3) population of the official age for pre-primary education, 4) qualified or trained teacher ratio in pre-primary education, 5) Government expenditure on pre-primary education as % of GDP, and 6) expenditure on pre-primary as % of total government expenditure.
- I will generate two series of graphs via Plotly. One type of graphs will be bar charts showing the top 30 countries ranked with the highest values of each of the ECE indicators. The other type of graphs will be geographic map plots using Plotly (or preferably interactive maps with tooltips showing additional information using Folium, another Python package).
- Based on the previous step of data visualization, I will also select five representative countries and perform time-series analysis to understand the trends of changes of some of the ECE indicators from 2000 to 2020.
- (Optional, for reach question) I will perform multiple linear regression on the five selected countries to determine the predictors of ECE indicators associated with the countries' population, finance, trade, and general expense in education, etc.

Via data analysis, I will be able to talk about my findings and showcase specific patterns in the prepared dataset effectively and concisely. The combination of bar charts, maps, and time-series will communicate the data via the dimensions of geometry, space, and time. 

### Data presentation
I will present the data visualizations in PNG format, along with supplementary writen explanations to support the images with more details. I will also use tables and bullet points during the report to give summative information about the source and analysis of the data.