# **Case Study - Education and Gender Data**



UNESCO outlines the levels of school education as part of its framework for inclusive and equitable education. It begins with early childhood education, which aims to foster foundational skills and holistic development for children aged 0-8 years. **Primary education** follows, typically for children aged 6-12, providing essential literacy and numeracy skills while promoting lifelong learning. **Lower secondary education** builds on this foundation, offering broader knowledge and skills, and **Upper secondary education**, which prepares students for either higher education or vocational training, leading to the transition to adulthood. Finally, **tertiary education** encompasses post-secondary learning, including universities, colleges, and vocational institutions, offering advanced academic or professional qualifications to prepare individuals for specialized careers.

Today, we will examine the completion rates across various education levels for different countries. We will also look into extreme poverty per region and how does income level affects the gender parity in education. Our data source for education data is UNESCO and income data is World Bank.  

# Step 1: A few keywords to understand

**Gross Enrolment ratio**: The Gross Enrollment Ratio (GER) is a statistical measure used to determine the total enrollment in a specific education level (such as primary, secondary, or tertiary) regardless of age, expressed as a percentage of the population that corresponds to the official age group for that level of education.
A GER above 100% indicates that students older or younger than the typical age group are enrolled.

**Extreme Poverty**: Extreme poverty is defined as living on less than $2.15 a day (as per the World Bank's international poverty line), making it difficult to meet basic needs such as food, shelter, and healthcare. It reflects severe deprivation of economic and social resources, impacting individuals' ability to lead a healthy, productive life.

**Gender Parity:** Gender parity refers to equal representation and opportunities for all genders, ensuring no gender is disadvantaged in any context. Gender parity is calculated using the Gender Parity Index (GPI), which compares the ratio of female to male participation or completion rates. A GPI of 1 indicates gender parity, values above 1 favor females, and values below 1 favor males.

# Step 2: Data Understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information. This step is often mixed with the next step, Data Preparation.

**Load Data**

The first dataset we will use is in gender_data_summary.csv. Make sure the file is in current folder.

Open the .csv and look through the various columns. It contains countries names and corresponding regions. It includes completion rate (%) for primary, lower secondary, upper secondary and tertiary levels for male, female and both sexes. For economy data, extreme poverty headcount and income group has been provided. Once you have are familiar with the data, you can load the libraries from python and start the data analysis.

In [None]:
# #Uncomment the lines below with CTRL+/ and run the cell.
# import pandas as pd
# import matplotlib.pyplot as plt
# import warnings
# warnings.filterwarnings('ignore')
# %matplotlib inline

In [None]:
# df_gendata = pd.read_csv('gender_data_summary.csv')
# df_gendata.head()

**Check Data Quality**

Check data quality. Most common check is to check missing values or null values.

In [None]:
# # Task 1: Check out Basic Dataframe info
# df_gendata.info()

In [None]:
# #Checking if there are any NA values
# df_gendata.isnull().sum()

In [None]:
# #Print the data for line 37 with NA
# print(df_gendata.iloc[[37]])

In [None]:
# # Task 2: Clean up Data, Replace NA with 0.
# df_gendata.replace('NA', pd.NA, inplace=True)
# df_gendata.fillna(0.0, inplace=True)
# print(df_gendata.iloc[[37]])

In [None]:
# #Task 3: Check out statistics of Numeric Columns
# df_gendata.describe()

# Step 3: Plotting the education data

In [None]:
# # Task 4: Plot bar chart for Primary Completion rate for male, female and total for 1 country
# country = 'Afghanistan'
# country_data = df_gendata[df_gendata['Country'] == country]
# primary_male = country_data['Completion rate, primary, male (%)'].values[0]
# primary_female = country_data['Completion rate, primary, female (%)'].values[0]
# primary_total = country_data['Completion rate, primary, both (%)'].values[0]
# #Change the country name to see other country statistics

# labels = ['Male', 'Female', 'Total']
# primary_values = [primary_male, primary_female, primary_total]

# #Plotting "Completion rate for Primary" for male, female and total
# fig, ax = plt.subplots(figsize=(8, 8))
# ax.bar(labels, primary_values)
# plt.title(f'Completion rate, primary for {country}')
# for i, v in enumerate(primary_values):
#     plt.text(i, v + 1, str(round(v, 2)), ha='center', va='bottom')

# plt.show()


**Source:** UNESCO Institute for Statistics (UIS), 2024.  
[Retrieved from UIS SDG 4 Database](https://sdg4-data.uis.unesco.org/)


In [None]:
# #Task 5: Merge the above bar chart with Lower Secondary Completion rate
# #for male, female and total for 1 country
# lower_sec_male = country_data['Completion rate, lower secondary, male (%)'].values[0]
# lower_sec_female = country_data['Completion rate, lower secondary, female (%)'].values[0]
# lower_sec_total = country_data['Completion rate, lower secondary, both (%)'].values[0]

# # Define lower secondary values for the bar chart
# lower_sec_values = [lower_sec_male, lower_sec_female, lower_sec_total]  # Lower secondary data

# # Set the width of the bars
# bar_width = 0.35

# # Create the bar chart
# fig, ax = plt.subplots(figsize=(8, 8))
# primary_bars = ax.bar(labels, primary_values, bar_width, label='Primary')  # Primary bars
# lower_sec_bars = ax.bar([x + bar_width for x in range(len(labels))], lower_sec_values, bar_width, label='Lower Secondary')  # Lower secondary bars (shifted)

# # Add labels and title
# plt.title(f'Completion Rates for {country}')
# plt.xlabel('Gender')
# plt.ylabel('Completion Rate (%)')

# # Add data labels to the bars
# for bar in primary_bars + lower_sec_bars:
#     height = bar.get_height()
#     x = bar.get_x() + bar.get_width() / 2
#     ax.text(x, height + 1, str(round(height, 2)), ha='center', va='bottom')

# # Add legend
# plt.legend()

# # Display the chart
# plt.show()


**Source:** UNESCO Institute for Statistics (UIS), 2024. Retrieved from https://sdg4-data.uis.unesco.org/

In [None]:
# #Task 6: Find the average tertiary Enrolment ratio for the regions and plot as pie chart (both sexes, female).
# ## Group data by region and calculate the average tertiary enrollment ratio
# region_tertiary_avg = df_gendata.groupby('Region')['Gross Enrolment ratio for tertiary, both (%)'].mean()
# region_tertiary_female_avg = df_gendata.groupby('Region')['Gross Enrolment ratio for tertiary, female (%)'].mean()

# # Create the first pie chart for region_tertiary_avg
# plt.figure(figsize=(4, 4))  # Adjust figure size as needed
# plt.pie(region_tertiary_avg, labels=region_tertiary_avg.index, autopct='%1.1f%%', startangle=90)
# plt.title('Average Tertiary Enrollment Ratio by Region (both sexes)')

# # Create the second pie chart for region_tertiary_female_avg
# plt.figure(figsize=(4, 4))  # Adjust figure size as needed
# plt.pie(region_tertiary_female_avg, labels=region_tertiary_female_avg.index, autopct='%1.1f%%', startangle=90)
# plt.title('Average Tertiary Enrollment Ratio by Region (Female)')

# plt.show()

# Step 4: Plotting the Extreme Poverty Headcount Ratio

In [None]:
# # Task 7: Plot pie chart for Extreme Poverty Headcount Ratio vs region
# # Calculating average for Poverty headcount data by region
# region_pov_headcount = df_gendata.groupby('Region')['Poverty Headcount Ratio (%)'].mean()

# # Create the first pie chart for region_pov_headcount
# plt.figure(figsize=(6, 6))  # Adjust figure size as needed
# plt.pie(region_pov_headcount, labels=region_pov_headcount.index, autopct='%1.1f%%', startangle=90)
# plt.title('Extreme Poverty Headcount Ratio by Region (All)')

# plt.show()

**Source:** World Bank - Poverty headcount ratio at $2.15 a day (2017 PPP) (% of population), 2024.Retrieved from https://data.worldbank.org/indicator/SI.POV.DDAY?view=chart

# Step 5: Plotting Gender Parity vs Income level

For tasks 8 and 9, we will use the .csv file, "gender_parity_income_region.csv".

In [None]:
# # Task 8: Plot gender parity by income group at the upper secondary level.

# # Load the CSV file into a DataFrame
# df_income_region = pd.read_csv("gender_parity_income_region.csv")

# # Display the first few rows of the DataFrame
# print(df_income_region.head())

In [None]:
# # Replace NA values with 0.00
# df_income_region["value"] = df_income_region["value"].fillna(0.00)

# # Print the last five rows of the new df_income_region DataFrame.
# print(df_income_region.tail())

Looking at the last five rows of the new dataframe above, we see that there are rows in the dataframe that are tagged with 'region' in the "region_or_income" column, even though the id is an income level group, and not a geographical region. Let's change that in our new DataFrame and put "income group" in that column, instead of "region".

In [None]:
# # Create a list of the names of country income level groups.
# income_levels = ["Low income", "Lower middle income", "Upper middle income", "High income"]

# # Replace values in 'region_or_income' for rows where 'id' is in the income_levels list
# df_income_region.loc[df_income_region["id"].isin(income_levels), "region_or_income"] = "income group"

# # Explanation:
# # df["id"].isin(income_levels): Checks if the value in the "id" column is in the income_levels list.
# # .loc[...]: Selects the rows where the condition is True.
# # ["region_or_income"] = "income group": Updates the "region_or_income" column for those rows to "income group".

# # Print a few rows to check the changes
# print(df_income_region.tail())

Now, we're going to look at the most recent year available for each income group, to see what is the value for gender parity in the education level of upper secondary school given in this dataset. This will allow us to compare the most recent year available for the different income groups. We don't want the most recent year where the value was zero, so we will remove those years before selecting.

In [None]:
# # Filter rows where the 'region_or_income' column says 'income group'
# df_income_only = df_income_region[df_income_region["region_or_income"] == "income group"]

# # Step 1: Remove rows where 'value' is 0.00 **before selecting the latest year**
# df_income_only = df_income_only[df_income_only["value"] != 0.00]

# # Step 2: Select the most recent year for each income group
# parity_by_income_group = df_income_only.loc[df_income_only.groupby("id")["year"].idxmax()]

# # Print the result
# print(parity_by_income_group)

Is parity_by_income_group a new Pandas DataFrame? Yes -- because we used .loc[] to select specific rows from df_income_only, it retains the structure of a DataFrame. You can opt to verify this by running the following line. It will confirm the type of object we have and provide general information about DataFrames.

In [None]:
# # If you want to check this, you can uncomment and run the next line of code.
# type(parity_by_income_group)

In [None]:
# # Plot the results.
# import matplotlib.pyplot as plt
# import piplite
# await piplite.install('seaborn')
# import seaborn as sns

# # Set seaborn style for better visuals
# sns.set_style("whitegrid")

# # Define custom order for the x-axis (income groups)
# income_order = ["Low income", "Lower middle income", "Upper middle income", "High income"]

# # Create the bar plot with custom order for income groups
# plt.figure(figsize=(8, 5))  # Set figure size
# ax = sns.barplot(
#     x="id", y="value", data=parity_by_income_group, palette="viridis", width=0.6,
#     order=income_order  # Apply custom order
# )

# # Add value labels on top of each bar
# for p in ax.patches:
#     ax.annotate(f"{p.get_height():.2f}",  # Format to 2 decimal places
#                 (p.get_x() + p.get_width() / 2, p.get_height()),
#                 ha="center", va="bottom", fontsize=11, fontweight="bold", color="black")

# # Add a horizontal line at 1 to indicate gender parity
# plt.axhline(y=1, color="red", linestyle="--", linewidth=1.5, label="Gender Parity (1.0)")

# # Customize labels and title
# plt.xlabel("Income Group", fontsize=12)
# plt.ylabel("Gender Parity Index", fontsize=12)
# plt.title("Gender Parity by Income Group at Upper Secondary Level", fontsize=14)

# # Rotate x-axis labels for better readability
# plt.xticks(rotation=30)

# # Add legend
# plt.legend()

# # Show the plot
# plt.show()

**Source:** UNESCO World Inequality Database on Education 2024.  
Retrieved from https://www.education-progress.org/en/articles/equity

Note that these values are for entire income level groups and that individual countries' values will differ. You have the latest available values by country in the first DataFrame we made in an earlier step, from the 'gender_data_summary.csv' file.

In [None]:
# # Task 9: Plot gender parity by geographic region at the upper secondary level.
# # Filter rows where 'region_or_income' is 'region'
# df_region_only = df_income_region[df_income_region["region_or_income"] == "region"]

# # Step 1: Remove rows where 'value' is 0.00 **before selecting the latest year**
# df_region_only = df_region_only[df_region_only["value"] != 0.00]

# # Step 2: Select the most recent year for each region
# parity_by_geo_region = df_region_only.loc[df_region_only.groupby("id")["year"].idxmax()]

# # Print the result
# print(parity_by_geo_region)

In [None]:
# # Plot the results

# # Create the bar plot
# plt.figure(figsize=(10, 6))  # Set figure size in inches
# ax = sns.barplot(
#     x="id", y="value", data=parity_by_geo_region, palette="viridis", width=0.6
# )

# # Add value labels on top of each bar
# for p in ax.patches:
#     ax.annotate(f"{p.get_height():.2f}",  # Format to 2 decimal places
#                 (p.get_x() + p.get_width() / 2, p.get_height()),
#                 ha="center", va="bottom", fontsize=11, fontweight="bold", color="black")

# # Add a horizontal line at 1 to indicate gender parity
# plt.axhline(y=1, color="red", linestyle="--", linewidth=1.5, label="Gender Parity (1.0)")

# # Customize labels and title
# plt.xlabel("Geographic Region", fontsize=12)
# plt.ylabel("Gender Parity Index", fontsize=12)
# plt.title("Gender Parity by Geographic Region, Upper Secondary", fontsize=14)

# # Rotate x-axis labels for better readability
# plt.xticks(rotation=45)

# # Add legend
# plt.legend()

# # Show the plot
# plt.show()

**Source:** UNESCO World Inequality Database on Education 2024.
Retrieved from https://www.education-progress.org/en/articles/equity

To explore further on your own, you can load and explore the data in the file, "sdg_sept2024_4.5.1_upper_secondary.csv", showing a gender parity index for upper secondary education by country (modelled). Source: UNESCO GPIA 2024, available at https://sdg4-data.uis.unesco.org/.

In [None]:
# Follow the steps above starting with loading data.
# Write your code here.

