# **Project Name - World Bank Global Education Analysis**    -



##### **Project Type**    - ***Exploratory Data Analysis (EDA)***
##### **Contribution**    - Individual Project
##### **Team Size 1 -** **(Individual)**




**The dataset is stored in the project directory and loaded using a relative path to ensure the notebook can be executed on any system without dependency on Google Drive.**

# **Project Summary**

**World Bank Global Education Analysis (Exploratory Data Analysis)**

This project focuses on performing an in-depth Exploratory Data Analysis (EDA) on the World Bank Global Education dataset to understand patterns, disparities, and key factors influencing education outcomes across countries and regions. The dataset contains internationally comparable indicators related to literacy, enrollment, education expenditure, school infrastructure, and access to education.

The analysis involved systematic data cleaning, transformation, and integration of multiple datasets, including country-level information and education indicators. Missing values and outliers were handled using appropriate statistical techniques such as IQR-based outlier removal, ensuring reliable and meaningful insights.

A comprehensive EDA was conducted using:

* Univariate analysis to understand individual indicator distributions

* Bivariate analysis to study relationships between education indicators (e.g., sanitation vs dropout, spending vs enrollment)

* Multivariate analysis to examine combined effects of literacy, enrollment, and expenditure

More than 20 meaningful visualizations were created using Matplotlib and Seaborn, including histograms, box plots, scatter plots, heatmaps, pair plots, and pie charts. Each visualization was supported with clear reasoning, insights, and business implications.

Key findings revealed that:

School infrastructure, especially sanitation facilities, has a strong impact on dropout rates, particularly among adolescent students

Higher education spending does not always guarantee better outcomes, highlighting the importance of spending efficiency

Significant disparities exist across regions and income groups, emphasizing the need for targeted, region-specific education policies

The project successfully demonstrates how data-driven insights can support policymakers, governments, and international organizations like the World Bank in making informed decisions to improve education access, quality, and retention globally.

‚úÖ Skills Demonstrated

* Data Cleaning & Preparation
* Exploratory Data Analysis
* Data Visualization & Storytelling
* Business Insight Generation
* Policy-oriented Decision Support

# **GitHub Link -Pending**

it will be added after completing data analysis and insights.

# **Problem Statement**


Education is one of the most important indicators of a country‚Äôs social and economic development. However, the level of access to education, literacy rates, enrollment, completion rates, and government spending on education vary significantly across countries and regions.

The World Bank EdStats dataset provides a comprehensive collection of global education indicators, but due to its large size and complexity, it is difficult to directly interpret patterns and trends without proper analysis.

***The objective of this project is to explore, analyze, and visualize global education indicators to understand how education outcomes differ across countries, regions, and income groups, and to identify key patterns, gaps, and inequalities in global education systems.***


#### **Define Your Business Objective?**

***The primary objective of this project is to generate data-driven insights from global education indicators that can support policy-making and strategic decision-making by governments, international organizations, and educational institutions.***

Through exploratory data analysis, this project aims to identify countries and regions that perform well in education, as well as those that require greater attention and investment. The insights derived from this analysis can help policymakers understand the relationship between education spending, enrollment, literacy, and learning outcomes, and assist in designing targeted education policies and resource allocation strategies.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset

In [None]:
data = pd.read_csv('/EdStatsData.csv')
country = pd.read_csv('/EdStatsCountry.csv')
series = pd.read_csv('/EdStatsSeries.csv')
footnote = pd.read_csv('/EdStatsFootNote.csv')
cnt_series = pd.read_csv('/EdStatsCountry-Series.csv')

*The datasets were loaded successfully using pandas.
The EdStatsData dataset contains country-wise and year-wise education indicators, while EdStatsCountry provides regional and income group information, and EdStatsSeries contains indicator descriptions.*

In [None]:
plt.style.use('default')

### Dataset First View

In [None]:
#dataset first view by using head

In [None]:
data.head()

In [None]:
country.head()

In [None]:
series.head()

In [None]:
footnote.head()

In [None]:
cnt_series.head()

**The dataset contains a very large number of rows and columns.**

**The year-wise data is structured in a wide format.**

**There are many missing values present, which is expected in global-level data.**

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

In [None]:
country.shape

In [None]:
series.shape

**main data csv file contains 886930 rows and 70 columns**

**country csv file contains 241 rows and 32 columns**

**series csv file contains 3665 rows and 21 columns**

### Dataset Information

In [None]:
# Dataset Info
data.info()

In [None]:
country .info()

In [None]:
series.info()

**the data have many null value we will handle duplicates and missing value**

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

In [None]:
country.duplicated().sum()

In [None]:
series.duplicated().sum()

**Note - data csv file doesn't contain any duplicate value**

#### Missing Values/Null Values in Average

> Add blockquote



In [None]:
# Missing Values/Null Values Count


In [None]:
data.isnull().mean()*100

In [None]:
country.isnull().mean()*100

In [None]:
series.isnull().mean()*100

*Missing values are expected in global education datasets due to inconsistent reporting across countries and years. Therefore, missing values were handled contextually during analysis instead of being blindly removed. because it causes big number of data loss*

#### Visualizing the missing values by using Bar Plot

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(14,5))

# Plot 1: Missing Values Count
data.isnull().sum().plot(kind='bar', ax=axes[0])
axes[0].set_title("Missing Values Count per Column (Chart no 1.1)")
axes[0].set_ylabel("Count of Missing Values")
axes[0].tick_params(axis='x', rotation=45)

# Plot 2: Missing Percentage
(data.isnull().mean() * 100).plot(kind='bar', ax=axes[1])
axes[1].set_title("Percentage of Missing Values (Chart no 1.2)")
axes[1].set_ylabel("Missing Percentage (%)")
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()


### What did you know about your dataset?

After exploring the dataset, I understood its overall structure, quality, and key characteristics. The dataset contains both numerical and categorical variables related to the problem domain.

###Dataset Understanding

**We have basically 5 files :-**

1. EdStatsData.csv üëâ Main data (Indicators + Values)

2. EdStatsCountry.csv üëâ Country info (Region, Income group)

3. EdStatsSeries.csv üëâ Indicator explanation

4. EdStatsFootNote.csv üëâ Extra notes

5. EdStatsCountry-Series.csv üëâ Country‚ÄìSeries mapping

**üëâ Main Data File for EDA:**

**1. EdStatsData.csv**
  * File name considered - **data**
  * Dataset rows = **886930**
  * Columns = **70**
  * Numerical + categorical columns present
  * Missing values - Yes (shows in chart no 1.1 and 1.2)

**2. EdStatsCountry.csv**
  * File name considered - **country**
  * Dataset rows = **241**
  * Columns = **32**
  * Categorical columns present
  * Missing values  - Yes

**3. EdStatsSeries.csv**
  * File name considered - **series**
  * Dataset ‡§Æ‡•á‡§Ç rows = **3665**
  * Columns = **21**
  * Categorical columns present
  * Missing values - Yes


we can't drop the columns beacuse the percentage of missing values is more than 3-5%. Directly drop value can be cause lose many important data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
country.columns

In [None]:
series.columns

### Variables Description

1Ô∏è‚É£ Understanding Variables in EdStatsData (Main Data)

| Variable         | Type             | Meaning             | Role                 |
| ---------------- | ---------------- | ------------------- | -------------------- |
| Country Name     | Categorical      | Country's name      | Grouping             |
| Country Code     | Categorical      | ISO code            | Join key             |
| Indicator Name   | Categorical      | Education indicator | Feature              |
| Indicator Code   | Categorical      | Indicator ID        | Mapping              |
| Year (1960‚Äì2022) | Numerical (Time) | Observation year    | Trend                |
| Value            | Numerical        | Indicator value     | **Target / Measure** |

2Ô∏è‚É£ Understanding Variables in EdStatsCountry (Country data)

| Variable     | Type        | Meaning                     |
| ------------ | ----------- | --------------------------- |
| Country Name | Categorical | Country's official name    |
| Country Code | Categorical | Unique country identifier   |
| Region       | Categorical | Geographic region           |
| IncomeGroup  | Categorical | Economic classification     |
| LendingType  | Categorical | World Bank lending category |

3Ô∏è‚É£ Understanding Variables in EdStatsSeries (Indicator data)
| Variable        | Type        | Meaning               |
| --------------- | ----------- | --------------------- |
| Series Name     | Categorical | Indicator name        |
| Series Code     | Categorical | Unique indicator ID   |
| Topic           | Categorical | Education category    |
| Long Definition | Text        | Indicator explanation |
| Source          | Categorical | Data source           |



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data.nunique()


In [None]:
country.nunique()

In [None]:
series.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
data.isnull().mean().sort_values(ascending=False)

In [None]:
year_cols = data.columns[4:]   # ALL year columns name After Country, Code, Indicator name and indicator code
data = data.dropna(subset=year_cols, how='all')

In [None]:
print(year_cols)

In [None]:
data = data.dropna(axis=1, how='all') # Delete all column where all rows value is NaN

In [None]:
data.shape

In [None]:
data.isnull().mean()*100

In [None]:
data.head()

In [None]:
year_cols = []

for col in data.columns:
    if col.isdigit():
        year_cols.append(col)
print(year_cols)

In [None]:
df_long = data.melt(
    id_vars=['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code'],
    value_vars=year_cols,
    var_name='Year',
    value_name='Value'
)

df_long['Year'] = df_long['Year'].astype(int)
df_long.head(20)

In [None]:
df = df_long.merge(
    country[['Country Code', 'Region', 'Income Group']],
    on='Country Code',
    how='left'
)

df = df.dropna(subset=['Value'], how='all')


In [None]:
df_country = df.dropna(subset=['Region', 'Income Group'])


In [None]:
df_series = df.merge(
    series[['Series Code', 'Topic']],
    left_on='Indicator Code',
    right_on='Series Code',
    how='left'
)


In [None]:
df_series.head()

In [None]:
df_series['Topic'].unique()

**üëâ  EdStatsCountry and EdStatsSeries datasets primarily contain categorical variables such as country names, regions, income groups, and indicator descriptions. These datasets do not include numerical variables and are mainly used for metadata enrichment and contextual understanding. The numerical analysis is primarily conducted using the EdStatsData dataset, which contains time-series values of education indicators.**

In [None]:
df.shape

In [None]:
df.head(20)

In [None]:
df.isnull().mean()

In [None]:
df['Indicator Name'].nunique() # identify total unique value in variable name "indicator name"


***nunique() Returns how many different values exist in the "indicator name" column.***

In [None]:
df['Indicator Name'].unique()


***We can't visualize 3665 rows data. Its a very big number***

In [None]:
df['Indicator Name'].value_counts().head(50)

üèÜ Recommended Indicators (Beginner-Friendly & Strong)
üî§ Literacy

Literacy rate, adult total (% of people ages 15 and above)
‚û°Ô∏è Education's basic outcome

üè´ Enrollment

School enrollment, primary (% gross)

School enrollment, secondary (% gross)
‚û°Ô∏è Access to education

üí∞ Expenditure

Government expenditure on education, total (% of GDP)
‚û°Ô∏è Investment in education

üë©‚Äçüè´ Teachers

Pupil-teacher ratio, primary
‚û°Ô∏è Quality indicator

üë∂ Population (supporting)

Population ages 0‚Äì14 (% of total)
‚û°Ô∏è Education demand context

### What all manipulations have you done and insights you found?

üëâ Insight:

* Missing values ‚Äã‚Äãare common in education data.

* Not every indicator is available for all countries.

* There is a mix of numerical + categorical variables in Dataset.

* Main numerical information are present in variable name Value

| Column         | Insight                                  |
| -------------- | ---------------------------------------- |
| Country Name   | unique ‚Üí categorical        |
| Country Code   | Unique per country                       |
| Indicator Name | Thousands ‚Üí high-cardinality categorical |
| Indicator Code | Unique indicator identifiers             |
| Year columns   | Numerical (time-series)                  |
| Values         | Continuous numerical data                |
| Region       | Few categories (Low cardinality) |
| IncomeGroup  | Very few categories              |

**Seies CSV File**

| Column      | Insight            |
| ----------- | ------------------ |
| Series Name | High cardinality   |
| Series Code | Unique             |
| Topic       | Limited categories |
| Source      | Few unique values  |




**We didn‚Äôt perform univariate numerical analysis on Country/Series files.**

Because these files do not contain quantitative variables. They are used for categorization and enrichment of the main dataset.

üîë **Overall Conclusion**

The EdStatsData dataset is the primary source for numerical analysis, as it contains time-series education indicator values.

The EdStatsCountry and EdStatsSeries datasets consist entirely of categorical variables and are used to enrich the main dataset with country-level and indicator-level context.

Checking unique values helps validate variable types, identify cardinality, and determine the appropriate analytical and visualization techniques to be applied in the exploratory data analysis.

üî•  Use:

* df ‚Üí overall/global trends

* df_country ‚Üí region & income analysis

* df_series ‚Üí Multivariate analysis by Topic


‚úçÔ∏è **Due to the presence of thousands of education indicators in the dataset, a focused subset of widely used and policy-relevant indicators was selected for exploratory data analysis. These indicators represent key dimensions of education such as literacy outcomes, enrollment access, education expenditure, and teaching quality etc.**


In [None]:
df.head()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

**The EdStatsData dataset contains more than three thousands of education-related indicators** covering literacy, enrollment, education expenditure, teachers, population, and learning outcomes. Each indicator is uniquely identified using an Indicator Code and described using a human-readable Indicator Name.

**Due to the large number of indicators, the analysis focuses on the most widely available and policy-relevant indicators.**


**UNIVARIATE ANALYSIS (1-5 Charts)**

#### üìä Chart - 1  Histogram ‚Äì Adult Literacy Rate

In [None]:
lit = df[df['Indicator Name'] ==
         'Adult literacy rate, population 15+ years, both sexes (%)']

plt.hist(lit['Value'], bins=30)
plt.xlabel('Literacy Rate (%)')
plt.ylabel('Frequency')
plt.title('Distribution of Adult Literacy Rate across Countries')
plt.show()


##### 1. Why did you pick the specific chart?
Ans - A histogram is chosen because it is the best chart to understand the distribution of a single numerical variable.
Here, our objective is to understand how adult literacy rates are spread across countries and years ‚Äî whether most countries have low literacy, high literacy, or are evenly distributed.

##### 2. What is/are the insight(s) found from the chart?
Ans - Insights:

* Most countries lie above 60% literacy
* Very few countries fall below 40% literacy, but those that do are extreme cases.
* The distribution is left-skewed, meaning most countries have moderate to high literacy rates.

##### 3. Will the gained insights help creating a positive business impact?
Ans - ‚úÖ Positive Impact:

* Governments and international organizations can focus resources on the bottom-performing countries rather than spreading funds evenly.

 * The chart supports targeted literacy programs, improving ROI of education spending.

      ‚ùå Negative Growth Insight:

* The existence of extreme low-literacy countries indicates education inequality.

* If ignored, these countries may experience low workforce productivity, impacting long-term economic growth.

##### 4. Negative growth insight:
Countries in the lower tail indicate stagnant education development.

#### üìä Chart - 2  Boxplot ‚Äì Adult Literacy Rate

In [None]:
plt.boxplot(lit['Value'])
plt.title('Outliers in Adult Literacy Rate')
plt.ylabel('Literacy Rate (%)')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer- To detect outliers and inequality in literacy.

##### 2. What is/are the insight(s) found from the chart?

Answer - Insights:
* Large variation exists

Some countries lag far behind the global median

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer - ‚úÖ Positive Impact:

* Positive impact:
Targets countries needing urgent literacy programs.

Negative insight:

        ‚ùå Negative Growth Insight:

* Persistent low performers show weak education systems.

#### Chart - 3 Bar Chart ‚Äì Countries per Region


In [None]:
df['Region'].value_counts().plot(kind='bar')
plt.title('Number of Countries by Region')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answe - This chart is selected to:

* To understand regional representation.

##### 2. What is/are the insight(s) found from the chart?

Answer -
* Europe & Central Asia show high and stable literacy rates.
* Sub-Saharan Africa and South Asia have lower medians and wider spread.
* Regional inequality is clearly visible.

This confirms that geography plays a significant role in education outcomes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer-

‚úÖ Positive Impact :

* Enables region-specific funding strategies.
* Helps NGOs prioritize high-need regions.

‚ùå Negative Growth Insight:

 Persistent low literacy in certain regions may lead to:

* Higher unemployment
* Poor human capital development
* Long-term economic stagnation

#### Chart - 4.1,4.2 üìä Bar Chart ‚Äì Income Group Distribution & population

In [None]:
df['Income Group'].value_counts().plot(kind='bar')
plt.title('Country Distribution by Income Group')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?
To understand economic segmentation.

##### 2. What is/are the insight(s) found from the chart?
Answer - Majority of countries are middle-income.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Answer - Positive impact:
Middle-income nations can be growth drivers.

Negative insight:
Low-income group remains vulnerable.

In [None]:
population = df[df['Indicator Name'] == 'Population, total']
population = population.dropna(subset=['Value'])
latest_year = population['Year'].max()
population_latest = population[population['Year'] == latest_year]
top10_pop = population_latest.sort_values(
    by='Value', ascending=False
).head(20)

plt.figure(figsize=(10,6))
sns.barplot(
    data=top10_pop,
    x='Value',
    y='Country Name'
)
plt.title('Top 10 Most Populated Countries (Latest Year)')
plt.xlabel('Total Population')
plt.ylabel('Country')
plt.show()

A bar chart was selected because it is the most effective way to compare population sizes across multiple countries.
Since population values vary significantly across countries, a bar chart provides clear ranking and easy comparison.

Additionally, focusing on the latest year ensures the analysis reflects the current global demographic scenario, which is essential for education planning.

**Insights**

A small number of countries account for a large proportion of the global population

Highly populated countries face:

Higher demand for schools

Greater pressure on teachers and infrastructure

Population concentration explains why global education challenges are often driven by a few countries

This insight highlights the scale of education responsibility in populous nations.


#### Chart - 5 üìä Histogram ‚Äì Education Expenditure (% GDP)

In [None]:
spend =  df[df['Indicator Name'].str.contains(
    'expenditure on education', case=False, na=False
)]
spend_final = spend[
    (spend['Value'] > 0) &
    (spend['Value'] < 15)
]

print(spend_final.describe())

In [None]:
plt.figure(figsize=(6, 4))
sns.histplot(spend_final['Value'], bins=50,kde = True)
plt.title('Education Expenditure Distribution as % of GDP')
plt.xlabel('% of GDP')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?
Answer - To study global spending behavior.

##### 2. What is/are the insight(s) found from the chart?
Answer -  Most countries spend **3‚Äì6 % of GDP.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer -Positive impact:
Benchmarking helps policy formulation.

Negative insight:
Very low spenders risk poor outcomes.

###**BIVARIATE ANALYSIS (6-15 Charts)**

#### Chart - 6 üìä Scatter ‚Äì Literacy vs Education Expenditure

In [None]:
merged = lit.merge(spend, on=['Country Name','Year'],
                   suffixes=('_lit','_spend'))

plt.scatter(merged['Value_spend'], merged['Value_lit'])
plt.xlabel('Education Expenditure (% GDP)')
plt.ylabel('Literacy Rate (%)')
plt.show()


##### 1. Why did you pick the specific chart?
To check spending‚Äìoutcome relationship.

2. Insights:

Higher spending generally improves literacy.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
Supports investment in education.

Negative insight:
Some high spenders still show low literacy ‚Üí inefficiency.

#### Chart - 7üìä Line Plot ‚Äì Global Literacy Trend

In [None]:
global_lit = lit.groupby('Year')['Value'].mean()

plt.plot(global_lit)
plt.title('Global Literacy Trend Over Time')
plt.xlabel('Year')
plt.ylabel('Literacy Rate (%)')
plt.show()


##### 1. Why did you pick the specific chart?
To observe long-term progress.

##### 2. What is/are the insight(s) found from the chart?

Strong improvement after 1990.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:
Shows effectiveness of global education initiatives.

Negative insight:
Growth slows after saturation.

#### Chart - 8 üìä Line Plot ‚Äì Primary Enrollment Trend

In [None]:
lit = df[df['Indicator Name'] ==
         'Adult literacy rate, population 15+ years, both sexes (%)']

sns.boxplot(data=lit, x='Income Group', y='Value')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.To compare literacy across economies.

##### 2. What is/are the insight(s) found from the chart?

Answer - High-income countries dominate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer - Positive impact:
Supports targeted funding.

Negative insight:
Low-income inequality persists.

#### Chart - 9 üìä Literacy Rate vs Primary Enrollment

In [None]:
enroll_in = df[df['Indicator Name']=='Barro-Lee: Percentage of population age 15+ with primary schooling. Completed Primary']

merged3 = pd.merge(
    lit, enroll_in,
    on=['Country Code', 'Year'],
    suffixes=('_lit', '_enroll')
)

sns.scatterplot(x='Value_enroll', y='Value_lit', data=merged3)
plt.xlabel('Primary Enrollment (%)')
plt.ylabel('Adult Literacy Rate (%)')
plt.title('Enrollment vs Literacy')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here. This scatter plot is chosen because it allows us to analyze two numerical variables simultaneously

##### 2. What is/are the insight(s) found from the chart?

Answer- High enrollment does not always guarantee high literacy

Quality of education matters

#### Chart - 10 üìä 'Pupil‚ÄìTeacher Ratio vs Literacy'

In [None]:
ptr = df[df['Indicator Name'].str.contains('Pupil-teacher ratio', case=False, na=False)]

merged3 = pd.merge(
    ptr, lit,
    on=['Country Code', 'Year'],
    suffixes=('_ptr', '_lit')
)

sns.scatterplot(x='Value_ptr', y='Value_lit', data=merged3)
plt.xlabel('Pupil‚ÄìTeacher Ratio')
plt.ylabel('Adult Literacy Rate (%)')
plt.title('Pupil‚ÄìTeacher Ratio vs Literacy')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. This scatter plot was selected to examine the relationship between education quality (pupil‚Äìteacher ratio) and learning outcomes (literacy rate)

##### 2. What is/are the insight(s) found from the chart?

Answer - Higher ratios ‚Üí lower literacy

Classroom overcrowding reduces outcomes

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer  - ‚ùå Poor education quality ‚Üí low-skilled workforce

#### Chart - 11 üìä Literacy Trend Over Time

In [None]:
lit_trend = lit.groupby('Year')['Value'].mean()

plt.plot(lit_trend)
plt.xlabel('Year')
plt.ylabel('Literacy Rate (%)')
plt.title('Global Literacy Trend')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer- * Steady improvement

* Growth slowing recently

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer - ‚ùå Saturation without quality improvements

#### Chart - 12 üìä School Toilet Facilities vs Primary Enrollment

In [None]:
toilet = df[df['Indicator Name'].str.contains(
    'toilet', case=False, na=False
)]

enroll = df[df['Indicator Name']=='Barro-Lee: Percentage of population age 15+ with primary schooling. Completed Primary']

merged_toilet = pd.merge(
    toilet, enroll,
    on=['Country Code', 'Year'],
    suffixes=('_toilet', '_enroll')
)

sns.scatterplot(
    data=merged_toilet,
    x='Value_toilet',
    y='Value_enroll'
)
plt.xlabel('Schools with Toilet Facilities (%)')
plt.ylabel('Primary Enrollment (%)')
plt.title('Toilet Facilities vs Primary Enrollment')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. This chart is chosen to understand how basic school sanitation infrastructure affects student enrollment.
Toilet availability is a critical factor, especially for girls‚Äô education and attendance.

##### 2. What is/are the insight(s) found from the chart?

Answer - Insights from the chart

Countries with higher toilet facility coverage tend to have higher enrollment

Poor sanitation is associated with lower participation

Relationship is stronger in developing regions

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer - ‚úÖ Positive Impact:

Supports investment in school infrastructure

Improves enrollment and attendance, especially for girls

‚ùå Negative Growth Insight:

Lack of toilets discourages attendance

Leads to higher dropout rates and poor human capital development

#### Chart - 13 üìä Electricity in Schools vs Learning *Outcomes*

In [None]:
electricity = df[df['Indicator Name'].str.contains(
    'electricity', case=False, na=False
)]

merged_elec = pd.merge(
    electricity, lit,
    on=['Country Code', 'Year'],
    suffixes=('_elec', '_lit')
)

sns.scatterplot(
    data=merged_elec,
    x='Value_elec',
    y='Value_lit'
)
plt.xlabel('Schools with Electricity (%)')
plt.ylabel('Adult Literacy Rate (%)')
plt.title('Electricity Availability vs Literacy Rate')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.  This chart is chosen to understand how basic school with electricty affects literacy rate. electricty availability is a critical factor, especially for digital learning .

##### 2. What is/are the insight(s) found from the chart?

Answer Here - Electricity enables digital learning

Strong association with better outcomes

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer - ‚ùå Lack of electricity limits technology-based education and skill development

#### Chart - 14 - üìä 'School Sanitation vs Dropout Rate'

In [None]:
dropout = df[df['Indicator Name'].str.contains(
    'drop', case=False, na=False
)]

toilet = df[df['Indicator Name'].str.contains(
    'toilet', case=False, na=False
)]

merged_drop = pd.merge(
    toilet, dropout,
    on=['Country Code', 'Year'],
    suffixes=('_toilet', '_dropout')
)

sns.scatterplot(
    data=merged_drop,
    x='Value_toilet',
    y='Value_dropout'
)
plt.xlabel('Toilet Facility Coverage (%)')
plt.ylabel('Dropout Rate (%)')
plt.title('School Sanitation vs Dropout Rate')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. We selected this chart to examine the relationship between school sanitation facilities and student dropout rates, because sanitation is a critical yet often overlooked factor influencing student retention, especially in developing regions.

##### 2. What is/are the insight(s) found from the chart?

Answer Here - Poor sanitation strongly linked to higher dropout rates

Impact is severe for adolescent students

#### Chart - 15 - üìä Distribution of Schools with Toilet Facilities (Global Share)

In [None]:
# Pair Plot visualization code
toilet = df[df['Indicator Name'].str.contains(
    'toilet', case=False, na=False
)]

# Convert continuous values into categories
toilet_cat = pd.cut(
    toilet['Value'],
    bins=[0, 50, 75, 100],
    labels=['Poor (<50%)', 'Moderate (50‚Äì75%)', 'Good (>75%)']
)

toilet_dist = toilet_cat.value_counts()

plt.figure(figsize=(6,6))
plt.pie(
    toilet_dist,
    labels=toilet_dist.index,
    autopct='%1.1f%%',
    startangle=140
)
plt.title('Global Distribution of School Toilet Facilities')
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart is ideal when the goal is to show proportion or share of categories within a whole.
Here, instead of comparing numbers, we want to understand:

What percentage of schools globally fall under
poor, moderate, or good sanitation coverage

This makes the pie chart a clear and intuitive choice for non-technical stakeholders.

##### 2. What is/are the insight(s) found from the chart?

A significant portion of schools fall under moderate or poor toilet coverage

Only a limited share of schools have good (>75%) sanitation availability

Infrastructure gaps are still widespread globally

This highlights that basic sanitation remains a challenge, not a solved problem.

3. Business / Policy Impact (Positive & Negative)
‚úÖ Positive Impact:

Helps governments and NGOs quickly understand infrastructure gaps

Useful for budget allocation decisions in school development programs

Easy to communicate in reports and presentations

‚ùå Negative Growth Insight:

Poor sanitation discourages attendance, especially for girls

Leads to higher dropout rates and health-related absenteeism

Long-term impact: reduced human capital development

Answer Here

###**MULTIVARIATE ANALYSIS (5 Charts)**

#### Chart - 16 'Literacy vs Education Expenditure by Income Group'

In [None]:
lit = df[df['Indicator Name']=='Adult literacy rate, population 15+ years, both sexes (%)'
]

spend = df[df['Indicator Name']=='Government expenditure on education as % of GDP (%)']

merged = pd.merge(
    lit, spend,
    on=['Country Code', 'Year'],
    suffixes=('_lit', '_spend')
)

In [None]:
sns.scatterplot(
    data=merged,
    x='Value_spend',
    y='Value_lit',
    hue='Income Group_lit'
)
plt.xlabel('Education Expenditure (% of GDP)')
plt.ylabel('Adult Literacy Rate (%)')
plt.title('Literacy vs Education Expenditure by Income Group')
plt.show()

##### 1. Why did you pick the specific chart?

This scatter plot is chosen because it allows us to analyze two numerical variables simultaneously (education spending and literacy rate) while also incorporating a categorical variable (income group) using color encoding.

This makes it ideal for understanding:

Whether higher spending leads to better outcomes

How this relationship differs across economic levelsAnswer Here.

##### 2. What is/are the insight(s) found from the chart?

High-income countries achieve high literacy even with moderate spending

Low-income countries spend similar amounts but show lower literacy

Spending efficiency differs significantly across income groups

This shows that money alone is not enough ‚Äî governance and system quality matter.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

‚úÖ Positive Impact:

Helps policymakers focus on efficiency, not just budget increase

Encourages low-income countries to adopt best practices

‚ùå Negative Growth Insight:

Inefficient spending can lead to wasted public funds

Poor literacy impacts workforce quality and long-term GDP growthAnswer Here

#### Chart - 17 Literacy Rate Trend Over Time

In [None]:
trends = lit.groupby(['Year', 'Income Group'])['Value'].mean().reset_index()

sns.lineplot(
    data=trends,
    x='Year',
    y='Value',
    hue='Income Group'
)
plt.ylabel('Average Literacy Rate (%)')
plt.title('Literacy Trend Over Time by Income Group')
plt.show()


##### 1. Why did you pick the specific chart?

A line plot is the best choice for time-series multivariate analysis.
It shows how literacy rates have changed over time while comparing multiple income groups.

##### 2. What is/are the insight(s) found from the chart?

Literacy has improved across all income groups

High-income countries reached saturation early

Low-income countries show improvement but still lag behind

This highlights persistent inequality despite progress.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

‚úÖ Supports long-term investment in education
‚ùå Slow growth in low-income countries may lead to:

Skill shortages

Higher unemployment

Economic dependencyAnswer Here

#### Chart - 18 Heatmap: Education Indicators by Income Group

In [None]:
enroll = df[df['Indicator Name'].str.contains(
    'Primary school', case=False, na=False
)]
print(enroll['Indicator Name'].unique())

In [None]:
pivot = merged.groupby('Income Group_lit')[['Value_spend', 'Value_lit']].mean()

sns.heatmap(pivot, annot=True, cmap='coolwarm')
plt.title('Education Indicators by Income Group')
plt.show()


##### 1. Why did you pick the specific chart?

A heatmap is perfect for:

Comparing multiple indicators at once

Quickly spotting high and low performance patterns

##### 2. What is/are the insight(s) found from the chart?

Insights

Strong positive relationship between income and education outcomes

High-income countries dominate across all indicators

Clear education inequality visibleAnswer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer - ‚úÖ Helps donors prioritize low-performing income groups
‚ùå Persistent inequality can cause global economic imbalance

#### Chart - 19 Pairplot of Key Education Indicators

In [None]:
pair_data = merged[['Value_spend', 'Value_lit', 'Income Group_lit']].dropna()

sns.pairplot(pair_data, hue='Income Group_lit')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. Pairplot enables simultaneous analysis of multiple relationships between numerical variables, grouped by a category.

This is one of the strongest multivariate EDA tools.

##### 2. What is/are the insight(s) found from the chart?

Answer Here Literacy correlates positively with spending

Income group separation is clearly visible

Variability is highest in low-income countries

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here ‚úÖ Provides holistic understanding for decision-makers
‚ùå High variability signals unstable education systems

#### Chart - 20 Pupil‚ÄìTeacher Ratio vs Literacy Rate

In [None]:
ptr = df[df['Indicator Name'].str.contains(
    'Pupil-teacher ratio', case=False, na=False
)]

lit = df[df['Indicator Name']=='Adult literacy rate, population 15+ years, both sexes (%)'
]

merged_ptr = pd.merge(
    ptr, lit,
    on=['Country Code', 'Year'],
    suffixes=('_ptr', '_lit')
)

In [None]:
sns.scatterplot(
    data=merged_ptr,
    x='Value_ptr',
    y='Value_lit',
    hue='Income Group_lit'
)
plt.xlabel('Pupil‚ÄìTeacher Ratio')
plt.ylabel('Adult Literacy Rate (%)')
plt.title('Pupil‚ÄìTeacher Ratio vs Literacy Rate by Income Group')
plt.show()

##### 1. Why did you pick the specific chart?

Answer - This scatter plot was selected to examine the relationship between education quality (pupil‚Äìteacher ratio) and learning outcomes (literacy rate) while simultaneously considering economic context (income group).

The chart helps answer:

Does classroom crowding affect literacy?

Is the impact consistent across income groups?

##### 2. What is/are the insight(s) found from the chart?

Answer- Countries with lower pupil‚Äìteacher ratios generally show higher literacy rates

High-income countries cluster at low ratios and high literacy

Low-income countries often have high ratios and lower literacy

This confirms that teacher availability is a critical driver of education quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Answer - ‚úÖ Positive Impact:

Supports hiring more teachers and reducing class sizes

Helps governments justify teacher recruitment budgets

‚ùå Negative Growth Insight:

High ratios reduce individual student attention

Leads to poor learning outcomes and long-term skill gaps

Negatively affects workforce productivity

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.(World Bank ‚Äì Global Education Analysis)
üéØ Business Objective

To improve global education access, quality, and retention by identifying key factors influencing enrollment, literacy, and dropout rates using data-driven insights.

üß† Recommended Solutions & Strategic Suggestions
1Ô∏è‚É£ Prioritize Basic School Infrastructure

The analysis shows that poor sanitation facilities are strongly linked to higher dropout rates, especially among adolescent students.
üëâ The client (World Bank / policymakers) should prioritize investment in school sanitation, clean water, and hygiene facilities before or alongside academic reforms.

Why?
Because students cannot continue education in unsafe or unhealthy environments, regardless of curriculum quality.

2Ô∏è‚É£ Target Spending Efficiency, Not Just Spending Volume

Education expenditure as a percentage of GDP does not always translate into higher enrollment or literacy outcomes.

üëâ The client should:

Focus on efficient utilization of funds

Monitor outcome-based spending

Invest in teacher training, school management, and accountability mechanisms

Result:
Better outcomes without necessarily increasing budgets.

3Ô∏è‚É£ Region & Income-Group Specific Policies

Low-income and lower-middle-income countries show:

Lower literacy rates

Higher dropout levels

Infrastructure gaps

üëâ A one-size-fits-all policy will not work.

Suggestion:
Design customized interventions based on region and income group rather than global averages.

4Ô∏è‚É£ Strengthen Secondary Education Retention

Dropout risk increases significantly at the secondary education level, especially for adolescents.

üëâ The client should:

Improve sanitation, safety, and transportation

Introduce incentives such as scholarships or midday meal extensions

Promote gender-sensitive infrastructure

Impact:
Higher retention ‚Üí stronger workforce ‚Üí long-term economic growth.

5Ô∏è‚É£ Use Data-Driven Monitoring & Early Warning Systems

EDA reveals that combining indicators (enrollment, literacy, sanitation, spending) provides early signals of risk.

üëâ The client should adopt:

Continuous data monitoring

Predictive analytics for dropout risk

Evidence-based decision systems

üìà Expected Positive Business Impact

Improved student retention and literacy rates

Better return on education investments

Reduced inequality across regions

Stronger human capital development

‚ö†Ô∏è Risk of Negative Growth (If Ignored)

If infrastructure gaps, especially sanitation, are not addressed:

Dropout rates will remain high

Female and adolescent education will suffer

Long-term productivity and economic growth will decline


# **Conclusion**

üìù Final One-Line Conclusion

To achieve the business objective, the client should focus on improving basic school infrastructure, ensuring efficient education spending, and implementing region-specific, data-driven education policies to enhance global education outcomes.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***