## SDG 4: Quality Education




### Problem Statement:
Despite significant progress in global education, challenges such as inequality, a lack of resources, and inadequate teaching methods persist. Many children and adults still lack access to quality education, particularly in marginalised communities. These challenges are exacerbated by insufficient data and analysis to inform policy and practice.

This project addresses these issues by analysing educational data to identify critical areas for intervention and propose actionable solutions to improve the quality of education.


### Objective of the Project:
The primary objective is to analyse educational data to identify key factors affecting educational quality and propose data-driven solutions to enhance educational outcomes. 
The specific objectives are as follows:
- By 2030, ensure all girls and boys complete free, equitable, and quality primary and secondary education.
- Provide access to quality early childhood development, care, and pre-primary education for all children.
- Guarantee equal access to affordable and quality technical, vocational, and tertiary education for all.
- Increase the number of youth and adults with relevant skills for employment and entrepreneurship.
- Eliminate gender disparities and ensure equal access to all education levels for vulnerable groups.
- Achieve literacy and numeracy for all youth and a substantial proportion of adults.
- Equip all learners with knowledge and skills for sustainable development, human rights, and global citizenship.


### Features:
Key features of the dataset will include:
- Demographics: Age, Gender, and Socio-economic status of students.
- Performance Metrics: Test scores, graduation rates, literacy rates.
- Resource Allocation: Availability of teachers, infrastructure.
- Attendance and Enrollment: Enrollment rates, attendance records.
- Contextual Factors: Economic indicators, geographical data.
- Educational Levels: Data on pre-primary, primary, and secondary education.
- Technological Integration: Access to and use of educational technology and digital learning resources.
- Special Needs Education: Provisions for students with disabilities and special educational needs.


### Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

### Exploratory Data Analysis (EDA)

In [2]:
# Load the combined dataset
data = pd.read_csv('combined_data.csv')

In [3]:
data.head()

Unnamed: 0,Entity,Code,Year,"Completion rate, by sex, location, wealth quintile and education level (%) - SE_TOT_CPLR - Primary - All areas - Total (national average) or no breakdown - Both sexes","Literacy rate, youth male (% of males ages 15-24)","Literacy rate, youth female (% of females ages 15-24)",Extent to which global citizenship education and education for sustainable development are mainstreamed in teacher education - SE_GCEDESD_TED,"Participation rate in organized learning (one year before the official primary entry age), by sex (%) - SE_PRE_PARTN - Both sexes","Total official flows for scholarships, by recipient countries (millions of constant 2021 United States dollars) - DC_TOF_SCHIPSL","Participation rate in formal and non-formal education and training, by sex (%) - SE_ADT_EDUCTRN - 15 to 64 years old - Both sexes","Proportion of teachers with the minimum required qualifications, by education level and sex (%) - SE_TRA_GRDL - Pre-primary - Both sexes","Adjusted gender parity index for completion rate, by location, wealth quintile and education level - SE_AGP_CPRA - Primary - All areas - Total (national average) or no breakdown","Proportion of children aged 36-59 months who are developmentally on track in at least three of the following domains: literacy-numeracy, physical development, social-emotional development, and learning (% of children aged 36-59 months) - SE_DEV_ONTRK - 36 to 59 months old - Both sexes","Proportion of youth and adults with information and communications technology (ICT) skills, by sex and type of skill (%) - SE_ADT_ACTS - 25 to 74 years old - All areas - Both sexes - Creating electronic presentations with presentation software","Proportion of schools with access to electricity, by education level (%) - SE_ACS_ELECT - Primary","Proportion of schools with access to electricity, by education level (%) - SE_ACS_ELECT - Upper secondary","Proportion of schools with access to electricity, by education level (%) - SE_ACS_ELECT - Lower secondary",Proportion of children and young people achieving a minimum proficiency level in reading and mathematics (%) - SE_TOT_PRFL - Grades 2/3 - Both sexes - Skill: Minimum proficiency in reading
0,Afghanistan,AFG,2000,20.0,,,,,,,,,,,,,,
1,Afghanistan,AFG,2001,21.0,,,,,,,,,,,,,,
2,Afghanistan,AFG,2002,22.0,,,,,,,,,,,,,,
3,Afghanistan,AFG,2003,24.0,,,,,,,,,,,,,,
4,Afghanistan,AFG,2004,25.0,,,,,,,,,,,,,,


In [4]:
data.tail()

Unnamed: 0,Entity,Code,Year,"Completion rate, by sex, location, wealth quintile and education level (%) - SE_TOT_CPLR - Primary - All areas - Total (national average) or no breakdown - Both sexes","Literacy rate, youth male (% of males ages 15-24)","Literacy rate, youth female (% of females ages 15-24)",Extent to which global citizenship education and education for sustainable development are mainstreamed in teacher education - SE_GCEDESD_TED,"Participation rate in organized learning (one year before the official primary entry age), by sex (%) - SE_PRE_PARTN - Both sexes","Total official flows for scholarships, by recipient countries (millions of constant 2021 United States dollars) - DC_TOF_SCHIPSL","Participation rate in formal and non-formal education and training, by sex (%) - SE_ADT_EDUCTRN - 15 to 64 years old - Both sexes","Proportion of teachers with the minimum required qualifications, by education level and sex (%) - SE_TRA_GRDL - Pre-primary - Both sexes","Adjusted gender parity index for completion rate, by location, wealth quintile and education level - SE_AGP_CPRA - Primary - All areas - Total (national average) or no breakdown","Proportion of children aged 36-59 months who are developmentally on track in at least three of the following domains: literacy-numeracy, physical development, social-emotional development, and learning (% of children aged 36-59 months) - SE_DEV_ONTRK - 36 to 59 months old - Both sexes","Proportion of youth and adults with information and communications technology (ICT) skills, by sex and type of skill (%) - SE_ADT_ACTS - 25 to 74 years old - All areas - Both sexes - Creating electronic presentations with presentation software","Proportion of schools with access to electricity, by education level (%) - SE_ACS_ELECT - Primary","Proportion of schools with access to electricity, by education level (%) - SE_ACS_ELECT - Upper secondary","Proportion of schools with access to electricity, by education level (%) - SE_ACS_ELECT - Lower secondary",Proportion of children and young people achieving a minimum proficiency level in reading and mathematics (%) - SE_TOT_PRFL - Grades 2/3 - Both sexes - Skill: Minimum proficiency in reading
17867,Togo,TGO,2014,,,,,,,,,,,,,,,20.1
17868,Togo,TGO,2019,,,,,,,,,,,,,,,24.5
17869,Uruguay,URY,2006,,,,,,,,,,,,,,,75.35
17870,Uruguay,URY,2013,,,,,,,,,,,,,,,71.5
17871,Uruguay,URY,2019,,,,,,,,,,,,,,,64.4


In [5]:
data.shape

(17872, 18)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17872 entries, 0 to 17871
Data columns (total 18 columns):
 #   Column                                                                                                                                                                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                                                                                                         --------------  -----  
 0   Entity                                                                                                                                                                                                                                                        

In [7]:
data.isnull().sum()

Entity                                                                                                                                                                                                                                                                                               0
Code                                                                                                                                                                                                                                                                                              1300
Year                                                                                                                                                                                                                                                                                                 0
Completion rate, by sex, location, wealth quintile and education level (%) - SE_TOT_CPLR - Primary - All areas - To

### Handling Missing Values

In [8]:
# For numeric columns
numeric_cols = data.select_dtypes(include='number').columns
data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())

In [9]:
data.isnull().sum()

Entity                                                                                                                                                                                                                                                                                              0
Code                                                                                                                                                                                                                                                                                             1300
Year                                                                                                                                                                                                                                                                                                0
Completion rate, by sex, location, wealth quintile and education level (%) - SE_TOT_CPLR - Primary - All areas - Total

In [10]:
data.describe()

Unnamed: 0,Year,"Completion rate, by sex, location, wealth quintile and education level (%) - SE_TOT_CPLR - Primary - All areas - Total (national average) or no breakdown - Both sexes","Literacy rate, youth male (% of males ages 15-24)","Literacy rate, youth female (% of females ages 15-24)",Extent to which global citizenship education and education for sustainable development are mainstreamed in teacher education - SE_GCEDESD_TED,"Participation rate in organized learning (one year before the official primary entry age), by sex (%) - SE_PRE_PARTN - Both sexes","Total official flows for scholarships, by recipient countries (millions of constant 2021 United States dollars) - DC_TOF_SCHIPSL","Participation rate in formal and non-formal education and training, by sex (%) - SE_ADT_EDUCTRN - 15 to 64 years old - Both sexes","Proportion of teachers with the minimum required qualifications, by education level and sex (%) - SE_TRA_GRDL - Pre-primary - Both sexes","Adjusted gender parity index for completion rate, by location, wealth quintile and education level - SE_AGP_CPRA - Primary - All areas - Total (national average) or no breakdown","Proportion of children aged 36-59 months who are developmentally on track in at least three of the following domains: literacy-numeracy, physical development, social-emotional development, and learning (% of children aged 36-59 months) - SE_DEV_ONTRK - 36 to 59 months old - Both sexes","Proportion of youth and adults with information and communications technology (ICT) skills, by sex and type of skill (%) - SE_ADT_ACTS - 25 to 74 years old - All areas - Both sexes - Creating electronic presentations with presentation software","Proportion of schools with access to electricity, by education level (%) - SE_ACS_ELECT - Primary","Proportion of schools with access to electricity, by education level (%) - SE_ACS_ELECT - Upper secondary","Proportion of schools with access to electricity, by education level (%) - SE_ACS_ELECT - Lower secondary",Proportion of children and young people achieving a minimum proficiency level in reading and mathematics (%) - SE_TOT_PRFL - Grades 2/3 - Both sexes - Skill: Minimum proficiency in reading
count,17872.0,17872.0,17872.0,17872.0,17872.0,17872.0,17872.0,17872.0,17872.0,17872.0,17872.0,17872.0,17872.0,17872.0,17872.0,17872.0
mean,2010.862522,80.764773,89.443907,84.346178,0.850492,74.721981,22793900.0,17.569495,73.648149,0.99371,75.378193,24.117072,83.58174,93.873867,91.549821,60.695135
std,7.145871,10.630273,4.053232,6.234342,0.007682,10.394298,44453660.0,2.28155,7.214363,0.050431,0.931025,1.654179,6.970758,3.256571,4.240152,2.252424
min,1970.0,6.0,15.88,6.66406,0.2,0.0,0.0,0.5,0.0,0.26,36.2,0.14,2.87,12.96,8.92,9.6
25%,2006.0,80.764773,89.443907,84.346178,0.850492,74.721981,22793900.0,17.569495,73.648149,0.99371,75.378193,24.117072,83.58174,93.873867,91.549821,60.695135
50%,2012.0,80.764773,89.443907,84.346178,0.850492,74.721981,22793900.0,17.569495,73.648149,0.99371,75.378193,24.117072,83.58174,93.873867,91.549821,60.695135
75%,2017.0,80.764773,89.443907,84.346178,0.850492,74.721981,22793900.0,17.569495,73.648149,0.99371,75.378193,24.117072,83.58174,93.873867,91.549821,60.695135
max,2022.0,100.0,100.0,100.0,1.0,100.0,1849060000.0,99.9,100.0,1.42,97.2,60.0,100.0,100.0,100.0,98.57


### Data Visualization

### Box and Whisker Plot

In [11]:
# Box plots for numerical columns
for col in numeric_cols:
    plt.figure(figsize=(8, 6))
    sns.boxplot(x=train[col])
    plt.title(f'Box plot of {col}')
    plt.xlabel(col)
    plt.show()

NameError: name 'numerical_cols' is not defined

### Univariate Analysis

In [None]:
numerical_cols = data.select_dtypes(include=[np.number]).columns
categorical_cols = data.select_dtypes(include=[object]).columns

In [None]:
# Encode categorical variables
label_encoders = {}
for column in categorical_cols:
    le = LabelEncoder()
    train[column] = le.fit_transform(data[column])
    label_encoders[column] = le

In [None]:
# Histograms for numerical columns
for col in numerical_cols:
    plt.figure(figsize=(8, 6))
    sns.histplot(data[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

### Bivariate Analysis

In [None]:
# Scatter plots for numerical columns
for i in range(len(numerical_cols)):
    for j in range(i + 1, len(numerical_cols)):
        plt.figure(figsize=(8, 6))
        sns.scatterplot(x=data[numerical_cols[i]], y=train[numerical_cols[j]])
        plt.title(f'Scatter plot between {numerical_cols[i]} and {numerical_cols[j]}')
        plt.xlabel(numerical_cols[i])
        plt.ylabel(numerical_cols[j])
        plt.show()

### Multivariate Analysis

In [None]:
# Pairplot for numerical columns to see pairwise relationships
sns.pairplot(data[numerical_cols])
plt.show()

## Heatmap

In [None]:
# Heatmap for correlation matrix of numerical columns
plt.figure(figsize=(12, 10))
corr_matrix = data[numerical_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Heatmap of Correlation Matrix')
plt.show()

### Model Training

### Split the data into training and testing sets

In [None]:
X = data.drop('Proportion of children and young people achieving a minimum proficiency level in reading and mathematics (%) - SE_TOT_PRFL - Grades 2/3 - Both sexes - Skill: Minimum proficiency in reading', axis=1)
y = data['Proportion of children and young people achieving a minimum proficiency level in reading and mathematics (%) - SE_TOT_PRFL - Grades 2/3 - Both sexes - Skill: Minimum proficiency in reading'] 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)