# Exploratory Data Analysis (EDA)
## Parental Education and Math Abilities

This notebook explores the relationship between parental socio-educational background
(parental education and classroom characteristics) and children's math-related cognitive
abilities using correlation analysis and visualization.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("../main_dataset.csv")
df.head()

## Variable Selection

To explore the relationship between socio-educational background and math abilities,
we selected a subset of variables from the dataset.

The analysis focuses on two main groups of variables:

**Socio-educational background variables**
- Parental education level (mother and father)
- School type
- Regular classroom status

**Math-related cognitive variables**
- CMAT Basic Calculation Quotient
- KeyMath sub-scores (Numeration, Measurement, Problem Solving)
- WJ-III Math Fluency score

These variables were chosen to reflect both environmental background factors
and core mathematical performance measures.

In [None]:
# --- Select background (socio-educational) variables ---
background_vars = [
    "mother_highest_grade",
    "father_highest_grade",
    "school_type",
    "regular_classroom"
]

# --- Select math-related cognitive variables ---
math_vars = [
    "CMAT_BasicCalc_Comp_Quotient",
    "KeyMath_Numeration_ScS",
    "KeyMath_Measurement_ScS",
    "KeyMath_ProblemSolving_ScS",
    "WJ-III_MathFluency_StS"
]


## Data Preparation for EDA
In this step, I create a reduced dataset that includes only the selected background and math-related variables.  
I then inspect the basic structure and summary statistics to understand data completeness and distributions.

In [None]:
# Keep only relevant columns
eda_df = df[background_vars + math_vars].copy()

# Display basic info
print(eda_df.info())
print(eda_df.describe())


## Data Preparation

Before computing correlations, several background variables require preprocessing.
Parental education variables are ordinal, and classroom status is categorical.
Therefore, selected variables will be converted to numeric format where appropriate,
and non-numeric variables will be handled separately in the analysis.

In [None]:
# -------------------------
# Correlations (numeric only)
# -------------------------

# Convert background education columns to numeric (safe)
for col in ["mother_highest_grade", "father_highest_grade", "regular_classroom"]:
    eda_df[col] = pd.to_numeric(eda_df[col], errors="coerce")

# Numeric columns for correlation (exclude school_type because it's categorical text)
numeric_cols = [c for c in eda_df.columns if c != "school_type"]

# Correlation matrix
corr = eda_df[numeric_cols].corr(method="spearman")  # spearman is robust for ordinal scales
print("\nSpearman correlation matrix (numeric vars):")
print(corr.round(2))

## Correlation Analysis

To examine associations between socio-educational background variables
and math-related cognitive measures, Spearman correlation coefficients were computed.

Spearman correlation was selected because several background variables are ordinal
and do not necessarily meet normality assumptions. The correlation matrix below
provides an overview of the strength and direction of relationships between
background characteristics and math performance measures.

In [None]:
plt.figure(figsize=(12, 9))

sns.heatmap(
    corr,
    annot=True,
    fmt=".2f",
    cmap="RdBu_r",
    vmin=-1,
    vmax=1,
    linewidths=0.5,
    linecolor="white",
    annot_kws={"size": 10},
    cbar_kws={"label": "Spearman correlation"}
)

plt.title(
    "Spearman Correlations between Socio-Educational Background and Math Abilities",
    fontsize=15,
    pad=15
)

plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)

plt.tight_layout()
plt.show()

## Preliminary Insights

The exploratory analysis reveals several notable patterns.
Parental education levels, particularly fatherâ€™s education, show moderate positive
associations with multiple math achievement measures.
In contrast, regular classroom status exhibits weak or negligible correlations
with math performance.

Strong correlations are observed among the different math-related measures,
indicating consistency across assessment tools.
Overall, these findings suggest that socio-educational background may be related
to mathematical abilities, supporting further investigation using multivariate
statistical models.