<a href="https://colab.research.google.com/github/rmfcardeira/EIACD_Assignement_02/blob/main/EIACD_Assignement_02_shared_version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Introduction
---

This notebook was developed by the following students, under the course **CC1023 – Elementos de Inteligência Artificial e Ciência de Dados 2024/2025**, of the Faculty of Sciences of the University of Porto (FCUP):

**Matilde Amorim – 202208540**

**Rita Saraiva – 202207331**

**Rodrigo Cardeira – 202206533**

This notebook explores a dataset related to student performance in secondary education.

The goal is to analyze the factors influencing student academic success, thus allowing to build an intervention system that may flag individual students requiring extra attention and support.

**Dataset:**

The dataset used in this analysis contains information about student demographics, social and school-related factors, and academic performance. We will use various data analysis and machine learning techniques to gain insights from this data.

**Key Questions:**

*   What are the key factors that influence student performance (passing/failing)?
*   Can we build a model to predict student success based on their characteristics?
*   What insights can we gain from this analysis to potentially improve student outcomes?

**Analysis Steps:**

1.  Data Exploration (EDA): Understanding the data structure, content, and identifying patterns.
2.  Data Preprocessing: Data Cleaning, data transformation and feature engineering
3.  Data Modeling (Supervised Learning): Choosing and training suitable machine learning models.
4.  Perfomance Evaluation: Assessing and optimizing model performance.
5.  Interpretation of Results: Drawing conclusions and suggesting potential interventions.





---

#0. Setup and Library Imports

---
Initial setup and configurations, as well as the importing of necessary libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualizations
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("Setup configured and libraries imported successfully.")

Setup configured and libraries imported successfully.




---


# 1. Data Exploration (EDA)


---



In this phase, we load the data and perform an initial analysis to understand its structure, content, and identify potential issues or characteristics relevant for preprocessing and modeling.

### 1.1 Load the Dataset

First, we load the `student-data.csv` file into a pandas DataFrame and display the first few rows to get an initial look at the data structure and content.

In [None]:
# Load the dataset
file_path = 'C:/Users/Ustudent-data.csv'
df = pd.read_csv(file_path)
print("Dataset loaded.")

# Display the first 5 rows
print("\nFirst 5 rows of the dataset:")
display(df.head())

# Display the last 5 rows
print("\nLast 5 rows of the dataset:")
display(df.tail())

FileNotFoundError: [Errno 2] No such file or directory: '/content/student-data.csv'

### 1.2 Initial Data Inspection

Let's get some basic information about the dataset:
*   Number of records (students) and features (columns).
*   Column names.
*   Data types of each column.
*   Check for missing values.
*   Check for duplicate rows.

In [None]:
# Get the shape of the dataset (rows, columns)
print(f"Dataset Shape: {df.shape[0]} records and {df.shape[1]} features.\n")

# Get column names
print("Column Names:")
print(list(df.columns))
print("-" * 50)

# Get data types and non-null counts
print("\nData Types and Non-Null Counts:")
df.info()
print("-" * 50)

# Check for missing values in each column
print("\nMissing Values per Column:")
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])
if missing_values.sum() == 0:
    print("No missing values found.")
else:
    print(f"Total missing values: {missing_values.sum()}")
print("-" * 50)

# Check for duplicate rows
duplicate_rows = df.duplicated().sum()
print(f"\nNumber of Duplicate Rows: {duplicate_rows}")
if duplicate_rows > 0:
    print("Consider removing duplicate rows in the preprocessing step.")
print("-" * 50)

**Observations from Initial Inspection:**

*   The dataset contains 395 student records and 31 features.
*   The features cover a range of demographic, social, school-related, and behavioral attributes.
*   The target variable appears to be `passed` (yes/no).
*   Most features are categorical (`object` type) or discrete numerical (`int64`). Many binary categorical features (like `schoolsup`, `famsup`, `paid`, etc.) are stored as strings.
*   **Crucially, there are no missing values** in the dataset according to `df.info()` and `df.isnull().sum()`.
*   **No duplicate rows** were found.

#### 1.2 Descriptive Statistics

In [None]:
if not df.empty:
    print("\nDescriptive statistics for numerical features:")
    print(df.describe(include=[np.number])) # For numerical columns

    print("\nDescriptive statistics for categorical (object) features:")
    print(df.describe(include=['object'])) # For object columns

**Observations from Descriptive Statistics:**

**Numerical Features:**

age: Ranges from 15 to 22, with a mean of about 16.7. Most students are between 16 and 18. The max age of 22 might be an outlier or represent students repeating years.

Medu (Mother's education) & Fedu (Father's education): Range from 0 (none) to 4 (higher education).

traveltime: 1 (<15 min) to 4 (>1 hour). Most students live relatively close (1 or 2).

studytime: 1 (<2 hours) to 4 (>10 hours). Most study 2-5 hours (category 2).

failures: Number of past class failures (0 to 3, n if 1<=n<3, else 4 - but data shows max 3). Most students have 0 failures.

famrel (Quality of family relationships): 1 (very bad) to 5 (excellent). Generally good.

freetime, goout, Dalc (Workday alcohol), Walc (Weekend alcohol): Graded 1 (very low) to 5 (very high).

health: 1 (very bad) to 5 (very good).

absences: Ranges from 0 to 75. The mean is around 5.7, but the standard deviation is high (8), and the max of 75 suggests potential outliers or data entry issues. We should investigate this further.

**Categorical Features:**

school: Two schools, 'GP' (Gabriel Pereira) is more frequent (349 students) than 'MS' (Mousinho da Silveira, 46 students). This is an imbalance.

sex: More 'F' (female, 208) than 'M' (male, 187). Fairly balanced.

address: Mostly 'U' (urban, 307) vs 'R' (rural, 88).

famsize: Mostly 'GT3' (greater than 3, 281) vs 'LE3' (less or equal to 3, 114).

Pstatus (Parent's cohabitation status): Mostly 'T' (together, 354) vs 'A' (apart, 41).

Mjob, Fjob: 'other' is the most common category. 'teacher' and 'services' are also prominent. 'at_home' and 'health' are less common.

reason: 'course' preference is the most common reason for choosing the school.

guardian: 'mother' is the most common guardian.

Binary categorical features (yes/no): schoolsup, famsup, paid, activities, nursery, higher, internet, romantic.

higher (wants to take higher education): Most students ('yes', 375) want to.

internet (internet access at home): Most students ('yes', 329) have it.

passed (Target Variable): More students 'yes' (passed, 265) than 'no' (failed, 130). This shows a moderate class imbalance.

### 1.3 Feature Analysis

Now, let's analyze the features in more detail, separating them into numerical and categorical types.

#### 1.3.1 Numerical Features

We'll look at the statistical summary and distributions of numerical features.

In [None]:
# Select numerical features (based on df.info())
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
print("Numerical Features:")
print(numerical_cols)
print("-" * 50)

# Get descriptive statistics for numerical features
print("\nDescriptive Statistics for Numerical Features:")
display(df[numerical_cols].describe())
print("-" * 50)

# Visualize distributions of numerical features
print("\nDistributions of Numerical Features:")
df[numerical_cols].hist(figsize=(15, 12), bins=15, layout=(-1, 4))
plt.suptitle('Histograms of Numerical Features', y=1.02)
plt.tight_layout()
plt.show()

# Visualize potential outliers using boxplots
print("\nBoxplots of Numerical Features (to identify potential outliers):")
plt.figure(figsize=(18, 10))
sns.boxplot(data=df[numerical_cols], orient='h')
plt.title('Boxplots of Numerical Features')
plt.show()

**Observations for Numerical Features:**

*   **Age:** Ranges from 15 to 22. Most students are between 15 and 18. There are fewer older students (19+), which might be considered outliers or a specific subgroup.
*   **Medu, Fedu (Mother's/Father's Education):** Ordinal scale (0-4). Most parents have some level of education (values > 0). Value 0 might indicate no education or missing information (though no NaNs were reported).
*   **traveltime, studytime:** Ordinal scales (1-4). Most students have relatively short travel times and moderate study times (1-2 hours or 2-5 hours).
*   **failures:** Number of past class failures (0-3). Most students have 0 failures. Value 3 likely represents '3 or more' failures. This feature is highly skewed.
*   **famrel, freetime, goout, Dalc, Walc, health:** Ordinal scales (1-5). Distributions vary. `Dalc` (weekday alcohol) and `Walc` (weekend alcohol) are skewed towards lower consumption.
*   **absences:** Number of school absences. Highly skewed to the right, ranging from 0 to 75. The value 75 seems like a significant outlier. Many students have 0 absences.
*   **Outliers:** `absences` clearly shows potential outliers. `age` has a few values (20, 21, 22) that are less common. `failures` has a group at 3, which might represent an aggregation.

#### 1.3.2 Categorical Features

Let's examine the unique values and distributions for categorical features (including binary 'yes'/'no' features).

In [None]:
# Select categorical features (object type)
categorical_cols = df.select_dtypes(include='object').columns.tolist()
print("Categorical Features:")
print(categorical_cols)
print("-" * 50)

# Analyze value counts for each categorical feature
print("\nValue Counts for Categorical Features:")
for col in categorical_cols:
    print(f"\n--- {col} ---")
    print(df[col].value_counts())
    # Optional: Add percentage
    # print(df[col].value_counts(normalize=True) * 100)

# Visualize distributions of categorical features
print("\nDistributions of Categorical Features:")
num_plots = len(categorical_cols)
num_cols_grid = 4
num_rows_grid = (num_plots + num_cols_grid - 1) // num_cols_grid # Calculate rows needed

fig, axes = plt.subplots(num_rows_grid, num_cols_grid, figsize=(16, num_rows_grid * 4))
axes = axes.flatten() # Flatten to easily iterate

for i, col in enumerate(categorical_cols):
    sns.countplot(data=df, y=col, ax=axes[i], order=df[col].value_counts().index, hue=col, palette='viridis', legend=False)
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel('Count')
    axes[i].set_ylabel('') # Remove y-label for clarity with y-ticks

# Hide any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

**Observations for Categorical Features:**

*   **school:** Two schools involved: 'GP' (Gabriel Pereira) and 'MS' (Mousinho da Silveira). 'GP' has significantly more students in this dataset.
*   **sex:** Slightly more female ('F') students than male ('M').
*   **address:** Most students live in Urban ('U') areas compared to Rural ('R').
*   **famsize:** Family size is predominantly 'GT3' (Greater than 3) compared to 'LE3' (Less than or equal to 3).
*   **Pstatus:** Parents' cohabitation status is mostly 'T' (Together) compared to 'A' (Apart).
*   **Mjob, Fjob:** Diverse range of jobs. 'other' is the most common category for both parents. 'teacher' and 'services' are also frequent. 'at_home' is more common for mothers.
*   **reason:** Most common reasons for choosing the school are 'course' preference, followed by 'home' proximity and 'reputation'.
*   **guardian:** Most students have 'mother' as their guardian, followed by 'father', then 'other'.
*   **schoolsup, famsup, paid, activities, nursery, higher, internet, romantic:** These are binary features ('yes'/'no').
    *   `schoolsup`: Most students do *not* have extra educational support.
    *   `famsup`: About half the students have family educational support.
    *   `paid`: Most students do *not* take extra paid classes.
    *   `activities`: About half the students participate in extra-curricular activities.
    *   `nursery`: Most students attended nursery school.
    *   `higher`: The vast majority of students want to pursue higher education. **This is highly imbalanced towards 'yes'.**
    *   `internet`: Most students have internet access at home.
    *   `romantic`: Most students are *not* in a romantic relationship.
*   **passed (Target Variable):** More students passed ('yes') than failed ('no'). There is an imbalance, but it might not be severe enough to require complex handling initially. We need to quantify this.

In [None]:
# Quantify the target variable distribution
print("\nTarget Variable Distribution ('passed'):")
passed_counts = df['passed'].value_counts()
passed_perc = df['passed'].value_counts(normalize=True) * 100

print(passed_counts)
print("\nPercentages:")
print(passed_perc)

# Visualize target variable distribution
plt.figure(figsize=(6, 4))
# Updated countplot call:
sns.countplot(data=df, x='passed', hue='passed', palette='Paired', legend=False)
plt.title('Distribution of Target Variable (passed)')
plt.xlabel('Passed Final Exam')
plt.ylabel('Number of Students')

# Add percentages to the plot (using iloc for position-based access)
total = len(df)
for i, count in enumerate(passed_counts):
    plt.text(i, count + 5, f'{passed_perc.iloc[i]:.1f}% ({count})', ha='center')
plt.ylim(0, max(passed_counts) * 1.15)  # Adjust y-limit for text
plt.show()

**Target Variable Observation:**

*   Approximately 67.1% of students passed ('yes'), while 32.9% failed ('no'). This confirms an imbalance where the 'passed' class is roughly twice the size of the 'failed' class.

### 1.4 Preliminary Relationship Analysis (Feature vs. Target)

Let's briefly explore how some features relate to the target variable `passed`.

#### 1.4.1 Numerical Features vs. `passed`

In [None]:
print("Numerical Features vs. Passed Status:")

# Select a few key numerical features to compare against 'passed'
num_cols_to_compare = ['age', 'failures', 'absences', 'studytime', 'goout', 'Dalc', 'Walc']

fig, axes = plt.subplots(len(num_cols_to_compare), 1, figsize=(8, len(num_cols_to_compare) * 4))
if len(num_cols_to_compare) == 1: # Handle case of single plot
    axes = [axes]

for i, col in enumerate(num_cols_to_compare):
    # Assign 'passed' to 'hue' and set legend=False
    sns.boxplot(data=df, x='passed', y=col, ax=axes[i], hue='passed', palette='Paired', legend=False)
    axes[i].set_title(f'{col} vs. Passed Status')
    axes[i].set_xlabel('Passed Final Exam')

plt.tight_layout()
plt.show()

**Observations (Numerical vs. Passed):**

*   **failures:** Students who failed ('no') tend to have significantly more past failures than those who passed ('yes'). This looks like a strong predictor.
*   **absences:** Students who failed seem to have slightly more absences on average, but the distributions have large overlaps and many outliers.
*   **studytime:** Students who passed seem to report slightly higher study times, but the difference is not dramatic based on the boxplot medians/quartiles.
*   **goout:** Students who failed tend to report going out more frequently.
*   **Dalc, Walc:** Alcohol consumption appears slightly higher for students who failed, particularly weekend consumption (`Walc`), but median values are often the same.
*   **age:** Older students seem slightly more likely to fail, but the distributions overlap considerably.

#### 1.4.2 Categorical Features vs. `passed`

In [None]:
print("\nCategorical Features vs. Passed Status:")

# Select a few key categorical features to compare against 'passed'
cat_cols_to_compare = ['sex', 'Medu', 'Fedu', 'schoolsup', 'higher', 'romantic', 'internet']

# Use countplot with 'hue' for visualization
fig, axes = plt.subplots(len(cat_cols_to_compare), 1, figsize=(10, len(cat_cols_to_compare) * 4.5))
if len(cat_cols_to_compare) == 1: # Handle case of single plot
    axes = [axes]

for i, col in enumerate(cat_cols_to_compare):
    order = sorted(df[col].unique()) if col in ['Medu', 'Fedu'] else None # Order education levels
    sns.countplot(data=df, x=col, hue='passed', ax=axes[i], palette='Paired', order=order)
    axes[i].set_title(f'{col} vs. Passed Status')
    axes[i].set_xlabel(col)
    axes[i].legend(title='Passed')
    # Add percentages within each category (optional, can make plots busy)
    # for container in axes[i].containers:
    #    axes[i].bar_label(container, fmt='%.0f')

plt.tight_layout()
plt.show()

# Crosstab for 'higher' might be more revealing due to its imbalance
print("\nCrosstab: higher vs passed")
display(pd.crosstab(df['higher'], df['passed'], normalize='index') * 100) # Show percentage within 'higher' category

**Observations (Categorical vs. Passed):**

*   **sex:** Pass rates seem roughly similar between males and females, perhaps slightly higher for females.
*   **Medu, Fedu:** Higher parental education levels appear correlated with a higher pass rate. Students whose parents have higher education (e.g., levels 3 and 4) are more likely to pass.
*   **schoolsup:** Students *with* school support ('yes') have a noticeably lower pass rate than those without. This is counter-intuitive and needs investigation. Perhaps students receiving support are already those struggling academically?
*   **higher:** Almost all students who want to go on to higher education ('yes') passed. Conversely, a large proportion of students *not* wanting to go to higher education ('no') failed. This seems like a very strong indicator.
*   **romantic:** Students in a romantic relationship ('yes') appear to have a lower pass rate.
*   **internet:** Having internet access seems slightly associated with a higher pass rate.

### 1.5 Summary of Data Exploration Findings & Potential Issues

1.  **Dataset Size:** 395 records, 31 features. Manageable size.
2.  **Target Variable:** `passed` (Categorical: 'yes'/'no').
3.  **Class Imbalance:** The target variable is imbalanced (approx. 67% 'yes', 33% 'no'). This should be considered during modeling and evaluation (e.g., using appropriate metrics like F1-score, Precision, Recall, AUC, and potentially resampling techniques).
4.  **Feature Types:** Mix of numerical (mostly discrete/ordinal) and categorical (many binary 'yes'/'no'). Categorical features will need encoding (e.g., One-Hot, Ordinal) for most ML models. Binary 'yes'/'no' features can be easily mapped to 1/0.
5.  **Missing Values:** None found.
6.  **Duplicates:** None found.
7.  **Outliers:** `absences` has significant potential outliers. `age` has a few older students. Decisions on handling these (e.g., clipping, removal, transformation) should be made during preprocessing.
8.  **Potential Predictors:** Features like `failures`, `higher`, `Medu`, `Fedu`, `schoolsup`, `goout`, `studytime`, and `romantic` showed noticeable associations with the `passed` status in the preliminary analysis and warrant further investigation.
9.  **Highly Skewed Features:** `failures` and `absences` are highly skewed. Transformations (e.g., log transform for `absences` if appropriate) might be considered.

10. **Underrepresented Categories:**
    `school`: 'MS' has significantly fewer students than 'GP'.
     Some job categories (`Mjob`, `Fjob`) and `guardian`='other' have low frequencies.
     This could impact model performance or generalization, especially if these rare categories are important predictors.
11. **Potential Irrelevant Features (Hypothesis - to be confirmed with further analysis/modeling):**
       It's hard to definitively say at this stage. Domain knowledge suggests most of these features could be relevant to student performance. Feature selection techniques will be important later.
       For example, `nursery` (attended nursery school) might have less impact on secondary school performance compared to more recent factors.
12. **Data Scale:**
      Numerical features are on different scales (e.g., `age` vs `absences`).
       Scaling/Normalization will be necessary for algorithms sensitive to feature magnitudes (e.g., KNN, SVM, Neural Networks)
10. **Feature Redundancy:** Although not explicitly checked with a correlation matrix for numerical features (omitted here for brevity, but recommended), potential redundancy might exist (e.g., `Medu` and `Fedu`, `Dalc` and `Walc`). `Dalc` and `Walc` could potentially be combined into a total alcohol consumption feature.
11. **Counter-intuitive Relationships:** The relationship between `schoolsup` and `passed` (students *with* support having lower pass rates) is unexpected and suggests that this feature might be capturing students who are already at risk.

**Next Steps (Data Cleaning and Preprocessing):**

*   Encode categorical features (binary and multi-class).
*   Address outliers (especially in `absences`).
*   Consider feature scaling for numerical features if using distance-based algorithms (like KNN) or models sensitive to scale (like some Neural Networks, SVMs).
*   Handle class imbalance if initial model performance is poor for the minority class ('no').
*   Potentially perform feature engineering (e.g., combining alcohol consumption) or feature selection.
*   Further investigate correlations between features.



---


#2. Data Cleaning and Preprocessing


---



Based on the EDA, we will now clean and prepare the data for machine learning modeling. We will work on a copy.

### 2.1 Create a Copy

In [None]:
# Work on a copy to preserve the original data 'df'
df_processed = df.copy()

print("Working on a copy of the original DataFrame.")
print(f"Initial shape for processing: {df_processed.shape}")

### 2.2 Feature Cleaning and Preprocessing

In this section, we will clean and transform the features identified during EDA to make them suitable for machine learning models. This includes encoding categorical variables, handling outliers and skewness in numerical variables, and potentially engineering new features.

#### 2.2.1 Handling Categorical Features

Most machine learning algorithms require numerical input. Therefore, we need to convert categorical features into a numerical format. We'll handle binary features by mapping them to 0/1 and multi-category nominal features using One-Hot Encoding.