#### Preparing Data for Students Dropout Analysis

05_ml_students_dropout

### Summary
This notebook is responsible for preparing the data for the "Predict Students' Dropout and Academic Success" project. The steps include reading the cleaned data, performing feature engineering, converting the target variable to numeric format, normalising highly skewed quantitative columns, and saving the prepared data for further analysis.

### Steps

1. Read the Cleaned Data
2. Feature Engineering 
3. Convert Target Variable to Numeric
4. Data normalisation techniques to quantitative columns
5. Save Prepared Data

In [157]:
# Import required packages

import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, StandardScaler

#### Reading the Cleaned Data

In [158]:

# Load the dataset into a Pandas dataframe
df_data = pd.read_csv('data/cleaned_data.csv')

### Calculates target data distribution 

In [159]:
# Function to check the distribution of the target variable
def data_distribution(df, target): 
    """
    This function calculates the percentage distribution of each category in the target variable.
    
    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    target (str): The name of the target column.

    Returns:
    tuple: A tuple containing the percentage of 'Graduate', 'Dropout', and 'Enrolled' in the target variable.
    """
    
    # Calculate the percentage of students who graduated
    graduated = round(len(df[df[target] == "Graduate"]) / len(df) * 100, 2)
    
    # Calculate the percentage of students who dropped out
    dropped = round(len(df[df[target] == "Dropout"]) / len(df) * 100, 2)
    
    # Calculate the percentage of students who are still enrolled
    enrolled = round(len(df[df[target] == "Enrolled"]) / len(df) * 100, 2)
    
    return graduated, dropped, enrolled

"""
This step calls the data_distribution function, passing the DataFrame and the name of the target column.
It calculates and returns the percentage distribution of 'Graduate', 'Dropout', and 'Enrolled' categories
in the target variable. This is useful to understand the class imbalance in the dataset, which can affect
the performance of machine learning models.
"""

# Example usage of the function
data_distribution(df_data, "Target")


(49.93, 32.12, 17.95)

### Feature Engineering

In [160]:
# Create New Features
df_data['age_admission_ratio'] = df_data['Age_at_enrollment'] / df_data['Admission_grade']

#### Converting Target Variable to Numeric

In [161]:
"""
This step converts the categorical target variable (e.g., 'Graduate', 'Dropout', 'Enrolled')
into numerical values, which are required for correlation analysis and machine learning models.
"""

# Drop 'Nationality' and 'International' columns if needed
df_data = df_data.drop(columns=['Nationality', 'International'])

# Remove 'Enrolled' from the Target
df_data = df_data[df_data['Target'] != 'Enrolled']

# Encode the 'Target' column into numerical values
label_encoder = LabelEncoder()
df_data['Target_encoded'] = label_encoder.fit_transform(df_data['Target'])

# Drop the original 'Target' column
df_data = df_data.drop(columns=['Target'])

display(df_data.head())

Unnamed: 0,Marital_Status,Application_mode,Application_order,Course,Daytime_evening_attendance,Previous_qualification,Previous_qualification_(grade),Mothers_qualification,Fathers_qualification,Mothers_occupation,...,Curricular_units_2nd_sem_enrolled,Curricular_units_2nd_sem_evaluations,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Curricular_units_2nd_sem_without_evaluations,Unemployment_rate,Inflation_rate,GDP,age_admission_ratio,Target_encoded
0,1,17,5,171,1,1,122.0,19,12,5,...,2.0,0.0,0.0,6.875,0.0,10.8,1.4,1.74,0.157109,0
1,1,15,1,9254,1,1,160.0,1,3,3,...,6.0,6.0,6.0,13.666667,0.0,13.9,-0.3,0.79,0.133333,1
2,1,1,5,9070,1,1,122.0,37,37,9,...,6.0,0.0,0.0,6.875,0.0,10.8,1.4,1.74,0.152244,0
3,1,17,2,9773,1,1,122.0,38,37,5,...,6.0,10.0,5.0,12.4,0.0,9.4,-0.8,-3.12,0.167224,1
4,2,39,1,8014,0,1,102.5,37,38,9,...,6.0,6.0,6.0,13.0,0.0,13.9,-0.3,0.79,0.240283,1


#### Normalising/Standardising Data

In [162]:
# Updated list of quantitative columns based on renamed columns
quantitative_cols = ['Curricular_units_1st_sem_credited', 'Curricular_units_1st_sem_enrolled', 'Curricular_units_1st_sem_evaluations',
                     'Curricular_units_1st_sem_approved', 'Curricular_units_1st_sem_grade', 'Curricular_units_1st_sem_without_evaluations',
                     'Curricular_units_2nd_sem_credited', 'Curricular_units_2nd_sem_enrolled', 'Curricular_units_2nd_sem_evaluations',
                     'Curricular_units_2nd_sem_approved', 'Curricular_units_2nd_sem_grade', 'Curricular_units_2nd_sem_without_evaluations',
                     'Age_at_enrollment', 'Inflation_rate', 'GDP', 'Unemployment_rate']


In [163]:
# Normalise/Standardise Data
scaler = StandardScaler()
df_data[['Admission_grade', 'Age_at_enrollment', 'age_admission_ratio']] = scaler.fit_transform(df_data[['Admission_grade', 'Age_at_enrollment', 'age_admission_ratio']])

#### Normalising Quantitative Columns with High Skewness

In [164]:
def Normalise_and_summarize(df, quantitative_cols):
    """
    Normalise columns with high skewness and generate numerical summaries for quantitative variables.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    quantitative_cols (list): List of quantitative column names.

    Returns:
    pd.DataFrame: A DataFrame containing the original and Normalised summaries stacked vertically with a separator.
    """
    # Calculate summary statistics
    summary_df = df_data[quantitative_cols].describe().loc[['mean', 'std', 'min', '25%', '50%', '75%', 'max']]
    summary_df.loc['skew'] = df_data[quantitative_cols].skew()

    # Identify columns with high skewness
    high_skew_cols = summary_df.columns[(summary_df.loc['skew'] > 1) | (summary_df.loc['skew'] < -1)]

    # Duplicate the dataset
    df_normalised = df_data.copy()

    # Normalise data using log transformation
    df_normalised[high_skew_cols] = np.sqrt(df_normalised[high_skew_cols])

    # Alternatively, other normalization techniques can be used
    # df_normalised[high_skew_cols] = np.log(df_normalised[high_skew_cols])

    # Numerical summaries for Normalised quantitative variables
    summary_df_normalised = df_normalised[quantitative_cols].describe().loc[['mean', 'std', 'min', '25%', '50%', '75%', 'max']]
    summary_df_normalised.loc['skew'] = df_normalised[quantitative_cols].skew()

    # Stack the two dataframes vertically with a separator
    separator = pd.DataFrame([['----'] * len(quantitative_cols)], columns=quantitative_cols)
    stacked_summary = pd.concat([summary_df, separator, summary_df_normalised], keys=['Original Summary', '----', 'Normalised Summary'])

    return df_normalised, stacked_summary, summary_df_normalised

# Normalise and summarize the data
df_normalised, stacked_summary, summary_df_normalised = Normalise_and_summarize(df_data, quantitative_cols)

# Display the Normalised data
display(df_normalised)

# print the summary length
len(df_normalised)

  result = func(self.values, **kwargs)


Unnamed: 0,Marital_Status,Application_mode,Application_order,Course,Daytime_evening_attendance,Previous_qualification,Previous_qualification_(grade),Mothers_qualification,Fathers_qualification,Mothers_occupation,...,Curricular_units_2nd_sem_enrolled,Curricular_units_2nd_sem_evaluations,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Curricular_units_2nd_sem_without_evaluations,Unemployment_rate,Inflation_rate,GDP,age_admission_ratio,Target_encoded
0,1,17,5,171,1,1,122.0,19,12,5,...,2.0,0.0,0.0,6.875000,0.0,10.8,1.4,1.74,-0.452407,0
1,1,15,1,9254,1,1,160.0,1,3,3,...,6.0,6.0,6.0,13.666667,0.0,13.9,-0.3,0.79,-0.915796,1
2,1,1,5,9070,1,1,122.0,37,37,9,...,6.0,0.0,0.0,6.875000,0.0,10.8,1.4,1.74,-0.547237,0
3,1,17,2,9773,1,1,122.0,38,37,5,...,6.0,10.0,5.0,12.400000,0.0,9.4,-0.8,-3.12,-0.255269,1
4,2,39,1,8014,0,1,102.5,37,38,9,...,6.0,6.0,6.0,13.000000,0.0,13.9,-0.3,0.79,1.168633,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4419,1,1,6,9773,1,1,125.0,1,1,5,...,6.0,8.0,5.0,12.666667,0.0,15.5,2.8,-4.06,-0.484105,1
4420,1,1,2,9773,1,1,120.0,1,1,9,...,6.0,6.0,2.0,11.000000,0.0,11.1,0.6,2.02,-0.566397,0
4421,1,1,1,9500,1,1,154.0,37,37,9,...,8.0,9.0,1.0,13.500000,0.0,13.9,-0.3,0.79,0.396566,0
4422,1,1,1,9147,1,1,162.5,37,37,7,...,5.0,6.0,5.0,12.000000,0.0,9.4,-0.8,-3.12,-0.980001,1


3630

#### Saving Prepared Data

In [165]:
# Save the cleaned and prepared data
df_normalised.to_csv('data/prep_data.csv', index=False)
print("Prepared for ML data has been saved as 'prep_data.csv'.")

Prepared for ML data has been saved as 'prep_data.csv'.
