# Student Dropout Prediction Analysis
## Higher Education Predictors of Student Retention

This notebook analyzes a dataset containing various factors that may influence student retention in higher education. The goal is to predict whether a student will graduate, dropout, or remain enrolled based on multiple features including academic performance, demographic information, and socioeconomic factors.

**Dataset Source**: Kaggle - Higher Education Predictors of Student Retention
**Target Variable**: Student outcome (Dropout, Graduate, Enrolled)
**Approach**: Multi-class classification using various machine learning algorithms

---

### Project Overview
Student retention is a critical concern for educational institutions. This analysis aims to:
1. Identify key factors that influence student outcomes
2. Build predictive models to forecast student success
3. Provide insights for improving retention strategies
4. Compare different machine learning approaches

## 1. Data Loading and Initial Setup

### Import Required Libraries
We'll need various libraries for data manipulation, visualization, and machine learning:

In [None]:
# Import data manipulation and visualization libraries
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

# Import machine learning evaluation metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Import various classification algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
import sklearn.svm as svm

# Import hyperparameter tuning and ensemble methods
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import VotingClassifier

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

### Load the Dataset
Load the student retention dataset from Kaggle:

In [None]:
# Load the dataset containing student information and retention data
df = pd.read_csv("/kaggle/input/higher-education-predictors-of-student-retention/dataset.csv")

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {len(df.columns)}")

## 2. Exploratory Data Analysis (EDA)

### Initial Data Inspection
Let's examine the structure and content of our dataset:

In [None]:
# Display the first few rows to understand the data structure
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Check data types of all columns
print("Data types of all columns:")
df.dtypes

In [None]:
# Get comprehensive information about the dataset including missing values
print("Dataset information:")
df.info()

### Target Variable Analysis
Understanding our target variable (student outcome):

In [None]:
# Check unique values in the target variable
print("Unique values in Target variable:")
df['Target'].unique()

In [None]:
# Analyze the distribution of target classes
print("Distribution of target classes:")
target_counts = df['Target'].value_counts()
print(target_counts)

# Calculate percentages
print("\nPercentage distribution:")
print((target_counts / len(df) * 100).round(2))

### Target Variable Encoding
Convert categorical target variable to numerical for machine learning:

In [None]:
# Encode target variable: Dropout=0, Graduate=1, Enrolled=2
target_mapping = {
    'Dropout': 0,
    'Graduate': 1,
    'Enrolled': 2
}

df['Target'] = df['Target'].map(target_mapping)
print("Target variable encoded successfully!")
print(f"Mapping: {target_mapping}")

## 3. Feature Selection and Correlation Analysis

### Correlation Analysis
Identify features most correlated with the target variable:

In [None]:
# Calculate correlation between all features and the target variable
target_correlations = df.corr()['Target'].sort_values(ascending=False)
print("Correlation with Target variable (sorted by absolute value):")
print(target_correlations)

### Feature Engineering
Remove low-correlation features to improve model performance:

In [None]:
# Create a cleaned dataset by removing low-correlation features
features_to_remove = [
    'GDP', 'Inflation rate', 'Unemployment rate',
    'Curricular units 2nd sem (without evaluations)',
    'Curricular units 1st sem (without evaluations)',
    'Curricular units 2nd sem (credited)',
    'Curricular units 2nd sem (enrolled)',
    'International', 'Curricular units 1st sem (credited)',
    'Curricular units 1st sem (enrolled)',
    'Application order', 'Course', 'Daytime/evening attendance',
    'Previous qualification', 'Nacionality',
    'Educational special needs'
]

df1 = df.copy()
df1 = df1.drop(columns=features_to_remove)

print(f"Original dataset shape: {df.shape}")
print(f"Cleaned dataset shape: {df1.shape}")
print(f"Features removed: {len(features_to_remove)}")
print(f"Features remaining: {len(df1.columns)}")

## 4. Data Visualization

### Target Distribution
Visualize the distribution of student outcomes:

In [None]:
# Create an interactive pie chart showing target distribution
target_counts = df1['Target'].value_counts()
labels = ['Graduate', 'Dropout', 'Enrolled']

fig = px.pie(
    values=target_counts.values,
    names=labels,
    title='Distribution of Student Outcomes',
    hole=0.4,
    color_discrete_sequence=['#2E8B57', '#DC143C', '#4169E1']
)

fig.update_traces(
    textinfo='percent+label',
    pull=[0, 0.2, 0.1],
    textfont_size=12
)

fig.update_layout(
    title_x=0.5,
    title_font_size=16
)

fig.show()

### Top Features Analysis
Visualize the most important features based on correlation:

In [None]:
# Create a horizontal bar chart of top 10 most correlated features
correlation = df1.corr()['Target'].drop('Target')
top_10_features = correlation.abs().nlargest(10).index
top_10_corr = correlation[top_10_features]

plt.figure(figsize=(12, 8))
colors = ['red' if x < 0 else 'green' for x in top_10_corr]
bars = plt.barh(range(len(top_10_features)), top_10_corr, color=colors, alpha=0.7)

plt.yticks(range(len(top_10_features)), top_10_features)
plt.xlabel('Correlation with Target', fontsize=12)
plt.title('Top 10 Features Most Correlated with Student Outcome', fontsize=14, pad=20)
plt.grid(axis='x', alpha=0.3)

# Add value labels on bars
for i, (bar, value) in enumerate(zip(bars, top_10_corr)):
    plt.text(value + (0.01 if value > 0 else -0.01), 
             bar.get_y() + bar.get_height()/2, 
             f'{value:.3f}', 
             ha='left' if value > 0 else 'right',
             va='center',
             fontweight='bold')

plt.tight_layout()
plt.show()

## 5. Model Preparation

### Feature and Target Separation
Prepare features (X) and target variable (y) for modeling:

In [None]:
# Separate features and target variable
X = df.drop('Target', axis=1)
y = df['Target']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Number of features: {X.shape[1]}")
print(f"Number of samples: {X.shape[0]}")

### Train-Test Split
Split the data into training and testing sets:

In [None]:
# Split data into 80% training and 20% testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Ensure balanced split across classes
)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
print(f"Training target distribution: {dict(y_train.value_counts())}")
print(f"Testing target distribution: {dict(y_test.value_counts())}")

## 6. Model Training

### Initialize Multiple Classifiers
Set up various machine learning algorithms for comparison:

In [None]:
# Initialize different classification algorithms
models = {
    'Decision Tree': DecisionTreeClassifier(random_state=0),
    'Random Forest': RandomForestClassifier(random_state=2),
    'Logistic Regression': LogisticRegression(max_iter=5000, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=3),
    'AdaBoost': AdaBoostClassifier(n_estimators=50, learning_rate=1, random_state=0),
    'XGBoost': XGBClassifier(tree_method='hist', device='cuda'),
    'Support Vector Machine': svm.SVC(kernel='linear', probability=True)
}

print("Models initialized successfully!")
print(f"Number of models: {len(models)}")
for name, model in models.items():
    print(f"  - {name}")

### Train All Models
Fit all models on the training data:

In [None]:
# Train all models on the training dataset
print("Training models...")
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    print(f"✓ {name} trained successfully!")

print("\nAll models trained successfully!")

## 7. Model Evaluation

### Individual Model Performance
Evaluate each model's accuracy on the test set:

In [None]:
# Evaluate all models and store results
results = {}

print("Model Performance Evaluation:")
print("=" * 50)

for name, model in models.items():
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    
    # Store results
    results[name] = {
        'Accuracy': accuracy,
        'F1-Score': f1,
        'Precision': precision,
        'Recall': recall
    }
    
    print(f"{name}:")
    print(f"  Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"  F1-Score:  {f1:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print()

### Model Performance Comparison
Visualize the performance comparison across all models:

In [None]:
# Create a comparison plot of model accuracies
model_names = list(results.keys())
accuracies = [results[name]['Accuracy'] for name in model_names]

plt.figure(figsize=(14, 8))
bars = plt.bar(model_names, accuracies, color='skyblue', alpha=0.7)

# Add value labels on bars
for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, 
             bar.get_height() + 0.005, 
             f'{acc:.3f}', 
             ha='center', 
             va='bottom',
             fontweight='bold')

plt.xlabel('Models', fontsize=12)
plt.ylabel('Accuracy Score', fontsize=12)
plt.title('Model Performance Comparison - Accuracy Scores', fontsize=14, pad=20)
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 1)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Find best model
best_model_name = max(results.keys(), key=lambda x: results[x]['Accuracy'])
best_accuracy = results[best_model_name]['Accuracy']
print(f"\nBest performing model: {best_model_name} with accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")

## 8. Ensemble Methods

### Soft Voting Ensemble
Create an ensemble using soft voting (probability-based):

In [None]:
# Create soft voting ensemble with multiple classifiers
ens1 = VotingClassifier(
    estimators=[
        ('rfc', RandomForestClassifier(random_state=2)),
        ('lr', LogisticRegression(max_iter=5000, random_state=42)),
        ('abc', AdaBoostClassifier(n_estimators=50, learning_rate=1, random_state=0)),
        ('xbc', XGBClassifier(tree_method='hist', device='cuda'))
    ], 
    voting='soft'
)

print("Soft voting ensemble created with 4 classifiers:")
for name, model in ens1.estimators:
    print(f"  - {name}: {type(model).__name__}")

In [None]:
# Create hard voting ensemble with multiple classifiers
ens2 = VotingClassifier(
    estimators=[
        ('rfc', RandomForestClassifier(random_state=2)),
        ('lr', LogisticRegression(max_iter=5000, random_state=42)),
        ('abc', AdaBoostClassifier(n_estimators=50, learning_rate=1, random_state=0)),
        ('xbc', XGBClassifier(tree_method='hist', device='cuda'))
    ], 
    voting='hard'
)

print("Hard voting ensemble created with 4 classifiers:")
for name, model in ens2.estimators:
    print(f"  - {name}: {type(model).__name__}")

In [None]:
# Train both ensemble models
print("Training ensemble models...")
ens1.fit(X_train, y_train)
print("✓ Soft voting ensemble trained!")
ens2.fit(X_train, y_train)
print("✓ Hard voting ensemble trained!")

In [None]:
# Make predictions with both ensemble methods
y_pred_soft = ens1.predict(X_test)
y_pred_hard = ens2.predict(X_test)

print("Predictions made with both ensemble methods!")

In [None]:
# Compare ensemble model performances
soft_accuracy = accuracy_score(y_test, y_pred_soft)
hard_accuracy = accuracy_score(y_test, y_pred_hard)

print("Ensemble Model Performance Comparison:")
print("=" * 40)
print(f"Soft Voting Ensemble Accuracy:  {soft_accuracy:.4f} ({soft_accuracy*100:.2f}%)")
print(f"Hard Voting Ensemble Accuracy:  {hard_accuracy:.4f} ({hard_accuracy*100:.2f}%)")

if soft_accuracy > hard_accuracy:
    print("\nSoft voting ensemble performs better!")
elif hard_accuracy > soft_accuracy:
    print("\nHard voting ensemble performs better!")
else:
    print("\nBoth ensemble methods perform equally!")

## 9. Summary and Conclusions

### Key Findings
- **Best Individual Model**: Random Forest achieved the highest accuracy
- **Feature Importance**: Academic performance metrics (grades, approvals) show strongest correlation with student outcomes
- **Age Factor**: Younger students tend to have better outcomes
- **Financial Factors**: Tuition payment status and scholarship holding are important predictors

### Recommendations
1. **Early Intervention**: Focus on early academic performance monitoring
2. **Age-Specific Support**: Implement targeted support for older students
3. **Financial Aid**: Ensure timely tuition payment processes and expand scholarship programs
4. **Predictive Analytics**: Use machine learning models for early identification of at-risk students

### Model Performance Summary
The analysis shows that ensemble methods and Random Forest provide the best performance for predicting student outcomes. The models can help educational institutions:

- Identify students at risk of dropping out early
- Allocate resources more effectively
- Develop targeted intervention strategies
- Improve overall student retention rates