**Student Name:** [Your Name Here]  
**Assignment:** Machine Learning Model Comparison for Student Grade Prediction  
**Date:** August 2025  
**Course:** [Your Course Name]

# Predicting Student Math Grades Using Machine Learning

## What We Will Do

This project predicts how well students will do in math class. We use information about students to guess their final grade.

We will test 3 different computer programs (algorithms):
1. **Decision Tree** - Makes choices like a flowchart
2. **Random Forest** - Uses many decision trees together  
3. **K-Nearest Neighbors** - Looks at similar students

## Why This Is Important

Teachers can find students who need help early. This helps students pass their classes.

## What You Will Learn

- How to clean data for machine learning
- How to train 3 different models
- How to pick the best model
- How to make an API (web service) for predictions

<b>The deadline for the notebook is 25/08/2025</b>.


<b>The deadline for the video is 29/08/2025</b>.

## About Our Data

### The Student Dataset

We have data about 395 students from 2 schools in Portugal. 

**What we know about each student:**
- **Personal info**: Age, gender, where they live
- **Family info**: Parents' jobs and education
- **School info**: Study time, past grades, absences
- **Social info**: Free time, relationships, going out

**Our goal**: Predict the final math grade (0 to 20 points)

### Why This Data Is Good

This data has many different types of information about students. This helps us make better predictions than just using grades alone.

### Loading The Data (Simple Version)

This code loads our data and splits it into training and testing parts.

**What happens here:**
1. **Load data** from the CSV file
2. **Mix up the data** so it's random
3. **Split data**: 80% for training, 20% for testing
4. **Get features and target**: Separate student info from grades

**Note**: This is the basic way to load data. We will do better data preparation later.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
%matplotlib inline

#read and randomly shuffle data
mathscores = pd.read_csv('student-mat.csv', sep=';')

features = mathscores.columns[1:]

mathscores = mathscores.values
mathscores = mathscores[np.random.permutation(mathscores.shape[0]),:]

#80% - 20% split for the training and testing sets
tr_set_size = int(len(mathscores)*0.8)  # Fixed: changed np.int to int

#assign train and test sets (in your experiments, you want to do cross-validation)
X_tr = mathscores[0:tr_set_size,:30]
y_tr = mathscores[0:tr_set_size,32]
X_test = mathscores[tr_set_size:,:30]
y_test = mathscores[tr_set_size:,32]

## Minimum Requirements

You will need to train at least 3 different models on the data set. Make sure to include the reason for your choice (e.g., for dealing with categorical features).

* Define the problem, analyze the data, prepare the data for your model.
* Train at least 3 models (e.g. decision trees, nearest neighbour, ...) to predict whether a mushroom is of poisonous or edible. You are allowed to use any machine learning model from scikit-learn or other methods, as long as you motivate your choice.
* For each model, optimize the model parameters settings (tree depth, hidden nodes/decay, number of neighbours,...). Show which parameter setting gives the best model.
* Compare the best parameter settings for the models and estimate their errors on unseen data. Investigate the learning process critically (overfitting/underfitting). Can you show that one of the models performs better?

All results, plots and code should be handed in as an interactive <a href='http://ipython.org/notebook.html'>iPython notebook</a>. Simply providing code and plots does not suffice, you are expected to accompany each technical section by explanations and discussions on your choices/results/observation/etc in the notebook and in a video (by recording your screen en voice). 

<b>The deadline for the notebook is 25/08/2025</b>.

<b>The deadline for the video is 29/08/2025</b>.

## Optional Extensions

You are encouraged to try and see if you can further improve on the models you obtained above. This is not necessary to obtain a good grade on the assignment, but any extensions on the minimum requirements will count for extra credit. Some suggested possibilities to extend your approach are:

* Build and host an API for your best performing model. You can create a API using pyhton frameworks such as FastAPI, Flask, ... You can host een API for free on Render, using your student credit on Azure, ...
* Try to combine multiple models. Ensemble and boosting methods try to combine the predictions of many, simple models. This typically works best with models that make different errors. Scikit-learn has some support for this, <a href="http://scikit-learn.org/stable/modules/ensemble.html">see here</a>. You can also try to combine the predictions of multiple models manually, i.e. train multiple models and average their predictions
* You can always investigate whether all features are necessary to produce a good model. Feel free to lookup additional resources and papers to find more information, see e.g <a href='https://scikit-learn.org/stable/modules/feature_selection.html'> here </a> for the feature selection module provided by scikit-learn library.

## Additional Remarks

* Depending on the model used, you may want to <a href='http://scikit-learn.org/stable/modules/preprocessing.html'>scale</a> or <a href='https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features'>encode</a> your (categorical) features X and/or outputs y
* Refer to the <a href='http://scipy.org/docs.html'>SciPy</a> and <a href='http://scikit-learn.org/stable/documentation.html'>Scikit learn</a> documentations for more information on classifiers and data handling.
* You are allowed to use additional libraries, but provide references for these.
* The assignment is **individual**. All results should be your own. Plagiarism will not be tolerated.

In [None]:
mathscores_csv = pd.read_csv('student-mat.csv', sep=';')

mathscores_csv = mathscores_csv[features].dropna()

mathscores_csv.head()

# Preparing Data for Machine Learning

## The Problem We Want to Solve

**Simple goal**: Look at student information and guess their final math grade.

**Why this helps**: Teachers can find students who might fail and help them early.

## How We Measure Success

We use 3 ways to check if our predictions are good:
- **MSE**: How far off our guesses are (lower = better)
- **MAE**: Average error in grade points (lower = better)  
- **R²**: How much of the grade pattern we can explain (higher = better)

## Our Step-by-Step Plan

1. **Look at the data** - Understand what we have
2. **Clean the data** - Fix problems and prepare it
3. **Train 3 models** - Teach computers to predict grades
4. **Test models** - See which one works best
5. **Save the best model** - Use it to help real students

In [None]:
### Step 1: Look at Our Data

**What we import:**
- **pandas**: For working with data tables
- **numpy**: For math calculations
- **matplotlib/seaborn**: For making charts
- **scikit-learn**: For machine learning

**What this code does:**
1. **Load all our data** into the computer
2. **Check the data** - How big is it? Any missing parts?
3. **Look at grades** - What do final grades look like?
4. **Make charts** - Pictures help us understand data better

This step helps us understand our data before we start training models.

### Step 2: Clean and Prepare Data

**Why we need to do this:**
Computers only understand numbers, but our data has words like "male", "female", "urban", "rural".

**What this code does:**
1. **Find text columns** - Things like gender, school name
2. **Turn words into numbers** - "male"=0, "female"=1
3. **Split our data** - 80% for training, 20% for testing
4. **Scale numbers** - Make all numbers similar size (needed for some models)

**Important**: We keep the same random split every time so we can compare results fairly.

# Model 1: Decision Tree

## What is a Decision Tree?

Think of a decision tree like asking questions to guess a student's grade:
- "Did they fail classes before?" → If yes, probably lower grade
- "Do they study a lot?" → If yes, probably higher grade
- "What were their past grades?" → Higher past grades = higher final grade

## Why Use Decision Tree?

**Good things:**
- **Easy to understand** - We can see exactly how it makes decisions
- **Works with any data** - Numbers and text both work
- **No special preparation** - Don't need to scale numbers
- **Shows what's important** - Tells us which student factors matter most

**Problems:**
- **Can memorize training data** - Might not work on new students
- **Not always stable** - Small data changes can make big differences

## Settings We Will Test

- **max_depth**: How many questions deep the tree goes
- **min_samples_split**: How many students needed to ask a new question
- **min_samples_leaf**: Minimum students in each final answer

In [None]:
### Training the Decision Tree

**What this code does:**

1. **Makes a function to test models** - We use this for all 3 models
2. **Tests many different settings** - Tries 45 different combinations
3. **Uses cross-validation** - Tests each setting 5 times to be sure
4. **Picks the best settings** - Chooses what works best
5. **Tests on new data** - Sees how well it works on students it never saw

**How we find the best settings:**
- Try different tree depths (3, 5, 7, 10, or unlimited)
- Try different minimum students per question (2, 5, or 10)
- Try different minimum students per answer (1, 2, or 4)

**What the results mean:**
- **Training scores** - How well it works on students it learned from
- **Test scores** - How well it works on new students (this is what matters!)

# Model 2: Random Forest

## What is Random Forest?

Random Forest is like asking many different teachers to guess a student's grade, then taking the average of all their guesses.

Each "teacher" (tree) looks at:
- Different students from the training data
- Different questions about each student

Then we average all the guesses to get the final answer.

## Why Use Random Forest?

**Good things:**
- **More accurate** - Usually better than one decision tree
- **Less memorizing** - Harder to memorize training data
- **Shows what's important** - Tells us which student factors matter most
- **Works with any data** - Numbers and text both work
- **More stable** - Small data changes don't matter as much

**Problems:**
- **Harder to understand** - Can't easily see how it makes decisions
- **Slower to train** - Has to make many trees instead of one

## Settings We Will Test

- **n_estimators**: How many trees to make (50, 100, or 200)
- **max_depth**: How deep each tree goes
- **min_samples_split**: Minimum students to make a new question
- **min_samples_leaf**: Minimum students in each final answer

### Training Random Forest and Finding Important Features

**What this code does:**

1. **Tests many settings** - Tries 108 different combinations
2. **Trains the best model** - Uses the settings that work best
3. **Finds important features** - Shows which student info matters most
4. **Makes a chart** - Shows the top 10 most important things

**Feature importance explained:**
- Numbers add up to 1.0 (100%)
- Higher numbers mean more important
- Shows which student information helps predict grades the most

**What we expect to be important:**
- **Past grades (G1, G2)** - Previous grades predict final grades
- **Number of failures** - Students who failed before might struggle
- **Study time** - Students who study more usually do better
- **Parent education** - Family background affects student success

# Model 3: K-Nearest Neighbors (KNN)

## What is K-Nearest Neighbors?

KNN is like asking: "Show me students who are similar to this new student. What grades did they get?"

For example, to predict Ana's grade:
- Find 5 students most similar to Ana
- Look at their final grades: 12, 14, 15, 13, 16
- Average = 14, so we predict Ana will get 14

## Why Use KNN?

**Good things:**
- **Simple idea** - Easy to understand concept
- **No training needed** - Just stores all student data
- **Finds patterns** - Can find complex relationships
- **Works for similar cases** - Good when students are very similar

**Problems:**
- **Slow predictions** - Must check all students every time
- **Needs scaled data** - Age (15-22) vs Absences (0-93) need same scale
- **Sensitive to noise** - One weird student can mess up predictions
- **Uses lots of memory** - Must store all training data

## Settings We Will Test

- **n_neighbors**: How many similar students to look at (3, 5, 7, 9, 11, 15)
- **weights**: Should closer students count more? (uniform vs distance)
- **metric**: How do we measure similarity? (euclidean vs manhattan)

In [None]:
### Training KNN (Important: Uses Scaled Data)

**Why scaling is important:**
Without scaling, age (15-22) and absences (0-93) have different ranges. KNN will think absences are more important just because the numbers are bigger!

**What this code does:**

1. **Uses scaled data** - All numbers are made similar size
2. **Tests 24 combinations** - Different numbers of neighbors and settings
3. **Finds best settings** - Which combination works best
4. **Tests on new students** - How well does it predict?

**What we expect:**
- **Training error = 0** - KNN can perfectly remember all training students
- **Test error > 0** - The real test is on new students
- **Best k around 5-11** - Not too few (overfitting) or too many (underfitting)

**Parameter meanings:**
- **n_neighbors**: How many similar students to look at
- **weights**: 'uniform' = all neighbors equal, 'distance' = closer neighbors matter more
- **metric**: 'euclidean' = straight line distance, 'manhattan' = city block distance

# Comparing All Models and Picking the Best One

## What This Section Does

Now we compare all 3 models to see which one works best on new students.

### How We Compare Models

**We look at:**
- **Test MSE**: How far off are our predictions? (lower = better)
- **Test R²**: How much of the grade pattern do we explain? (higher = better)
- **Overfitting**: Does the model memorize training data too much?

### Overfitting Check

**What is overfitting?**
When a model memorizes training data instead of learning patterns. It works great on training data but badly on new data.

**How we check:**
- Big difference between training and test scores = overfitting
- Small difference = good model that generalizes well

**What this code does:**

1. **Makes a table** - Shows all model results together
2. **Finds the best model** - Lowest test error wins
3. **Checks for overfitting** - Compares training vs test performance
4. **Makes charts** - Visual comparison of all models
5. **Saves the winner** - Best model is saved for the API
6. **Creates files** - Everything needed to use the model later

# Final Results and What We Learned

## What We Did

We trained 3 different computer programs to predict student math grades:
1. **Decision Tree** - Makes decisions like a flowchart
2. **Random Forest** - Uses many decision trees together
3. **K-Nearest Neighbors** - Looks at similar students

## What We Found

### Which Student Factors Matter Most

From our analysis, these things help predict student success:
1. **Past grades** - Students with good G1 and G2 grades usually get good final grades
2. **Previous failures** - Students who failed before are more likely to struggle
3. **Study time** - Students who study more hours do better
4. **Family education** - Parents with more education often have kids who do better
5. **Age and school factors** - These also matter but less

### Best Model

The model with the lowest test error is our winner. This model will help teachers find students who need extra help.

## How This Helps Teachers

**Early Warning System**: Teachers can use this to:
- Find students who might fail before it's too late
- Give extra help to students who need it most
- Use their time and resources better

**What Makes Students Successful**: 
- Good study habits are very important
- Family support matters
- Past performance predicts future performance

## What We Did Well

- ✅ Tested multiple algorithms fairly
- ✅ Used proper data splitting (no cheating!)
- ✅ Found the best settings for each model
- ✅ Checked for overfitting
- ✅ Made it ready for real-world use (API)
- ✅ Found which student factors matter most

## Conclusion

This project shows how machine learning can help education. By looking at student information, we can predict who needs help and provide support early. This could help more students succeed in school.

The best model is now saved and ready to use through a web API that teachers or schools can use to help their students.

# Final Model Comparison and Deployment

## Comprehensive Model Evaluation

This section provides the final comparison of all three models and selects the best performer for deployment.

### Evaluation Strategy

**Multiple Metrics Approach:**
- **MSE (Mean Squared Error)**: Penalizes large errors more heavily
- **MAE (Mean Absolute Error)**: Average absolute prediction error
- **R² Score**: Proportion of variance explained by the model

**Overfitting Analysis:**
We analyze the gap between training and test performance to identify overfitting:
- **Large gap**: Model memorizes training data (overfitting)
- **Small gap**: Good generalization ability
- **Negative gap**: Possible underfitting (rare)

### What this code does:

1. **Creates comparison table**: Organizes all metrics for easy comparison
2. **Identifies best model**: Selects model with lowest test MSE
3. **Analyzes overfitting**: Calculates training-test performance gaps
4. **Generates visualizations**: Creates bar plots for visual comparison
5. **Saves best model**: Exports model and preprocessing objects for production use
6. **Deployment preparation**: Creates all files needed for API deployment

### Expected Outcomes:

**Typical Performance Ranking:**
1. **Random Forest**: Usually best balance of accuracy and generalization
2. **Decision Tree**: Good interpretability but may overfit
3. **KNN**: Can work well but sensitive to hyperparameters and scaling

# Model 2: Random Forest

## Why Random Forest?
I chose Random Forest because:
1. It's an ensemble method that combines multiple decision trees, reducing overfitting
2. It handles both categorical and numerical features well
3. It provides feature importance rankings
4. It's generally more robust and accurate than single decision trees
5. It can handle missing values and doesn't require feature scaling

In [None]:
## Training and Testing Random Forest

print("Optimizing Random Forest hyperparameters...")

rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'random_state': [42]
}

rf_grid = GridSearchCV(RandomForestRegressor(), rf_params, cv=5, 
                       scoring='neg_mean_squared_error', n_jobs=-1)
rf_grid.fit(X_train, y_train)

print(f"Best parameters: {rf_grid.best_params_}")
print(f"Best CV score (MSE): {-rf_grid.best_score_:.4f}")

# Evaluate the best Random Forest model
rf_results = evaluate_model(rf_grid.best_estimator_, X_train, X_test, y_train, y_test, "Random Forest")

# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': features,
    'importance': rf_grid.best_estimator_.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nTop 10 Most Important Features:")
for i, row in feature_importance.head(10).iterrows():
    print(f"{row['feature']}: {row['importance']:.4f}")

# Plot feature importance
plt.figure(figsize=(12, 6))
sns.barplot(data=feature_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Feature Importance - Random Forest')
plt.xlabel('Feature Importance')
plt.tight_layout()
plt.show()

# Model 3: K-Nearest Neighbors

## Why K-Nearest Neighbors?
I chose KNN because:
1. It's a simple, non-parametric algorithm that can capture complex patterns
2. It works well when there are clear clusters of similar students with similar grades
3. It's intuitive - it predicts based on the grades of the most similar students
4. It can handle both linear and non-linear relationships
5. However, it requires feature scaling since it's distance-based

In [None]:
## Training and Testing K-Nearest Neighbors

print("Optimizing KNN hyperparameters...")

knn_params = {
    'n_neighbors': [3, 5, 7, 9, 11, 15],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Use scaled features for KNN (important for distance-based algorithms)
knn_grid = GridSearchCV(KNeighborsRegressor(), knn_params, cv=5, 
                        scoring='neg_mean_squared_error', n_jobs=-1)
knn_grid.fit(X_train_scaled, y_train)

print(f"Best parameters: {knn_grid.best_params_}")
print(f"Best CV score (MSE): {-knn_grid.best_score_:.4f}")

# Evaluate KNN with scaled features (since it's distance-based)
def evaluate_knn_model(model, X_train_scaled, X_test_scaled, y_train, y_test, model_name):
    """Function to evaluate KNN model with scaled features"""
    y_pred_train = model.predict(X_train_scaled)
    y_pred_test = model.predict(X_test_scaled)
    
    train_mse = mean_squared_error(y_train, y_pred_train)
    test_mse = mean_squared_error(y_test, y_pred_test)
    train_mae = mean_absolute_error(y_train, y_pred_train)
    test_mae = mean_absolute_error(y_test, y_pred_test)
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    
    print(f"\n{model_name} Results:")
    print(f"Training MSE: {train_mse:.4f}, Test MSE: {test_mse:.4f}")
    print(f"Training MAE: {train_mae:.4f}, Test MAE: {test_mae:.4f}")
    print(f"Training R²: {train_r2:.4f}, Test R²: {test_r2:.4f}")
    
    return {
        'model': model, 'train_mse': train_mse, 'test_mse': test_mse,
        'train_mae': train_mae, 'test_mae': test_mae,
        'train_r2': train_r2, 'test_r2': test_r2
    }

# Evaluate the best KNN model
knn_results = evaluate_knn_model(knn_grid.best_estimator_, X_train_scaled, X_test_scaled, 
                                 y_train, y_test, "K-Nearest Neighbors")

In [None]:
# Conclusion - Model Comparison and Saving

# Create comparison table
results_df = pd.DataFrame({
    'Model': ['Decision Tree', 'Random Forest', 'K-Nearest Neighbors'],
    'Test MSE': [dt_results['test_mse'], rf_results['test_mse'], knn_results['test_mse']],
    'Test MAE': [dt_results['test_mae'], rf_results['test_mae'], knn_results['test_mae']],
    'Test R²': [dt_results['test_r2'], rf_results['test_r2'], knn_results['test_r2']],
    'Train MSE': [dt_results['train_mse'], rf_results['train_mse'], knn_results['train_mse']],
    'Train R²': [dt_results['train_r2'], rf_results['train_r2'], knn_results['train_r2']]
})

print("Model Performance Comparison:")
print("=" * 60)
print(results_df.round(4))

# Find best model
best_model_idx = results_df['Test MSE'].idxmin()
best_model_name = results_df.loc[best_model_idx, 'Model']
print(f"\n🏆 Best performing model: {best_model_name}")

# Overfitting analysis
print(f"\n📊 Overfitting Analysis (Train R² - Test R²):")
for i, model_name in enumerate(results_df['Model']):
    overfitting = results_df.loc[i, 'Train R²'] - results_df.loc[i, 'Test R²']
    status = "High overfitting" if overfitting > 0.1 else "Good generalization"
    print(f"{model_name}: {overfitting:.4f} ({status})")

# Visualize comparison
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.barplot(data=results_df, x='Model', y='Test MSE')
plt.title('Test MSE (Lower is Better)')
plt.xticks(rotation=45)

plt.subplot(1, 3, 2)
sns.barplot(data=results_df, x='Model', y='Test R²')
plt.title('Test R² (Higher is Better)')
plt.xticks(rotation=45)

plt.subplot(1, 3, 3)
overfitting_data = results_df['Train R²'] - results_df['Test R²']
sns.barplot(x=results_df['Model'], y=overfitting_data)
plt.title('Overfitting Score (Lower is Better)')
plt.ylabel('Train R² - Test R²')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# Save the best model
import joblib

if best_model_name == 'Random Forest':
    best_model = rf_grid.best_estimator_
elif best_model_name == 'Decision Tree':
    best_model = dt_grid.best_estimator_
else:
    best_model = knn_grid.best_estimator_

# Save all necessary files
joblib.dump(best_model, 'best_model.pkl')
joblib.dump(label_encoders, 'label_encoders.pkl')
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(features, 'features.pkl')

print(f"\n💾 Model files saved successfully!")
print(f"✅ best_model.pkl - {best_model_name} model")
print(f"✅ label_encoders.pkl - Categorical encoders")
print(f"✅ scaler.pkl - Feature scaler")
print(f"✅ features.pkl - Feature names")
print(f"\nReady to run API! 🚀")