# Student Grade Classifier: Fundamentals of AI Project

Welcome! This notebook will guide you step by step through building a Student Grade Classifier using the complete AI pipeline. Each section includes explanations and code, making it beginner-friendly. We will use Python, pandas, and scikit-learn.

**Pipeline Steps:**
1. Data Collection/Loading
2. Preprocessing
3. Model Development
4. Evaluation
5. Hyperparameter Tuning
6. Deployment

Let's get started!

## 1. Data Collection/Loading

In this section, we'll load the student dataset and explore its contents. We'll use the pandas library to read the CSV file and inspect the data.

In [1]:
# Import necessary libraries
import pandas as pd

# Load the dataset with the correct delimiter
file_path = 'student-mat.csv'
data = pd.read_csv(file_path, sep=';')

# Display the first 5 rows of the dataset
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


## 2. Preprocessing

In this step, we will prepare the data for modeling. This includes:
- Checking for and handling missing values
- Encoding categorical variables (converting text columns to numbers)
- Scaling numerical features (if needed)
- Splitting the data into training and testing sets

These steps ensure the data is clean and suitable for building a machine learning model.

In [2]:
# Check for missing values in the dataset
missing_values = data.isnull().sum()
print('Missing values in each column:')
print(missing_values)

# If there are missing values, handle them (e.g., fill with mean/mode or drop rows)
# For this dataset, we expect no missing values, but if any are found, handle accordingly
data = data.dropna()
print('\nAfter handling missing values, shape of data:', data.shape)

Missing values in each column:
school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64

After handling missing values, shape of data: (395, 33)


### Encoding Categorical Variables

Many columns in the dataset are categorical (text). Machine learning models require numerical input, so we need to convert these columns to numbers. We'll use one-hot encoding to achieve this.

In [3]:
# Encode categorical variables using one-hot encoding
data_encoded = pd.get_dummies(data, drop_first=True)

print('Shape after encoding categorical variables:', data_encoded.shape)
data_encoded.head()

Shape after encoding categorical variables: (395, 42)


Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,guardian_mother,guardian_other,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes
0,18,4,4,2,2,0,4,3,4,1,...,True,False,True,False,False,False,True,True,False,False
1,17,1,1,1,2,0,5,3,3,1,...,False,False,False,True,False,False,False,True,True,False
2,15,1,1,1,2,3,4,3,2,2,...,True,False,True,False,True,False,True,True,True,False
3,15,4,2,1,3,0,3,2,2,1,...,True,False,False,True,True,True,True,True,True,True
4,16,3,3,1,2,0,4,3,2,1,...,False,False,False,True,True,False,True,True,False,False


### Splitting Data and Scaling Features

Now, we will:
- Split the data into features (X) and target (y).
- Split the dataset into training and testing sets.
- Scale the numerical features to ensure all features contribute equally to the model.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define the target variable (for classification, let's predict if the final grade G3 >= 10: pass/fail)
data_encoded['pass'] = (data_encoded['G3'] >= 10).astype(int)
X = data_encoded.drop(['G3', 'pass'], axis=1)
y = data_encoded['pass']

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print('Training set shape:', X_train_scaled.shape)
print('Test set shape:', X_test_scaled.shape)

Training set shape: (316, 41)
Test set shape: (79, 41)


## 3. Model Development

In this step, we will build and train a machine learning model to classify whether a student passes or fails based on their features. We will use Logistic Regression, a simple and effective algorithm for binary classification tasks.

In [5]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)

# Train the model on the training data
model.fit(X_train_scaled, y_train)

print('Model training complete.')

Model training complete.


## 4. Evaluation

In this step, we will evaluate the performance of our trained model using the test data. We will use metrics such as accuracy, precision, recall, F1-score, and the confusion matrix to assess how well the model predicts student outcomes.

In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1-score: {f1:.2f}')
print('\nConfusion Matrix:')
print(cm)

# Detailed classification report
print('\nClassification Report:')
print(classification_report(y_test, y_pred))

Accuracy: 0.85
Precision: 0.94
Recall: 0.83
F1-score: 0.88

Confusion Matrix:
[[23  3]
 [ 9 44]]

Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.88      0.79        26
           1       0.94      0.83      0.88        53

    accuracy                           0.85        79
   macro avg       0.83      0.86      0.84        79
weighted avg       0.86      0.85      0.85        79



## 5. Hyperparameter Tuning

In this step, we will optimize our model by searching for the best hyperparameters using Grid Search. This helps improve model performance by systematically testing different parameter values.

In [7]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for Logistic Regression
grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs'],
    'penalty': ['l2']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), grid, cv=5, scoring='f1', n_jobs=-1)

grid_search.fit(X_train_scaled, y_train)

print('Best parameters:', grid_search.best_params_)

# Retrain the model with the best parameters
best_model = grid_search.best_estimator_

# Evaluate the tuned model on the test set
y_pred_best = best_model.predict(X_test_scaled)

from sklearn.metrics import classification_report
print('\nClassification Report (Tuned Model):')
print(classification_report(y_test, y_pred_best))

Best parameters: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}

Classification Report (Tuned Model):
              precision    recall  f1-score   support

           0       0.75      0.92      0.83        26
           1       0.96      0.85      0.90        53

    accuracy                           0.87        79
   macro avg       0.85      0.89      0.86        79
weighted avg       0.89      0.87      0.88        79



In [8]:
import joblib
# Save the trained model and scaler for CLI deployment
joblib.dump(best_model, 'student_grade_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
print('Model and scaler saved successfully.')

Model and scaler saved successfully.


In [9]:
# Save the feature names used for model input
import joblib
joblib.dump(X.columns.tolist(), 'feature_names.pkl')
print('Feature names saved successfully.')

Feature names saved successfully.


## Feature Selection: Reducing the Number of Features

To simplify the model and make predictions easier, we will select the most important features using the coefficients from our trained Logistic Regression model. We'll retrain the model using only these top features.

In [10]:
import numpy as np
import pandas as pd

# Get feature importance from the trained Logistic Regression model
feature_importance = pd.Series(np.abs(best_model.coef_[0]), index=X.columns)
feature_importance = feature_importance.sort_values(ascending=False)

# Display the top 10 most important features
print('Top 10 most important features:')
print(feature_importance.head(10))

# Select the top N features (now 15 for more information)
N = 15
top_features = feature_importance.head(N).index.tolist()
print(f'\nSelected features for new model: {top_features}')

# Prepare new feature set
X_reduced = X[top_features]

# Split the data again
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reduced, y, test_size=0.2, random_state=42, stratify=y)

# Scale features
scaler_r = StandardScaler()
X_train_r_scaled = scaler_r.fit_transform(X_train_r)
X_test_r_scaled = scaler_r.transform(X_test_r)

# Train new model
model_r = LogisticRegression(max_iter=1000, random_state=42)
model_r.fit(X_train_r_scaled, y_train_r)

# Evaluate new model
y_pred_r = model_r.predict(X_test_r_scaled)
print(classification_report(y_test_r, y_pred_r))

# Save the reduced-feature model, scaler, and feature names for deployment
import joblib
joblib.dump(model_r, 'student_grade_model.pkl')
joblib.dump(scaler_r, 'scaler.pkl')
joblib.dump(top_features, 'feature_names.pkl')
print('Reduced-feature model, scaler, and feature names saved for deployment (15 features).')

Top 10 most important features:
G2             9.192319
G1             2.701841
school_MS      1.872498
age            1.741739
Fjob_other     1.365867
Walc           1.283136
Mjob_other     1.139375
famsize_LE3    0.895714
famrel         0.882027
sex_M          0.872307
dtype: float64

Selected features for new model: ['G2', 'G1', 'school_MS', 'age', 'Fjob_other', 'Walc', 'Mjob_other', 'famsize_LE3', 'famrel', 'sex_M', 'guardian_mother', 'freetime', 'failures', 'studytime', 'Dalc']


              precision    recall  f1-score   support

           0       0.75      0.92      0.83        26
           1       0.96      0.85      0.90        53

    accuracy                           0.87        79
   macro avg       0.85      0.89      0.86        79
weighted avg       0.89      0.87      0.88        79

Reduced-feature model, scaler, and feature names saved for deployment (15 features).
