# Class 10 Stream Counselling System using Machine Learning

### Project Domain
Artificial Intelligence – Recommendation System

### Introduction
Choosing the right academic stream after Class 10 is a critical decision that can significantly influence a student’s future career path. Many students face confusion while selecting between Science, Commerce, and Arts due to limited guidance, peer pressure, or lack of data-driven insights.

This project presents an AI-powered Stream Counselling System that uses Machine Learning techniques to analyze a student’s academic performance and recommend a suitable Class 11 stream. The system aims to provide objective, data-driven guidance to support students in making informed academic decisions.


## Problem Statement

Students completing Class 10 often struggle to decide which academic stream to pursue in Class 11. This decision is usually influenced by subjective factors such as parental expectations, peer choices, or incomplete understanding of personal academic strengths.

The absence of structured and data-driven counselling can lead to students choosing unsuitable streams, resulting in academic stress, poor performance, or loss of interest in studies.

### Objective
The objective of this project is to develop a Machine Learning–based recommendation system that analyzes a student’s Class 10 academic performance and suggests a suitable academic stream (Science, Commerce, or Arts). The system aims to assist students by providing personalized, data-backed recommendations rather than replacing professional counselling.


## Dataset Description

This project uses two publicly available educational datasets to build a robust and generalized stream counselling model.

### 1. UCI Student Performance Dataset
The UCI Student Performance dataset contains academic and demographic information of secondary school students. It includes subject-wise performance data, study-related attributes, and other academic indicators. This dataset helps in understanding patterns related to academic performance.

### 2. Kaggle Students Performance Dataset
The Kaggle Students Performance dataset includes students’ scores in subjects such as Mathematics, Reading, and Writing. It provides a clear numerical representation of student performance, which is useful for feature engineering and model training.

### Purpose of Using Multiple Datasets
Using multiple datasets helps improve the diversity of academic profiles and reduces overfitting to a single data source. The datasets are later combined and transformed to extract meaningful features relevant to stream selection.

The following code loads both datasets and performs an initial inspection to understand their structure and contents.


In [None]:
import pandas as pd

# Load UCI dataset
df_uci = pd.read_csv("student-mat.csv", sep=";")

# Load Kaggle dataset
df_kaggle = pd.read_csv("StudentsPerformance.csv")

# Show basic info
print("UCI Dataset Shape:", df_uci.shape)
print("Kaggle Dataset Shape:", df_kaggle.shape)

# Preview both datasets
print("\nUCI Dataset (student-mat.csv)")
display(df_uci.head())

print("\nKaggle Dataset (StudentsPerformance.csv)")
display(df_kaggle.head())


UCI Dataset Shape: (395, 33)
Kaggle Dataset Shape: (1000, 8)

UCI Dataset (student-mat.csv)


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10



Kaggle Dataset (StudentsPerformance.csv)


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


### Selection of Relevant Features

Both datasets contain multiple attributes, not all of which are directly relevant for stream counselling. To maintain focus on academic performance, only subject-related and study-related features were selected.

From the UCI Student Performance dataset, grade-related and academic behavior attributes such as internal assessments, study time, number of past failures, and absences were selected.

From the Kaggle Students Performance dataset, core subject scores were selected to represent academic proficiency in Mathematics and language-related subjects.

This step helps in reducing noise, simplifying the dataset, and ensuring that only meaningful attributes are used for further analysis and feature engineering.


In [None]:
# ---- STEP 4A: Select useful columns from UCI dataset ----
uci_cols = ['G1', 'G2', 'G3', 'studytime', 'failures', 'absences']
df_uci_clean = df_uci[uci_cols].copy()

print("Cleaned UCI Dataset")
display(df_uci_clean.head())


# ---- STEP 4B: Select useful columns from Kaggle dataset ----
kaggle_cols = ['math score', 'reading score', 'writing score']
df_kaggle_clean = df_kaggle[kaggle_cols].copy()

print("Cleaned Kaggle Dataset")
display(df_kaggle_clean.head())


Cleaned UCI Dataset


Unnamed: 0,G1,G2,G3,studytime,failures,absences
0,5,6,6,2,0,6
1,5,5,6,2,0,4
2,7,8,10,2,3,10
3,15,14,15,3,0,2
4,6,10,10,2,0,4


Cleaned Kaggle Dataset


Unnamed: 0,math score,reading score,writing score
0,72,72,74
1,69,90,88
2,90,95,93
3,47,57,44
4,76,78,75


## Data Cleaning & Preprocessing

After selecting the relevant columns from both datasets, basic preprocessing steps were applied to prepare the data for feature engineering. The datasets were checked for consistency in structure and cleaned to ensure numerical stability.

Since the datasets primarily contained numerical academic attributes, no extensive missing value imputation was required. The focus of preprocessing was on transforming raw academic scores into meaningful indicators that could support counselling decisions.


## Feature Engineering

Feature engineering plays a crucial role in converting raw academic data into meaningful indicators for decision-making.

From the UCI dataset, the following features were engineered:
- **Average Score**: Represents overall academic performance across internal assessments.
- **Academic Strength**: A normalized measure of average academic performance.
- **Consistency Score**: Captures the impact of past academic failures on consistency.

From the Kaggle dataset, the following features were engineered:
- **Math Inclination**: Indicates numerical aptitude based on mathematics scores.
- **Language Inclination**: Represents language proficiency using reading and writing scores.

These engineered features provide a balanced representation of a student’s academic abilities and are used as inputs to the Machine Learning model.


In [None]:
# ---- STEP 5A: Counselling features from UCI dataset ----
df_uci_feat = df_uci_clean.copy()

df_uci_feat['avg_score'] = (df_uci_feat['G1'] + df_uci_feat['G2'] + df_uci_feat['G3']) / 3
df_uci_feat['academic_strength'] = df_uci_feat['avg_score'] / 20   # normalize
df_uci_feat['consistency'] = 1 / (1 + df_uci_feat['failures'])

display(df_uci_feat.head())


# ---- STEP 5B: Counselling features from Kaggle dataset ----
df_kaggle_feat = df_kaggle_clean.copy()

df_kaggle_feat['math_inclination'] = df_kaggle_feat['math score'] / 100
df_kaggle_feat['language_inclination'] = (
    df_kaggle_feat['reading score'] + df_kaggle_feat['writing score']
) / 200

display(df_kaggle_feat.head())


Unnamed: 0,G1,G2,G3,studytime,failures,absences,avg_score,academic_strength,consistency
0,5,6,6,2,0,6,5.666667,0.283333,1.0
1,5,5,6,2,0,4,5.333333,0.266667,1.0
2,7,8,10,2,3,10,8.333333,0.416667,0.25
3,15,14,15,3,0,2,14.666667,0.733333,1.0
4,6,10,10,2,0,4,8.666667,0.433333,1.0


Unnamed: 0,math score,reading score,writing score,math_inclination,language_inclination
0,72,72,74,0.72,0.73
1,69,90,88,0.69,0.89
2,90,95,93,0.9,0.94
3,47,57,44,0.47,0.505
4,76,78,75,0.76,0.765


### Integration of Engineered Features

After engineering relevant counselling features from both datasets, the next step was to integrate them into a single unified dataset suitable for model training.

Since the two datasets originate from different sources and contain different numbers of records, the datasets were first aligned by trimming them to a common length. This ensures a one-to-one correspondence between feature rows during combination.

The selected counselling features were then combined horizontally to form the final counselling dataset containing:
- Academic Strength
- Consistency Score
- Math Inclination
- Language Inclination

This integrated dataset represents a comprehensive academic profile for each student and serves as the input for the Machine Learning model.


In [None]:
# ---- STEP 6A: Select only engineered features ----
uci_final = df_uci_feat[['academic_strength', 'consistency']].copy()

kaggle_final = df_kaggle_feat[['math_inclination', 'language_inclination']].copy()


# ---- STEP 6B: Make sizes equal (important) ----
min_len = min(len(uci_final), len(kaggle_final))

uci_final = uci_final.iloc[:min_len].reset_index(drop=True)
kaggle_final = kaggle_final.iloc[:min_len].reset_index(drop=True)


# ---- STEP 6C: Combine horizontally ----
counselling_df = pd.concat([uci_final, kaggle_final], axis=1)

display(counselling_df.head())
print("Final counselling dataset shape:", counselling_df.shape)


Unnamed: 0,academic_strength,consistency,math_inclination,language_inclination
0,0.283333,1.0,0.72,0.73
1,0.266667,1.0,0.69,0.89
2,0.416667,0.25,0.9,0.94
3,0.733333,1.0,0.47,0.505
4,0.433333,1.0,0.76,0.765


Final counselling dataset shape: (395, 4)


## Model Selection & Training

Before training a Machine Learning model, it is necessary to define the target variable that the model will learn to predict.

In this project, the target variable represents the recommended academic stream for a student. Since the datasets used do not explicitly contain stream labels, a rule-based logic was designed to assign stream recommendations based on academic performance and aptitude-related features.

The following logic was used:
- **Science**: Assigned to students with strong academic performance and high mathematical inclination, or strong math skills combined with good consistency.
- **Commerce**: Assigned to students with moderate academic strength and numerical ability.
- **Arts**: Assigned to students with language inclination or lower academic pressure requirements.

This approach creates a supervised learning problem where the Machine Learning model learns patterns from the engineered features to predict the appropriate stream.


In [None]:
def recommend_stream_v2(row):
    # Science: strong academics OR strong math + good consistency
    if (row['academic_strength'] >= 0.6 and row['math_inclination'] >= 0.55) or \
       (row['math_inclination'] >= 0.75 and row['consistency'] >= 0.7):
        return 'Science'

    # Commerce: moderate academics + math/business ability
    elif row['academic_strength'] >= 0.45 or row['math_inclination'] >= 0.5:
        return 'Commerce'

    # Arts: language-oriented or lower academic pressure
    else:
        return 'Arts'


counselling_df['recommended_stream'] = counselling_df.apply(recommend_stream_v2, axis=1)


### Analysis of Target Variable Distribution

After assigning stream labels using the rule-based logic, it is important to analyze the distribution of the target variable. This helps in understanding whether the dataset is balanced across different stream categories and ensures that the Machine Learning model does not become biased towards a particular class.

The following code displays the number of instances assigned to each academic stream.


In [None]:
counselling_df['recommended_stream'].value_counts()


Unnamed: 0_level_0,count
recommended_stream,Unnamed: 1_level_1
Commerce,208
Science,173
Arts,14


### Counselling Explanation Generation

In addition to predicting the recommended academic stream, the system also generates a brief counselling explanation to justify the recommendation. This improves interpretability and makes the system more user-friendly.

Based on the predicted stream, a corresponding explanation is assigned:
- **Science**: Indicates strong academic performance and mathematical aptitude.
- **Commerce**: Reflects balanced academics with numerical and analytical skills.
- **Arts**: Suggests inclination towards language-based or creative subjects.

Providing explanations alongside predictions helps students better understand the reasoning behind the recommendation.


In [None]:
# ---- STEP 8: Counselling explanation ----

def counselling_reason(row):
    if row['recommended_stream'] == 'Science':
        return 'Strong academic performance with high mathematical ability'

    elif row['recommended_stream'] == 'Commerce':
        return 'Moderate academics with good numerical and analytical skills'

    else:
        return 'Better inclination towards language-based or creative subjects'


counselling_df['counselling_reason'] = counselling_df.apply(counselling_reason, axis=1)

display(counselling_df.head())


Unnamed: 0,academic_strength,consistency,math_inclination,language_inclination,recommended_stream,counselling_reason
0,0.283333,1.0,0.72,0.73,Commerce,Moderate academics with good numerical and ana...
1,0.266667,1.0,0.69,0.89,Commerce,Moderate academics with good numerical and ana...
2,0.416667,0.25,0.9,0.94,Commerce,Moderate academics with good numerical and ana...
3,0.733333,1.0,0.47,0.505,Commerce,Moderate academics with good numerical and ana...
4,0.433333,1.0,0.76,0.765,Science,Strong academic performance with high mathemat...


### Final Dataset Preparation for Model Training

After completing feature engineering, target variable creation, and counselling explanation generation, the final counselling dataset was saved for use in model training.

Saving the processed dataset ensures reproducibility and allows the Machine Learning model to be trained independently of the data preparation steps. This also enables easier integration with other components of the system if required.


In [None]:
# ---- STEP 9: Save final counselling dataset ----

counselling_df.to_csv("class10_stream_counselling_final.csv", index=False)

print("Dataset saved successfully!")
print("Final shape:", counselling_df.shape)


Dataset saved successfully!
Final shape: (395, 5)


### Exporting the Final Dataset

The processed counselling dataset was saved and downloaded from Google Colab for use in model training and backend deployment. This step enables smooth transition from experimentation to implementation.


In [None]:
from google.colab import files
files.download("class10_stream_counselling_final.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Loading the Final Counselling Dataset

The prepared counselling dataset was loaded for Machine Learning model training. This dataset contains engineered academic features along with the corresponding stream labels.


In [None]:
import pandas as pd

# Load the final counselling dataset
df = pd.read_csv("class10_stream_counselling_final.csv")

# Show dataset
display(df.head())
print(df.shape)


Unnamed: 0,academic_strength,consistency,math_inclination,language_inclination,recommended_stream
0,0.283333,1.0,0.72,0.73,Commerce
1,0.266667,1.0,0.69,0.89,Commerce
2,0.416667,0.25,0.9,0.94,Commerce
3,0.733333,1.0,0.47,0.505,Commerce
4,0.433333,1.0,0.76,0.765,Science


(395, 5)


### Feature and Target Separation

The dataset was divided into input features (`X`) and the target variable (`y`). The features represent academic indicators, while the target corresponds to the recommended academic stream.


In [None]:
# STEP ML-2: Prepare X (features) and y (target)

X = df[['academic_strength',
        'consistency',
        'math_inclination',
        'language_inclination']]

y = df['recommended_stream']

# Check
print("Features shape:", X.shape)
print("Target shape:", y.shape)

print("\nTarget class distribution:")
print(y.value_counts())


Features shape: (395, 4)
Target shape: (395,)

Target class distribution:
recommended_stream
Commerce    208
Science     173
Arts         14
Name: count, dtype: int64


### Label Encoding and Train–Test Split

The target stream labels were encoded into numerical form using Label Encoding. The dataset was then split into training and testing sets to evaluate model performance on unseen data.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# STEP ML-3A: Encode target labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print("Encoded classes:")
for i, cls in enumerate(label_encoder.classes_):
    print(i, "->", cls)

# STEP ML-3B: Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)

print("\nTraining set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)


Encoded classes:
0 -> Arts
1 -> Commerce
2 -> Science

Training set shape: (316, 4) (316,)
Test set shape: (79, 4) (79,)


### Model Training

A Random Forest Classifier was used to train the stream counselling model. Random Forest was selected due to its ability to handle non-linear relationships, reduce overfitting, and perform well on structured tabular data.

The model was trained using the prepared training dataset to learn patterns between academic features and recommended streams.


In [None]:
from sklearn.ensemble import RandomForestClassifier

# STEP ML-4: Train the model
model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

model.fit(X_train, y_train)

print("✅ Model trained successfully!")


✅ Model trained successfully!


## Model Evaluation

The trained model was evaluated using standard classification metrics. Accuracy was used to measure overall performance, while the confusion matrix and classification report provided detailed insights into class-wise predictions.


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# STEP ML-5A: Predictions on test data
y_pred = model.predict(X_test)

# STEP ML-5B: Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# STEP ML-5C: Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# STEP ML-5D: Classification Report
print("\nClassification Report:")
print(classification_report(
    y_test, y_pred,
    target_names=label_encoder.classes_
))


Model Accuracy: 0.9873417721518988

Confusion Matrix:
[[ 2  1  0]
 [ 0 42  0]
 [ 0  0 34]]

Classification Report:
              precision    recall  f1-score   support

        Arts       1.00      0.67      0.80         3
    Commerce       0.98      1.00      0.99        42
     Science       1.00      1.00      1.00        34

    accuracy                           0.99        79
   macro avg       0.99      0.89      0.93        79
weighted avg       0.99      0.99      0.99        79



In [None]:
import joblib

# STEP ML-6: Save model and label encoder
joblib.dump(model, "stream_counselling_model.pkl")
joblib.dump(label_encoder, "stream_label_encoder.pkl")

print("✅ Model and Label Encoder saved successfully!")


✅ Model and Label Encoder saved successfully!


### Environment Setup for Deployment

The following libraries were installed to support backend development and model deployment using Flask.


In [None]:
pip install flask flask-cors joblib


Collecting flask-cors
  Downloading flask_cors-6.0.2-py3-none-any.whl.metadata (5.3 kB)
Downloading flask_cors-6.0.2-py3-none-any.whl (13 kB)
Installing collected packages: flask-cors
Successfully installed flask-cors-6.0.2


## System Integration (Backend + Frontend)

The trained Machine Learning model was integrated into a Flask-based backend to enable real-time predictions. The backend loads the saved model and label encoder and exposes a REST API endpoint (`/predict`) that accepts student academic details as input.

The backend performs the same feature engineering steps used during training to ensure consistency. A simple counselling explanation is also generated along with the predicted stream.

The frontend, built using HTML, CSS, and JavaScript, communicates with this backend using HTTP requests, completing the end-to-end AI-powered counselling system.


In [None]:
%%writefile backend.py
from flask import Flask, request, jsonify
from flask_cors import CORS
import joblib

app = Flask(__name__)
CORS(app)  # Allow frontend to talk to backend

# Load ML assets
model = joblib.load("stream_counselling_model.pkl")
encoder = joblib.load("stream_label_encoder.pkl")

# Feature engineering (same as training)
def calculate_features(math, english, science, failures):
    avg_score = (math + english + science) / 3
    academic_strength = avg_score / 100
    math_inclination = math / 100
    language_inclination = english / 100
    consistency = 1 / (1 + failures)

    return [
        academic_strength,
        consistency,
        math_inclination,
        language_inclination
    ]

# Counselling explanation
def counselling_reason(stream):
    if stream == "Science":
        return "Strong academic performance and mathematical ability suggest Science stream."
    elif stream == "Commerce":
        return "Balanced academic performance suggests Commerce stream."
    else:
        return "Language skills and overall profile suggest Arts or Humanities."

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json

    features = calculate_features(
        data["math"],
        data["english"],
        data["science"],
        data["failures"]
    )

    prediction = model.predict([features])[0]
    stream = encoder.inverse_transform([prediction])[0]

    return jsonify({
        "recommended_stream": stream,
        "reason": counselling_reason(stream)
    })

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000, debug=False)



Writing backend.py


### Running the Backend Server

The Flask backend was executed to activate the API endpoint for stream prediction. Once running, the backend listens for requests from the frontend and returns stream recommendations in real time.


In [None]:
!python backend.py

 * Serving Flask app 'backend'
 * Debug mode: off
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://172.28.0.12:5000
[33mPress CTRL+C to quit[0m


### Backend Validation and Transition to Full-Stack Integration

The JSON response received from the `/predict` API endpoint, containing both the **recommended stream** and its **justification**, serves as concrete proof that the backend system is functioning correctly.

The successful output confirms that:
- The trained Machine Learning model was loaded correctly using serialized `.pkl` files
- Feature engineering at inference time matches the training pipeline
- The prediction logic executes without errors
- The Flask API correctly processes requests and returns structured JSON responses

Since the backend is now fully validated, the following backend components were downloaded for local deployment:
- `backend.py` – Flask application handling API requests
- `stream_counselling_model.pkl` – Trained Machine Learning model
- `stream_label_encoder.pkl` – Label encoder for decoding predictions

These backend files are now integrated with the frontend files (HTML, CSS, JavaScript) to form a **complete full-stack web application**.  
The frontend communicates with the backend through HTTP POST requests, enabling real-time stream recommendations based on user input.

This separation of frontend and backend ensures modularity, scalability, and ease of deployment, while maintaining a clear distinction between user interface logic and machine learning inference.


In [None]:
!nohup python backend.py > backend.log 2>&1 &


In [None]:
!ps aux | grep backend.py


root       12508 48.0  0.7 322104 100532 ?       Rl   14:00   0:01 python3 backend.py
root       12523  0.0  0.0   7372  3576 ?        S    14:00   0:00 /bin/bash -c ps aux | grep backend.py
root       12525  0.0  0.0   6612  2440 ?        S    14:00   0:00 grep backend.py


In [None]:
!curl -X POST http://127.0.0.1:5000/predict \
-H "Content-Type: application/json" \
-d '{"math":75,"english":80,"science":70,"failures":1}'


{"reason":"Strong academic performance and mathematical ability suggest Science stream.","recommended_stream":"Science"}


## 10️⃣ Ethical Considerations & Limitations

- **Bias in Datasets**  
  The datasets used in this project (UCI and Kaggle) may contain inherent academic and socio-economic biases. These biases can influence the model’s predictions and may not fully represent students from diverse backgrounds.

- **Model Limitations**  
  The machine learning model is trained only on academic performance indicators such as marks, consistency, and subject inclination. It does not consider factors like personal interest, emotional intelligence, creativity, or external support systems.

- **Not a Replacement for Human Counselling**  
  This system is intended to support decision-making and should not be considered a substitute for professional career counselling. Final academic decisions should involve teachers, parents, and trained counsellors.

- **Responsible AI Usage**  
  The predictions generated by the system are probabilistic and should be interpreted responsibly. Users are informed that the recommendations are guidance-based, not absolute outcomes.


## 1️⃣1️⃣ Conclusion & Future Scope

- **Conclusion**  
  The project successfully demonstrates an end-to-end AI-based counselling system that recommends a suitable Class 11 stream for students based on their Class 10 academic performance.

- **Project Achievements**  
  - Data integration from multiple educational datasets  
  - Feature engineering aligned with counselling logic  
  - Training and evaluation of a machine learning classification model  
  - Backend implementation using Flask  
  - Frontend-backend integration using REST APIs  

- **Practical Impact**  
  This system can assist students and educators by providing quick, data-driven insights during a critical academic decision-making stage.

- **Future Scope**  
  - Integration of aptitude and interest-based assessments  
  - Use of advanced models and explainable AI techniques  
  - Development of real-time dashboards for institutions  
  - Cloud deployment for scalability and wider accessibility  

### System Architecture Overview
The system follows a client–server architecture where the frontend collects student inputs, sends them to a Flask-based backend via REST API, and the backend processes the data using a trained machine learning model to return stream recommendations.
