##  Regression model for predicting the performance of students on a standardized test (column 'JAMB_Score'). ##

Students Performance in 2024 JAMB

About Dataset
This dataset was generated using statistics from the 2024 Joint Admissions and Matriculation Board (JAMB) examination to predict students performance. JAMB is a standardized test for university admissions in Nigeria. It aims to identify factors affecting student performance and support the development of targeted interventions to improve outcomes.

### Import Libraries  ###

In [3]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeRegressor

### Load the dataset  ###

In [4]:
student_performance = pd.read_csv("C:/Users/PC-023/Downloads/archive (5)/jamb_exam_results.csv")
df = student_performance.copy()

In [5]:
df.head()

Unnamed: 0,JAMB_Score,Study_Hours_Per_Week,Attendance_Rate,Teacher_Quality,Distance_To_School,School_Type,School_Location,Extra_Tutorials,Access_To_Learning_Materials,Parent_Involvement,IT_Knowledge,Student_ID,Age,Gender,Socioeconomic_Status,Parent_Education_Level,Assignments_Completed
0,192,22,78,4,12.4,Public,Urban,Yes,Yes,High,Medium,1,17,Male,Low,Tertiary,2
1,207,14,88,4,2.7,Public,Rural,No,Yes,High,High,2,15,Male,High,,1
2,182,29,87,2,9.6,Public,Rural,Yes,Yes,High,Medium,3,20,Female,High,Tertiary,2
3,210,29,99,2,2.6,Public,Urban,No,Yes,Medium,High,4,22,Female,Medium,Tertiary,1
4,199,12,98,3,8.8,Public,Urban,No,Yes,Medium,Medium,5,22,Female,Medium,Tertiary,1


### Preparing the dataset ###

In [6]:
# First, let's make the names lowercase:
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [7]:
# View the dataset
df.head()

Unnamed: 0,jamb_score,study_hours_per_week,attendance_rate,teacher_quality,distance_to_school,school_type,school_location,extra_tutorials,access_to_learning_materials,parent_involvement,it_knowledge,student_id,age,gender,socioeconomic_status,parent_education_level,assignments_completed
0,192,22,78,4,12.4,Public,Urban,Yes,Yes,High,Medium,1,17,Male,Low,Tertiary,2
1,207,14,88,4,2.7,Public,Rural,No,Yes,High,High,2,15,Male,High,,1
2,182,29,87,2,9.6,Public,Rural,Yes,Yes,High,Medium,3,20,Female,High,Tertiary,2
3,210,29,99,2,2.6,Public,Urban,No,Yes,Medium,High,4,22,Female,Medium,Tertiary,1
4,199,12,98,3,8.8,Public,Urban,No,Yes,Medium,Medium,5,22,Female,Medium,Tertiary,1


In [8]:
df.shape

(5000, 17)

#### Preparation ####

Remove the student_id column.
Fill missing values with zeros.
Do train/validation/test split with 60%/20%/20% distribution.
Use the train_test_split function and set the random_state parameter to 1.
Use DictVectorizer(sparse=True) to turn the dataframes into matrices.

#### Question 1 ####


 Let's train a decision tree regressor to predict the jamb_score variable.

Train a model with max_depth=1.

Which feature is used for splitting the data?

study_hours_per_week
attendance_rate
teacher_quality
distance_to_school 

In [9]:
# Drop the 'Student_ID' column
df = df.drop(columns=['student_id'])


In [10]:
df.head()

Unnamed: 0,jamb_score,study_hours_per_week,attendance_rate,teacher_quality,distance_to_school,school_type,school_location,extra_tutorials,access_to_learning_materials,parent_involvement,it_knowledge,age,gender,socioeconomic_status,parent_education_level,assignments_completed
0,192,22,78,4,12.4,Public,Urban,Yes,Yes,High,Medium,17,Male,Low,Tertiary,2
1,207,14,88,4,2.7,Public,Rural,No,Yes,High,High,15,Male,High,,1
2,182,29,87,2,9.6,Public,Rural,Yes,Yes,High,Medium,20,Female,High,Tertiary,2
3,210,29,99,2,2.6,Public,Urban,No,Yes,Medium,High,22,Female,Medium,Tertiary,1
4,199,12,98,3,8.8,Public,Urban,No,Yes,Medium,Medium,22,Female,Medium,Tertiary,1


In [11]:
df.shape

(5000, 16)

In [12]:
# Fill missing values with zeros

df = df.fillna(0)

In [13]:
# Define features and target variable
X = df.drop(columns=['jamb_score'])
y = df['jamb_score']

In [14]:
X.head()

Unnamed: 0,study_hours_per_week,attendance_rate,teacher_quality,distance_to_school,school_type,school_location,extra_tutorials,access_to_learning_materials,parent_involvement,it_knowledge,age,gender,socioeconomic_status,parent_education_level,assignments_completed
0,22,78,4,12.4,Public,Urban,Yes,Yes,High,Medium,17,Male,Low,Tertiary,2
1,14,88,4,2.7,Public,Rural,No,Yes,High,High,15,Male,High,0,1
2,29,87,2,9.6,Public,Rural,Yes,Yes,High,Medium,20,Female,High,Tertiary,2
3,29,99,2,2.6,Public,Urban,No,Yes,Medium,High,22,Female,Medium,Tertiary,1
4,12,98,3,8.8,Public,Urban,No,Yes,Medium,Medium,22,Female,Medium,Tertiary,1


In [15]:
X.shape

(5000, 15)

In [16]:
y.head()

0    192
1    207
2    182
3    210
4    199
Name: jamb_score, dtype: int64

In [17]:
# Split the data into train, validation, and test sets (60%/20%/20%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=1)


In [18]:
#  Apply DictVectorizer to convert features into a suitable format for the model
dv = DictVectorizer(sparse=True)
X_train = dv.fit_transform(X_train.to_dict(orient='records'))
X_val = dv.transform(X_val.to_dict(orient='records'))
X_test = dv.transform(X_test.to_dict(orient='records'))

In [19]:
# Train a Decision Tree Regressor with max_depth=1
dt = DecisionTreeRegressor(max_depth=1, random_state=1)
dt.fit(X_train, y_train)


In [20]:
# Identify which feature is used for splitting
split_feature = dv.feature_names_[dt.tree_.feature[0]]
print("Feature used for splitting:", split_feature)

Feature used for splitting: study_hours_per_week


#### Question 2 ####


*
Train a random forest regressor with these parameters:* 

n_estimators=* 10
random_stat* e=1
n_jobs=-1 (optional - to make training fa
  ster)
What's the RMSE of this model on the validation data?

22.13
42.13
62.13
82.12

### Random Forest Regressor ### 

In [21]:
# Import library
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Train a Random Forest Regressor with specified parameters

rf = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)


In [22]:
# Make predictions on the validation set
y_val_pred = rf.predict(X_val)

y_val_pred

array([203.3, 165.6, 175.3, 209.1, 195.5, 188.3, 218.3, 221. , 145.6,
       153.2, 189.6, 149.3, 243.6, 183. , 218.8, 136.3, 176.5, 138.4,
       203.9, 220. , 182.3, 159.8, 205.7, 183.6, 205.3, 219.3, 242.2,
       167.4, 167.9, 155.7, 207.5, 183. , 186.6, 150.8, 130.2, 143. ,
       176.6, 229.9, 160.1, 172.1, 156.2, 132. , 165.8, 137.7, 171. ,
       198.8, 185. , 209. , 154.4, 148.2, 149.9, 198.3, 177.1, 141.9,
       151.1, 210.8, 151.5, 193.7, 128.3, 147.9, 164.7, 199. , 178.8,
       171.7, 135.7, 196.7, 133.6, 175.8, 211.3, 171.5, 192.1, 150.2,
       157.2, 143.2, 152. , 143. , 181.9, 197.8, 182.9, 185.4, 168.9,
       222.7, 132.8, 156.7, 199. , 189.4, 167.6, 168.6, 185.9, 190.2,
       161.3, 156.7, 158.7, 189.3, 213.4, 142.1, 202.8, 261.2, 172.3,
       133.4, 194.1, 229.9, 175.3, 152.8, 140.1, 216.3, 159.5, 140.8,
       237.7, 166.6, 123.2, 177.5, 150.7, 189.2, 138.5, 194.9, 164.1,
       222.9, 243.9, 162.8, 149. , 145.3, 204.4, 203.5, 197.4, 137.8,
       182.9, 146.3,

In [23]:
# Calculate RMSE (Root Mean Squared Error) on the validation data
rmse_val = np.sqrt(mean_squared_error(y_val, y_val_pred))
print("RMSE on validation data:", rmse_val)

RMSE on validation data: 43.157758977963624


#### Question 3 ####

 Now let's experiment with the n_estimators parameter

Try different values of this parameter from 10 to 200 with step 10.
Set random_state to 1.
Evaluate the model on the validation dataset.
After which value of n_estimators does RMSE stop improving? Consider 3 decimal places for calculating the answer.

10
25
80
200

In [24]:
# Range of n_estimators to try
n_estimators_range = range(10, 201, 10)
rmse_values = []

for n in n_estimators_range:
    # Train a Random Forest Regressor with the current n_estimators
    rf = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1)
    rf.fit(X_train, y_train)
    
    # Predict on the validation set and calculate RMSE
    y_val_pred = rf.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
    rmse_values.append((n, rmse))

# Display RMSE values for each n_estimators
for n, rmse in rmse_values:
    print(f"n_estimators: {n}, RMSE: {rmse:.3f}")


n_estimators: 10, RMSE: 43.158
n_estimators: 20, RMSE: 41.790
n_estimators: 30, RMSE: 41.556
n_estimators: 40, RMSE: 41.076
n_estimators: 50, RMSE: 40.957
n_estimators: 60, RMSE: 40.774
n_estimators: 70, RMSE: 40.588
n_estimators: 80, RMSE: 40.503
n_estimators: 90, RMSE: 40.435
n_estimators: 100, RMSE: 40.365
n_estimators: 110, RMSE: 40.348
n_estimators: 120, RMSE: 40.302
n_estimators: 130, RMSE: 40.286
n_estimators: 140, RMSE: 40.263
n_estimators: 150, RMSE: 40.254
n_estimators: 160, RMSE: 40.200
n_estimators: 170, RMSE: 40.187
n_estimators: 180, RMSE: 40.136
n_estimators: 190, RMSE: 40.152
n_estimators: 200, RMSE: 40.138


* From the results generated, the RMSE values show minimal improvement after n_estimators = 80. Although there are minor decreases as n_estimators increases beyond 80, these improvements are quite small when rounded to three decimal places.

Answer: The RMSE stops improving significantly after 80 estimators.

#### Question 4 ####

Let's select the best max_depth:

Try different values of max_depth: [10, 15, 20, 25]
For each of these values,
try different values of n_estimators from 10 till 200 (with step 10)
calculate the mean RMSE
Fix the random seed: random_state=1
What's the best max_depth, using the mean RMSE?

10
15
20
25

In [25]:
# Define the parameters to try
max_depth_values = [10, 15, 20, 25]
n_estimators_range = range(10, 201, 10)
mean_rmse_per_depth = {}

# Loop over each max_depth value
for max_depth in max_depth_values:
    rmse_values = []
    
    # For each max_depth, try different values of n_estimators
    for n_estimators in n_estimators_range:
        # Train the Random Forest model
        rf = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=1, n_jobs=-1)
        rf.fit(X_train, y_train)
        
        # Predict on validation set and calculate RMSE
        y_val_pred = rf.predict(X_val)
        rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
        rmse_values.append(rmse)
    
    # Calculate the mean RMSE for this max_depth
    mean_rmse_per_depth[max_depth] = np.mean(rmse_values)
    print(f"Max Depth: {max_depth}, Mean RMSE: {mean_rmse_per_depth[max_depth]:.3f}")

# Find the best max_depth based on the lowest mean RMSE
best_max_depth = min(mean_rmse_per_depth, key=mean_rmse_per_depth.get)
print(f"Best max_depth: {best_max_depth} with Mean RMSE: {mean_rmse_per_depth[best_max_depth]:.3f}")

Max Depth: 10, Mean RMSE: 40.138
Max Depth: 15, Mean RMSE: 40.644
Max Depth: 20, Mean RMSE: 40.610
Max Depth: 25, Mean RMSE: 40.688
Best max_depth: 10 with Mean RMSE: 40.138


#### Question 5 ####

We can extract feature importance information from tree-based models.

At each step of the decision tree learning algorithm, it finds the best split. When doing it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the feature_importances_ field.

For this homework question, we'll find the most important feature:

Train the model with these parameters:
n_estimators=10,
max_depth=20,
random_state=1,
n_jobs=-1 (optional)
Get the feature importance information from this model
What's the most important feature (among these 4)?

study_hours_per_week
attendance_rate
distance_to_school
teacher_quality

In [29]:
# Define the features and target
features = ['study_hours_per_week', 'attendance_rate', 'teacher_quality', 'distance_to_school']
target = 'jamb_score'

# Split the data into training and testing sets
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train the RandomForestRegressor model with the specified parameters
model = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
model.fit(X_train, y_train)

# Get feature importances
feature_importances = model.feature_importances_

# Create a dataframe to display feature importances
importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the feature importances
print(importance_df)

                Feature  Importance
3    distance_to_school    0.329162
0  study_hours_per_week    0.325711
1       attendance_rate    0.248699
2       teacher_quality    0.096428


* The most important feature, according to the RandomForest model, is Distance_To_School with an importance score of 0.329. ​

#### Question 6 ####

Now let's train an XGBoost model! For this question, we'll tune the eta parameter:

Install XGBoost
Create DMatrix for train and validation
Create a watchlist
Train a model with these parameters for 100 rounds:
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
Now change eta from 0.3 to 0.1.

Which eta leads to the best RMSE score on the validation dataset?

0.3
0.1.
Both give equal value

In [34]:
import xgboost as xgb

# Define the features and target
features = ['study_hours_per_week', 'attendance_rate', 'teacher_quality', 'distance_to_school']
target = 'jamb_score'


In [35]:
# Split the data into training and validation sets
X = df[features]
y = df[target]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1)


In [36]:
# Convert the datasets into DMatrix (XGBoost's internal data structure)
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)

In [37]:
# Create a watchlist to monitor training and validation error
watchlist = [(dtrain, 'train'), (dval, 'eval')]

# Define the initial parameters with eta = 0.3
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    'objective': 'reg:squarederror',
    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

In [38]:
# Train the model with eta=0.3
model_0_3 = xgb.train(xgb_params, dtrain, num_boost_round=100, evals=watchlist, early_stopping_rounds=10, verbose_eval=False)

# Get predictions and calculate RMSE for eta=0.3
y_pred_0_3 = model_0_3.predict(dval)
rmse_0_3 = np.sqrt(mean_squared_error(y_val, y_pred_0_3))

In [39]:
# Now change eta to 0.1
xgb_params['eta'] = 0.1

# Train the model with eta=0.1
model_0_1 = xgb.train(xgb_params, dtrain, num_boost_round=100, evals=watchlist, early_stopping_rounds=10, verbose_eval=False)

# Get predictions and calculate RMSE for eta=0.1
y_pred_0_1 = model_0_1.predict(dval)
rmse_0_1 = np.sqrt(mean_squared_error(y_val, y_pred_0_1))

# Display the RMSE for both models
print(f"RMSE for eta=0.3: {rmse_0_3}")
print(f"RMSE for eta=0.1: {rmse_0_1}")

RMSE for eta=0.3: 41.7023388950155
RMSE for eta=0.1: 41.04660679062348


In [40]:
# Compare and conclude which eta gives the best performance
if rmse_0_3 < rmse_0_1:
    print("Eta = 0.3 gives the best RMSE.")
elif rmse_0_1 < rmse_0_3:
    print("Eta = 0.1 gives the best RMSE.")
else:
    print("Both eta values give equal RMSE.")

Eta = 0.1 gives the best RMSE.
