# Predict Test Scores of students


This notebook is a work flow for various Python-based machine learning model for predicting test scores of students.

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation

# 1. Problem Definition

Given the set of parameters, can we predict a test score of a student?

# 2. Data

Predicting the posttest scores of students from 11 features by Kwadwo Ofosu

Source: https://www.kaggle.com/kwadwoofosu/predict-test-scores-of-students

# 3. Evalutation

Creating a Regression Model that we will evalute using the Root Mean Square Error (RMSE), R2 Score and Mean Absolute Error (MAE)

# 4. Features

It contains information about a test written by some students. It include features such as: School setting, School type, gender, pretetest scores among others.

## Features / inputs

    1. school - Name of the school the student is enrolled in.
    2. school_settings - The location of the school
    3. school_type - The type of school. Either public or non-public
    4. classroom - The type of classroom
    5. teaching_method - Teaching methods: Either experimental or Standard
    6. n_student - Number of students in the class
    7. student_id - A unique ID for each student
    8. gender - The gender of the students: male or female
    9. lunch - Whether a student qualifies for free/subsidized lunch or not
    10. pretest - The pretest score of the students out of 100

## Label / Output
    11. posttest - The posttest scores of the students out of 100

## Standard imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML Self-Projects/Predict Test Scores of students/Data/test_scores.csv')
df = pd.read_csv('/kaggle/input/predict-test-scores-of-students/test_scores.csv')
df.head()

## Data Exploration (Exploratory Data Analysis (EDA) )

In [None]:
df

In [None]:
df.info()

In [None]:
df['school'].unique()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Value counts of Schools')
sns.countplot(data=df, x='school');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Post Test Scores vs Pretest Scores')
sns.scatterplot(data=df, x='posttest', y='pretest');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Schools vs Pre Test Scores')
sns.boxplot(data=df, x='pretest', y='school');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Schools vs Post Test Scores')
sns.boxplot(data=df, x='posttest', y='school');

In [None]:
print(f'Mean of Pre test scores: { df["pretest"].mean() }')
print(f'Mean of Post test scores: {df["posttest"].mean()}')

From the Above Bar plot and the two mean scores, we can see that students usually perfrom better at the post test scores

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Post Test Scores vs Pretest Scores vs School Setting')
sns.scatterplot(data=df, x='posttest', y='pretest', hue='school_setting', s=70, alpha=0.7);

From the plot we can see that the Urban schools and Suburban schools are more tightly in clusters then Rural Schools in term of both scores

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Post Test Scores vs Pretest Scores vs School Type')
sns.scatterplot(data=df, x='posttest', y='pretest', hue='school_type', s=70, alpha=0.7);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Classroom Value Count')
plt.xticks(rotation=90)
sns.countplot(data=df, x='classroom');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Post Test Scores vs Pretest Scores vs teaching Method')
sns.scatterplot(data=df, x='posttest', y='pretest', hue='teaching_method', s=70, alpha=0.7);

This is interesting, the plot shows that usually standard teaching method is scores higher in pre test scores then Experimental teaching method.

In [None]:
plt.figure(figsize=(20,10))
plt.title('Number of student Value Count per class')
plt.xticks(rotation=90)
sns.countplot(data=df, x='n_student');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Post Test Scores vs Pretest Scores vs Gender')
sns.scatterplot(data=df, x='posttest', y='pretest', hue='gender', s=70, alpha=0.7);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Post Test Scores vs Pretest Scores vs Qualifies for lunch')
sns.scatterplot(data=df, x='posttest', y='pretest', hue='lunch', s=70, alpha=0.7);

From the plot we can see that students who Does not qualifty for lunch usually does better in both Pre test and Post test Scores

## Data cleaning

In [None]:
df.info()

As student_id is unique to per student, we will be dropping it.

In [None]:
df_backup = df.copy()

In [None]:
df = df.drop('student_id', axis=1)

In [None]:
df.info()

### Getting Dummies Vars

In [None]:
df = pd.get_dummies(df, drop_first=True)

In [None]:
df.head()

# 5. Modelling

In [None]:
X = df.drop('posttest', axis=1)
y = df['posttest']
len(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Importing Models

In [None]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        model_scores[name] = model.score(X_test,y_test)

    model_scores = pd.DataFrame(model_scores, index=['Accuracy'])
    model_scores = model_scores.transpose().sort_values('Accuracy')

    return model_scores

## Baseline models and scores

In [None]:
models = {'Ridge' : Ridge(),
         'Lasso': Lasso(),
         'ElasticNet': ElasticNet(),
         'KNeighborsRegressor': KNeighborsRegressor(),
         'SVR': SVR(),
         'DecisionTreeRegressor': DecisionTreeRegressor(),
         'RandomForestRegressor':RandomForestRegressor(),
         'GradientBoostingRegressor': GradientBoostingRegressor(),
         'AdaBoostRegressor': AdaBoostRegressor()}

In [None]:
baseline_model_scores_df = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
baseline_model_scores_df.sort_values('Accuracy')

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores_df.T)
plt.title('Baseline Model Accuracy Score')
plt.xticks(rotation=90);

With the scoring of the baseline model, we will use the following models to tune the hyperparameter:

    1. KNeighborsRegressor 	0.942739
    2. RandomForestRegressor 	0.945066
    3. GradientBoostingRegressor 	0.948113
    4. Ridge 	0.957755

## Hyperparameter Tuning via Grid Search CV

In [None]:
from sklearn.model_selection import GridSearchCV
from warnings import filterwarnings

In [None]:
filterwarnings('ignore')

In [None]:
def gridsearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_gs_scores = {}
    model_gs_best_param = {}
    
    for name, model in models.items():
        gs_model = GridSearchCV(model,
                                param_grid=params[name],
                                scoring='neg_mean_squared_error',
                                n_jobs=-1,
                                cv=5,
                                verbose=2)
        
        gs_model.fit(X_train,y_train)

        model_gs_scores[name] = gs_model.score(X_test,y_test)
        model_gs_best_param[name] = gs_model.best_params_

    model_gs_scores = pd.DataFrame(model_gs_scores, index=['neg_mean_squared_error'])
    model_gs_scores = model_gs_scores.transpose().sort_values('neg_mean_squared_error')
        
    return model_gs_scores, model_gs_best_param

### Grid Search CV model 1

In [None]:
models = {'Ridge' : Ridge(),
         'KNeighborsRegressor': KNeighborsRegressor(),
         'RandomForestRegressor':RandomForestRegressor(),
         'GradientBoostingRegressor': GradientBoostingRegressor()}
         
params = {'Ridge' : {'alpha' : np.linspace(0,1,20),
                     'normalize': [True, False]},
          'KNeighborsRegressor': {'n_neighbors':[1,2,5,10,20]},
          'RandomForestRegressor': {'n_estimators' : [50,100,200],
                    'criterion' : ['mse','mae'],
                    'oob_score' : [True,False]},
          'GradientBoostingRegressor': {'criterion': ['mse', 'friedman_mse'],
                                        'loss': ['ls','lad','huber','quantile']}
          }

In [None]:
model_gs_scores_1, model_gs_best_param_1 = gridsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_gs_scores_1

In [None]:
model_gs_best_param_1

### Grid Search CV model 2

In [None]:
models = {'Ridge' : Ridge(),
         'KNeighborsRegressor': KNeighborsRegressor(),
         'RandomForestRegressor':RandomForestRegressor(),
         'GradientBoostingRegressor': GradientBoostingRegressor()}
         
params = {'Ridge' : {'alpha' : np.linspace(0.5,1,20),
                     'normalize': [False]},
          'KNeighborsRegressor': {'n_neighbors':[4,5,6,7]},
          'RandomForestRegressor': {'n_estimators' : [150,200,300],
                    'criterion' : ['mse'],
                    'oob_score' : [False]},
          'GradientBoostingRegressor': {'criterion': ['mse'],
                                        'loss': ['ls'],
                                        'n_estimators' : [150,200,300]}
          }

In [None]:
model_gs_scores_2, model_gs_best_param_2 = gridsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_gs_scores_2

In [None]:
model_gs_best_param_2

From the Grid Search CV using the neg mean squared error, we can see that the Ridge model is performing the best with a result of 8.194713.

# 6. Model Evaluation

In [None]:
model = Ridge(alpha=1.0,normalize=False)
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
r2 = r2_score(y_test,y_preds)
mae = mean_absolute_error(y_test, y_preds)
mse = mean_squared_error(y_test, y_preds)
rmse = np.sqrt(mse)

In [None]:
print(f'R2 Score: {r2}')
print(f'Mean Absolute Error: {mae}')
print(f'Mean Square Error: {mse}')
print(f'Root Mean Square Error: {rmse}')

Using a Ridge Model we have evaulated the model of a Root Mean Square Error of 2.8626409815834273, a R2 Score of 0.957754777599356 and a Mean Absolute Error: 2.2527079714433427