### Author: `Winston Menzies`
> #### Imperial College Business School
> #### Professional Certificate in Machine Learning and Artifical Intelligence
#### Date: `29 April 2024`
#### Student ID: `484 (Class of 2023)`
#### Required activity 25.3: `Portfolio project on optimising a model for real-life data`
#### Usage: `Capstone Project - Predicting Ofsted School Grades`

## Activity directions

**1. Code Presentation:** Other people should be able to implement your code easily. Make sure it is well commented and clear. It is recommended that you use a Jupyter Notebook to present your method and results.

### Codebase Development
My preferred approach was to use a more modular, structured and reusable development, implementing an object-oriented programming (OOPs) codebase using classes, objects, inheritance etc, which would also lead to cleaner code and better maintainability. Therefore, I developed a library of Python files. Click on the link below to view the GitHub codebase.

> https://github.com/wrm65/Capstone-Project-2024/tree/main/src

While this notebook build and run the models developed to meet the requirements for this project, it does not contain any devlopment code. Instead it contains a series of function calls to run the models and produce the performance metrics, graphs and summmary reports.

Full explanations and expectations are provided prior to running each function. 

### <span style='background:green'><font color = white>&nbsp;Start School Rating Process&nbsp;</font></span>

In [None]:

# import libraries needed to run the models
import sys
sys.path.append("./src")
from constants import *
from school_ofsted_rating import SchoolRating

# create instance of school rating class

school_rating = SchoolRating()


### Exploratory data analysis (EDA)

Let's gain a deeper understanding of the OIS dataset, so we can make informed decisions about how to proceed with our analysis.

1. **OIS dataset dimension**<br>
> total rows and columns
2. **Dataset definition**<br>
> column names and datatype
3. **View _head_ of dataset**<br>
> first 5 rows
4. **Check for missing values**<br>
> show column _NULL_ counts
5. **Classification**<br>
> list the classifications to _predict_

In [None]:

#call fuction to report on dataset vitals

school_rating.report_dataset_vitals()


### - show additional data analysis

1. **School Gender Type**<br>
> percentage of _boys_, _girls_, and _mixed_ schools
2. **School Religious Ethos**<br>
> percentage of _Chruch of England_, _Roman Catholic_, _Other religion_ and _non-faith_ schools
2. **School Ofsted Rating**<br>
> percentage of _Outstanding_, _Good_, _Requires improvement_, and _Inadequate_ ratings

In [None]:

#call fuction to show summary report

school_rating.show_summary()


### Data Preprocessing

1. **Remove _irrelevant_ columns**
2. **Encode Categorical Variables**<br>
> convert ratings to numerical values<br>
> `Outstanding = 1`<br>
> `Good = 2`<br>
> `Requires improvement = 3`<br>
> `Inadequate = 4`<br>
3. **List the models being used**

In [None]:

# call function to drop unwanted columns
# columns being dropped will be shown
# resulting dataset definition will also be shown

school_rating.drop_unwanted_columns()


In [None]:

# call function to encode categorical variables 

school_rating.encode_rating_catogory()


## <span style='background:black'><font color = white>&nbsp;Model Development and Tuning&nbsp;</font></span>

## Data Preparation

### Training, and Testing

Data splitting divides a dataset into two main subsets: 
- the training set, used to train the model;
- and the testing set, used for checking the model’s performance on new data.

In [None]:

# call function to split the dataset as specified
# traing set will be 65% of dataset
# testing set will be 35% of dataset

school_rating.split_data(0.35)


## Build and Run Models
The building and running of a model is encapsulated in the function `build_model(model_type)` where the argument `model_type` specify the **_model_** to build and run.

The steps involve in the building and running process are:
- pre-build process: _define_ hyperparamter to tune
- train model: _fit_ model using training data
- predict model: _predict_ model using test data 
- evaluate model: _calculate_ set of performance metrics (`accuracy`, `mean squared error`, `recall` and `f1 scorce`)
- print metrics: _print_ performance metrics for comparison
- post build process: _provide_ additional information such as _importance of features_, _classification report_, _decision tree visualisation_

#### Evaluate Model
- **accuracy:**<br>
$$ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$
- **mean squared error:**<br>
$$ \text{MSE} = \frac{1}{n}\sum \limits _{i=1} ^{n} \left(y_i - \hat{y}_i\right)^2$$
- **recall score**<br>
$$ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives}+\text{False Negatives}}$$
- **precision score:**<br>
$$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives}+\text{False Positives}}$$
- **f1 score:**<br>
$$ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

#### Importance of features
- Feature importance gives us better **interpretability** of data.
- Features provide insights into the underlying relationships and processes within the data.
- Understanding which features are important can help in interpreting the results of the analysis and drawing meaningful conclusions.

#### Classification Report
- The classification report provides a detailed breakdown of how well the models performs on each class (`Outstanding` `Good` `Requires improvement` `Inadequate`), and how it balances the trade-off between precision and recall. It also provides the number of instances (support) for each class, which can indicate the class imbalance or the size of the dataset.

#### Decision Tree Visualisation
- A decision tree visualisation is used to illustrate how underlying data predicts a chosen target and highlights key insights about the decision tree.

In [None]:

# call function to list the models being used

school_rating.show_rating_models()


## <font color = blue>Random Forest model</font>

The following steps are taken during the build and run process:

- **Hyperparamter tuning:** `n_estimators` - number of trees in the forest
- **Method:** iteratively tune the `n_estimators` parameter by increasing in steps of `50`and find the best performing `n_estimators` setting
- **Metrics:** `accuracy score` `recall score` `f1 score` `mean squared error`
- **Additional Info:** _Importance of Features_ and _Classification Report_

In [None]:

# call function to build and run the Random Forest model

# school_rating.build_model(MODEL_RANDOM_FOREST['TYPE'])


## <font color = blue>Gradient Boosting model</font>

The following steps are taken during the build and run process:

- **Hyperparamter tuning:** `n_estimators` - number of boosting stages to perform
- **Method:** iteratively tune the `n_estimators` parameter by increasing in steps of `10`and find the best performing `n_estimators` setting
- **Metrics:** `accuracy score` `recall score` `f1 score` `mean squared error`
- **Additional Info:** _Importance of Features_

In [None]:

# call function to build and run the Gradient Boosting model

# school_rating.build_model(MODEL_GRADIENT_BOOSTING['TYPE'])


## <font color = blue>Logistic Regression model</font>

The following steps are taken during the build and run process:

- **Hyperparamter tuning:** `max_iter` - maximum number of iterations taken for the solvers to converge
- **Method:** single step process
- **Metrics:** `accuracy score` `recall score` `f1 score` `mean squared error`
- **Additional Info:** _Importance of Features_

In [None]:

# call function to build and run the Logistic Regression model

# school_rating.build_model(MODEL_LOGISTIC_REGRESSION['TYPE'])


## <font color = blue>Naive Bayes model</font>

The following steps are taken during the build and run process:

- **Hyperparamter tuning:** `default settings`
- **Method:** single step process
- **Metrics:** `accuracy score` `recall score` `f1 score` `mean squared error`
- **Additional Info:** none provided

In [None]:

# call function to build and run the Naive Bayes model

# school_rating.build_model(MODEL_NAIVE_BAYES['TYPE'])


## <font color = blue>Support Vector model</font>

The following steps are taken during the build and run process:

- **Hyperparamter tuning:** `decision_function_shape` - one-vs-one (`ovo`) used as a multi-class strategy to train models
- **Method:** single step process
- **Metrics:** `accuracy score` `recall score` `f1 score` `mean squared error`
- **Additional Info:** none provided

In [None]:

# call function to build and run the Support Vector model

# school_rating.build_model(MODEL_SUPPORT_VECTOR['TYPE'])


## <font color = blue>K Nearest Neighbors (KNN) model</font>

The following steps are taken during the build and run process:

- **Hyperparamter tuning:** `n_neighbors` - number of neighbors to use by default for kneighbors queries
- **Method:** iteratively tune the `n_neighbors` parameter by increasing in steps of `10`and find the best performing `n_neighbors` setting
- **Metrics:** `accuracy score` `recall score` `f1 score` `mean squared error`
- **Additional Info:** none provided

In [None]:

# call function to build and run the KNN model

# school_rating.build_model(MODEL_KNN['TYPE'])


## <font color = blue>Multilayer Perceptron model</font>

The following steps are taken during the build and run process:

- **Hyperparamter tuning:** `hidden_layer_sizes` - _ith_ element represents the number of neurons in the _ith_ hidden layer
- **Method:** iteratively tune the `hidden_layer_sizes` parameter by increasing in steps of `2`and find the best performing `hidden_layer_sizes` setting
- **Metrics:** `accuracy score` `recall score` `f1 score` `mean squared error`
- **Additional Info:** _Classification Report_

In [None]:

# call function to build and run the Multilayer Perceptron model

# school_rating.build_model(MODEL_MULTILAYER_PERCEPTRON['TYPE'])


## <font color = blue>Decision Tree model</font>

The following steps are taken during the build and run process:

- **Hyperparamter tuning:** `max_depth` - maximum depth of the tree `max_leaf_nodes` - grow tree with maximum number of leaf nodes
- **Method:** iteratively tune the `max_leaf_nodes` parameter by increasing in steps of `5`and find the best performing `max_leaf_nodes` setting
- **Metrics:** `accuracy score` `recall score` `f1 score` `mean squared error`
- **Additional Info:** _Importance of Features_ and _Decision Tree Visulisation_

In [None]:

# call function to build and run the Decision Tree model

# school_rating.build_model(MODEL_DECISION_TREE['TYPE'])


## Model Evaluation

In [None]:

# call function to produce the list of important features for the models

# school_rating.show_important_features()


## <font color = blue>Confusion Matrix</font>

Confusion Matrix is a performance measurement for the model classification showing the different combinations of predicted and actual values. It is extremely useful for measuring `Recall` `Precision` `Specificity` and `Accuracy` scores.

- **True Positive (TP):** model predict `positive` and it's _true_
- **True Negative (TN):** model predict `negative` and it's _true_
- **False Positive (FP):** model predict `positive` and it's _false_
- **False Negative (FN):** model predict `negative` and it's _false_

In [None]:

# call function to produce the confusion matrix for the models

# school_rating.show_confusion_matrix()


In [None]:

# call function to produce the list of performance metrics for the models

# school_rating.evaluate_model()


### Results Analysis

## <span style='background:black'><font color = white>&nbsp;Data Imbalance&nbsp;</font></span>

The performance metrics report and confusion matrices show that there is a bias in the grading predictions.

To overcome the bias of the majority classes (Good and Outstanding) and balance the class distribution in the dataset, the **over-sampling** technique **SMOTE** will be used. Over-sampling involves creating synthetic instances for the minority classes (Inadequate and Requires Improvement) to match the number of instances in the majority classes.

#### - Address the problem of imbalanced datasets
> SMOTE - Synthetic Minority Oversampling Technique
> - SMOTE works by synthesizing new instances for the minority class by interpolating between existing minority class instances
> - For each minority class instance, SMOTE selects one or more of its nearest neighbors from the same class and creates synthetic instances along the line segments connecting the instance to its neighbours.
> - SMOTE helps to increase the representation of the minority class in the dataset without simply duplicating existing instances, thus reducing the risk of overfitting.

In [None]:

# call function to balance the data

school_rating.run_data_balance()


## Rebuild and rerun the models

In [None]:

# call function to build and run the Random Forest model

school_rating.build_model(MODEL_RANDOM_FOREST['TYPE'])


In [None]:

# call function to build and run the Gradient Boosting model

school_rating.build_model(MODEL_GRADIENT_BOOSTING['TYPE'])


In [None]:

# call function to build and run the Logistic Regression model

school_rating.build_model(MODEL_LOGISTIC_REGRESSION['TYPE'])


In [None]:

# call function to build and run the Naive Bayes model

school_rating.build_model(MODEL_NAIVE_BAYES['TYPE'])


In [None]:

# call function to build and run the Support Vector model

school_rating.build_model(MODEL_SUPPORT_VECTOR['TYPE'])


In [None]:

# call function to build and run the KNN model

school_rating.build_model(MODEL_KNN['TYPE'])


In [None]:

# call function to build and run the Multilayer Perceptron model

school_rating.build_model(MODEL_MULTILAYER_PERCEPTRON['TYPE'])


In [None]:

# call function to build and run the Decision Tree model

school_rating.build_model(MODEL_DECISION_TREE['TYPE'])


## Model Evaluation

In [None]:

# call function to produce the list of important features for the models

school_rating.show_important_features()


In [None]:

# call function to produce the confusion matrix for the models

school_rating.show_confusion_matrix()


In [None]:

# call function to produce a comparison table of the performance metrics of the models

school_rating.evaluate_model()


In [None]:

# call function to produce the leaderboard for the models

school_rating.model_leaderboard()


## Model Recommendation

- Based on the performance metrics, the **Multilayer Perceptron Classifier** model achieved the highest accuracy of `86.05%`, closely followed by the **Gradient Boosting Classifier** with `86.04%` for predicting Ofsted school grading.
- However, it's essential to consider other factors such as interpretability, computational complexity, and ethical considerations when selecting the best model.
- By considering insights from the four evaluation reports collectively, the model which is best suited for predicting Ofsted school grading is the **Decision Tree Classifier**, with an accuracy of `85.99%`.

- Using the **Decision Tree Classifier** model to predict Ofsted school grading offers several advantages that make it preferable in certain scenarios:
> - **Interpretability**: Decision trees are inherently interpretable models, meaning that the decision-making process is transparent and easy to understand. This is especially important in educational settings where stakeholders such as teachers, administrators, and policymakers need to comprehend the factors driving school grading decisions.
> - **Feature Importance**: Decision trees provide insight into the relative importance of different features in predicting school grading. By examining the decision rules and splits in the tree, stakeholders can identify which features have the greatest influence on the classification outcome. This information can inform targeted interventions and improvement strategies.
> - **Natural Representation of Decision-Making**: Decision trees mimic human decision-making processes, making them intuitive and easy to relate to for stakeholders. This natural representation can facilitate discussions and collaboration between educators, policymakers, and other stakeholders involved in education quality and improvement efforts.



### <span style='background:green'><font color = white>&nbsp;End School Rating Process&nbsp;</font></span>


## Thank You _!_

<img src="./images/email_logo-01_190x65.png" alt="Grammology" style="width: 110px;"/>