This is my first attempt to clean data, engineer features, and train some machine learning models in Python. Any feedback will be appreciated.

*  [1. Introduction](#introduction)
> * 1.1 Acknowledgements

*  [2. Import libraries](#import-libraries)

* [3. Data overview](#data-overview)
> * [3.1. Assessments info](#ass)
     * 3.1.1. Missing values and duplicate rows
     * 3.1.2. Data types
     * 3.1.3. Inconsistent weights
         * 3.1.3.1. Fix inconsistent weights
     * 3.1.4. Check if Assessments Info is the in Results table
> * [3.2. Assessments results](#results)
    * 3.2.1. Missing values and duplicate rows
    * 3.2.2. Data types
    * 3.2.3. Non-submissions
> * [3.3. Courses info](#courses)
    * 3.3.1. Missing values and duplicate rows
    * 3.3.2. Data types
> * [3.4. Student registration](#reg)
    * 3.4.1. Missing values and duplicate rows
    * 3.4.2. Data types
    * 3.4.3. Check if in results table
> * [3.5. VLE resources](#vle)
    * 3.5.1. Missing values and duplicate rows
    * 3.5.2. Data types
> * [3.6. VLE Interactions](#vle-int)
    * 3.6.1. Missing values and duplicate rows
    * 3.6.2. Data types
> * [3.7. Student information](#info)
    * 3.7.1. Missing values and duplicate rows
    * 3.7.2. Data types

* [4. Frame the problem](#frame)

* [5. Merge tables and Feature engineering](#merge)
> * 5.1. VLE + VLE Interactions
    * 5.1.1. Pre-prosessing
> * 5.2. Registration info + Courses + Info
> * 5.3. Assessment info + Assessment Results
    * 5.3.1. Feature engineering
        * 5.3.1.2. Late Submission
        * 5.3.1.3. Fail rate
    * 5.3.2. Merged all result tables
> * 5.4. Merge all tables

* [6. Split the dataset](#split)

* [7. Final cleaning](#final-cleaning)

* [8. Univariate analysis: numerical data](#num)

* [9. Univariate analysis: categorical data](#cat)

* [10. Bivariate analysis: final scores vs other variables](#scores-vs-variables)

* [11. Regression](#regression)
   > * 11.1 Model preparation
   > * 11.2. Models
   > * 11.3 Best Regression Model - evaluation

* [12. Classification](#classification)
   > * 12.1. Model preparation
   > * 12.2. Models
   > * 12.3. Best Classification Model - evaluation

* [13. Discussion](#discussion)

<a id="introduction"></a>
# 1. Introduction
***

The dataset for this machine learning project has been provided by the learning analytics research group at the Knowledge Media institute, The Open University. The dataset is publicly available and consists of tables with information on student demographics, modules undertaken, time of year the modules start (module presentations), and information on student academic success in terms of grades for assignments and exams, as well as students’ interactions with the university’s Virtual Learning Environment (VLE).

The task at hand is to predict which students are to fail or withdraw and which are to pass their modules.

The dataset is rather messy, with many values missing and some inconsistencies between tables. All cleaning steps are detailed in the first part of the notebook. Various inconsistencies are reported and dealt with or suggestions are made as to how to deal with them in future work.

Some feature engineering is done with suggestions for more features that could be of help in this project.

Finally, several classification and regression models are used to predict student academic success.

## 1.1. Acknowledgements

Two notebooks were very helpful in starting this analysis:
* [Data Cleaning-Feature Generation-EDA-Segmentation by Anil](https://www.kaggle.com/anlgrbz/data-cleaning-feature-generation-eda-segmentation)
* [Student Performance Prediction: Complete analysis by Victor Régis](https://www.kaggle.com/devassaxd/student-performance-prediction-complete-analysis)

<a id="import-libraries"></a>
# 2. Import libraries
***

In [None]:
# Data manipulation
import numpy as np
import pandas as pd

# Visualisations
# (a) ggplot-like graphs for EDA
from plotnine import *
import plotnine
plotnine.options.figure_size = (5.2,3.2)
# (b) for plotting other plots
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
%matplotlib inline

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# setting random seed for notebook reproducability
import random

seed = 123
random.seed(seed)
np.random.seed(seed)

<a id="data-overview"></a>
# 3. Data overview
***

This chapter presents a quick overview of the data before the training/test split.

In [None]:
# load datasets
ass = pd.read_csv('/kaggle/input/open-university-learning-analytics-dataset/assessments.csv')
courses = pd.read_csv('/kaggle/input/open-university-learning-analytics-dataset/courses.csv')
results = pd.read_csv('/kaggle/input/open-university-learning-analytics-dataset/studentAssessment.csv')
info = pd.read_csv('/kaggle/input/open-university-learning-analytics-dataset/studentInfo.csv')
reg = pd.read_csv('/kaggle/input/open-university-learning-analytics-dataset/studentRegistration.csv')
vle = pd.read_csv('/kaggle/input/open-university-learning-analytics-dataset/studentVle.csv')
materials = pd.read_csv('/kaggle/input/open-university-learning-analytics-dataset/vle.csv')

<a id="ass"></a>
## 3.1. Assessments info

This file contains information about assessments in module-presentations. Usually, every
presentation has a number of assessments followed by the final exam. CSV contains columns:
1. **code_module** – identification code of the module, to which the assessment belongs.
2. **code_presentation** - identification code of the presentation, to which the assessment
belongs.
3. **id_assessment** – identification number of the assessment.
4. **assessment_type** – type of assessment. Three types of assessments exist: Tutor
Marked Assessment (TMA), Computer Marked Assessment (CMA) and Final Exam
(Exam).
5. **date** – information about the final submission date of the assessment calculated as
the number of days since the start of the module-presentation. The starting date of
the presentation has number 0 (zero).
6. **weight** - weight of the assessment in %. Typically, Exams are treated separately and
have the weight 100%; the sum of all other assessments is 100%.
If the information about the final exam date is missing, it is at the end of the last presentation
week.

In [None]:
ass.head()

### 3.1.1. Missing values and duplicate rows

In [None]:
# Percentage of missing values
ass.isnull().sum() * 100 / len(ass)

In [None]:
ass.nunique()

In [None]:
ass[ass.duplicated()]

### 3.1.2. Data types

In [None]:
ass.info()

Assessments IDs are denoted as integers. This is incorrect - IDs by definition are categorical. Below code corrects this.

In [None]:
ass['id_assessment'] = ass['id_assessment'].astype(object)

### 3.1.3. Inconsistent weights

Project brief states that typically, exams have a weight of 100 and the sum of all other assessments is 100. This would man that a module with one exam only would have a weight of 100 and a module with one exam and some assessments would have a weight of 200. Let’s check if this so in the table provided.

In [None]:
# Group by module presentation and sum the weights of assessments
ass\
.groupby(['code_module','code_presentation'])\
.agg(total_weight = ('weight',sum))

Here we can see most that module presentations have total weight of 200, apart from module CCC which is 300 and module GGG which is 100. Let's have a closer look.

In [None]:
# See what are the weights of exams in module presentations
ass[ass['assessment_type'] == 'Exam']\
.groupby(['code_module','code_presentation', 'assessment_type'])\
.agg(total_weight = ('weight',sum))

All modules show weight of 100 for exams apart from module CCC (for both presentations). Let's count the exams in each module presentation.

In [None]:
# Count how many exams there are in every module presentation
ass[ass['assessment_type'] == 'Exam'][['code_module', 'code_presentation', 'id_assessment']]\
.groupby(['code_module', 'code_presentation'])\
.count()

Module CCC has two exams, this can explain the hight assessments weight for this module. Now let's have a look at all the assignments that are not exams and see if everything is as it should be.

In [None]:
# Sum the weights of all course work assignments per module presentation
ass[ass['assessment_type'] != 'Exam']\
.groupby(['code_module', 'code_presentation'])\
.agg(total_weight = ('weight',sum))

Here we see that module GGG doesn't have any weight to its assignments. Is it because there's no assingments for this module?

In [None]:
ass[ass['code_module'] == 'GGG']\
.groupby(['code_module','code_presentation', 'assessment_type'])\
.agg(weight_by_type = ('weight', sum))

Are there any other CMA and TMA assignments with a weight of 0?

In [None]:
ass[(ass['assessment_type'] == 'CMA') & (ass['weight'] == 0) & (ass['code_module'] != 'GGG')]['weight'].count()

In [None]:
ass[(ass['assessment_type'] == 'TMA') & (ass['weight'] == 0) & (ass['code_module'] != 'GGG')]['weight'].count()

In [None]:
ass[(ass['assessment_type'] == 'TMA') & (ass['weight'] == 0)]

In [None]:
ass[ass['code_module'] == 'BBB']\
.groupby(['code_module','code_presentation', 'assessment_type'])\
.agg(weight_by_type = ('weight',sum))

#### 3.1.3.1. Fix inconsistent weights

What is the usual weight of assignments?

In [None]:
column = ass[(ass['assessment_type'] == 'CMA') & (ass['code_module'] != 'GGG')]['weight']

unique, counts = np.unique(column, return_counts = True)

dict(zip(unique, counts))

In [None]:
column = ass[(ass['assessment_type'] == 'TMA') & (ass['code_module'] != 'GGG')]['weight']

unique, counts = np.unique(column, return_counts = True)

dict(zip(unique, counts))

In [None]:
# How many total assignments in GGG module are there?
ass[ass['code_module'] == 'GGG'][['code_module', 'code_presentation', 'assessment_type', 'id_assessment']]\
.groupby(['code_module','code_presentation', 'assessment_type'])\
.count()

Since CMA assignment is often weight 0, we will just assign 100 total weight to TMA assignment for simplicity.

In [None]:
# Assign new weights to module GGG assessments
ass.loc[(ass.code_module=='GGG') & (ass.assessment_type=='TMA'),'weight'] = (100/3)
ass.loc[(ass.code_module=='GGG') & (ass.assessment_type=='CMA'),'weight'] = (0)

In [None]:
# Check that TMA now sums to 100
ass[ass['code_module'] == 'GGG']\
.groupby(['code_module','code_presentation', 'assessment_type'])\
.agg(weight_by_type = ('weight', sum))

In [None]:
# check that all assessments now sum to 200
ass[ass['code_module'] == 'GGG']\
.groupby(['code_module','code_presentation'])\
.agg(total_weight = ('weight', sum))

### 3.1.4. Check if Assessments Info is the in Results table

In [None]:
def compareCols(df1, df2):
    '''
    Check what columns are shared between two dataframes
    and count values of df1 present and absent in df2 (in the shared
    columns)
    '''

    # Show shared columns between dataframes
    # (a) Make lists of columns for each data frame
    df1Columns = df1.columns.values.tolist()
    df2Columns = df2.columns.values.tolist()

    # (b) Find column names that are the same
    diffDict = set(df1Columns) & set(df2Columns)
    
    print('Shared columns : ', diffDict, '\n')

    # (c) Make a list of the dictinary
    diffList = list(diffDict)
    # (d) Check that if values in
    # every shared column match in
    # the two dataframes
    for col in diffList:
        x = df1[col].isin(df2[col]).value_counts()
        print('Check if values are present in both dataframes:')
        print(x, '\n')

compareCols(ass, results)

In [None]:
def findDiffValues(df1, df2, col):
    '''
    Find all df1.col values not present in df2.col
    '''
    # Pull out all unique values of col
    df1_IDs = df1[col].unique()
    df2_IDs = df2[col].unique()

    # Compare the two lists
    # (a) Find what values are different
    diff = set(df1_IDs).difference(set(df2_IDs))
    # (b) Count how many are different
    numberDiff = len(diff)

    print("Values from df1 not in df2: " + str(diff))
    print("Number of missing values: " + str(numberDiff))

findDiffValues(ass, results, 'id_assessment')

In [None]:
def printDiffValues(df1, df2, col):
    '''
    Show all df1.col values not present in df2.col
    '''
    # Pull out all unique values id_assessments
    df1_IDs = df1[col].unique()
    df2_IDs = df2[col].unique()

    # Compare the two lists
    # (a) Find what values are different
    diff = set(df1_IDs).difference(set(df2_IDs))
    
    # Show information for all df1.col values not presentin df2.col
    # (a) Make a list of missing values
    missingList = list(diff)
    # (b) Find these IDs in df2
    missingDf = df1[df1[col].isin(missingList)]

    return missingDf

printDiffValues(ass, results, 'id_assessment')

All assignments missing from the Results (and consequently Merged) table are exams with 100% module weight. Are there any other 100% weighted assignments in the Assessment table apart from these?

In [None]:
# Make a list of missing IDs
missingList = [30723, 1763, 34885, 15014, 37444, 14990, 30713, 37424, 15025, 34898, 37434, 40087, 34872, 40088, 15002, 1757, 30718, 34911]

# Get all rows with weight 100 from Assessments table
weight100 = ass[ass['weight'] == 100]
# Get all unique assessment IDs
weight100List = weight100['id_assessment'].unique()

# Compare this list with the list of all assessment IDs missing from results table
compare = set(weight100List).difference(set(missingList))
numberCompare = len(compare)

print("100 weighted assessments in the Results table (that are not missing exams): " + str(compare))
print("Number of 100 weighted assessments (that are not missing exams) in the Results table: " + str(numberCompare))

In [None]:
# Show information for weight 100 assessments in the results table
# (a) Make a list of IDs to look for
matchList = [24290, 25354, 24299, 25361, 25368, 25340]
# (b) Find these IDs in the Assessments table
matchDf = ass[ass['id_assessment'].isin(matchList)]

matchDf

Due to the above we can't say that all final exams are missing from the results table, just some exams.

<a id="results"></a>
## 3.2. Assessments results

This file contains the results of students’ assessments. If the student does not submit the
assessment, no result is recorded. The final exam submissions is missing, if the result of the
assessments is not stored in the system. This file contains the following columns:
1. **id_assessment** – the identification number of the assessment.
2. **id_student** – a unique identification number for the student.
3. **date_submitted** – the date of student submission, measured as the number of days
since the start of the module presentation.
4. **is_banked** – a status flag indicating that the assessment result has been transferred
from a previous presentation.
5. **score** – the student’s score in this assessment. The range is from 0 to 100. The score
lower than 40 is interpreted as Fail. The marks are in the range from 0 to 100.

In [None]:
results.head()

### 3.2.1. Missing values and duplicate rows

In [None]:
# Percentage of missing values
results.isnull().sum() * 100 / len(results)

In [None]:
results.nunique()

We can see that the number of assessments in the results table does not match the number of assessments in the Assessments table.

In [None]:
results[results.duplicated()]

### 3.2.2. Data types

In [None]:
results.info()

In [None]:
results['id_assessment'] = results['id_assessment'].astype(object)
results['id_student'] = results['id_student'].astype(object)

### 3.2.3. Non-submissions

We know that if the student does not submit the assessment, no result is recorded. Therefore, all null scores can be interpreted as non-submissions. This means we can fill them out with zeros.

It is, however, a little strange that there are recorded submission days for assessments with null scores. One would expect a null value for the submission date for an assessment that has not been submitted. Ideally, this should be clarified with data providers.

In [None]:
# Have a look at NaN values
results[results['score'].isnull()]

In [None]:
# Replace all null values with 0s
results.fillna(0, inplace=True)

<a id="courses"></a>
## 3.3. Courses info

File contains the list of all available modules and their details. The columns are:
1. **code_module** – code name of the module, which serves as the identifier.
2. **code_presentation** – code name of the presentation. It consists of the year and “B” for
the presentation starting in February and “J” for the presentation starting in October.
3. **length** - length of the module-presentation in days.

In [None]:
courses.head()

### 3.3.1. Missing values and duplicate rows

In [None]:
# Percentage of missing values
courses.isnull().sum() * 100 / len(courses)

In [None]:
courses.nunique()

In [None]:
courses[courses.duplicated()]

### 3.3.2. Data types

In [None]:
courses.info()

<a id="reg"></a>
## 3.4. Student registration

This file contains information about the time when the student registered for the module
presentation. For students who unregistered the date of unregistration is also recorded. File
contains five columns:
1. **code_module** – an identification code for a module.
2. **code_presentation** - the identification code of the presentation.
3. **id_student** – a unique identification number for the student.
4. **date_registration** – the date of student’s registration on the module presentation, this
is the number of days measured relative to the start of the module-presentation (e.g.
the negative value -30 means that the student registered to module presentation 30
days before it started).
5. **date_unregistration** – date of student un-registration from the module presentation,
this is the number of days measured relative to the start of the module-presentation.
Students, who completed the course have this field empty. Students who unregistered
have Withdrawal as the value of the final_result column in the studentInfo.csv file.

In [None]:
reg.head()

### 3.4.1. Missing values and duplicate rows

In [None]:
# Percentage of missing values
reg.isnull().sum() * 100 / len(reg)

In [None]:
reg.nunique()

In [None]:
reg[reg.duplicated()]

### 3.4.2. Data types

In [None]:
reg.info()

In [None]:
reg['id_student'] = reg['id_student'].astype(object)

### 3.4.3. Check if in Results table

Check if all student IDs recorded in the Registration tables are recorded in the Results table.

In [None]:
compareCols(reg, results)

There are 5847 students missing from the Results table. Are there any students from the Student Information table missing from the Results table?

In [None]:
compareCols(info, results)

Yes, there are also 5847 students recorded in the Students Information table missing from the Assessment Results table. Are these the same students?

In [None]:
# Pull out all unique values id_assessments
df1_IDs = reg['id_student'].unique()
df2_IDs = info['id_student'].unique()

# Compare the two lists
# (a) Find what assessment IDs are different
diff = set(df1_IDs).difference(set(df2_IDs))
# (b) Count how many are different
numberDiff = len(diff)

numberDiff

In [None]:
compareCols(reg, info)

Yes, these are the same students. Let's have a closer look.

In [None]:
info_not_in_results = printDiffValues(info, results, 'id_student')
info_not_in_results.head(10)

In [None]:
# What are their final results?
column = info_not_in_results['final_result']

unique, counts = np.unique(column, return_counts = True)

dict(zip(unique, counts))

The brief stated that assignments not recorded in the Results table and not recorded due to the student not submitting them. However, here we have 2 students with no submissions recorded who have passed their modules. This may be due to two reasons:
* The recorded pass is a clerical error.
* The brief is wrong.

In [None]:
reg_not_in_results = printDiffValues(reg, results, 'id_student')
reg_not_in_results.head(10)

In [None]:
# What are their unregistration status?
reg_not_in_results['date_unregistration'].notnull().sum()

Here, again, we see an inconsistency as all withdrawn students should have their date_unregistration field filled in. According to the Student Information table 4648 students have withdrawn, however, according to the Registration table 4594 students have unregistered. This leaves 54 withdrawn students without an unregistration date.

Let's check unregistration dates for 2 students with passes that have no recorded assessment results. If we find unregistration dates for these students we'll know it's a clerical error.

In [None]:
# Show rows with passes
info_not_in_results[info_not_in_results['final_result'] == 'Pass']

In [None]:
# Find their date unregistration
reg_not_in_results[reg_not_in_results['id_student'] == 1336190]

In [None]:
reg_not_in_results[reg_not_in_results['id_student'] == 1777834]

There are no unregistration dates for these 2 students, however, we know there are 54 withdrawn students that have no unregistration dates, so it's unclear how much we can trust this data.

<a id="vle"></a>
## 3.5. VLE resources

The csv file contains information about the available materials in the VLE. Typically, these are
html pages, pdf files, etc. Students have access to these materials online and their interactions
with the materials are recorded. The table comprises of the following columns:
1. **id_site** – an identification number of the material.
2. **code_module** – an identification code for module.
3. **code_presentation** - the identification code of presentation.
4. **activity_type** – the role associated with the module material.
5. **week_from** – the week from which the material is planned to be used.
6. **week_to** – week until which the material is planned to be used.

In [None]:
materials.head()

### 3.5.1. Missing values and duplicate rows

In [None]:
# Percentage of missing values
materials.isnull().sum() * 100 / len(materials)

In [None]:
materials.nunique()

In [None]:
materials[materials.duplicated()]

### 3.5.2. Data types

In [None]:
materials.info()

In [None]:
materials['id_site'] = materials['id_site'].astype(object)

<a id="vle-int"></a>
## 3.6. VLE Interactions

The studentVle.csv file contains information about each student’s interactions with the
materials in the VLE. This file contains the following columns:
1. **code_module** – an identification code for a module.
2. **code_presentation** - the identification code of the module presentation.
3. **id_student** – a unique identification number for the student.
4. **id_site** - an identification number for the VLE material.
5. **date** – the date of student’s interaction with the material measured as the number of
days since the start of the module-presentation.
6. **sum_click** – the number of times a student interacts with the material in that day.

In [None]:
vle.head()

### 3.6.1. Missing values and duplicate rows

In [None]:
# Percentage of missing values
vle.isnull().sum() * 100 / len(vle)

In [None]:
vle.nunique()

In [None]:
vle[vle.duplicated()].head()

Duplication is entirely acceptable here as the system most likely records the clicks at different points on the same day, leading to duplicates.

### 3.6.2. Data types

In [None]:
vle.info()

In [None]:
vle['id_student'] = vle['id_student'].astype(object)
vle['id_site'] = vle['id_site'].astype(object)

<a id="info"></a>
## 3.7. Student information

This file contains demographic information about the students together with their results. File
contains the following columns:
1. **code_module** – an identification code for a module on which the student is registered.
2. **code_presentation** - the identification code of the presentation during which the
student is registered on the module.
3. **id_student** – a unique identification number for the student.
4. **gender** – the student’s gender.
5. **region** – identifies the geographic region, where the student lived while taking the
module-presentation.
6. **highest_education** – highest student education level on entry to the module
presentation.
7. **imd_band** – specifies the Index of Multiple Depravation band of the place where the
student lived during the module-presentation.
8. **age_band** – band of the student’s age.
9. **num_of_prev_attempts** – the number times the student has attempted this module.
10. **studied_credits** – the total number of credits for the modules the student is currently
studying.
11. **disability** – indicates whether the student has declared a disability.
12. **final_result** – student’s final result in the module-presentation.

In [None]:
info.head()

### 3.7.1. Missing values and duplicate rows

In [None]:
info.isnull().sum() * 100 / len(info)

In [None]:
info.nunique()

In [None]:
info[info.duplicated()]

### 3.7.2. Data types

In [None]:
info.info()

In [None]:
info['id_student'] = info['id_student'].astype(object)

<a id="frame"></a>
# 4. Frame the problem
***

We can think of this work as a regression *and* a classification problem designed to predict student academic failure and student withdrawal from module presentations.

Considering the incompleteness of data, the above is tricky.

The scores in the Assessment Results table are not complete - all modules but one are missing their final exam results for all students. This means that using the table as a whole with scores as a response variable for regression can lead to less robust results as information is not complete. In other words, it is possible for a student to pass their assignments and fail their final exam resulting in overall fail for the module.

Another point is that score is the same thing as the final result (in the Student Information table), so predicting the likelihood of someone failing knowing that they got less than 40% as their final mark is not a prediction at all. And, it would be quite interesting to see if it is possible to identify students at risk of withdrawing or failing without knowing anything about their actual academic performance.

All of these points considered, this is the plan:

1. **Classification problem**: merge all tables apart from Assessment Results and use the final result column from Student Information table as target.
2. **Regression problem**: merge all tables, deleting the final result column from Student Information and using scores as target.

We can then see which method gives the best predictions.

<a id="merge"></a>
# 5. Merge tables and engineer features
***

## 5.1. VLE + VLE Interactions

In [None]:
compareCols(materials, vle)

There are 96 entries in id_site in Materials table that are not in the VLE table.

In [None]:
findDiffValues(materials, vle, 'id_site')

In [None]:
printDiffValues(materials, vle, 'id_site')

This probbaly means these resources were not used by any students or that these resources did not record activity. And as such, we can merge these two tables with an inner merge as resources with no activity for any student provide zero information. Week_from and week_to columns can be dropped as they are over 82% empty. Drop date as it won't provide any extra information after grouping by module presentation per student.

In [None]:
# Merge with an inner join
VLEmaterials = pd.merge(vle, materials, on=['code_module', 'code_presentation', 'id_site'], how='inner')
# Drop columns
VLEmaterials.drop(columns=['week_from', 'week_to', 'date'], inplace=True)

VLEmaterials.head()

### 5.1.1. Pre-prosessing

Get toal clicks per student per module presentation.

In [None]:
VLEmaterials\
.groupby(['code_module', 'code_presentation', 'id_student'])\
.agg(total_click = ("sum_click",sum))

In [None]:
total_click_per_student = VLEmaterials\
.groupby(['code_module', 'code_presentation', 'id_student'])\
.agg(total_click = ("sum_click",sum))\
.reset_index()

total_click_per_student.head(7)

## 5.2. Registration info + Courses + Info

Date_registration may turn out to be a predictor of future fail or withdrawal as early registration may predict keen interest and future success, or in an opposite way, early registration means students become disinterested in the module by the time it starts and are likely to withdraw.

In [None]:
# Check that all module presentations in
# Registration table are present in Courses table
compareCols(reg, courses)

All module presentations from Registration table are present in the Courses table.

Course length may well be a good predictor of withdrawal simply due to the fact that longer courses will have more time for students to decide to drop out.

In [None]:
# Have a look at all unique module lengths
courses['module_presentation_length'].unique()

The lengths of modules are not drastically different, but it might make an impact.

In [None]:
# Merge with an inner join
regCourses = pd.merge(reg, courses, on=['code_module', 'code_presentation'], how='inner')

regCourses.head()

In [None]:
# Merge with an inner join
regCoursesInfo = pd.merge(regCourses, info, on=['code_module', 'code_presentation', 'id_student'], how='inner')

regCoursesInfo.head()

## 5.3. Assessment info + Assessment Results

Assessment information table will provide just that - information on weights for assessmtn scores.

In [None]:
# merge with an inner join
assResults = pd.merge(ass, results, on=['id_assessment'], how='inner')
# Rearrange column names
assResults = assResults[['id_student', 'code_module', 'code_presentation', 'id_assessment', 'assessment_type', 'date', 'date_submitted', 'weight', 'is_banked', 'score']]

assResults.head()

In [None]:
assResults.isnull().sum()

Note that there are null values.

### 5.3.1. Feature engineering

#### 5.3.1.1. Weighted score

**How it will be calculated:**

To calculate the total weight of all modules, we need to remember that most final exams are missing from the Results table.

1. Multiply the weight of the assignment with its score.
2. Aggregate the dataframe per weight\*score per module per module presentation with the sum function.
3. Calculate total recorded weight of module (recorded total is key here as most modules are missing their final exam).
4. Now calculate weighted scores - divide summed weight\*score by total recorded weight of module.

In [None]:
# Make a copy of dataset
scores = assResults

# Count how many exams there are in Results for every module presentation
scores[scores['assessment_type'] == 'Exam'][['code_module', 'code_presentation', 'id_assessment']]\
.groupby(['code_module', 'code_presentation'])\
.nunique()

* **CCC module** only has results for 1 exam when the module should have 2 exams in total.
* **DDD module** has results for the final exam (DDD module should have one exam in total).

In [None]:
### Make helper columns ###
# (a) Add column multiplying weight and score
scores['weight*score'] = scores['weight']*scores['score']
# (b) Aggregate recorded weight*score per student
    # per module presentation
sum_scores = scores\
.groupby(['id_student', 'code_module', 'code_presentation'])\
.agg(weightByScore = ('weight*score', sum))\
.reset_index()
# (c) Calculate total recorded weight of module
# (c.i) Get total weight of modules
total_weight = ass\
.groupby(['code_module', 'code_presentation'])\
.agg(total_weight = ('weight', sum))\
.reset_index()
# (c.ii) Subtract 100 to account for missing exams
total_weight['total_weight'] = total_weight['total_weight']-100
# (c.iii) Mark module DDD as having 200 credits 
total_weight.loc[(total_weight.code_module == 'DDD'), 'total_weight'] = 200

### Calculate weighted score ###
# (a) Merge sum_scores and total_weight tables
score_weights = pd.merge(sum_scores, total_weight, on=['code_module', 'code_presentation'], how='inner')
# (b) Calculate weighted score
score_weights['weighted_score'] = score_weights['weightByScore'] / score_weights['total_weight']
# (c) Drop helper columns
score_weights.drop(columns=['weightByScore', 'total_weight'], inplace=True)

In [None]:
score_weights.head()

One thing to note is that is_banked column is dropped along with date_submitted and assessment_type. We can add these as features to see if it impoves our model after we build a basic model.

#### 5.3.1.2. Late submission

Calculate the rate of late submission for the assignments that the student did submit.

**How will be calculated**

1. Calculate the difference between the deadline and the actual submission date.
2. Make a new column - if the difference between dates is more that ), the submission was late.
3. Aggregate by student ID, module, and module presenation.

In [None]:
# Calculate the difference between the submission dates
lateSubmission = assResults.assign(submission_days=assResults['date_submitted']-assResults['date'])
# Make a column indicating if the submission was late or not 
lateSubmission = lateSubmission.assign(late_submission=lateSubmission['submission_days'] > 0)

lateSubmission.head()

Null scores will be assigned not a fail. It is ok as most submissions are not fails, so it would make sense to automatically assign them as passes.

Can exams be late submissions?

In [None]:
lateSubmission[(lateSubmission['assessment_type'] == 'Exam') & (lateSubmission['late_submission'] == True)]

Yes, exams can be submitted late.

In [None]:
# Aggregate per student per module presentation
total_late_per_student = lateSubmission\
.groupby(['id_student', 'code_module', 'code_presentation'])\
.agg(total_late_submission = ('late_submission', sum))\
.reset_index()

total_late_per_student.head()

In [None]:
# Make a df with total number of all assessments per student per module presentation
total_count_assessments = lateSubmission[['id_student', 'code_module', 'code_presentation', 'id_assessment']]\
.groupby(['id_student', 'code_module', 'code_presentation'])\
.size()\
.reset_index(name='total_assessments')

total_count_assessments.head()

In [None]:
# Merge df with total late assessements and total count assessments
late_rate_per_student = pd.merge(total_late_per_student, total_count_assessments, on=['id_student', 'code_module', 'code_presentation'], how='inner')
# Make a new column with late submission rate
late_rate_per_student['late_rate'] = late_rate_per_student['total_late_submission'] / late_rate_per_student['total_assessments']
# Drop helper columns
late_rate_per_student.drop(columns=['total_late_submission', 'total_assessments'], inplace=True)

late_rate_per_student

#### 5.3.1.3. Fail rate

Do the same as above to calculate the fail rate.

In [None]:
# Define function for marking failed assignments
passRate = assResults
passRate = passRate.assign(fail=passRate['score'] < 40)

passRate.head()

In [None]:
passRate.head()

In [None]:
# Aggregate per student per module presentation
total_fails_per_student = passRate\
.groupby(['id_student', 'code_module', 'code_presentation'])\
.agg(total_fails = ("fail",sum))\
.reset_index()

total_fails_per_student.head()

In [None]:
# Merge df with total fails and total count assessments
fail_rate_per_student = pd.merge(total_fails_per_student, total_count_assessments, on=['id_student', 'code_module', 'code_presentation'], how='inner')
# Make a new column with late submission rate
fail_rate_per_student['fail_rate'] = fail_rate_per_student['total_fails'] / fail_rate_per_student['total_assessments']
# Drop helper columns
fail_rate_per_student.drop(columns=['total_fails', 'total_assessments'], inplace=True)

fail_rate_per_student

5.3.2. Merged all result tables

In [None]:
assessments = pd.merge(score_weights, late_rate_per_student, on=['id_student', 'code_module', 'code_presentation'], how='inner')
assessments = pd.merge(assessments, fail_rate_per_student, on=['id_student', 'code_module', 'code_presentation'], how='inner')

assessments.head()

## 5.4 Merge all tables

 The dataframes created previously:
 
 1. VLE + VLE materials = total_click_per_student
 2. Registration Info + Courses + Student Info = regCoursesInfo
 3. Assessments + Results = assessments

In [None]:
merged = pd.merge(regCoursesInfo, total_click_per_student, on=['id_student', 'code_module', 'code_presentation'], how='left')

merged.head()

In [None]:
merged = pd.merge(merged, assessments, on=['id_student', 'code_module', 'code_presentation'], how='left')

merged.head()

<a id="split"></a>
# 6. Split the dataset
***

It's important to split the dataset before doing serious exploratory analysis as we do not want to peak at the testing data. Any pre-processing and further feature engineering will also be done to the test set with the same parameters as are set for the training set. We'' stratify by code module to make sure that each module is represented equally in both the test and the training sets.

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(merged, test_size=0.2, random_state=42, stratify=merged['code_module'])

<a id="final-cleaning"></a>
# 7. Final cleaning
***

## Missing values

In [None]:
train.isnull().sum()

There's a few missing values we need to impute.

### IMD band

In [None]:
train\
[train['imd_band'].isnull()]\
.head()

How to fill out missing values here - fill out according to the most frequent band for that region:
1. Find which regions have null imd_band values
2. Find what band is the most frequent one for that region
3. Replace null values with most frequent values for that region.

In [None]:
# Find what is the most frequent band in each region
regions_list = list(train\
                    [train['imd_band'].isnull()]['region']\
                    .unique())

for i in regions_list:
    result = train[train['region'] == i].imd_band.mode()
    print(f'{i} IMD band : \n', result)

In [None]:
# Replace all null values with respective most frequent imd_bands
regions_list = list(train\
                    [train['imd_band'].isnull()]['region']\
                    .unique())

for i in regions_list:
    train['imd_band'] = np.where( ( (train['imd_band'].isnull()) & (train['region'] == i) ),
                                           train[train['region'] == i].imd_band.mode(),
                                           train['imd_band']
                                    )

### Date registration

In [None]:
# Make a new dataframe just with rows that have null values for the registration date
reg_date_nulls_in_reg = train\
[train['date_registration'].isnull()]

In [None]:
# What are their final results?
column = reg_date_nulls_in_reg['final_result']

unique, counts = np.unique(column, return_counts = True)

dict(zip(unique, counts))

For the withdrawn students let's not put the registration date after unregistation date. Let's substract the median value from the unregistration date to fill these.

In [None]:
# Get median registration date
train.date_registration.median()

In [None]:
# Replace NaN values with date_unreg minus the median (note, the median is negative)
train['date_registration'] = np.where( (train['date_registration'].isnull()),
                                           train['date_unregistration'] + train.date_registration.median(),
                                           train['date_registration']
                                    )
# Replace remaining NaNs with -57
train['date_registration'] = np.where( (train['date_registration'].isnull()),
                                           train.date_registration.median(),
                                           train['date_registration']
                                    )

### total_click

Those students who have null values for total_click are the students who did not have any records in the VLE Interactions table, meaning they did not interact with VLE. Therefore, we can replace them with 0s.

In [None]:
train['total_click'] = train['total_click'].replace(np.nan).fillna(0)

### weighted_score

The students who have null values for weighted_score have not submitted any assignments. We can replace the nan values with 0s.

In [None]:
train['weighted_score'] = train['weighted_score'].replace(np.nan).fillna(0)

### late_rate

Students who have nan values for their late submission rate have not submitted any of their assignments. We can replace the nan values with 1.00 (100% late rate).

In [None]:
train['late_rate'] = train['late_rate'].replace(np.nan).fillna(1.0)

### fail_rate

Students who have nan values for their fail rate have not submitted any assingments. Their fail rate is therefore 100% (1.0).

In [None]:
train['fail_rate'] = train['fail_rate'].replace(np.nan).fillna(1.0)

## Drop columns

In [None]:
# Make a copy of training and test datasets for classification
train_class = train.copy()
test_class = test.copy()

### For Regression

Date unregistration has been dropped as it should be the same metric as Withdrawal in the final results column of Student Information table. I'm assuming the prediction for who is likely to withdraw would only be useful if trying to predict future withdrawals, trying to predict if a student has withdrawn when we know they have withdrawn is useless (date_unregistration is essentially another target).

For the regression problem due to the contuinous nature of the target variable, we can't distinguish between fails and witdrawals, so all withdrawals will be treated as fails (score < 40%).

In [None]:
# Drop unneeded columns
train.drop(columns=['id_student'], inplace=True)
train.drop(columns=['final_result'], inplace=True)
train.drop(columns=['date_unregistration'], inplace=True)

train.head()

### For Classification

In [None]:
# Drop unneeded columns
train_class.drop(columns=['id_student'], inplace=True)
train_class.drop(columns=['date_unregistration'], inplace=True)
# Drop columns on assessments
train_class.drop(columns=['weighted_score'], inplace=True)
train_class.drop(columns=['late_rate'], inplace=True)
train_class.drop(columns=['fail_rate'], inplace=True)


train_class.head()

## Cleaning test sets

### For Regression

In [None]:
'''IMD BAND'''
# Replace all null values with respective most frequent imd_bands
regions_list = list(test\
                    [test['imd_band'].isnull()]['region']\
                    .unique())

for i in regions_list:
    test['imd_band'] = np.where( ( (test['imd_band'].isnull()) & (test['region'] == i) ),
                                           test[test['region'] == i].imd_band.mode(),
                                           test['imd_band']
                                    )

'''DATE REGISTRATION'''
# Get registration date median
reg_date_median = test.date_registration.median()


# Replace NaN values with date_unreg minus 57 days
test['date_registration'] = np.where( (test['date_registration'].isnull()),
                                           test['date_unregistration'] + reg_date_median,
                                           test['date_registration']
                                    )
# Replace remaining NaNs with -57
test['date_registration'] = np.where( (test['date_registration'].isnull()),
                                           reg_date_median,
                                           test['date_registration']
                                    )

'''Rest of null values'''
test['total_click'] = test['total_click'].replace(np.nan).fillna(0)
test['weighted_score'] = test['weighted_score'].replace(np.nan).fillna(0)
test['late_rate'] = test['late_rate'].replace(np.nan).fillna(1.0)
test['fail_rate'] = test['fail_rate'].replace(np.nan).fillna(1.0)

'''Drop unneeded columns'''
# Drop ID, final result, and date unregistration columns
test.drop(columns=['id_student'], inplace=True)
test.drop(columns=['final_result'], inplace=True)
test.drop(columns=['date_unregistration'], inplace=True)

### For Classification

In [None]:
'''IMD BAND'''
# Replace all null values with respective most frequent imd_bands
regions_list = list(test_class\
                    [test_class['imd_band'].isnull()]['region']\
                    .unique())

for i in regions_list:
    test_class['imd_band'] = np.where( ( (test_class['imd_band'].isnull()) & (test_class['region'] == i) ),
                                           test_class[test_class['region'] == i].imd_band.mode(),
                                           test_class['imd_band']
                                    )

'''DATE REGISTRATION'''
# Get registration date median
reg_date_median = test_class.date_registration.median()


# Replace NaN values with date_unreg minus 57 days
test_class['date_registration'] = np.where( (test_class['date_registration'].isnull()),
                                           test_class['date_unregistration'] + reg_date_median,
                                           test_class['date_registration']
                                    )
# Replace remaining NaNs with -57
test_class['date_registration'] = np.where( (test_class['date_registration'].isnull()),
                                           reg_date_median,
                                           test_class['date_registration']
                                    )

'''Rest of null values'''
test_class['total_click'] = test_class['total_click'].replace(np.nan).fillna(0)

'''Drop unneeded columns'''
# Drop ID, final result, and date unregistration columns
test_class.drop(columns=['id_student'], inplace=True)
test_class.drop(columns=['date_unregistration'], inplace=True)
# Drop columns on assessments
test_class.drop(columns=['weighted_score'], inplace=True)
test_class.drop(columns=['late_rate'], inplace=True)
test_class.drop(columns=['fail_rate'], inplace=True)

<a id="num"></a>
# 8. Univariate analysis: numerical data
***

In [None]:
train.describe().transpose()

## Distribution plots

Code for the below cell is adapted from a [Kaggle notebook](https://www.kaggle.com/teertha/us-health-insurance-eda) by Anirban Datta.

In [None]:
# Create statistics summaries with skew, mean, and median
# Produce a dataframe with just numerical columns
df_num = train.select_dtypes(include=np.number)

for col in df_num.columns:

    skew = df_num[col].skew()
    mean = df_num[col].mean()
    median = df_num[col].median()
    
    print(f'\tSummary for {col.upper()}')
    print(f'Skewness of {col}\t: {skew}')
    print(f'Mean {col} :\t {mean}')
    print(f'Median {col} :\t {median} \n')

In [None]:
train.hist(bins=50, figsize=(20,15))
plt.show()

There's a lot of skewed variables in this dataset. Something to keep in mind when using linear models as these assume normal distributions.

## Target variable - weighted score

In [None]:
(
    ggplot(train)
    + aes(x=0, y='weighted_score')
    + geom_boxplot(outlier_color='crimson')
    + ggtitle("Distribution of scores per module")
    + coord_flip()
)

We can see the target variable has two peaks and is not normally distributed, but it doesn't have any outliers. We may wish to transform the target at some point to improve our models. This notebook shows very basic analysis though, so we will not be doing this, but it's something to keep in mind when using certain models (like the linear regression that assumes the distributions are normal).

## Correlation matrix

In [None]:
# Let's make a correlation heatmap
plt.figure(figsize=(6,4))
sns.heatmap(df_num.corr(), annot=True, cmap="coolwarm", );

This can show us that there is little colinearity among the variables. Let's have a closer look at linear correlations between features and the target.

In [None]:
train\
.drop(columns=['weighted_score'])\
.corrwith(train['weighted_score']).plot.bar(
        figsize = (6, 4), title = "Correlation with Target", fontsize = 12,
        rot = 90, grid = True);

Let's look at the numbers for this: 

In [None]:
train.corrwith(train['weighted_score']).sort_values(ascending=False)

Immediately we can see weighted_score is most strongly positively correlated with total_click. The more students engaged with Blackboard, the better results they got. Theyre is also weak negative correlation with the number of the previous attempts. Late_rate and fail_rate also negatively correlated with weighted score, albeit weakly.

There's no correlation with module_presentation_length or date_registration, or studied_credits.

<a id="cat"></a>
# 9. Univariate analysis: categorical data
***

In [None]:
# Produce a dataframe with just categorical columns
df_cat = train.select_dtypes(exclude=np.number)

df_cat.head()

In [None]:
# Set the plot number for the first subplot function
plot_number = 1

# Set sizes for all plots
plt.figure(figsize=(15, 15)) # create a figure object
plt.subplots_adjust(hspace = 0.5) # set the size of subplots

for col in df_cat[['code_module', 'code_presentation', 'gender', 'region']]:
    
    # Call countplot on each column
    plt.subplot(4, 2, plot_number)
    sns.countplot(
        y=col,
        data=df_cat,
        order=df_cat[col].value_counts().index
    )
    plt.title(f'{col.capitalize()} Countplot')
    plt.xlabel('')
    plt.ylabel('')

    plot_number = plot_number + 1 # set a new plot number for the next subplot function
    
    # Add relative frequency labels:
    n_points = df_cat.shape[0]
    col_counts = df_cat[col].value_counts()
    locs, labels = plt.yticks()   # get the current tick locations and labels

    # loop through each pair of locations and labels
    for loc, label in zip(locs, labels):

        # get the text property for the label to get the correct count
        count = col_counts[label.get_text()]
        pct_string = '{:0.1f}%'.format(100*count/n_points)

        # print the annotation at the top of the bar
        plt.text(x=count, y=loc, s=pct_string, ha='left', va='center', color='k')
    
plt.tight_layout()

In [None]:
# Set the plot number for the first subplot function
plot_number = 1

# Set sizes for all plots
plt.figure(figsize=(15, 15)) # create a figure object
plt.subplots_adjust(hspace = 0.5) # set the size of subplots

for col in df_cat[['highest_education', 'imd_band', 'age_band', 'disability']]:
    
    # Call countplot on each column
    plt.subplot(4, 2, plot_number)
    sns.countplot(
        y=col,
        data=df_cat,
        order=df_cat[col].value_counts().index
    )
    plt.title(f'{col.capitalize()} Countplot')
    plt.xlabel('')
    plt.ylabel('')

    plot_number = plot_number + 1 # set a new plot number for the next subplot function
    
    # Add relative frequency labels:
    n_points = df_cat.shape[0]
    col_counts = df_cat[col].value_counts()
    locs, labels = plt.yticks()   # get the current tick locations and labels

    # loop through each pair of locations and labels
    for loc, label in zip(locs, labels):

        # get the text property for the label to get the correct count
        count = col_counts[label.get_text()]
        pct_string = '{:0.1f}%'.format(100*count/n_points)

        # print the annotation at the top of the bar
        plt.text(x=count, y=loc, s=pct_string, ha='left', va='center', color='k')
    
plt.tight_layout()

* Very few students with no formal education (1%)
* Very few students with post-grad qualifications.

These two categories should be merged with 'Lower Than A Level' and 'HE Qualification', respectivelly, as with so little data these two categories are not likely to bring much insight.

## Change education categories

In [None]:
# Rename 'no formal quals' into 'lower than a level'
train['highest_education'] = np.where( (train['highest_education'] == 'No Formal quals'),
                                           'Lower Than A Level',
                                           train['highest_education']
                                    )

# Rename post-grads
train['highest_education'] = np.where( (train['highest_education'] == 'Post Graduate Qualification'),
                                           'HE Qualification',
                                           train['highest_education']
                                    )


# Do the same for the test set
test['highest_education'] = np.where( (test['highest_education'] == 'No Formal quals'),
                                           'Lower Than A Level',
                                           test['highest_education']
                                    )

test['highest_education'] = np.where( (test['highest_education'] == 'Post Graduate Qualification'),
                                           'HE Qualification',
                                           test['highest_education']
                                    )

## Change age categories

Same can be done for the age bands, merging 35-55 and 55+ groups into one. First, let's have a closer look at this variable.

In [None]:
# Have a closer look at the category
(
    ggplot(train)
    + aes(x='age_band', fill='age_band')
    + geom_bar()
    + geom_text(
     aes(label='stat(prop)*100', group=1),
     stat='count',
     nudge_y=0.125,
     va='bottom',
     format_string='{:.1f}%'
 )
    + theme(axis_text_x=element_text(rotation=45, hjust=1))
)

Now let's merge the least frequent categories.

In [None]:
# Replace 55+ and 35-55 groups with 35+
train['age_band'] = np.where( (train['age_band'] == '55<='),
                                           '35+',
                                           train['age_band']
                                    )

train['age_band'] = np.where( (train['age_band'] == '35-55'),
                                           '35+',
                                           train['age_band']
                                    )

# Do the same for the test set
test['age_band'] = np.where( (test['age_band'] == '55<='),
                                           '35+',
                                           test['age_band']
                                    )

test['age_band'] = np.where( (test['age_band'] == '35-55'),
                                           '35+',
                                           test['age_band']
                                    )

Let's see what our count plot for the age variable looks like after merging the categories.

In [None]:
# See the changes
(
    ggplot(train)
    + aes(x='age_band', fill='age_band')
    + geom_bar()
    + geom_text(
     aes(label='stat(prop)*100', group=1),
     stat='count',
     nudge_y=0.125,
     va='bottom',
     format_string='{:.1f}%'
 )
)

<a id="scores-vs-variables"></a>
# 10. Bivariate analysis: final scores vs other variables
***

## Numerical

In [None]:
sns.pairplot(train)

The results are vague. There doesn't seem to be any strong relationships between any variables and the target except, perhaps, the number of total clicks.

## Categorical

Let's make a helper column to indicate if the student failed or not so we can compare categorical variables for failed and passing students.

In [None]:
train = train.assign(fail_final=train['weighted_score'] < 40)

train.head()

### code_module

In [None]:
(
    ggplot(train)
    + aes(x='code_module', fill='fail_final')
    + geom_bar(position='fill')
    + ggtitle("Count frequency of different modules by pass rate")
)

In [None]:
(
    ggplot(train)
    + aes('code_module', 'weighted_score')
    + geom_boxplot(outlier_color='crimson')
    + ggtitle("Distribution of scores per module")
    + coord_flip()
)

It seems that some modules have higher fail rates than others. For example, for module CCC the pass rate is just a little over 50%. The boxplot also reveals some outliers.

### code_presentation

In [None]:
(
    ggplot(train)
    + aes(x='code_presentation', fill='fail_final')
    + geom_bar(position='fill')
)

In [None]:
(
    ggplot(train)
    + aes('code_presentation', 'weighted_score')
    + geom_boxplot(outlier_color='crimson')
    + ggtitle("Distribution of scores per presentation")
    + coord_flip()
)

Module presentation (semester) seems to have no effect on the pass/fail rate.

### gender

In [None]:
(
    ggplot(train)
    + aes(x='gender', fill='fail_final')
    + geom_bar(position='fill')
)

In [None]:
(
    ggplot(train)
    + aes('gender', 'weighted_score')
    + geom_boxplot(outlier_color='crimson')
    + ggtitle("Distribution of scores per sex")
    + coord_flip()
)

Again, there's not much difference between men and women passing or failing modules.

### region

In [None]:
(
    ggplot(train)
    + aes(x='region', fill='fail_final')
    + geom_bar(position='fill')
    + theme(axis_text_x=element_text(rotation=45, hjust=1))
)

In [None]:
(
    ggplot(train)
    + aes('region', 'weighted_score')
    + geom_boxplot(outlier_color='crimson')
    + ggtitle("Distribution of scores per region")
    + coord_flip()
)

There is very little difference in pass rates between regions.

### highest_education

In [None]:
(
    ggplot(train)
    + aes(x='highest_education', fill='fail_final')
    + geom_bar(position='fill')
    + theme(axis_text_x=element_text(rotation=45, hjust=1))
)

Students that have lower than A level previous education seem to fail more, however, the difference is so slight it may not be statistically significant.

### imd_band

In [None]:
(
    ggplot(train)
    + aes(x='imd_band', fill='fail_final')
    + geom_bar(position='fill')
    + theme(axis_text_x=element_text(rotation=45, hjust=1))
)

Very little difference in pass rates amongst students in different deprivation bands, but there does seem to be a trend - the more deprived the area is, the higher the fail rate. 0-10% IMD means the student lives in an area that that falls amongst top 0-10% most deprived small areas (the higher the percentage, the more deprived the area).

### age_band

In [None]:
(
    ggplot(train)
    + aes(x='age_band', fill='fail_final')
    + geom_bar(position='fill')
)

Older people seem to do bettwer academically, however, the difference is fairly small.

### disability

In [None]:
(
    ggplot(train)
    + aes(x='disability', fill='fail_final')
    + geom_bar(position='fill')
)

As expected, people with disabilities do worse academically. This is to be expected as disabled students would face more challenges due to ill health.

<a id="regression"></a>
# 11. Regression
***

## 11.1 Model preparation

In [None]:
# Separate features from target

'''Training set'''
# Drop target and helper columns
X_train = train.drop(columns=['fail_final', 'weighted_score'])
# Make an array with target
Y_train = train['weighted_score'].copy()

'''Test set'''
# Drop target column
X_test = test.drop(columns=['weighted_score'])
# Make an array with target
Y_test = test['weighted_score'].copy()

X_train.head()

In [None]:
'''Make a copy for the subsequent last evaluation'''
X_train_eval = X_train.copy()
Y_train_eval = Y_train.copy()
X_test_eval = X_test.copy()
Y_test_eval = Y_test.copy()

In [None]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler, RobustScaler
from sklearn.compose import make_column_transformer

# Set encoding and scaling instructions
column_transform = make_column_transformer(
    (OneHotEncoder(), ['code_module', 'code_presentation', 'gender', 'region', 'age_band', 'disability']),
    (OrdinalEncoder(), ['highest_education', 'imd_band']),
    (RobustScaler(), ['date_registration', 'module_presentation_length',
                       'num_of_prev_attempts', 'studied_credits', 'total_click', 'late_rate',
                       'fail_rate'])
)

# Apply column transformer to features
X_encoded = column_transform.fit_transform(X_train)

RobustScaler is used to make the models more robust to putliers. More specifically, RobustScaler  scales the data according to the interquartile range.

In [None]:
# Have a look at what the scaled and encoded data looks like
pd.DataFrame(X_encoded).head()

## 11.2. Models:

### Linear Regression

Linear regression is most likely going to be a bad model as the data breaks several of the assumptions of this model.

**The assumptions are as follows**:
* Linear relationship between the target and features.
    * The pair plots show this isn't the case.
* Multivariate normality - all variables must be normal.
    * The histograms of the numerical variables show that their distributions aren't normal.
* Little to no multicollinearity - all variables must be independent from each other.
    * The correlation matrix shows that this isn't so.
* No auto-correlation - when the value of y(x+1) is independent from the value of y(x).
* Homoscedasticity - residuals must be equal along the regression line.

Linear regression can show us an example of a bad model, showing how other models can vastly improve the predictions.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Setting up the pipeline
lm = LinearRegression()

lm_pipeline = make_pipeline(column_transform, lm)

# Fit the training data
lm_pipeline.fit(X_train, Y_train)
# Predict the training data
lm_train_predictions = lm_pipeline.predict(X_train)

In [None]:
# Now let's evaluate the model
import sklearn.metrics as metrics

def regression_eval(X, y, predictions):
    MSE = metrics.mean_squared_error(y, predictions)
    RMSE = np.sqrt(MSE)
    R2 = metrics.r2_score(y, predictions)
    adj_R2 = 1 - ( (1-R2)*(len(y)-1)/(len(y)-X.shape[1]-1) )

    print("-----------------------")
    print('RMSE is {}'.format(RMSE))
    print('Adjusted R2 score is {}\n'.format(adj_R2))

### For training set ###
print("Model performance for training set:")
regression_eval(X_train, Y_train, lm_train_predictions)

The above evaluation of the model is using our uses a randomised training and test set split, then calculates RMSE and adjusted-R2. Adjusted-R2 shows the model explains 35% of the total variance in the sample. RMSE of 23.8 shows us the error - predictions are off by 23.8 points. This is a large error considering the fail mark is only 40%.

In [None]:
# Perform cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lm, column_transform.fit_transform(X_train), Y_train, cv=10, scoring='neg_mean_squared_error')
lm_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print('Scores\t:', scores)
    print('Mean\t:', scores.mean())
    print('SD\t:', scores.std())
    
display_scores(lm_rmse_scores)

Cross-validation uses stratified k-fold cross-validation which is different from validation with randomised values. The training set is split into a smaller training set and an even smaller validation set. Each of these sets are then used for training and validation sequentially. Cross-validation also shows us the model performs abysmally, just as expected.

### LASSO Regression

LASSO (least absolute shrinkage and selection operator) regression is a modification of linear regression. In very simple terms, this algorithm can drop some features based on those features' coefficients (if they are too low).

The assumptions of this model as the same as for the linear model, except normality is not assumed.

In [None]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha =0.0005, random_state=1)

lasso_pipeline = make_pipeline(column_transform, lasso)

lasso_pipeline.fit(X_train, Y_train)

X_lasso_predictions = lasso_pipeline.predict(X_train)

### Evaluating the model ###
print("Model performance for training set:")
regression_eval(X_train, Y_train, X_lasso_predictions)

In [None]:
# Perform cross-validation
scores = cross_val_score(lasso, column_transform.fit_transform(X_train), Y_train, cv=10, scoring='neg_mean_squared_error')

lasso_rmse_scores = np.sqrt(-scores)
    
display_scores(lasso_rmse_scores)

This model was not in any way an improvement.

### Support Vector Regression

Linear models are doing badly, as expected. We will next use non-linear models and see if our predictions improve.

Support Vector Regression seeks not to minimise the squared error as in the linear regression, but to minimise coefficients.

In [None]:
from sklearn.svm import SVR

SVR = SVR(kernel='rbf')

SVR_pipeline = make_pipeline(column_transform, SVR)

SVR_pipeline.fit(X_train, Y_train)

train_SVR_predictions = SVR_pipeline.predict(X_train)

### Evaluating the model ###
print("Model performance for training set:")
regression_eval(X_train, Y_train, train_SVR_predictions)

In [None]:
scores = cross_val_score(SVR, column_transform.fit_transform(X_train), Y_train, cv=4, scoring='neg_mean_squared_error')

SVR_rmse_scores = np.sqrt(-scores)
    
display_scores(SVR_rmse_scores)

The model is an improvement on the linear models. Adjusted R2 is a bit higher (39%) and the standard error for RMSE scores in the cross-validated sets is lower (from SD = 0.31 for LASSO to SD = 0.09). It is still not a great predictor for the dataset.

### Decision Tree

This model uses a very different approach - building a what is essentially a flow chart based on probabilities and likelihoods. It is a simple algorithm with many models using it as a base algorithm (e.g. Random Forest).

In [None]:
from sklearn.tree import DecisionTreeRegressor

Dtree = DecisionTreeRegressor(min_samples_leaf=15, min_samples_split=10, max_features=13)

Dtree_pipeline = make_pipeline(column_transform, Dtree)

Dtree_pipeline.fit(X_train, Y_train)

train_Dtree_predictions = Dtree_pipeline.predict(X_train)

### Evaluating the model ###
print("Model performance for training set:")
regression_eval(X_train, Y_train, train_Dtree_predictions)

In [None]:
scores = cross_val_score(Dtree, column_transform.fit_transform(X_train), Y_train, cv=5, scoring='neg_mean_squared_error')

Dtree_rmse_scores = np.sqrt(-scores)
    
display_scores(Dtree_rmse_scores)

The results are much better than the linear models or SVR. Adjusted-R2 is 0.66% and cross-validation shows RMSE of 19. Interestingly, even though the model is capable of explaining significantly more variance, RMSE is not improved drastically.

### GradientBoost

This is a predictive model using an ensemble of weak predictive models (decision trees). We expect it will perform better than a simple Decision Tree.

The model is trained with huber loss, making it more robust to outliers.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

GBoost = GradientBoostingRegressor(n_estimators=400, learning_rate=0.05,
                                   max_depth=4, max_features=13,
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)

GBoost_pipeline = make_pipeline(column_transform, GBoost)

GBoost_pipeline.fit(X_train, Y_train)

train_GBoost_predictions = GBoost_pipeline.predict(X_train)

### Evaluating the model ###
print("Model performance for training set:")
regression_eval(X_train, Y_train, train_GBoost_predictions)

In [None]:
scores = cross_val_score(GBoost, column_transform.fit_transform(X_train), Y_train, cv=4, scoring='neg_mean_squared_error')

GBoost_rmse_scores = np.sqrt(-scores)
    
display_scores(GBoost_rmse_scores)

This may be a small improvement. Further tests needed to determine if the results of this model are significantly different than a simple Decision Tree. RMSE is 18.4 based on cross-validation, which is lower than we've seen before. Adjusdted R2 score is one point lower (65%).

### K Nearest Neighbours Regression

KNN-Regression examines the point close to the target point and then makes a prediction on which class these data points belong to.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

KNReg = KNeighborsRegressor(n_neighbors=2)

KNReg_pipeline = make_pipeline(column_transform, KNReg)

KNReg_pipeline.fit(X_train, Y_train)

train_KNReg_predictions = KNReg_pipeline.predict(X_train)

### Evaluating the model ###
print("Model performance for training set:")
regression_eval(X_train, Y_train, train_KNReg_predictions)

RMSE is the lowest yet at 16.9. Let's see how the model fairs with cross-validation.

In [None]:
scores = cross_val_score(KNReg, column_transform.fit_transform(X_train), Y_train, cv=5, scoring='neg_mean_squared_error')

KNReg_rmse_scores = np.sqrt(-scores)
    
display_scores(KNReg_rmse_scores)

Mean RMSE is at 29.1 which is the worst performance among all non-linear models. Validation error is expected to be higher than training error, but this differenc is quite stark. This means the model is overfitting badly.

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

RForest = RandomForestRegressor(min_samples_leaf=15, min_samples_split=10,
                                max_features=13, n_estimators=20)

RForest_pipeline = make_pipeline(column_transform, RForest)

RForest_pipeline.fit(X_train, Y_train)

train_RForest_predictions = RForest_pipeline.predict(X_train)

### Evaluating the model ###
print("Model performance for training set:")
regression_eval(X_train, Y_train, train_RForest_predictions)

In [None]:
scores = cross_val_score(RForest, column_transform.fit_transform(X_train), Y_train, cv=5, scoring='neg_mean_squared_error')

RForest_rmse_scores = np.sqrt(-scores)
    
display_scores(RForest_rmse_scores)

The model looks great. Training error (RMSE = 17.0) is not much lower than validation error (mean RMSE = 18.6, SD = 0.2). Let's compare all non-linear models below and chose the best performing one.

## 11.3 Best Regression Model - evaluation

Let's chose the best model. To do this we need to have a look at how the model performed on the training set and in cross-validation.


**On the training set:**

|Models| RMSE score|
| ----------- | ----------- |
|**SVR**|23.1|
|**DT**|17.3|
|**GB**|17.7|
|**KNN**|16.9|
|**RF**|17.0|

**Cross-validation:**

|Models| Mean RMSE|SD|
| ----------- | ----------- |-------|
|**SVR**|23.4|0.1|
|**DT**|20.4|0.5|
|**GB**|18.4|0.2|
|**KNN**|29.1|0.3|
|**RF**|18.6|0.2|


GradientBoost and Random Forest are our best models. Cross-validation for the GBoost model shows it to be the most accurate model, even though the error for the training set without cross-validation for the GBoost is higher than for the RF. KNN had the lowest RMSE score for the training set without cross-validation, but we can see how this model's performance degraded in the cross-validation, meaning this model is overfitting.

In [None]:
# Set encoding and scaling instructions
column_transform = make_column_transformer(
    (OneHotEncoder(), ['code_module', 'code_presentation', 'gender', 'region', 'age_band', 'disability']),
    (OrdinalEncoder(), ['highest_education', 'imd_band']),
    remainder='passthrough')
    
GBoost = GradientBoostingRegressor(n_estimators=400, learning_rate=0.05,
                                   max_depth=4, max_features=13,
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)

GBoost_pipeline = make_pipeline(column_transform, GBoost)

# Fit the training data
GBoost_pipeline.fit(X_train_eval, Y_train_eval)

# Transform the test set (don't fit)
X_prepared_eval = column_transform.transform(X_test_eval)
# Predict the test data
test_GBoost_predictions = GBoost.predict(X_prepared_eval)

regression_eval(X_test_eval, Y_test_eval, test_GBoost_predictions)

This is our final regression model with RMSE = 18.04 and adjusted R2 = 63%.

<a id="classification"></a>
# 12. Classification
***

Next, we approach the task as a classification problem. Same steps as with the regression apply - model is prepared with last cleaning steps, then categorical values are encoded, models are fitted and then evaluated.

## 12.1 Model preparation

In [None]:
train = train_class.copy()
test = test_class.copy()

train.head()

### Last cleaning

In [None]:
# Distinction as a Pass
train['final_result'] = np.where( (train['final_result'] == 'Distinction'),
                                           'Pass',
                                           train['final_result']
                                    )
# Withdrawn as a Fail (to make the target binary)
train['final_result'] = np.where( (train['final_result'] == 'Withdrawn'),
                                           'Fail',
                                           train['final_result']
                                    )
# Same for test set
test['final_result'] = np.where( (test['final_result'] == 'Distinction'),
                                           'Pass',
                                           test['final_result']
                                    )
test['final_result'] = np.where( (test['final_result'] == 'Withdrawn'),
                                           'Fail',
                                           test['final_result']
                                    )

In [None]:
# Rename 'no formal quals' into 'lower than a level'
train['highest_education'] = np.where( (train['highest_education'] == 'No Formal quals'),
                                           'Lower Than A Level',
                                           train['highest_education']
                                    )

# Rename post-grads
train['highest_education'] = np.where( (train['highest_education'] == 'Post Graduate Qualification'),
                                           'HE Qualification',
                                           train['highest_education']
                                    )


# Do the same for the test set
test['highest_education'] = np.where( (test['highest_education'] == 'No Formal quals'),
                                           'Lower Than A Level',
                                           test['highest_education']
                                    )

test['highest_education'] = np.where( (test['highest_education'] == 'Post Graduate Qualification'),
                                           'HE Qualification',
                                           test['highest_education']
                                    )
### Age bands ###
train['age_band'] = np.where( (train['age_band'] == '55<='),
                                           '35-55',
                                           train['age_band']
                                    )

train['age_band'] = np.where( (train['age_band'] == '35-55'),
                                           '35+',
                                           train['age_band']
                                    )

# Do the same for the test set
test['age_band'] = np.where( (test['age_band'] == '55<='),
                                           '35-55',
                                           test['age_band']
                                    )

test['age_band'] = np.where( (test['age_band'] == '35-55'),
                                           '35+',
                                           test['age_band']
                                    )

In [None]:
# Separate features from target

'''Training set'''
# Drop target column
X_train = train.drop(columns=['final_result'])
# Make an array with target
Y_train = train['final_result'].copy()

'''Test set'''
# Drop target column
X_test = test.drop(columns=['final_result'])
# Make an array with target
Y_test = test['final_result'].copy()

X_train.head()

In [None]:
X_train.shape

### Encoding for trees

Encoding for tree models does not require scaling.    

In [None]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.compose import make_column_transformer

# Set encoding and scaling instructions
column_transform = make_column_transformer(
    (OneHotEncoder(), ['code_module', 'code_presentation', 'gender', 'region', 'disability']),
    (OrdinalEncoder(), ['highest_education', 'imd_band', 'age_band']),
    remainder='passthrough'
)

# Apply column transformer to features
X_encoded = column_transform.fit_transform(X_train)

In [None]:
pd.DataFrame(X_encoded).head()

## 12.2. Models:

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

Dtree = DecisionTreeClassifier(min_samples_leaf=15, min_samples_split=10, max_features=8)

Dtree_pipeline = make_pipeline(column_transform, Dtree)

# Cross-validate
def display_accuracy_scores(pipeline, X, Y):
    scores = cross_val_score(pipeline, X, Y, cv=5, scoring='accuracy')
    print('Scores\t:', scores)
    print('Mean\t:', scores.mean())
    print('SD\t:', scores.std())

### Cross-validate ###
# Train set
print('Evaluation of the training set')
display_accuracy_scores(Dtree_pipeline, X_train, Y_train)

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

RForest = RandomForestClassifier(min_samples_leaf=15, min_samples_split=10,
                                max_features=8, n_estimators=20)

RForest_pipeline = make_pipeline(column_transform, RForest)

### Cross-validate ###
# Train set
print('Evaluation of the training set')
display_accuracy_scores(RForest_pipeline, X_train, Y_train)

### Support Vector Machine

Encoding for the last two models requires scaling.

In [None]:
# Set encoding and scaling instructions
column_transform = make_column_transformer(
    (OneHotEncoder(), ['code_module', 'code_presentation', 'gender', 'region', 'disability']),
    (OrdinalEncoder(), ['highest_education', 'imd_band', 'age_band']),
    (RobustScaler(), ['date_registration', 'module_presentation_length',
                       'num_of_prev_attempts', 'studied_credits', 'total_click'])
)

In [None]:
from sklearn.svm import SVC

SVClass = SVC(gamma='auto')

SVClass_pipeline = make_pipeline(column_transform, SVClass)

# Train set
print('Evaluation of the training set')
display_accuracy_scores(SVClass_pipeline, X_train, Y_train)

### SGD

In [None]:
from sklearn.linear_model import SGDClassifier

SDG = SGDClassifier(max_iter=1000, tol=1e-3)

SDG_pipeline = make_pipeline(column_transform, SDG)
    
# Train set
print('Evaluation of the training set')
display_accuracy_scores(SDG_pipeline, X_train, Y_train)

## 12.3 Best Classification Model - evaluation 

Support Vector Machine classifier model performed the best. Althrough its accuracy scores (0.78, SD = 0.002) are similar to the RF model (0.79, SD = 0.004), the SVC model performs with slightly less variance between the scores during cross-validation as shown by lower standard deviation.

In [None]:
# Test set evaluation for SVC
print('Evaluation of the test set')

# Fit the training data
SVClass_pipeline.fit(X_train, Y_train)
# Transform the test data
X_test_prepared = column_transform.transform(X_test)
# Predict the test data
SVClass_predictions_test = SVClass.predict(X_test_prepared)

print('Accuracy score:', metrics.accuracy_score(Y_test, SVClass_predictions_test))

The accuracy score is almost the same for both the training and the test sets.

<a id="discussion"></a>
# 13. Discussion
***

To quickly summarise, the best model for the regression task was Gradient Boost (RMSE = 18.4, SD = 0.2 with 4-fold cross-validation) which gave us RMSE = 18.1 and adjusted R2 = 0.63 when evaluated on the test set). This is without any fine-tuning of the hyperparameters. The best model for the classification task was a Support Vector Machine classifier (0.78% accuracy score on the test set). Again, this is without any hyperparameter tuning.

Next steps to take would be to find which features are most important and which can be dropped. Hyperparameter tuning can be used to find the best set of parameters for the models. Various dimensionality reduction tools can be used to improve the performance of the models. Another point to make is that accuracy score isn't the best way to evaluate classification models, especially when the target is imbalanced. We have an imbalanced target for this classification problem, so dealing with this imbalance and using a different evaluation metric would be advantageous.

It's possible to engineer some more features too, for example, we know VLE interactions are important for student success, but maybe the type of the resource the student interacts with will be a better signal than total clicks for all resources?