Version 1 of data cleaning and feature engineering.

This version is to ensure that I have a working version of the data which I can use to build and assess classification models.

The general goal is to create a dataset which contains: 

* student biographical information (student table)
* student's course registration information (course and registration tables)
* student's assessment information (assessments table)
  * this needs to be feature engineered to create a single row per student per course
  * as each module has a different number of assessments, this dataset will engineer features which are available for all modules (i.e. average score, proportion of assessments submitted, mean distance from due date, etc.)
* student's VLE information (VLE table)
  * this needs to be feature engineered to create a single row per student per course
  * as each module has a different number of VLE interactions, this dataset will engineer features which are available for all modules (i.e. average number of clicks, proportion of clicks on each resource type, etc.)

Finally, the dataset needs to be sliced in relation to the prediction point:
* students who have withdrawn at the point of prediction need to be excluded - as their outcome is known and will only confuse the model
* more critically, a mechanism to remove assessments, vles, etc. which have 'not 'happened in relation to the prediction point needs to be implemented.  
* this will be the enhanced version - or a future version - of the dataset which will:
  * allow for the prediction of the outcome at any point in the course
  * comparison between time periods


Future versions/questions/scnenarios to explore:

* predict at any point in the course, with information available up to that point
* comparisons between modules - that is, different models per module
* do different models perform better at different points in the course?
* is biographical information useful?
* are there inherent student characteristics which are predictive of outcome?
* are there clusters (unsupervised learning) of students which are predictive of outcome?
* can a model be built which incorporates previous predictions into the model?

BUT - for this assignment, I need to first have something which works and fulfills the assignment specification.



## Data preparation

### Libraries and zip file

In [6]:
import pandas as pd
import numpy as np
import zipfile
import matplotlib.pyplot as plt


In [18]:
    
# import zip file with csv 
ou_zip = zipfile.ZipFile('../data/anonymisedData.zip') 

# save separate csvs
registrations = pd.read_csv(ou_zip.open('studentRegistration.csv'))
courses = pd.read_csv(ou_zip.open('courses.csv'))
students = pd.read_csv(ou_zip.open('studentInfo.csv'))
student_vle = pd.read_csv(ou_zip.open('studentVle.csv'))
vle = pd.read_csv(ou_zip.open('vle.csv'))
student_assessments = pd.read_csv(ou_zip.open('studentAssessment.csv'))
assessments = pd.read_csv(ou_zip.open('assessments.csv'))

### student information

* base table to which all other tables will be joined



TODO remove rows (students) with missing data in imd_cat


In [19]:
students.count()


code_module             32593
code_presentation       32593
id_student              32593
gender                  32593
region                  32593
highest_education       32593
imd_band                31482
age_band                32593
num_of_prev_attempts    32593
studied_credits         32593
disability              32593
final_result            32593
dtype: int64

### course information
* add course information from course table - merge on code_module and code_presentation
* TODO create new features for 'intake', 'year' and 'subject'


In [32]:
# merge students and courses
final = pd.merge(students, courses, on=['code_module', 'code_presentation'], validate='many_to_one')

final.count()

code_module                   32593
code_presentation             32593
id_student                    32593
gender                        32593
region                        32593
highest_education             32593
imd_band                      31482
age_band                      32593
num_of_prev_attempts          32593
studied_credits               32593
disability                    32593
final_result                  32593
module_presentation_length    32593
dtype: int64

### registrations

* merge registration table with student table


In [33]:
# merge registrations
final = pd.merge(final, registrations, on=['code_module', 'code_presentation', 'id_student'], how = 'left', validate='1:1')
final.count()

code_module                   32593
code_presentation             32593
id_student                    32593
gender                        32593
region                        32593
highest_education             32593
imd_band                      31482
age_band                      32593
num_of_prev_attempts          32593
studied_credits               32593
disability                    32593
final_result                  32593
module_presentation_length    32593
date_registration             32548
date_unregistration           10072
dtype: int64

In [34]:
# drop missing value rows (date_registration, imd_band)
final.dropna(subset=['date_registration', 'imd_band'], inplace=True)
final.count()

code_module                   31437
code_presentation             31437
id_student                    31437
gender                        31437
region                        31437
highest_education             31437
imd_band                      31437
age_band                      31437
num_of_prev_attempts          31437
studied_credits               31437
disability                    31437
final_result                  31437
module_presentation_length    31437
date_registration             31437
date_unregistration            9798
dtype: int64

In [35]:
final['final_result'].unique()

array(['Pass', 'Withdrawn', 'Fail', 'Distinction'], dtype=object)

In [64]:
# prediction point = days from start of course
prediction_point = 200

# prediction point must be less than course length, integer, and greater than 0
if not isinstance(prediction_point, int) or prediction_point <= 0 or prediction_point >= max(final['module_presentation_length']):
    print("Error: Invalid prediction point. \n\nPlease provide an integer value greater than 0 and less than the maximum course length. \n\nThis is the number of days from the start of the course for which you want to predict the outcome.")
else:
    # withdrawn or failed before prediction point - remove
    withdrawn_fail_condition = (final['final_result'].isin(['Withdrawn', 'Fail'])) & (final['date_unregistration'] <= prediction_point)
    final.loc[withdrawn_fail_condition, 'status'] = 'remove_outcome_known'
    # if unregister after prediction point - keep
    unregister_after_condition = final['date_unregistration'] > prediction_point
    final.loc[unregister_after_condition, 'status'] = 'keep'
    # if no unregistration date - keep
    no_unregistration_condition = final['date_unregistration'].isna()
    final.loc[no_unregistration_condition, 'status'] = 'keep'
    # default case
    final.loc[~(withdrawn_fail_condition | unregister_after_condition | no_unregistration_condition), 'status'] = 'query'






In [65]:
# Assuming 'final' is the DataFrame containing the relevant data
query_rows = final[final['status'] == 'query']
print(query_rows)


Empty DataFrame
Columns: [code_module, code_presentation, id_student, gender, region, highest_education, imd_band, age_band, num_of_prev_attempts, studied_credits, disability, final_result, module_presentation_length, date_registration, date_unregistration, status]
Index: []


following the above steps, the missing values in date_unregistration should be populated with the max of the course end date, for now...



In [66]:
# replace missing date_unreg with module_presentation_length
final['date_unregistration'] = final['date_unregistration'].fillna(final['module_presentation_length'])
final.count()

code_module                   31437
code_presentation             31437
id_student                    31437
gender                        31437
region                        31437
highest_education             31437
imd_band                      31437
age_band                      31437
num_of_prev_attempts          31437
studied_credits               31437
disability                    31437
final_result                  31437
module_presentation_length    31437
date_registration             31437
date_unregistration           31437
status                        31437
dtype: int64

now the table to use can be created by filtering on status which takes into account the prediction point, if there is one....

need a default value - maybe max of course end date?

In [70]:
model_final = final[final['status'] != 'remove_outcome_known']
model_final.count()
#model_final.head(20)

code_module                   22149
code_presentation             22149
id_student                    22149
gender                        22149
region                        22149
highest_education             22149
imd_band                      22149
age_band                      22149
num_of_prev_attempts          22149
studied_credits               22149
disability                    22149
final_result                  22149
module_presentation_length    22149
date_registration             22149
date_unregistration           22149
status                        22149
dtype: int64

   


1. merge student_asssements with assessments table - merge on code_module and code_presentation and id_assessment
   remove missing scores? - investigate? impute?  could be useful?
   remove cols not going to use (iteration 1, weight, is_banked)

   reduce to only keep those relevant to prediction point - i.e. those which have happened before the prediction point

   assessments = assessemts_combined[date] <= prediction point
   don't use date_submitted as this is student specific; use dates re course assessments
   get number of assessments per course (expected number)
   get number of assessments perstudent row (actual number)
   get sum of scores per student row (sum actual score)
   feature = actual/expected
   feature = sum actual score/actual number * 100
   consider:
   get sum of submission dates (actual submission dates)
   get sum of submission dates (expected submission dates)
   feature = expected submission dates - actual submission dates
   
drop unused, unneeded, columns (type, dates,etc.)
merge with final student table - left join (only active students)

In [5]:
# step 1 merge 'student' with 'course'
final_student = pd.merge(students, courses, on=['code_module', 'code_presentation'], validate='many_to_one')
#final_student.head()

# step 2 merge registrations with final_student
final_student = pd.merge(final_student, registrations, on=['code_module','code_presentation','id_student'], validate="1:1")

final_student.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,module_presentation_length,date_registration,date_unregistration
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass,268,-159.0,
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass,268,-53.0,
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn,268,-92.0,12.0
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass,268,-52.0,
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass,268,-176.0,


In [None]:
## to be made into a function?? with a prediction point splitter??

# step 1 merge 'student' with 'course'
final_student = pd.merge(students, courses, on=['code_module', 'code_presentation'], validate='many_to_one')
final_student.head()

# step 2 merge registrations with final_student
final_student = pd.merge(final_student, registrations, on=['code_module','code_presentation','id_student'], validate="1:1")

# step 3 drop imd_band, date_reg nulls and replace NaN for unreg date with the length of the course
final_student.dropna(subset=['date_registration', 'imd_band'], inplace=True)
final_student['date_unregistration'] = final_student['date_unregistration'].fillna(final_student['module_presentation_length'])

# step 4 drop rows where withdrew before started - may want to add cool off buffer 7-14 days? (as earlier)
final_student = final_student[final_student['date_unregistration'] < final_student['date_registration']]

# step 5 split code_presentation into year and month and add subject from code_module
final_student['year'] = final_student['code_presentation'].str[:4].astype(int)
final_student['month'] = final_student['code_presentation'].str[-1].map({'J': 'Oct', 'B': 'Feb'})

# Module subject mapping
code_module_mapping = {
    'AAA': 'SocSci',
    'BBB': 'SocSci',
    'GGG': 'SocSci',
    'CCC': 'Stem',
    'DDD': 'Stem',
    'EEE': 'Stem',
    'FFF': 'Stem'
}
final_student['subject'] = final_student['code_module'].map(code_module_mapping)

#final_student.head()

# step 6 prep assessments - fill missing dates with length of course - 3 days (last week of course)

# merge 'assessments' and 'courses' on 'code_module' and 'code_presentation'
merged_assess = assessments.merge(courses[['code_module', 'code_presentation', 'module_presentation_length']], on=['code_module', 'code_presentation'], how='left')

# value to fill in the missing 'date' values
value_to_fill = merged_assess['module_presentation_length'] - 3

# missing values in 'date' column 
merged_assess['date'] = merged_assess['date'].fillna(value_to_fill)

# update 'assessments' with filled 'date' column
assessments['date'] = merged_assess['date']


# step 7 merge assessments with final_student

## remove cols - is_banked, score = null
