# Data Modelling (Pseudo Plan)


1. Data Splitting
* Split into training and test
    * Train:
        * More detailed EDA (maybe - so no bias introduced)
        * Training models
        * Hyperparameter tuning with cross-validation

2. Data Scaling
* In order to take out effect of different scales (e.g. number of clicks v dates)
* Applied to Train and then use same paramaters on Test (to avoid data leakage)

3. Unsupervised Clustering / EDA (optional - to consider)
* Explore student profiles through unsupervised clustering
* K-means clustering, hierarchical clustering

4. Supervised Prediction Model:
* Predict student outcome (`final_result`) - so classifier, classfication models
    * E.g. Random Forest, Decision Trees, Support Vector Machines, Naive Bayes
* Prepare feature matrix:
    * Consider feature reduction, selection
    * Encode categorical variables - one-hot encoding, label encoding

5. Feature Selection / Dimensionality Reduction (optional)
* Consider reducing/simplifying features by selection or dimensionality reduction
* Correlation analysis, mutual information, feature importance (trees and forests?) 
* Consider PCA, LDA

6. Model Training and Evaluation: 
* Train model
* Evaluate model - metrics like accuracy, precision, recall, F1, AUC-ROC
* Compare and select

7. Variable Effects questions:
* Signficance of different variables or categories of variables (bio, demo, study, behaviour) on prediction

8. Performance over time
* Accuracy (of model) overtime - different prediction points to compare performance

## Split dataset

I decided to split the dataset into three -  `train` and `test` on a 75/25 split with stratification to ensure that the proportions remain the same within each subset.

* `train` - model training, hyperparameter tuning with k-fold cross validation
* `test` - final model evaluation, remains unseen.

### Data and Libraries

In [1]:
# load libraries
import pandas as pd
import matplotlib as plt
from sklearn.model_selection import train_test_split



In [2]:
# load preprocessed data from csv file
model = pd.read_csv('../data/final_model_ALL_20230525.csv')

In [3]:

# drop 'id_student' column
model = model.drop('id_student', axis=1)



In [4]:
# drop 'status' - should have been dropped in function (now fixed)
#model = model.drop('status', axis=1)

#change year to str (object) - not fixed in function
model['year'] = model['year'].astype(str)


model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31437 entries, 0 to 31436
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   code_module              31437 non-null  object 
 1   code_presentation        31437 non-null  object 
 2   gender                   31437 non-null  object 
 3   region                   31437 non-null  object 
 4   highest_education        31437 non-null  object 
 5   imd_band                 31437 non-null  object 
 6   age_band                 31437 non-null  object 
 7   num_of_prev_attempts     31437 non-null  int64  
 8   studied_credits          31437 non-null  int64  
 9   disability               31437 non-null  object 
 10  course_length            31437 non-null  int64  
 11  date_registration        31437 non-null  float64
 12  date_unregistration      31437 non-null  float64
 13  prop_submissions         31437 non-null  float64
 14  avg_score             

In [5]:


# drop target from X, save target to y
X = model.drop('final_result', axis=1)  
y = model['final_result']  

# split data into train and test sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=567)


### check proportions - stratification

It works!

In [6]:
# proportions of target variable in original data
original_proportions = model['final_result'].value_counts(normalize=True)

# proportions of target variable in train and test sets
train_proportions = y_train.value_counts(normalize=True)
test_proportions = y_test.value_counts(normalize=True)

# results
print("Original Proportions:")
print(original_proportions)

print("\nTrain Set Proportions:")
print(train_proportions)

print("\nTest Set Proportions:")
print(test_proportions)


Original Proportions:
final_result
Pass           0.376276
Withdrawn      0.314311
Fail           0.219550
Distinction    0.089862
Name: proportion, dtype: float64

Train Set Proportions:
final_result
Pass           0.376277
Withdrawn      0.314327
Fail           0.219532
Distinction    0.089864
Name: proportion, dtype: float64

Test Set Proportions:
final_result
Pass           0.376272
Withdrawn      0.314249
Fail           0.219625
Distinction    0.089854
Name: proportion, dtype: float64


In [7]:
# Check for missing values in X_train_transformed
missing_values = X_train.isnull().sum()

# Filter rows with missing values
rows_with_missing = X_train[X_train.isnull().any(axis=1)]

# Save the rows with missing values to a separate DataFrame or file
rows_with_missing



Unnamed: 0,code_module,code_presentation,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,...,avg_score,submission_distance,stu_activity_count,stu_activity_type_count,stu_total_clicks,stu_days_active,mod_pres_vle_type_count,year,month,subject


## Variable preparation
### Scaling and One-Hot Encoding / Ordinal

Because the variables are in different units and scales - i.e. average score (0-100) v number_of_clicks (000s), the dataset needs to be scaled/normalised.  The `train` dataset is scaled and the same transformation (i.e. the same parameters) are applied to the `test` set.  This way there is no 'data leakage' - we have not accessed `test` in any way.

Scaling only applies to 'numeric' variables - that is variables which can be, for example, means-centred (which is what I apply below).

Categorical variables need to be excluded from this process, but they also require transformation.

* One-hot encoding - converts categorical variables into binary vectors.  That is - it creates new binary columns for each category.  For example, module_code will have 'AAA' (Y/N) 'BBB' (Y/N) etc. for each row - where only one of the 7 columns will be a Y.  One-hot encoding is used where there is no inherent ordinal relationship between the categories.  For example, module 'AAA' cannot be ranked above or below module 'CCC'. 

* Ordinal encoding - is used for categorical variables which have an inherent ordinal structure, i.e. a meaningful order.  For example, 'age_band' can be ordered from lower ages to higher ages, as can 'highest_education' etc. 

In [8]:
numeric_columns = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
non_numeric_columns = X_train.select_dtypes(exclude=['int64', 'float64']).columns.tolist()

print("Numeric Columns:")
print(numeric_columns)
print("\n")
print("Non-Numeric Columns:")
print(non_numeric_columns)


Numeric Columns:
['num_of_prev_attempts', 'studied_credits', 'course_length', 'date_registration', 'date_unregistration', 'prop_submissions', 'avg_score', 'submission_distance', 'stu_activity_count', 'stu_activity_type_count', 'stu_total_clicks', 'stu_days_active', 'mod_pres_vle_type_count']


Non-Numeric Columns:
['code_module', 'code_presentation', 'gender', 'region', 'highest_education', 'imd_band', 'age_band', 'disability', 'year', 'month', 'subject']


In [9]:
#model['highest_education'].unique()
#model['imd_band'].unique()
#model['age_band'].unique()

In [10]:
#print(X_train.isnull().sum())
#print(X_test.isnull().sum())

#print(X_train.info())
#print(X_test.info())


In [11]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OrdinalEncoder

# nominal and ordinal categorical columns
nominal_cols = ['code_module', 'code_presentation', 'gender', 'region', 'disability', 'month', 'subject', 'year']
ordinal_cols = ['highest_education', 'imd_band', 'age_band']

# ordinal encoding for ordinal variables
ordinal_mapping = {
    'highest_education': {'No Formal quals': 0, 'Lower Than A Level': 1, 'A Level or Equivalent': 2, 'HE Qualification': 3, 'Post Graduate Qualification': 4},
    'imd_band': {'0-10%': 0, '10-20': 1, '20-30%': 2, '30-40%': 3, '40-50%': 4, '50-60%': 5, '60-70%': 6, '70-80%': 7, '80-90%': 8, '90-100%': 9},  
    'age_band': {'0-35': 0, '35-55': 1, '55<=': 2} 
}


In [12]:

# One-Hot Encoding
X_train_nominal_encoded = pd.get_dummies(X_train[nominal_cols])
X_test_nominal_encoded = pd.get_dummies(X_test[nominal_cols])

#print("One-Hot Encoding:")
#print(X_train_nominal_encoded.info())
#print(X_test_nominal_encoded.info())
#print("X_train_nominal_encoded shape:", X_train_nominal_encoded.shape)
#print("X_test_nominal_encoded shape:", X_test_nominal_encoded.shape)

#print("\n") 
#print(X_train_nominal_encoded.isnull().sum())
#print(X_test_nominal_encoded.isnull().sum())

In [13]:
# Ordinal Encoding
ordinal_encoder = OrdinalEncoder(categories=[list(ordinal_mapping[col].keys()) for col in ordinal_cols])
X_train_ordinal_encoded = pd.DataFrame(ordinal_encoder.fit_transform(X_train[ordinal_cols]), columns=ordinal_cols)
X_test_ordinal_encoded = pd.DataFrame(ordinal_encoder.transform(X_test[ordinal_cols]), columns=ordinal_cols)

#print("Ordinal Encoding:")
#print(X_train_ordinal_encoded.info())
#print(X_test_ordinal_encoded.info())
#print("X_train_ordinal_encoded shape:", X_train_ordinal_encoded.shape)
#print("X_test_ordinal_encoded shape:", X_test_ordinal_encoded.shape)

#print(X_train_ordinal_encoded.isnull().sum())
#print(X_test_ordinal_encoded.isnull().sum())


In [14]:

# Reset the indices of ordinal and nominal encoded dataframes
X_train_ordinal_encoded.reset_index(drop=True, inplace=True)
X_train_nominal_encoded.reset_index(drop=True, inplace=True)
X_test_ordinal_encoded.reset_index(drop=True, inplace=True)
X_test_nominal_encoded.reset_index(drop=True, inplace=True)

# Merge ordinal and nominal encoded dataframes using row indices
X_train_merged = pd.concat([X_train_ordinal_encoded, X_train_nominal_encoded], axis=1)
X_test_merged = pd.concat([X_test_ordinal_encoded, X_test_nominal_encoded], axis=1)

# Verify the results of merging ordinal and nominal dataframes
print("Shape of X_train_merged:", X_train_merged.shape)
print("Shape of X_test_merged:", X_test_merged.shape)




Shape of X_train_merged: (25149, 37)
Shape of X_test_merged: (6288, 37)


In [15]:
# standard Scaling
X_train_numeric = X_train[numeric_columns]
X_test_numeric = X_test[numeric_columns]

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_test_scaled = scaler.transform(X_test_numeric)

# reset the indices 
X_train_scaled_reset = pd.DataFrame(X_train_scaled, columns=numeric_columns).reset_index(drop=True)
X_test_scaled_reset = pd.DataFrame(X_test_scaled, columns=numeric_columns).reset_index(drop=True)

# concatenate merged ordinal and nominal dataframes with scaled dataframes
X_train_transformed = pd.concat([X_train_merged, X_train_scaled_reset], axis=1)
X_test_transformed = pd.concat([X_test_merged, X_test_scaled_reset], axis=1)

# merging all dataframes
print("Shape of X_train_transformed:", X_train_transformed.shape)
print("Shape of X_test_transformed:", X_test_transformed.shape)


Shape of X_train_transformed: (25149, 50)
Shape of X_test_transformed: (6288, 50)


In [16]:
X_train_transformed.to_csv('../data/X_train_transformed.csv', index=False)
X_test_transformed.to_csv('../data/X_test_transformed.csv', index=False)
y_train.to_csv('../data/y_train.csv', index=False)
y_test.to_csv('../data/y_test.csv', index=False)
