## Data Modelling  

This section is preparing for data modelling and stepping through the process of building a model.

### Data Splitting

I have had several thoughts about how the data could and should be split:

* train/validate/test split - where the model is trained on `train`, validated and tuned on `validate` and tested/confirmed on `test`
* train/test split - where the model is trained, validated and tuned on `train` using cross validation, before final model is tested/confirmed on `test`
* train/test split on `year` in the dataset - where the model is trained on data from 2013 and tested on 2014 data.  This would mimic the real world scenario, where a HE institution would use data from previous years to build a model which is applied to the current year students.  The model would be retrained on an annual basis with new data.  
  * However, the dataset had some `module_presentations` which would not feature in the training data
  * The distribution of the data between years is quite different - I focused on `subject` as a proxy or student type and behaviour as well as curriculum differences between modules - and it varied between years.  
* reworking the final_result target variables into two categories - `intervene` and `no_intervene` - and using a binary classification model to predict whether a student would need intervention or not.  This could be a more useful model for the HE institution, as it would be able to identify students who need intervention and target resources at them.  However, this would require a different approach to the modelling, as the target variable would be binary rather than multi-class.

So, I decided to split the whole dataset into `train` and `test` sets, using the `train_test_split` function from `sklearn.model_selection`.  I used a `test_size` of 0.25, which is 25% of the data.  I also set the `random_state` to 567, so that the split would be reproducible.

In [1]:
# load libraries
import pandas as pd
import matplotlib as plt
from sklearn.model_selection import train_test_split



In [2]:
# load preprocessed data from csv file
data = pd.read_csv('../../data/final_model_ALL_20230526.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31437 entries, 0 to 31436
Data columns (total 28 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   code_module                                 31437 non-null  object 
 1   code_presentation                           31437 non-null  object 
 2   id_student                                  31437 non-null  int64  
 3   gender                                      31437 non-null  object 
 4   region                                      31437 non-null  object 
 5   highest_education                           31437 non-null  object 
 6   imd_band                                    31437 non-null  object 
 7   age_band                                    31437 non-null  object 
 8   num_of_prev_attempts                        31437 non-null  int64  
 9   studied_credits                             31437 non-null  int64  
 10  disability

Drop columns as discussed in previous notebooks.  

I still think it would be interesting to build different models with some of these dropped features, although they are different research questions which need to be properly considered. 

In [4]:
model = data.copy()
# columns to drop
columns_to_drop = ['code_module','code_presentation', 'id_student', 'gender', 'region', 'highest_education', 'imd_band',
                   'age_band', 'disability', 'course_length', 'unregistration_before_registration',
                   'unregistration_before_registration_14_days', 'mod_pres_vle_type_count', 'year', 'month','date_registration', 'date_unregistration',]

# drop columns
model = model.drop(columns=columns_to_drop)


Save into standard `X` and `y` variables.

Use stratification to ensure that the proportions of `final_result` are the same in both the `train` and `test` sets.  This is important as the `final_result` is the target variable and we want to ensure that the model is trained on a representative sample of the data.  I used the `stratify` parameter in the `train_test_split` function to do this.

In [5]:

# drop target from X, save target to y
X = model.drop('final_result', axis=1)  
y = model['final_result']  

# split data into train and test sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=567)


Below I have checked to see that the proportions are the same in both the `train` and `test` sets.

In [6]:
# proportions of target variable in original data
original_proportions = model['final_result'].value_counts(normalize=True)

# proportions of target variable in train and test sets
train_proportions = y_train.value_counts(normalize=True)
test_proportions = y_test.value_counts(normalize=True)

# results
print("Original Proportions:")
print(original_proportions)

print("\nTrain Set Proportions:")
print(train_proportions)

print("\nTest Set Proportions:")
print(test_proportions)


Original Proportions:
Pass           0.376276
Withdrawn      0.314311
Fail           0.219550
Distinction    0.089862
Name: final_result, dtype: float64

Train Set Proportions:
Pass           0.376257
Withdrawn      0.314332
Fail           0.219536
Distinction    0.089876
Name: final_result, dtype: float64

Test Set Proportions:
Pass           0.376336
Withdrawn      0.314249
Fail           0.219593
Distinction    0.089822
Name: final_result, dtype: float64


Another standard check is to ensure that there are no missing values in the datasets.

In [7]:
# missing values in X_train, X_test, y_train, y_test
missing_values_X_train = X_train.isnull().sum()
missing_values_X_test = X_test.isnull().sum()
missing_values_y_train = y_train.isnull().sum()
missing_values_y_test = y_test.isnull().sum()


# rows with missing values
rows_with_missing_X_train = X_train[X_train.isnull().any(axis=1)]
rows_with_missing_X_test = X_test[X_test.isnull().any(axis=1)]
rows_with_missing_y_train = y_train[y_train.isnull()]
rows_with_missing_y_test = y_test[y_test.isnull()]



# results
print("Missing values in X_train:", len(rows_with_missing_X_train))
print("Missing values in X_test:", len(rows_with_missing_X_test))
print("Missing values in y_train:", len(rows_with_missing_y_train))
print("Missing values in y_test:", len(rows_with_missing_y_test))



Missing values in X_train: 0
Missing values in X_test: 0
Missing values in y_train: 0
Missing values in y_test: 0


### Feature Preparation

In [8]:
from sklearn.preprocessing import StandardScaler, OrdinalEncoder

In [9]:
numeric_columns = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
non_numeric_columns = X_train.select_dtypes(exclude=['int64', 'float64']).columns.tolist()

print("Numeric Columns:")
print(numeric_columns)
print("\n")
print("Non-Numeric Columns:")
print(non_numeric_columns)


Numeric Columns:
['num_of_prev_attempts', 'studied_credits', 'prop_submissions', 'avg_score', 'submission_distance', 'stu_activity_count', 'stu_activity_type_count', 'stu_total_clicks', 'stu_days_active']


Non-Numeric Columns:
['subject']


#### One-Hot / Ordinal Encoding Categorical Variables

Categorical values need to be converted into numerical values for the model.  There are two main approaches:

Originally, I needed to consider both - but with the current dataset only one-hot encoding is required.  [model_01_plan](../V1/model_01_plan%20%2B%20split%20%2B%20scale.ipynb) has initial exploration of ordinal encoding.

* One-hot encoding - converts categorical variables into binary vectors.  That is - it creates new binary columns for each category.  For example, `subject` will be encoded as two features - subject_socsci and subject_stem which is either a 0 or 1 for each row.



In [10]:
nominal_cols = ['subject']

# One-Hot Encoding
X_train_nominal_encoded = pd.get_dummies(X_train[nominal_cols])
X_test_nominal_encoded = pd.get_dummies(X_test[nominal_cols])

# reset indices
X_train_nominal_encoded.reset_index(drop=True, inplace=True)
X_test_nominal_encoded.reset_index(drop=True, inplace=True)

print("Shape of X_train_nominal_encoded:", X_train_nominal_encoded.shape)
print("Shape of X_test_nominal_encoded:", X_test_nominal_encoded.shape)


Shape of X_train_nominal_encoded: (23577, 2)
Shape of X_test_nominal_encoded: (7860, 2)



#### Scaling Numerical Variables

Because the variables are in different units and scales - i.e. average score (0-100) v number_of_clicks (000s), the dataset needs to be scaled/normalised.  

The `train` dataset is scaled and the same transformation (i.e. the same parameters) are applied to the `test` set.  This way there is no 'data leakage' - we have not accessed `test` in any way.

Scaling only applies to 'numeric' variables - that is variables which can be, for example, means-centred (which is what I apply below).



In [11]:
# standard Scaling
X_train_numeric = X_train[numeric_columns]
X_test_numeric = X_test[numeric_columns]

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_test_scaled = scaler.transform(X_test_numeric)

# reset indices 
X_train_scaled_reset = pd.DataFrame(X_train_scaled, columns=numeric_columns).reset_index(drop=True)
X_test_scaled_reset = pd.DataFrame(X_test_scaled, columns=numeric_columns).reset_index(drop=True)

# concatenate merged nominal dataframes with scaled dataframes
X_train_transformed = pd.concat([X_train_nominal_encoded, X_train_scaled_reset], axis=1)
X_test_transformed = pd.concat([X_test_nominal_encoded, X_test_scaled_reset], axis=1)

# merging all dataframes
print("Shape of X_train_transformed:", X_train_transformed.shape)
print("Shape of X_test_transformed:", X_test_transformed.shape)


Shape of X_train_transformed: (23577, 11)
Shape of X_test_transformed: (7860, 11)


In [12]:
X_train_transformed.to_csv('../../data/X_train_transformed.csv', index=False)
X_test_transformed.to_csv('../../data/X_test_transformed.csv', index=False)
y_train.to_csv('../../data/y_train.csv', index=False)
y_test.to_csv('../../data/y_test.csv', index=False)


#### Draft function for handling unseen data - first transformation

In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def preprocess_data(data):
    # columns to be used for each type of variable
    numeric_columns = ['num_of_prev_attempts', 'studied_credits', 'prop_submissions', 'avg_score', 'submission_distance', 'stu_activity_count', 'stu_activity_type_count', 'stu_total_clicks', 'stu_days_active']
    nominal_columns = ['subject']

    # check required columns exist
    missing_numeric_cols = [col for col in numeric_columns if col not in data.columns]
    missing_nominal_cols = [col for col in nominal_columns if col not in data.columns]

    assert not missing_numeric_cols, f"Missing numeric columns: {', '.join(missing_numeric_cols)}"
    assert not missing_nominal_cols, f"Missing nominal columns: {', '.join(missing_nominal_cols)}"

    # drop unneeded columns
    unneeded_cols = [col for col in data.columns if col not in numeric_columns + nominal_columns]
    data = data.drop(unneeded_cols, axis=1)

    # preprocessing for each type of variable
    numeric_transformer = StandardScaler()
    nominal_transformer = OneHotEncoder(sparse=False, handle_unknown='ignore')

    # ColumnTransformer for appropriate transformations
    preprocessor = ColumnTransformer(
        transformers=[
            ('numeric', numeric_transformer, numeric_columns),
            ('nominal', nominal_transformer, nominal_columns)
        ])

    # fit and transform the data
    transformed_data = preprocessor.fit_transform(data)

    return transformed_data


# Example usage
data = pd.read_csv('data.csv')  # Load your dataset here
preprocessed_data = preprocess_data(data)
