## M4 | Data Preparation

This notebook prepares the data for our different prediction tasks.

**Research question** : Predicting student reflection responses to the "How do you feel about your learning progress" question from their session interactions (response time, response correctness) and from the characteristics of the session (number of questions, feedback mode, time of the day, etc.)

#### Useful imports and setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pathlib import Path

# Self defined modules
from modules import data

%load_ext autoreload
%autoreload 2

DATA_DIR = '../data'

### Pre-process raw data
*Note: skip to later sections if datasets have already been computed*

We build a dataframe from the raw data with columns `[participant_id, answer_time, mode, feedback_mode, force_reflection, timer, is_solo, video, image, correctness, nth_answer, response]`. This dataframe can be used to aggregate participant answers, as well as be used for time series analysis.

In [2]:
path_to_processed_data = '{}/processed/time-series-processed.csv.gz'.format(DATA_DIR)
path = Path(path_to_processed_data)

# Compute the processed dataset if it does not exist
if not path.is_file():
    raw_data_dir = '{}/raw'.format(DATA_DIR)
    data.process_time_series_data(raw_data_dir, path)

# Load the processed data
df = data.load_dataframe(path_to_processed_data)
df.head()

TypeError: object of type 'float' has no len()

### Missing data
We will impute missing data with a different strategy given the nature of the data. For categorical features, we will replace missing data by the most frequent class. For numerical features. we simply replace with the mean of defined values.

In [None]:
# Are there any nan values in the data?
nan_columns = df.columns[df.isna().any()].to_list()
nan_columns

In [None]:
# Impute missing values
from sklearn.impute import SimpleImputer
imp_frequent = SimpleImputer(strategy='most_frequent')
imp_mean = SimpleImputer(strategy='mean')
for col in ['answer_time', 'timer', 'correctness']:
    df[col] = imp_mean.fit_transform(df[col].array.reshape(-1,1))
for col in ['mode', 'feedback_mode', 'force_reflection', 'is_solo']:
    df[col] = imp_frequent.fit_transform(df[col].array.reshape(-1,1))

### Normalizing data
In order for all features to be on the same scale, we normalize our data

In [None]:
# Scale numerical features not in [0,1] to [0,1]
from sklearn.preprocessing import MinMaxScaler
min_max = MinMaxScaler()
for col in ['answer_time', 'timer']:
    df[col] = min_max.fit_transform(df[col].values.reshape(-1, 1))

### Encode categorical features

In [None]:
# Encode categorical columns
df['feedback_mode'] = df['feedback_mode'].astype('category').cat.codes.astype('float')
df['mode'] = df['mode'].astype('category').cat.codes.astype('float')
df['force_reflection'] = df['force_reflection'].astype('category').cat.codes.astype('float')
df['is_solo'] = df['is_solo'].astype('category').cat.codes.astype('float')
df['video'] = df['video'].astype('category').cat.codes.astype('float')
df['image'] = df['image'].astype('category').cat.codes.astype('float')

#### Save the final dataset

In [None]:
display(df.head())
df.to_csv(f'{DATA_DIR}/processed/final.csv.gz', index=False, compression='gzip')

## Adapt for tensorflow
We finally can make some adjustments to the dataset for it to be passed as input for use in TensorFlow tensors.

In [None]:
# Load the dataset
df = df.read_csv(f'{DATA_DIR}/processed/final.csv.gz')

To perform predictions on our time series data, we decide on a fixed number of time steps to consider.

In [None]:
# We decide to consider a fixed amount of time steps
N_STEPS = 10 # Note: can be tuned
# Keep first answers by participant
df = df[df.nth_answer < N_STEPS]
# Extract labels
labels = df.groupby('participant_id').response.first()
# Drop unused columns
df.drop(labels=['nth_answer', 'response'], axis='columns', inplace=True)

In [None]:
# Number of features = # of columns - participant_id column
N_FEATURES = df.shape[1] - 1
feature_cols = ['answer_time', 'mode', 'feedback_mode', 'force_reflection', 'timer', 'is_solo', 'video', 'image', 'correctness']

Since not all participants have a enough answers, we need to pad our data:

In [None]:
PAD_VALUE = -1.0

def pad(values, n_steps=N_STEPS, pad_val=PAD_VALUE):
    return np.pad(values, [(0, n_steps-values.shape[0]), (0, 0)], mode='constant', constant_values=pad_val)

df = df.groupby('participant_id').apply(lambda r: np.stack(pad(r[feature_cols].values), axis=0)).explode()

In [None]:
# Explode column of list
X = np.array(df.to_list())
# Reshape as tensor
X = X.reshape(-1, N_STEPS, N_FEATURES)
X.shape

In [None]:
np.save(f'{DATA_DIR}/processed/{N_STEPS}-steps.npy', X)
labels.to_csv(f'{DATA_DIR}/processed/participant-labels.csv.gz', compression='gzip')