## Data preparation
This notebook exhibits our data processing pipeline. We transform the raw datasets into final processed datasets to be used as input into our machine learning models.

#### Useful imports

In [2]:
import pandas as pd
import numpy as np

from modules import preparation

%load_ext autoreload
%autoreload 2

## Process data

#### Keep relevant data
- We first only keep sessions that are not "onboarding" sessions and which have at least one participant that has answered the relevant feeling of learning question. 
- We also save the answering participants along with their response.
- Keep all relevant answers
- Keep all relevant questions

In [5]:
preparation.keep_relevant()

In [6]:
preparation.filter_answers_questions()

#### Augment session information
Add title length, topic, language, and translation to session information

In [7]:
# First translate titles
preparation.translate_titles()

In [None]:
# Add title length, topic, language, translation
preparation.augment_session_info()
# Clear output of cell after running

#### Augment answers with added features
We now create the (almost) final data which consists of all considered participants' answers with added features from the sessions and questions. 

In [9]:
preparation.augment_answers()

## Finalize data
We now need to simply finalize the dataset(s) such that they can be immediately used by our models.
#### Missing data
We will impute missing data with a different strategy given the nature of the data. For categorical features, we will replace missing data by the most frequent class. For numerical features. we simply replace with the mean of defined values.

In [10]:
# We first find out if there are any nan values in our data
nan_columns = preparation.nan_columns()
nan_columns

['mode',
 'feedback_mode',
 'force_reflection',
 'timer',
 'is_solo',
 'correctness']

Of these we have categorical and numerical data columns:

In [11]:
cat_columns = ['mode', 'feedback_mode', 'force_reflection', 'is_solo']
num_columns = ['timer', 'correctness']

In [15]:
preparation.impute(cat_columns, num_columns)

#### Encode categorical features

In [16]:
preparation.encode_categorical()

#### Normalizing data
In order for all features to be on the same scale and to avoid any weight bias, we normalize our data.

In [17]:
preparation.normalize()

## Generate Aggregated datasets
We now can generate datasets to be used for our classification machine learning models that will take as input participants' aggregated data.

In [18]:
preparation.aggregate_participant_data()

We also generate balanced datasets such that all labels are equally represented.

In [20]:
preparation.make_balanced_datasets()

## Generate Time Series datasets
We finally can make some adjustments to the dataset for it to be passed as input to Tensorflow neural network architectures. 

In [26]:
preparation.prepare_time_series_data(5)
preparation.prepare_time_series_data(10)
preparation.prepare_time_series_data(15)
preparation.prepare_time_series_data(20)