# Assignment 3
In this assignment you will be working with an accelerometer dataset recoreded when participants in a study where executing some task, for example, jogging. Our task is to predict the activity of a participant given the accelerometer data. This is known as the activity recognition task.

The data is gathered from a laboratory experiment which is described in more detail [here](http://www.cis.fordham.edu/wisdm/dataset.php) and published in the paper "Activity Recognition using Cell Phone Accelerometers" (Kwapisz et al. 2010).

Your task in this assignment is to train a neural network predicts the activity being performed. You can use what ever network architecture you want to. In order to pass this assignment you will need to achieve accuracy higher than 80%, for a higher score you will need to achieve accuracy higher than 90%.

In order to make the assignment easier we provide you with the data preprocessing code so you only need to implement the model. As a starting point, try these layers when testing architectures.
- `layer_dense`
- `layer_conv_1d`
- `layer_gru`

In order to achieve accuracy higher than 90% you might need to preprocess the data differently. We leave that up to you. 

First we start by loading the required libraries. In the `assignment-3-helpers.R` you will find the function `create_sequences_x_y <- function(data, sequence_length, target_shift, step_shift)` which we used to generate the sequences. You will also find the function `load_activity_dataset <- function()` to load the dataset.

In [None]:
library(keras)
library(ggplot2)
source("assignment-3-helpers.R")

set_thread_count(2L)
# This specifies the number of threads TensorFlow will use. You can change this number, 
# but it depends on the model, what the optimal number of threads might be.
# In my experiments 2 threads performed quite well for simpleRNNs.
# Using more than 2 threads increased communication overhead between CPUs and decreased the training speed.
# To change this value simply set the value which you want and then -
# click "Kernel" -> "Shutdown" and then "Kernel" -> "Restart".
# If you do not shutdown the kernel, the change will not take effect.

## 1. Prepare the data
In this assignment we will do the usual preprocessing steps, read the data, split and scale the data. You will not have to do much in this part but we recommend reading it through as you might want to adjust it later.

## 1.1 Read data
In the cell below we load the dataset by calling the function `load_activity_dataset()`. This will download the data into the folder `data/WISDM_ar_v1.1` and return the contents of the file `data/WISDM_ar_v1.1/WISDM_ar_v1.1_raw_cleaned.txt`. We also reccomend reading the `data/WISDM_ar_v1.1/readme.txt` file supplied by the researchers with the dataset.

We save the data as `data`.

In [None]:
data <- load_activity_dataset()
dim(data)

In [None]:
head(data)

The dataset contains 6 columns. The following description is adjust from the file `data/WISDM_ar_v1.1/WISDM_ar_v1.1_raw_about.txt`.
- UserId: nominal, 1..36. A unique identifier per study participant.
- Activity: nominal, {Walking, Jogging, Sitting, Standing, Upstairs, Downstairs }. The task which the participant is performing during the measurement. We want to predict this value.
- Timestamp: numeric, generally the phone's uptime in nanoseconds.
- x-acceleration: numeric, floating-point values between -20 .. 20. The acceleration in the x direction as measured by the android phone's accelerometer. A value of 10 = 1g = 9.81 m/s^2, and 0 = no acceleration. The acceleration recorded includes gravitational acceleration toward the center of the Earth, so that when the phone is at rest on a flat surface the vertical axis will register +-10.
- y-accel: numeric, see x-acceleration
- z-accel: numeric, see x-acceleration

A datapoint is collected every 50 ms, or at 20Hz. We want to create sequences which last roughly 5 seconds, which means that we will need to have a sequence length of 100 using this sample rate.

We will not use the "Timestamp" column, so let us drop it.

In [None]:
data <- data[, -(3)]

## 1.2 Some data exploration
The histogran plot below shows that the activity distribution is not balanced. Walking and jogging are by far the most common activities. We will need to take this into account when splitting the dataset into train/val/test as we want each dataset to have the same distribution of activities.

In [None]:
library(ggplot2)
ggplot(data, aes(x=Activity)) +
  geom_bar()

Let us also plot the number of examples we have for each user.

In [None]:
library(ggplot2)
ggplot(data, aes(x=UserId)) +
  geom_bar()

## 1.3 Dataset splitting
We want to create a neural network model which can correctly recognise the activity of a new user. We will therefore evaluate the model on users we do not train on. We will therefore need to sample users to use for our training, validation and test sets.

We will do this manually and check the class distribution in of each dataset.

In [None]:
# We select users from 24 up to and including 30 to be in our validation data.
val_idx <- data['UserId'] == '24' |
             data['UserId'] == '25' |
             data['UserId'] == '26' |
             data['UserId'] == '27' |
             data['UserId'] == '28' |
             data['UserId'] == '29' |
             data['UserId'] == '30'
val_data <- data[val_idx, ]
# The fraction of the total data
nrow(val_data)/nrow(data)
ggplot(val_data, aes(x=Activity)) +
  geom_bar()

The distribution is not exactly the same as we saw over the whole data. We seem to be lacking some "Sitting" examples.

Now for the test dataset.

In [None]:
test_idx <- data['UserId'] == '31' |
              data['UserId'] == '32' |
              data['UserId'] == '33' |
              data['UserId'] == '34' |
              data['UserId'] == '35' |
              data['UserId'] == '36'
test_data <- data[test_idx, ]
nrow(test_data)/nrow(data)
ggplot(test_data, aes(x=Activity)) +
  geom_bar()

Our test dataset seems to have a similar distribution as our whole dataset, and in the cell below we will see that our training set is very similar to our test set. We thus expect our model to generalise well from the training data to the test data and we expect to perform better on the test set rather than the validation set. This means that our validation dataset will be harder than the test dataset. Keep this in mind when training and evaluating your model.

Our training dataset will be the remaining rows:

In [None]:
# we select all entities in the validation set and the test set, and take the compliment.
train_idx <- !(val_idx | test_idx)
train_data <- data[train_idx, ]
nrow(train_data)/nrow(data)
ggplot(train_data, aes(x=Activity)) +
  geom_bar()

Now we drop the `UserId` column as we will not use it to make predictions.

In [None]:
train_data <- train_data[, -(1)]
val_data <- val_data[, -(1)]
test_data <- test_data[, -(1)]
dim(train_data)
dim(val_data)
dim(test_data)

## 1.4 Creating sequences
Now we want to create the sequences which we will use to train and evaluate our model. Our sequences will be 5 seconds long, that is, a sequence length of 100. We use the label of the last element in the sequence as a target, `target_shift = -1`. Since our dataset is quite large we get plenty of examples, so we shift 50 steps for each sequence to keep the number of sequences in an acceptable range (this sequence length and shift was set after some experimentation). Feel free to adjust these values if you feel like.

In [None]:
train_seq <- create_sequences_x_y(data = train_data, sequence_length = 100, target_shift = -1, step_shift = 50)
dim(train_seq$x)
dim(train_seq$y)
val_seq <- create_sequences_x_y(data = val_data, sequence_length = 100, target_shift = -1, step_shift = 50)
dim(val_seq$x)
dim(val_seq$y)
test_seq <- create_sequences_x_y(data = test_data, sequence_length = 100, target_shift = -1, step_shift = 50)
dim(test_seq$x)
dim(test_seq$y)

A minor note: When creating the sequences we created some sequences which contain data from two different UserIds, which will never happen in real-life. We do not care that much about this defect in our data processing, since the number of sequences which are from two different UserIds are **very** few (at most 36*2 out of 14132+3832+3995).

Now we drop last few columns which we do not need. We drop the label from the input sequences and drop the accelormeter data from the targets.

In [None]:
# Keep drop labels for the input features, x
train_seq$x <- train_seq$x[ , , -1]
# Keep the labels for the y data.
train_seq$y <- train_seq$y[ , 1]
head(train_seq$x)
head(train_seq$y)

We do the same for the validation and test data.

In [None]:
val_seq$x <- val_seq$x[ , , -1]
val_seq$y <- val_seq$y[ , 1]
test_seq$x <- test_seq$x[ , , -1]
test_seq$y <- test_seq$y[ , 1]

## 1.5 Scaling
To help our network train faster, we need to **scale** our data. In this assignment we will scale our data using the Min/Max approach, the same approach as we did for for the last assignment.

We will scale the data so that the largest value in our training data will have value `1` and the smallest value will have the value `0`. To achieve this we do:

$$
x' = \frac{x - \min(\boldsymbol{x})}{\max(\boldsymbol{x}) - \min(\boldsymbol{x})}
$$

Where $x$ is a single example and $x'$ is our new scaled value. $\min(\boldsymbol{x})$ is the smallest value in the training set and $\max(\boldsymbol{x})$ is the largest value.
For the interested we recommend the [wikipedia article](https://en.wikipedia.org/wiki/Feature_scaling).

In [None]:
# We initialise our arrays in the shape we want them to be.
x_train_scaled <- array(0, dim = dim(train_seq$x))
x_val_scaled <- array(0, dim = dim(val_seq$x))
x_test_scaled <- array(0, dim = dim(test_seq$x))
dim(x_train_scaled)
dim(x_val_scaled)
dim(x_test_scaled)


for (j in 1:dim(train_seq$x)[3]) {
    # For each feature we compute the max and scale
    min_train <- min(as.numeric(train_seq$x[,,j]))
    max_train <- max(as.numeric(train_seq$x[,,j]))
    
    # For each dataset and feature we scale the values according to the max/min
    x_train_scaled[,,j] <- (as.numeric(train_seq$x[,,j]) - min_train) / (max_train - min_train)
    x_val_scaled[,,j] <- (as.numeric(val_seq$x[,,j]) - min_train) / (max_train - min_train)
    x_test_scaled[,,j] <- (as.numeric(test_seq$x[,,j]) - min_train) / (max_train - min_train)
}
min(x_train_scaled)
max(x_train_scaled)

We also need to map our labels to the one-hot encoding representation, as a first step we need to map the text to a numerical value and then we use the `to_categorical` function from Keras. The `category_to_label` function below maps the text to a numerical value which the `to_categorical` function can translate.

In [None]:
category_to_label <- function(category) {
    if (category == "Downstairs")
        0
    else if (category == "Jogging")
        1
    else if (category == "Sitting")
        2
    else if (category == "Standing")
        3
    else if (category == "Upstairs")
        4
    else if (category == "Walking")
        5
    else
        -1
}

In [None]:
y_train <- to_categorical(lapply(train_seq$y, FUN = category_to_label))
y_val <- to_categorical(lapply(val_seq$y, FUN = category_to_label))
y_test <- to_categorical(lapply(test_seq$y, FUN = category_to_label))

## 2. The model
In this section you will implement the model. You need to construct a model which assigns one of 6 classes to each sequence.

To achieve a score of `1` for the assignment you only need to implement the baseline model and achieve accuracy higher than 80% on the **test set**. To achieve a score of `2` for the assignment you will need to achieve accuracy higher than 90% on the **test set**. Keep in mind that the test set is easier than the validation set.

We leave the model definition completely up to you but suggest that you start by trying some of the layers below.
- `layer_dense`
- `layer_conv_1d`
- `layer_gru`

When you finished tuning your model evaluate your model and report the loss over the test set. Methodologically, you should only evaluate your once (or not very often). Try to keep to that convention.

## Evaluate the model
Evaluate your model and report the loss over the test dataset.

In [None]:
model %>% evaluate(x_test_scaled, y_test)