# Activity recognition with accelerometer data using a convolutional neural network

(Assignment developed by Dr. Bruns)

In this notebook we build an activity recognizer using a 1D-convolutional network.  There are seven different activities, so this is a multi-class classifier.

A question that comes up in activity recognition is:
- Is it possible to build a generic activity recognizer that does not need to be special tuned on data from the user that will use it?

In this notebook we'll do experiments to see if a custom activity recognizer (trained for its user) would be much more accurate than a generic one.

v1.4

## Instructions:
- There are 7 clearly-identified problems below.
- Work on your own.  Do not look on the web for ideas on activity recognition.
- In all problems except problem 1, you are free to create multiple markdown and code cells, to create plots, to define new functions, etc.
- Provide commentary on the results of every problem.
- Feel free to use the hyperparameter tuning functions that were defined in a previous homework.
- Be sure to clearly report on accuracy values when you are asked to.

In [None]:
import numpy as np
from scipy.signal import find_peaks, periodogram
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from tensorflow.keras import models, layers
from tensorflow.keras import backend as K
from tensorflow.keras.callbacks import EarlyStopping
from IPython.core.display import display, HTML

In [None]:
display(HTML("<style>.container { width:100% !important; }</style>"))

Names associated with the activity numbers in the data.

In [None]:
activity_names = ['desk', 'mixed', 'standing', 'walking', 'stairs', 'talking/walking', 'talking/standing']

Functions to help with reading and preprocessing the data.

In [None]:
def read_user_data(user_id):
    """ Return a data frame containing the data for the given user.
    Args:
        user_id an integer user id from 1 to 15.
    """

    infile = f'https://raw.githubusercontent.com/grbruns/cst495/master/activity/{user_id}.csv'
    df = pd.read_csv(infile, index_col=0, header=None)

    df.columns = ['x', 'y', 'z', 'activity']
    df = df[df['activity'] != 0]
    df['activity'] = df['activity'] - 1

    return df

In [None]:
def create_segments(X, window_size, shift):
    """ Return a list of NumPy arrays, each a segment.

    X is a numpy array with shape (number of time steps, number of variables)
    window_size defines the size of the segments
    shift is the number of time steps to shift the window

    The output is a NumPy array with shape (k, window_size, n)
    """

    # compute number of segments in X
    # X.shape[0]/shift gives total number of window positions
    num_segments = np.floor(X.shape[0]/shift) - np.ceil(window_size/shift)
    num_segments = int(num_segments)

    # create the segments
    segments = np.zeros((num_segments, window_size, X.shape[1]))
    for i in np.arange(num_segments):
        segments[i, :, :] = X[(i*shift):(i*shift + window_size), :]

    return segments

In [None]:
def clean_and_label(segments):
    """ From the given segments, create a new array of the clean segments.
    Return the clean segments, with activity values removed,
    and an activity label for each.
    """

    # compute number of single class ("clean") segments
    n = segments.shape[0]
    num_clean = 0
    for i in range(n):
        segment_classes = segments[i,:,3]
        if segment_classes.min() == segment_classes.max():
            num_clean += 1

    print('fraction of segments with a single class: {:.3f}'.format(num_clean/n))

    # create clean segments, and create training labels
    segs = np.zeros((num_clean, segments.shape[1], segments.shape[2]-1))
    y = np.full(num_clean, 0)
    idx = 0
    for i in range(n):
        segment_classes = segments[i,:,3]
        if segment_classes.min() == segment_classes.max():
            segs[idx,:,:] = segments[i,:,:3]
            y[idx] = segment_classes[0]
            idx += 1

    return segs, y

A function to help with diagnosis of neural net training.

In [None]:
def plot_metric(history, metric='loss'):
    """ Plot training and test values for a metric. """

    val_metric = 'val_'+metric
    plt.plot(history.history[metric])
    plt.plot(history.history[val_metric])
    plt.title('model '+metric)
    plt.ylabel(metric)
    plt.xlabel('epoch')
    plt.legend(['train', 'test'])
    plt.show();

### Read the raw data

The data set comes from the UCI repository:

[Activity Recognition from Single Chest-Mounted Accelerometer Data Set](https://archive.ics.uci.edu/ml/datasets/Activity+Recognition+from+Single+Chest-Mounted+Accelerometer)

The data was apparently collected at the University of Barcelona in about 2009.

The data set contains data for 15 people.  During data collection, each person wore a chest-mounted accelerometer, and x/y/z data was collected at 52 Hz (52 samples/second).

In [None]:
# Load data for a single user.  Users are numbered 1-15.

df = read_user_data(4)

### Initial exploration

The three columns of the data contain samples in the x, y, and z dimensions.  There are about 120K samples for each of x, y, and z, so, based on the sample rate, about 2300 seconds (38 minutes) of data.

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df['activity'].value_counts().plot.bar()
plt.title('Counts of activities');

### Data segmentation and cleaning

As a first step we need to break the data up into segments and assign a label to each segment.  

The process of labeling is a little tricky, because not every segment has a single label associated with it.

Remember that the data was sampled at 52 Hz.  In other words, a segment of length 52 would contain 1 second of recorded activity.

---
#### Problem 1: set the data segmentation parameters

In this problem, set the window_size and shift values.

---

In [None]:
window_size = None      # select a segment size
shift = None            # select a shift amount

In [None]:
df = read_user_data(4)
segments = create_segments(df.values, window_size, shift)
X, y = clean_and_label(segments)

In [None]:
print(X.shape)
print(y.shape)

### Prepare data for machine learning

Perform any further preprocessing, then do a train/test split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

Plot a random clean segment.

In [None]:
plt.figure(figsize=(6,3))
i = np.random.choice(X_train.shape[0])
plt.plot(X_train[i])
plt.title(activity_names[y_train[i]]);

---
#### Problem 2: Perform any additional preprocessing of the data that you want.
---

This problem is optional; you may choose to not perform any additional preprocessing.

Remember to treat training and test data correctly.  For example, if you do any scaling or normalization, train the scaler on the training data and then apply it on the test data.  As another example, if you do any balancing of the data, do the balancing only on the training data.

In [None]:
# YOUR CODE AND MARKDOWN CELLS HERE

### Machine learning

The goal is to predict the activity of a training example.  This is a multi-class classification problem.

In [None]:
# to get baseline accuracy, always predict the most common activity
counts = df['activity'].value_counts()/df.shape[0]
print('baseline accuracy: {:0.4f}'.format(counts[0]))

This is to help ensure that a standard data set is being used.

In [None]:
print(X_train.shape, y_train.shape)
print(X_train.sum(), X_test.sum())

---
#### Problem 3:  Create a 1D convolutional net to predict the activity from a segment.  
---

The accelerometer data has three channels (for the x, y, and z axes of movement).  These will be the input channels of the model.

Be sure to report on the test accuracy of your model.

In [None]:
# delete any old models
K.clear_session()

In [None]:
# YOUR CODE AND MARKDOWN CELLS HERE

---
#### Problem 4: Compute the accuracy of your model on data from users 1, 2, and 3.
---

Does the model you trained on data from user 4 make good predictions on data from other people?  

Compute the accuracy of your model on 1.csv, on 2.csv, and on 3.csv.

Do not train the model in this step; just compute the accuracy of the model that you trained in the previous problem.

Make sure you clearly print three accuracy values: the accuracy on 1.csv, on 2.csv, and on 3.csv.

In [None]:
# YOUR CODE AND MARKDOWN CELLS HERE

---
#### Problem 5: Train and then compute the accuracy of your model on data from users 1, 2, and 3.
---

Does the model you tuned for user 4 work make good predictions for user 1, when trained on data from user 1?

Split the user 1 data into a training and test sets, do the same preprocessing you did for data from user 4, train the model, and then compute test accuracy.

Repeat the process for users 2 and 3.

Remember, you will not modify your model of problem 3, but you will train it and compute accuracy for each of users 1-3.

Hint: re-read these instructions carefully!  You will need to train and test separately for each of users 1, 2, and 3.

In [None]:
# YOUR CODE AND MARKDOWN CELLS HERE

---
#### Problem 6: Create a new model, train it on users 1-11, and test it on users 12-15.
---

Can you make a generic model that will work well for everybody?

Create a new model, then fit it using data from users 1-11 as your training data.  You will have to combine the data from these people into a single training data set.

Then compute the accuracy of your model, individually, on users 12-15.

Combining data from users 1-11 into a single dataset is easy to do with Pandas dataframes using the concat() function.  For example, `df = pd.concat([df1, df2])`.

In [None]:
# YOUR CODE AND MARKDOWN CELLS HERE

---
#### Problem 7: Write your final conclusions.  Write clearly, and write in paragraphs.
---

(Replace this with your text.)