# **Activity Data Integration**
This notebook is intended to show you how you might train your own model on activity data. In this example, we took some data from [Viktor Malyi's 4 part article](https://towardsdatascience.com/run-or-walk-detecting-user-activity-with-machine-learning-and-core-ml-part-1-9658c0dcdd90) and formatted it such that the TuriCreate activity classifier function could accept it.

In [None]:
# Import necessary dependencies (make sure you have installed them first)
from datetime import datetime

import pandas as pd
import turicreate as tc
from skafossdk import *

ska = Skafos()

## **Read the Data**
For simplicity, we loaded the data into an S3 bucket but the original source is [Viktor Malyi's Kaggle submission](https://www.kaggle.com/vmalyi/run-or-walk)

In [None]:
s3_url = "https://s3.amazonaws.com/skafos.example.data/ActivityClassifier/running_walking.csv"
data = pd.read_csv(s3_url)

In [None]:
# Inspect the data
data.head(5)

## **Data Cleaning**
We do some basic data cleaning to get it in a format for the Turi Create activity classification model

- The major requirements for the Turi Create function are a *session_id* and *activity label*.
- A session can be thought of as an experiment where the data is being collected by a single user on various activities (not limited to one). 

Because we have timestamps and not session ids, we try to back into a session column.

In [None]:
# Map the activities to names
activity_map = {1: 'running', 0: 'walking'}
data['activity'] = data['activity'].apply(lambda x: activity_map[x])

# Clean up the date time field
data['time'] = data['time'].astype(str).apply(lambda x: ":".join(x.split(":")[0:3]))
data['date_time'] = data['date'] + " " + data['time']
data['date_time'] = data['date_time'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))

In [None]:
# Inspect changes
data.head(5)

#### The below is a function that:
- takes as input a dataframe
- a datetime column name
- an activity column name
- returns the same dataframe with a 'session_id' column.

The function takes each row and assigns it a session based on how soon that record was timestamped after the previous record (controlling for actvity type).

In [None]:
def generate_session_ids(df, time_col, activity_col, threshold=10):
    
    # Sort the dataframe by activity and time, add an index column
    temp_df = df.sort_values(by=[activity_col, time_col]).reset_index(drop=False)
    
    # Create a list of index, time, activity objects
    recs = list(temp_df.apply(lambda x: {'index': x['index'], time_col:  x[time_col], activity_col: x[activity_col]}, axis=1))
    sessions = []
    session_id = 0
    # Loop over the time, activity objects, assign "session ids" to those records that are within the time threshold
    for i in range(len(recs)):
        if (recs[i][time_col] - recs[i-1][time_col]).total_seconds() < threshold and recs[i][activity_col] == recs[i-1][activity_col]:
            recs[i]['session_id'] = session_id
            sessions.append(recs[i])
        else:
            session_id +=1 # up the session id
            recs[i]['session_id'] = session_id
            sessions.append(recs[i])
    
    # Convert back to df, merge with original df 
    session_df = pd.DataFrame.from_records(sessions)
    merged_df = pd.merge(temp_df, session_df, on = ['index', time_col, activity_col], how = 'left')
    
    # Clean up the dataframe
    del merged_df['index']
    
    return merged_df
    

- Here we generate the session ids and assign it back to the variable **`data`**. 
- Finally we convert to an **`SFrame`**, the a TuriCreate data type similar to pandas dataframes

In [None]:
data = generate_session_ids(data, 'date_time', 'activity')
data = data[['session_id', 'activity', 'acceleration_x', 'acceleration_y', 'acceleration_z', 'gyro_x', 'gyro_y', 'gyro_z']]
print(f"The data has dimensions {data.shape} (rows x columns)", flush=True)

In [None]:
# Check the distribution of the sessions by activity type - mostly running
data.groupby(['activity']).agg({'session_id' : pd.Series.nunique})/data['session_id'].nunique()

In [None]:
# Sample sessions from the dataframe to reduce the training data size
unique_sessions = len(data['session_id'].unique())
n_session_samples = int(unique_sessions * 0.5)
print(f"There are {unique_sessions} sessions", flush=True)
print(f"Sampling {n_session_samples} sessions due to memory constraints", flush=True)
session_sample = pd.Series(data['session_id'].unique()).sample(n_session_samples)

# Create a dataframe of sampled data
sample_data = data[data['session_id'].isin(session_sample)]

In [None]:
# Check the distribution of the sampled sessions by activity type - still mostly running
sample_data.groupby(['activity']).agg({'session_id' : pd.Series.nunique})/sample_data['session_id'].nunique()

In [None]:
# Convert to SFrame because that's what Turi Create needs
sample_data_sframe = tc.SFrame(sample_data)

## **Train the Model**

The following is the same code as in the example. We have replaced the session_id argument and target argument with the appropriate column names in our new dataframe.

Steps:
- Split into training and testing
- Create the model (model build)
- Evaluate the model on the testing dataset

In [None]:
train_data, test_data = tc.activity_classifier.util.random_split_by_session(
    dataset=sample_data_sframe,
    session_id='session_id',
    fraction=0.8
)


In [None]:
model = tc.activity_classifier.create(
    dataset=train_data,
    session_id='session_id',
    target='activity',
    prediction_window=50 # We want 1 prediction per second.. so prediction_window = (1 pred/sec) * (50 Hz) = 50
)

## **Model Evaluation**
Now we evaluate our model against some testing data. First, we grab some slices of data where walking and running took place and see if the model correctly classifies those activities. Then we evaluation the model against the entire testing set, computing an accuracy benchmark.

In [None]:
# Find 3 seconds of walking data constrained to a single experiment where walking took place
exp_id_with_walking = test_data[test_data['activity'] == 'walking']['session_id'].value_counts()[0]['value']
walking_3_seconds = test_data[(test_data['activity'] == 'walking') & (test_data['session_id'] == exp_id_with_walking)][:150]

# Find 10 seconds of running data constrained to a single experiment where running took place
exp_id_with_running = test_data[test_data['activity'] == 'running']['session_id'].value_counts()[0]['value']
running_10_seconds = test_data[(test_data['activity'] == 'running') & (test_data['session_id'] == exp_id_with_running)][:500]

In [None]:
# Check if the model properly classifies each second as walking or something else
model.predict(walking_3_seconds, output_frequency='per_window')

In [None]:
# Check if the model properly classifies each second as running or something else
model.predict(running_10_seconds, output_frequency='per_window')

In [None]:
# Calculate accuracy of the model against the entire hold out testing set
accuracy = tc.evaluation.accuracy(test_data['activity'], model.predict(test_data))
print(f'The activity classifier predicted {accuracy*100} % of the testing observations correctly!', flush=True)