# **Activity Data Integration**
This notebook is intended to show you how you might train your own model over some activity data. In this example, we took some data from [Viktor Malyi's 4 part article](https://towardsdatascience.com/run-or-walk-detecting-user-activity-with-machine-learning-and-core-ml-part-1-9658c0dcdd90) and formatted it such that the TuriCreate activity classifier function could accept it.

In [1]:
import pandas as pd
from datetime import datetime
import turicreate as tc
import io
import requests
from skafossdk import *

ska = Skafos()

         To fix this, please install the currently recommended version:

             pip uninstall -y mxnet && pip install mxnet==1.1.0

         If you want to use a CUDA GPU, then change 'mxnet' to 'mxnet-cu90' (adjust 'cu90' depending on your CUDA version):

2018-11-21 18:22:38,979 - skafossdk.data_engine - INFO - Connecting to DataEngine
2018-11-21 18:22:39,275 - skafossdk.data_engine - INFO - DataEngine Connection Opened


## **Read the Data**
For simplicity, we loaded the data into an S3 bucket but the original source is [Viktor Malyi's Kaggle submission](https://www.kaggle.com/vmalyi/run-or-walk)

In [2]:
req = requests.get("https://s3.amazonaws.com/skafos.example.data/running_walking.csv")
s = req.content
dat = pd.read_csv(io.StringIO(s.decode('utf-8')))

## **Data Cleaning**
We do some basic data cleaning to get it in a format for the Turi Create function to accept.

- The major requirements for the Turi Create function are a session_id and activity label.
- A session can be thought of as an experiment where the data is being collected on just one activity type. 

Because we have timestamps and not session ids, we try to back into a session column.

In [3]:
# not necessary but for ease of interpretation, map the activities to names
activity_map = {1 : 'running', 0: 'walking'}

# clean up the date time field
dat['time'] = dat['time'].astype(str).apply(lambda x: ":".join(x.split(":")[0:3]))
dat['date_time'] = dat['date'] + " " + dat['time']
dat['date_time'] = dat['date_time'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))

# map the activities to names
dat['activity'] = dat['activity'].apply(lambda x: activity_map[x])

#### The below is a function that:
- takes as input a dataframe
- a time column name ( of type **`datetime`** or in the pandas world **`pandas._libs.tslibs.timestamps.Timestamp`** )
- an activity column name
- returns the same dataframe with a 'session_id' column.

The function takes each row and assigns it a session based on how soon that record was timestamped after the previous record (controlling for actvity type).

In [4]:
def generate_session_ids(df, time_col, activity_col, threshold = 10):
    
    # sort the dataframe by activity and time, add an index column
    temp_df = df.sort_values(by = [activity_col, time_col]).reset_index(drop = False)
    
    # create a list of index, time, activity objects
    recs = list(temp_df.apply(lambda x: {'index' : x['index'], time_col :  x[time_col], activity_col : x[activity_col]}, axis = 1))
    sessions = []; session_id = 0
    # loop over the time, activity objects, assign "session ids" to those records that are within the time threshold
    for i in range(len(recs)):
        if (recs[i][time_col] - recs[i-1][time_col]).total_seconds() < threshold and recs[i][activity_col] == recs[i-1][activity_col]:
            recs[i]['session_id'] = session_id
            sessions.append(recs[i])
        else:
            session_id +=1 # up the session id
            recs[i]['session_id'] = session_id
            sessions.append(recs[i])
    
    # convert back to df, merge with original df 
    session_df = pd.DataFrame.from_records(sessions)
    merged_df = pd.merge(temp_df, session_df, on = ['index', time_col, activity_col], how = 'left')
    
    # clean up the dataframe
    del merged_df['index']
    
    return merged_df
    

- Here we generate the session ids and assign it back to the variable **`dat`**. 
- Finally we convert to an **`SFrame`**, the a TuriCreate data type similar to pandas dataframes

In [5]:
dat = generate_session_ids(dat, 'date_time', 'activity')
dat = dat[['session_id', 'activity', 'acceleration_x', 'acceleration_y', 'acceleration_z', 'gyro_x', 'gyro_y', 'gyro_z']]
print(f"The data has dimensions {dat.shape}")

The data has dimensions (88588, 8)


In [None]:
# Check the distribution of the sessions across activity type
print("The Distribution of Sessions across activity type are as follows ... \n ")
dat.groupby(['activity']).agg({'session_id' : pd.Series.nunique})/dat['session_id'].nunique()

In [None]:
# sample sessions from the dataframe
unique_sessions = len(dat['session_id'].unique())
n_session_samples = int(unique_sessions * 0.5)
print(f"There are {unique_sessions} sessions")
print(f"Sampling {n_session_samples} sessions due to memory constraints")
session_sample = pd.Series(dat['session_id'].unique()).sample(n_session_samples)

# assign the sampled df back to itself
sample_dat = dat[dat['session_id'].isin(session_sample)]

In [None]:
print("The Distribution of Sampled Sessions across activity type are as follows ... \n ")
sample_dat.groupby(['activity']).agg({'session_id' : pd.Series.nunique})/sample_dat['session_id'].nunique()

In [6]:
# convert to SFrame because that's what Turi Create needs
dat = tc.SFrame(dat)

## **Train the Model**

The following is the same code as in the example. We have replaced the session_id argument and target argument with the appropriate column names in our new dataframe.

Steps:
- Split into training and testing
- Create the model (model build)
- Evaluate the model on the testing dataset

In [7]:
train, test = tc.activity_classifier.util.random_split_by_session(dat, session_id='session_id',
                                                                  fraction=0.8)


In [None]:
model = tc.activity_classifier.create(train, session_id='session_id', target='activity',
                                      prediction_window=20)


In [None]:
metrics = model.evaluate(test)

In [None]:
metrics