# **Activity Data Integration**
This notebook is intended to show you how you might train your own model over some activity data. In this example, we took some data from [Viktor Malyi's 4 part article](https://towardsdatascience.com/run-or-walk-detecting-user-activity-with-machine-learning-and-core-ml-part-1-9658c0dcdd90) and formatted it such that the TuriCreate activity classifier function could accept it.

In [None]:
#%%capture
!pip install turicreate --upgrade
!pip install s3fs --upgrade
import turicreate as tc
from s3fs.core import S3FileSystem

Looking in indexes: https://pypi.org/simple, https://pub-pypi.metismachine.io
Collecting turicreate
  Using cached https://files.pythonhosted.org/packages/3d/f2/2d4cb93072fbabe7b4785fb8fbc4cf652b91668eab9bb8baed3addf865b9/turicreate-5.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting mxnet<1.2.0,>=1.1.0 (from turicreate)
  Using cached https://files.pythonhosted.org/packages/96/98/c9877e100c3d1ac92263bfaba7bb8a49294e099046592040a2ff8620ac61/mxnet-1.1.0.post0-py2.py3-none-manylinux1_x86_64.whl
Collecting requests>=2.9.1 (from turicreate)
  Using cached https://files.pythonhosted.org/packages/ff/17/5cbb026005115301a8fb2f9b0e3e8d32313142fe8b617070e7baad20554f/requests-2.20.1-py2.py3-none-any.whl
Collecting numpy (from turicreate)
  Using cached https://files.pythonhosted.org/packages/ff/7f/9d804d2348471c67a7d8b5f84f9bc59fd1cefa148986f2b74552f8573555/numpy-1.15.4-cp36-cp36m-manylinux1_x86_64.whl
Collecting coremltools==2.0b1 (from turicreate)
  Using cached https://files.pythonhosted.org/packa

In [None]:
import pandas as pd
from datetime import datetime
from skafossdk import *

## **Read the Data**
For simplicity, we loaded the data into an S3 bucket but the original source is [Viktor Malyi's Kaggle submission](https://www.kaggle.com/vmalyi/run-or-walk)

In [None]:
s3 = S3FileSystem(anon= True)
file = s3.open("s3://skafos.example.data/running_walking.csv", "rb")
dat = pd.read_csv(file)

## **Data Cleaning**
We do some basic data cleaning to get it in a format for the Turi Create function to accept.

- The major requirements for the Turi Create function are a session_id and activity label.
- A session can be thought of as an experiment where the data is being collected on just one activity type. 

Because we have timestamps and not session ids, we try to back into a session column.

In [None]:
# not necessary but for ease of interpretation, map the activities to names
activity_map = {1 : 'running', 0: 'walking'}

# clean up the date time field
dat['time'] = dat['time'].astype(str).apply(lambda x: ":".join(x.split(":")[0:3]))
dat['date_time'] = dat['date'] + " " + dat['time']
dat['date_time'] = dat['date_time'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))

# map the activities to names
dat['activity'] = dat['activity'].apply(lambda x: activity_map[x])

#### The below is a function that:
- takes as input a dataframe
- a time column name ( of type **`datetime`** or in the pandas world **`pandas._libs.tslibs.timestamps.Timestamp`** )
- an activity column name
- returns the same dataframe with a 'session_id' column.

The function takes each row and assigns it a session based on how soon that record was timestamped after the previous record (controlling for actvity type).

In [None]:
def generate_session_ids(df, time_col, activity_col, threshold = 10):
    
    # sort the dataframe by activity and time, add an index column
    temp_df = df.sort_values(by = [activity_col, time_col]).reset_index(drop = False)
    
    # create a list of index, time, activity objects
    recs = list(temp_df.apply(lambda x: {'index' : x['index'], time_col :  x[time_col], activity_col : x[activity_col]}, axis = 1))
    sessions = []; session_id = 0
    # loop over the time, activity objects, assign "session ids" to those records that are within the time threshold
    for i in range(len(recs)):
        if (recs[i][time_col] - recs[i-1][time_col]).total_seconds() < threshold and recs[i][activity_col] == recs[i-1][activity_col]:
            recs[i]['session_id'] = session_id
            sessions.append(recs[i])
        else:
            session_id +=1 # up the session id
            recs[i]['session_id'] = session_id
            sessions.append(recs[i])
    
    # convert back to df, merge with original df 
    session_df = pd.DataFrame.from_records(sessions)
    merged_df = pd.merge(temp_df, session_df, on = ['index', time_col, activity_col], how = 'left')
    
    # clean up the dataframe
    del merged_df['index']
    
    return merged_df
    

- Here we generate the session ids and assign it back to the variable **`dat`**. 
- Finally we convert to an **`SFrame`**, the a TuriCreate data type similar to pandas dataframes

In [None]:
dat = generate_session_ids(dat, 'date_time', 'activity')
dat = dat[['session_id', 'activity', 'acceleration_x', 'acceleration_y', 'acceleration_z', 'gyro_x', 'gyro_y', 'gyro_z']]
dat = tc.SFrame(dat)

## **Train the Model**

The following is the same code as in the example. We have replaced the session_id argument and target argument with the appropriate column names in our new dataframe.