In [1]:
import pandas as pd
from datetime import datetime
import turicreate as tc
from s3fs.core import S3FileSystem

## **Read the Data**
For simplicity, we loaded the data into an S3 bucket.

In [2]:
s3 = S3FileSystem(anon= True)
file = s3.open("s3://skafos.example.data/running_walking.csv", "rb")
dat = pd.read_csv(file)

## **Data Cleaning**
We do some basic data cleaning to get it in a format for the Turi Create function to accept.

- The major requirements for the Turi Create function are a session_id and activity label.
- A session can be thought of as an experiment where the data is being collected on just one activity type. 

Because we have timestamps and not session ids, we try to back into a session column.

In [3]:
# not necessary but for ease of interpretation, map the activities to names
activity_map = {1 : 'running', 0: 'walking'}

# clean up the date time field
dat['time'] = dat['time'].astype(str).apply(lambda x: ":".join(x.split(":")[0:3]))
dat['date_time'] = dat['date'] + " " + dat['time']
dat['date_time'] = dat['date_time'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))

# map the activities to names
dat['activity'] = dat['activity'].apply(lambda x: activity_map[x])

#### The below is a function that:
- takes as input a dataframe
- a time column name ( of type **`datetime`** or in the pandas world **`pandas._libs.tslibs.timestamps.Timestamp`** )
- an activity column name
- returns the same dataframe with a 'session_id' column.

The function takes each row and assigns it a session based on how soon that record was timestamped after the previous record (controlling for actvity type).

In [4]:
def generate_session_ids(df, time_col, activity_col, threshold = 10):
    
    # sort the dataframe by activity and time, add an index column
    temp_df = df.sort_values(by = [activity_col, time_col]).reset_index(drop = False)
    
    # create a list of index, time, activity objects
    recs = list(temp_df.apply(lambda x: {'index' : x['index'], time_col :  x[time_col], activity_col : x[activity_col]}, axis = 1))
    sessions = []; session_id = 0
    # loop over the time, activity objects, assign "session ids" to those records that are within the time threshold
    for i in range(len(recs)):
        if (recs[i][time_col] - recs[i-1][time_col]).total_seconds() < threshold and recs[i][activity_col] == recs[i-1][activity_col]:
            recs[i]['session_id'] = session_id
            sessions.append(recs[i])
        else:
            session_id +=1 # up the session id
            recs[i]['session_id'] = session_id
            sessions.append(recs[i])
    
    # convert back to df, merge with original df 
    session_df = pd.DataFrame.from_records(sessions)
    merged_df = pd.merge(temp_df, session_df, on = ['index', time_col, activity_col], how = 'left')
    
    # clean up the dataframe
    del merged_df['index']
    
    return merged_df
    

- Here we generate the session ids and assign it back to the variable **`dat`**. 
- Finally we convert to an **`SFrame`**, the a TuriCreate data type similar to pandas dataframes

In [5]:
dat = generate_session_ids(dat, 'date_time', 'activity')
dat = dat[['session_id', 'activity', 'acceleration_x', 'acceleration_y', 'acceleration_z', 'gyro_x', 'gyro_y', 'gyro_z']]
dat = tc.SFrame(dat)

## **Train the Model**

The following is the same code as in the example. We have replaced the session_id argument and target argument with the appropriate column names in our new dataframe.

In [6]:
train, test = tc.activity_classifier.util.random_split_by_session(dat, session_id='session_id', fraction=0.8)


In [7]:
model = tc.activity_classifier.create(train, session_id='session_id', target='activity', prediction_window=20)


Using GPU to create model (CUDA)
+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
| Iteration           | Train Accuracy      | Train Loss          | Validation Accuracy | Validation Loss     | Elapsed Time        |
+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
| 1                   | 0.457               | 0.855               | 0.939               | 0.239               | 0.1                 | 
| 2                   | 0.709               | 0.374               | 0.846               | 0.255               | 0.2                 | 
| 3                   | 0.767               | 0.320               | 1.000               | 0.185               | 0.3                 | 
| 4                   | 0.851               | 0.253               | 1.000               | 0.098               | 0.4                 | 
| 5                   | 0

In [8]:
metrics = model.evaluate(test)

In [9]:
metrics

{'accuracy': 1.0,
 'auc': 0.9999999999999984,
 'precision': 1.0,
 'recall': 1.0,
 'f1_score': 1.0,
 'log_loss': 0.006585248164550634,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 2
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |   running    |     running     |  4237 |
 |   walking    |     walking     |  9285 |
 +--------------+-----------------+-------+
 [2 rows x 3 columns],
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+--------------------+-----+------+------+
 | threshold |        fpr         | tpr |  p   |  n   |
 +-----------+--------------------+-----+------+------+
 |    0.0    |        1.0         | 1.0 | 9285 | 4237 |
 |   1e-05   |        1.0         | 1.0 | 9285 | 4237 |
 |   2e-05   |        1.0         | 1.0 | 9285 | 4237 |
 |   3e-05   |        1.0         