## EY Datawave Challenge Code

**Simple rule**: 
- all "df_xx" types are pd.DataFrame
- "xx_data" are usually NumPy arrays

# What is in this Version:

Before, we have been predicting by considering all the trajectories separately. Yet, this approach may be misleading in that our goal is to predict **each person's position between 15:00 ~ 16:00 PM, not other time period.** So, it may be better to group trajectories of the same person into one row.

What I used for feature here:

1. Total time elapsed
2. distance from park center (last point)
3. within the park center (last trajectories' entry)
4. within the park center (overall trajectories' exit)
5. Average Velocity
6. Average Bearing (maybe deviation angle from the straight line from starting point to park center?)
7. velocity of last trajectory


In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.utils import to_categorical
from sklearn import preprocessing
from tensorflow.keras import backend as K
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score
import math

pd.set_option('display.max_columns', None)

# Cool point got from here

https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/

In [2]:
# # fix random seed for reproducibility
# seed = 7
# np.random.seed(seed)

# Create a Callback

In [3]:
class MyCallback(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
        if logs.get('acc') >= 0.98:
            print("Reached 98% acc so cancelling training!")
            self.model.stop_training = True

reach_90acc = MyCallback()

# Metric for F1

https://medium.com/@thongonary/how-to-compute-f1-score-for-each-epoch-in-keras-a1acd17715a2

In [4]:
class F1(keras.callbacks.Callback):
    def __init__(self, val_data):
        super().__init__()
        self.validation_data = val_data
        
    def on_train_begin(self, logs={}):
        self.val_f1s = []
        self.val_recalls = []
        self.val_precisions = []

    def on_epoch_end(self, epoch, logs={}):
        val_predict = (np.asarray(self.model.predict(self.validation_data[0]))).round()
        val_targ = self.validation_data[1]
        _val_f1 = f1_score(val_targ, val_predict)
        _val_recall = recall_score(val_targ, val_predict)
        _val_precision = precision_score(val_targ, val_predict)
        self.val_f1s.append(_val_f1)
        self.val_recalls.append(_val_recall)
        self.val_precisions.append(_val_precision)
        print ("— val_f1: %f — val_precision: %f — val_recall %f" %(_val_f1, _val_precision, _val_recall))
        return

# Bearing Calculation Function

https://gist.github.com/jeromer/2005586

In [5]:
def calculate_initial_compass_bearing(pointA, pointB):
    """
    Calculates the bearing between two points.
    The formulae used is the following:
        θ = atan2(sin(Δlong).cos(lat2),
                  cos(lat1).sin(lat2) − sin(lat1).cos(lat2).cos(Δlong))
    :Parameters:
      - `pointA: The tuple representing the latitude/longitude for the
        first point. Latitude and longitude must be in decimal degrees
      - `pointB: The tuple representing the latitude/longitude for the
        second point. Latitude and longitude must be in decimal degrees
    :Returns:
      The bearing in degrees
    :Returns Type:
      float
    """
    if (type(pointA) != tuple) or (type(pointB) != tuple):
        raise TypeError("Only tuples are supported as arguments")

    lat1 = math.radians(pointA[0])
    lat2 = math.radians(pointB[0])

    diffLong = math.radians(pointB[1] - pointA[1])

    x = math.sin(diffLong) * math.cos(lat2)
    y = math.cos(lat1) * math.sin(lat2) - (math.sin(lat1)
            * math.cos(lat2) * math.cos(diffLong))

    initial_bearing = math.atan2(x, y)

    # Now we have the initial bearing but math.atan2 return values
    # from -180° to + 180° which is not what we want for a compass bearing
    # The solution is to normalize the initial bearing as shown below
    initial_bearing = math.degrees(initial_bearing)
    compass_bearing = (initial_bearing + 360) % 360

    return compass_bearing

# Read the Data

df is training data + label

In [6]:
#read training data
raw_train = pd.read_csv("/Users/Godwithus/Desktop/EY/data_train.csv", low_memory=False) #nrows = integer
raw_train = raw_train.loc[:,'hash':'y_exit']
raw_train.fillna('', inplace=True)

#read test data
raw_test = pd.read_csv("/Users/Godwithus/Desktop/EY/data_test.csv", low_memory=False)
raw_test = raw_test.loc[:,'hash':'y_exit']
raw_test.fillna('', inplace=True)

# Before Grouping, perform some common tasks

In [7]:
#time to seconds
df_train = raw_train
df_train['time_entry_seconds'] = pd.to_timedelta(df_train['time_entry']).dt.total_seconds()
df_train['time_exit_seconds']=pd.to_timedelta(df_train['time_exit']).dt.total_seconds()

df_test = raw_test
df_test['time_entry_seconds'] = pd.to_timedelta(df_test['time_entry']).dt.total_seconds()
df_test['time_exit_seconds']=pd.to_timedelta(df_test['time_exit']).dt.total_seconds()

## Some look at the data

In [8]:
#debugging

# print (df_train.info())
# print (df_test.info())

# df_train.head()

## Group the Dataset

In particular, for each person, store the last trajectories' data in separate dataframe for later use.

In [9]:
#for train data

last_traj_train = df_train.groupby('hash').last()

count_one_traj = df_train.groupby('hash').count()

count_one_traj = count_one_traj[count_one_traj['trajectory_id']==1]
one_traj_train = last_traj_train.loc[count_one_traj.index]

df_train = df_train.merge(last_traj_train, how='left', indicator=True)
df_train = df_train[(df_train['_merge'] == 'left_only')]



last_traj_train = last_traj_train.merge(one_traj_train, how='left', indicator=True).set_index(last_traj_train.index)
last_traj_train = last_traj_train[(last_traj_train['_merge'] == 'left_only')]


# Summary
last_traj_train = contains the last trajectories for each hash (exception: hash with only 1 trajectory) 
last_traj_test = same as above, for test set

df_train = contains the trajectories except for the last for each hash (do not have the hash with only 1 trajectory)
df_test = same, for the test set

one_traj_train = contain the trajectories for the hash that has only 1 trajectory
one_traj_test = same

In [10]:
print("last_traj_train length: ", len(last_traj_train))
print("df_train length: ", len(df_train))
print("one_traj_train length: ", len(one_traj_train))

last_traj_train length:  132753
df_train length:  680199
one_traj_train length:  1310


# Prepare the Training Data - 1: With Multiple Trajectories

choose the features: 

Change the time values into float (total seconds)

Finally, store train_data as NumPy arrays, and normalize them.

Features:
1. Total time elapsed (non-last-traj)
2. within the park center (last-traj entry point)
3. distance from park center (last traj entry point)
4. Total distance traveled (non-last-traj)
5. Average Velocity (non-last-traj)
6. Average Bearing (non-last-traj)
7. distance from park center boundaries (last traj entry point)
8. time stayed (last-traj)


In [11]:
# 0. Prepare required stats in each trajectory (seconds)

aggregation = {
    'time_entry_seconds': {'first'},
    'time_exit_seconds': {'last'},
    'x_entry' : {'first'},
    'y_entry' : {'first'},
    'x_exit' : {'last'},
    'y_exit' : {'last'}
}

df_train_traj = df_train.groupby('hash').agg(aggregation)

df_train_traj.columns = ['time_entry','time_exit','x_entry','y_entry','x_exit','y_exit']

In [12]:
# 1. total time elapsed (seconds)

df_train_traj['total_time'] = df_train_traj['time_exit'] - df_train_traj['time_entry']

# 2. prepare whether entry point of last trajectory is in cityhall

x_in_city = (last_traj_train['x_entry'] >=3750901.5068) & (last_traj_train['x_entry']<=3770901.5068)
y_in_city = (last_traj_train['y_entry'] >= -19268905.6133) & (last_traj_train['y_entry'] <= -19208905.6133)

last_traj_train['entry_inside'] = 1*(x_in_city & y_in_city)

# 3.0 time stayed in last trajectory

last_traj_train['total_time']=last_traj_train['time_exit_seconds']-last_traj_train['time_entry_seconds']

# 3. the distance from the entry point of last trajectory from the city hall's mid point

last_traj_train['distance_from_center'] = ((3760901.5068 - last_traj_train['x_entry']).pow(2) + \
                        (-19238905.6133 - last_traj_train['y_entry']).pow(2)).pow(1/2)

# 4. total distance traveled

df_train_traj['total_travel'] = ((df_train_traj['x_exit'] - df_train_traj['x_entry']).pow(2) + \
                                 (df_train_traj['y_exit'] - df_train_traj['y_entry']).pow(2)).pow(1/2)

# distance from city hall boundaries

last_traj_train.loc[(last_traj_train['x_entry'] >=3750901.5068) & (last_traj_train['x_entry']<=3770901.5068) & (last_traj_train['y_entry'] >= -19268905.6133) & (last_traj_train['y_entry'] <= -19208905.6133), 'distance_2'] = 0
last_traj_train.loc[(last_traj_train['x_entry'] <3750901.5068) & (last_traj_train['y_entry'] >= -19268905.6133) & (last_traj_train['y_entry'] <= -19208905.6133), 'distance_2'] = 3750901.5068 - last_traj_train['x_entry']
last_traj_train.loc[(last_traj_train['x_entry']>3770901.5068) & (last_traj_train['y_entry'] >= -19268905.6133) & (last_traj_train['y_entry'] <= -19208905.6133), 'distance_2'] = last_traj_train['x_entry'] - 3770901.5068
last_traj_train.loc[(last_traj_train['x_entry'] >=3750901.5068) & (last_traj_train['x_entry']<=3770901.5068) & (last_traj_train['y_entry'] < -19268905.6133), 'distance_2'] = -19268905.6133 - last_traj_train['y_entry']
last_traj_train.loc[(last_traj_train['x_entry'] >=3750901.5068) & (last_traj_train['x_entry']<=3770901.5068) & (last_traj_train['y_entry'] > -19208905.6133), 'distance_2'] = last_traj_train['y_entry'] + 19208905.6133
last_traj_train.loc[(last_traj_train['x_entry']>3770901.5068) & (last_traj_train['y_entry'] > -19208905.6133), 'distance_2'] = ((3770901.5068 - last_traj_train['x_entry']).pow(2) + (-19208905.6133 - last_traj_train['y_entry']).pow(2)).pow(1/2)
last_traj_train.loc[(last_traj_train['x_entry'] <3750901.5068) & (last_traj_train['y_entry'] > -19208905.6133), 'distance_2'] = ((3750901.5068 - last_traj_train['x_entry']).pow(2) + (-19208905.6133 - last_traj_train['y_entry']).pow(2)).pow(1/2)
last_traj_train.loc[(last_traj_train['x_entry']>3770901.5068) & (last_traj_train['y_entry'] < -19268905.6133), 'distance_2'] = ((3770901.5068 - last_traj_train['x_entry']).pow(2) + (-19268905.6133 - last_traj_train['y_entry']).pow(2)).pow(1/2)
last_traj_train.loc[(last_traj_train['x_entry'] <3750901.5068) & (last_traj_train['y_entry'] < -19268905.6133), 'distance_2'] = ((3750901.5068 - last_traj_train['x_entry']).pow(2) + (-19268905.6133 - last_traj_train['y_entry']).pow(2)).pow(1/2)


# 5. Avg. Velocity

df_train_traj['Avg_velocity'] = df_train_traj['total_travel'] / df_train_traj['total_time']

# 6. Avg. Bearing

a = []
for i in range(len(df_train_traj['x_entry'].values)):
    a.append(calculate_initial_compass_bearing((df_train_traj['x_entry'].values[i], df_train_traj['y_entry'].values[i]) , \
                                 (df_train_traj['x_exit'].values[i],  df_train_traj['y_exit'].values[i])))

bearing = np.array(a)


df_bearing = pd.DataFrame(bearing, columns = ['bearing'])

df_bearing.index = df_train_traj.index

df_train_traj = df_train_traj.merge(df_bearing, left_index=True, right_index=True)

df_train_traj.head()


Unnamed: 0_level_0,time_entry,time_exit,x_entry,y_entry,x_exit,y_exit,total_time,total_travel,Avg_velocity,bearing
hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0000a8602cf2def930488dee7cdad104_1,25471.0,52689.0,3751014.0,-19093980.0,3744909.0,-19285580.0,27218.0,191701.58092,7.043191,80.402028
0000cf177130469eeac79f67b6bcf3df_9,52163.0,53414.0,3749450.0,-19265060.0,3749042.0,-19266320.0,1251.0,1333.516162,1.06596,12.281473
0001f97b99a80f18f62e2d44e54ef33d_3,41826.0,42728.0,3771461.0,-19104130.0,3757004.0,-19296980.0,902.0,193384.920615,214.395699,74.334681
0002124248b0ca510dea42824723ccac_31,35768.0,53941.0,3765544.0,-19172270.0,3768391.0,-19202110.0,18173.0,29978.550018,1.64962,179.197578
000219c2a6380c307e8bffd85b5e404b_23,4.0,41934.0,3760336.0,-19228180.0,3763808.0,-19269950.0,41930.0,41914.389647,0.999628,192.881739


In [13]:
# 7. merge with last traj

last_data_train = last_traj_train.loc[:,['entry_inside', 'distance_from_center', 'total_time','distance_2']]

df_train_traj = df_train_traj.merge(last_data_train, on='hash', how = 'outer')

df_train_traj.fillna(0, inplace=True) #fill nan

df_train_traj.describe()

Unnamed: 0,time_entry,time_exit,x_entry,y_entry,x_exit,y_exit,total_time_x,total_travel,Avg_velocity,bearing,entry_inside,distance_from_center,total_time_y,distance_2
count,132753.0,132753.0,132753.0,132753.0,132753.0,132753.0,132753.0,132753.0,132753.0,132753.0,132753.0,132753.0,132753.0,132753.0
mean,25671.715231,47152.948596,3760196.0,-19218660.0,3760506.0,-19221830.0,21481.233366,46518.375814,6.572706,174.281534,0.288438,62183.017208,415.021935,37461.906914
std,14277.562355,8422.139774,9682.846,87083.25,9100.093,75669.43,14764.266453,58107.275576,22.367276,109.366522,0.453038,47815.543665,701.342283,42188.151837
min,0.0,2.0,3741031.0,-19382910.0,3741031.0,-19376750.0,0.0,0.0,0.0,0.0,0.0,104.246457,0.0,0.0
25%,15854.0,44951.0,3753232.0,-19287730.0,3755239.0,-19275400.0,9187.0,2037.273253,0.130824,79.913785,0.0,20893.499773,0.0,0.0
50%,26892.0,50432.0,3759926.0,-19228190.0,3760218.0,-19230090.0,20191.0,21806.366827,1.214724,177.474637,0.0,50628.212851,0.0,19758.639223
75%,35527.0,52740.0,3768090.0,-19147020.0,3767704.0,-19171740.0,30496.0,72184.101843,4.438598,265.866438,1.0,96127.656481,594.0,65610.421929
max,53998.0,53999.0,3777055.0,-19042660.0,3776987.0,-19046660.0,53984.0,326050.480531,1639.609127,359.998695,1.0,192032.134746,19140.0,161960.898405


# Normalize & Make it NumPy

In [14]:
#make a numpy array

train_data=df_train_traj.loc[:,['total_time_y','entry_inside',
                                'total_time_x','distance_from_center', 'total_travel','Avg_velocity',
                                'bearing','distance_2']].values 
                                                #'total_time_x','distance_from_center', 'total_travel'

min_max_scaler = preprocessing.MinMaxScaler()
normalized_col = min_max_scaler.fit_transform(train_data[:,[0,2,3,4,5,6,7]])

train_data = np.concatenate((train_data[:,[1]],normalized_col), axis = 1)

df_train_data = pd.DataFrame(train_data)

df_train_data.columns = ['entry_inside','total_time_y',
                        'total_time_x','distance_from_center', 'total_travel',
                         'Avg_velocity','bearing','distance_2']

# prepare Train Labels

prepare the label for training:

x_exit and y_exit values have to be within certain range. Do each of the comparison and store the value as 0 or 1 in train_label NumPy array.

In [15]:
#prepare training label

target_x = (last_traj_train['x_exit']>=3750901.5068) & (last_traj_train['x_exit']<=3770901.5068)
target_y = (last_traj_train['y_exit']>=-19268905.6133) & (last_traj_train['y_exit']<=-19208905.6133)

train_label = 1*(target_x & target_y)
df_train_data['train_label'] = train_label.values


train_label = train_label.values

# train_label = to_categorical(train_label)
df_train_data

Unnamed: 0,entry_inside,total_time_y,total_time_x,distance_from_center,total_travel,Avg_velocity,bearing,distance_2,train_label
0,0.0,0.050261,0.504186,0.238078,0.587951,4.295653e-03,0.223340,0.087889,0
1,0.0,0.091745,0.023174,0.153702,0.004090,6.501307e-04,0.034115,0.011198,0
2,0.0,0.141902,0.016709,0.713536,0.593113,1.307602e-01,0.206486,0.660867,0
3,0.0,0.000000,0.336637,0.319040,0.091945,1.006106e-03,0.497773,0.191035,0
4,0.0,0.000000,0.776712,0.092470,0.128552,6.096744e-04,0.535785,0.020133,0
5,0.0,0.021996,0.024526,0.354405,0.060514,9.088875e-03,0.357494,0.227082,0
6,0.0,0.000000,0.366108,0.332647,0.016583,1.668544e-04,0.741575,0.201252,0
7,1.0,0.000000,0.507854,0.017468,0.105559,7.656598e-04,0.139462,0.000000,1
8,0.0,0.036886,0.086896,0.468260,0.119135,5.050297e-03,0.993451,0.368086,0
9,0.0,0.012226,0.138949,0.196482,0.101626,2.694195e-03,0.598447,0.038955,0


# Custom F1 loss function

In [16]:
def f1_loss(y_true, y_pred):
    
    tp = K.sum(K.cast(y_true*y_pred, 'float'), axis=0)
    tn = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=0)
    fp = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=0)
    fn = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=0)

    p = tp / (tp + fp + K.epsilon())
    r = tp / (tp + fn + K.epsilon())

    f1 = 2*p*r / (p+r+K.epsilon())
    f1 = tf.where(tf.is_nan(f1), tf.zeros_like(f1), f1)
    return 1 - K.mean(f1)

In [17]:
print(len(train_data), len(train_label))

132753 132753


# Keras NN model -1 : Multi Trajectories

binary softmax, but categorical_crossentropy loss. *can improve loss, optimizer, layer*

In [18]:
# train_data = train_data.reshape(134063, 8,1)
#define model
model_multi = keras.Sequential([
    keras.layers.Flatten(),
    keras.layers.Dense(100,activation='relu'),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

#complile the model
model_multi.compile(optimizer='Adam',
              loss='binary_crossentropy',
              metrics=['accuracy']) 

#fit the model
# f1 = F1((test_data_eval, test_label))
history = model_multi.fit(train_data, train_label, epochs=10, \
                     callbacks=[reach_90acc]) #, validation_data=(test_data_eval, test_label)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# Prepare the Training Data - 2: With only 1 Trajectory

choose the features: 

Change the time values into float (total seconds)

Finally, store train_data as NumPy arrays, and normalize them.

Features:
1. Total time elapsed
2. within the park center
3. distance from park center
3. distance from park boundaries

In [19]:
# 1. total time elapsed (seconds)

one_traj_train['total_time'] = one_traj_train['time_exit_seconds'] - one_traj_train['time_entry_seconds']

# 2. prepare whether entry point is in cityhall

x_in_city = (one_traj_train['x_entry'] >=3750901.5068) & (one_traj_train['x_entry']<=3770901.5068)
y_in_city = (one_traj_train['y_entry'] >= -19268905.6133) & (one_traj_train['y_entry'] <= -19208905.6133)

one_traj_train['entry_inside'] = 1*(x_in_city & y_in_city)


# 3. the distance from the entry point of last trajectory from the city hall's mid point

one_traj_train['distance_from_center'] = ((3760901.5068 - one_traj_train['x_entry']).pow(2) + \
                        (-19238905.6133 - one_traj_train['y_entry']).pow(2)).pow(1/2)


# 4. distance from city hall boundaries

one_traj_train.loc[(one_traj_train['x_entry'] >=3750901.5068) & (one_traj_train['x_entry']<=3770901.5068) & (one_traj_train['y_entry'] >= -19268905.6133) & (one_traj_train['y_entry'] <= -19208905.6133), 'distance_2'] = 0
one_traj_train.loc[(one_traj_train['x_entry'] <3750901.5068) & (one_traj_train['y_entry'] >= -19268905.6133) & (one_traj_train['y_entry'] <= -19208905.6133), 'distance_2'] = 3750901.5068 - one_traj_train['x_entry']
one_traj_train.loc[(one_traj_train['x_entry']>3770901.5068) & (one_traj_train['y_entry'] >= -19268905.6133) & (one_traj_train['y_entry'] <= -19208905.6133), 'distance_2'] = one_traj_train['x_entry'] - 3770901.5068
one_traj_train.loc[(one_traj_train['x_entry'] >=3750901.5068) & (one_traj_train['x_entry']<=3770901.5068) & (one_traj_train['y_entry'] < -19268905.6133), 'distance_2'] = -19268905.6133 - one_traj_train['y_entry']
one_traj_train.loc[(one_traj_train['x_entry'] >=3750901.5068) & (one_traj_train['x_entry']<=3770901.5068) & (one_traj_train['y_entry'] > -19208905.6133), 'distance_2'] = last_traj_train['y_entry'] + 19208905.6133
one_traj_train.loc[(one_traj_train['x_entry']>3770901.5068) & (one_traj_train['y_entry'] > -19208905.6133), 'distance_2'] = ((3770901.5068 - one_traj_train['x_entry']).pow(2) + (-19208905.6133 - one_traj_train['y_entry']).pow(2)).pow(1/2)
one_traj_train.loc[(one_traj_train['x_entry'] <3750901.5068) & (one_traj_train['y_entry'] > -19208905.6133), 'distance_2'] = ((3750901.5068 - one_traj_train['x_entry']).pow(2) + (-19208905.6133 - one_traj_train['y_entry']).pow(2)).pow(1/2)
one_traj_train.loc[(one_traj_train['x_entry']>3770901.5068) & (one_traj_train['y_entry'] < -19268905.6133), 'distance_2'] = ((3770901.5068 - one_traj_train['x_entry']).pow(2) + (-19268905.6133 - one_traj_train['y_entry']).pow(2)).pow(1/2)
one_traj_train.loc[(one_traj_train['x_entry'] <3750901.5068) & (one_traj_train['y_entry'] < -19268905.6133), 'distance_2'] = ((3750901.5068 - one_traj_train['x_entry']).pow(2) + (-19268905.6133 - one_traj_train['y_entry']).pow(2)).pow(1/2)

one_traj_train.fillna(0, inplace=True) #fill nan

one_traj_train.head()

Unnamed: 0_level_0,trajectory_id,time_entry,time_exit,vmax,vmin,vmean,x_entry,y_entry,x_exit,y_exit,time_entry_seconds,time_exit_seconds,total_time,entry_inside,distance_from_center,distance_2
hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
003701810d0b3e69732ec05654769d16_9,traj_003701810d0b3e69732ec05654769d16_9_3,15:08:27,15:13:36,,,,3759756.0,-19119220.0,3758231.0,-19141590.0,54507.0,54816.0,309.0,0,119689.372246,0.0
00713e9907826da5e8b6ddefcba51770_3,traj_00713e9907826da5e8b6ddefcba51770_3_0,15:24:03,15:24:26,,,,3742580.0,-19287320.0,3742601.0,-19294080.0,55443.0,55466.0,23.0,0,51763.129529,20205.37347
0082a8207f0a1240210aa02f14921394_5,traj_0082a8207f0a1240210aa02f14921394_5_0,14:59:04,15:09:00,,,,3758777.0,-19123660.0,3760882.0,-19138780.0,53944.0,54540.0,596.0,0,115263.171889,0.0
00c04d3c0b35f93429fe0fc32eff21f8_29,traj_00c04d3c0b35f93429fe0fc32eff21f8_29_0,14:59:15,15:03:13,,,,3743534.0,-19286860.0,3744788.0,-19292940.0,53955.0,54193.0,238.0,0,51001.305232,19406.048727
00dd3e54b391a940c425e38d2236d404_11,traj_00dd3e54b391a940c425e38d2236d404_11_0,15:07:24,15:13:46,67.1175,67.1175,67.1175,3745866.0,-19208580.0,3744994.0,-19286400.0,54444.0,54826.0,382.0,0,33845.185181,5045.454186


# Normalize, NumPy-ize

In [20]:
#make a numpy array

one_train_data=one_traj_train.loc[:,['entry_inside','distance_from_center','distance_2','total_time']].values 

min_max_scaler = preprocessing.MinMaxScaler()
normalized_col = min_max_scaler.fit_transform(one_train_data[:,[1,2,3]])

one_train_data = np.concatenate((one_train_data[:,[0]],normalized_col), axis = 1)

df_one_train_data = pd.DataFrame(one_train_data)

df_one_train_data.columns = ['entry_inside','distance_from_center','distance_2','total_time']

# prepare Train Labels - 2


In [21]:
#prepare training label

target_x = (one_traj_train['x_exit']>=3750901.5068) & (one_traj_train['x_exit']<=3770901.5068)
target_y = (one_traj_train['y_exit']>=-19268905.6133) & (one_traj_train['y_exit']<=-19208905.6133)

one_train_label = 1*(target_x & target_y)
df_one_train_data['train_label'] = one_train_label.values


one_train_label = one_train_label.values

# train_label = to_categorical(train_label)
df_one_train_data

Unnamed: 0,entry_inside,distance_from_center,distance_2,total_time,train_label
0,0.0,0.622794,0.000000,0.050999,0
1,0.0,0.267576,0.155150,0.003796,0
2,0.0,0.599648,0.000000,0.098366,0
3,0.0,0.263592,0.149012,0.039280,0
4,0.0,0.173874,0.038742,0.063047,0
5,0.0,0.253754,0.134378,0.000000,0
6,0.0,0.254921,0.145084,0.035319,0
7,0.0,0.277329,0.169134,0.016339,0
8,0.0,0.272181,0.161445,0.000000,0
9,0.0,0.174255,0.040162,0.351708,1


# Keras NN model -2 : Single Trajectory

binary softmax, but categorical_crossentropy loss. *can improve loss, optimizer, layer*

In [22]:
#define model
model_single = keras.Sequential([
    keras.layers.Flatten(),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

#complile the model
model_single.compile(optimizer='Adam',
              loss='binary_crossentropy',
              metrics=['accuracy']) 

#fit the model
# f1 = F1((test_data_eval, test_label))
history = model_single.fit(one_train_data, one_train_label, epochs=100, \
                     callbacks=[reach_90acc]) #, validation_data=(test_data_eval, test_label)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


# Prepare the Test Data

choose the features: 

*velocty for pred data is weired

Change the time values into float, by dividing into minutes.

Finally, store test_data_pred and test_data_eval as NumPy arrays, and normalize them.

## Group the Dataset

In particular, for each person, store the last trajectories' data in separate dataframe for later use.

In [23]:
#for test data

last_traj_test = df_test.groupby('hash').last()

count_one_traj = df_test.groupby('hash').count()

count_one_traj = count_one_traj[count_one_traj['trajectory_id']==1]
one_traj_test = last_traj_test.loc[count_one_traj.index]

df_test = df_test.merge(last_traj_test, how='left', indicator=True)
df_test = df_test[(df_test['_merge'] == 'left_only')]

last_traj_test = last_traj_test.merge(one_traj_test, how='left', indicator=True).set_index(last_traj_test.index)
last_traj_test = last_traj_test[(last_traj_test['_merge'] == 'left_only')]


# Test Data -1: Multi Trajectories

In [24]:
# 0. Prepare required stats in each trajectory (seconds)

aggregation = {
    'time_entry_seconds': {'first'},
    'time_exit_seconds': {'last'},
    'x_entry' : {'first'},
    'y_entry' : {'first'},
    'x_exit' : {'last'},
    'y_exit' : {'last'}
}

df_test_traj = df_test.groupby('hash').agg(aggregation)

df_test_traj.columns = ['time_entry','time_exit','x_entry','y_entry','x_exit','y_exit']

df_test_traj.head()

Unnamed: 0_level_0,time_entry,time_exit,x_entry,y_entry,x_exit,y_exit
hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
00032f51796fd5437b238e3a9823d13d_31,42197.0,49393.0,3773413.0,-19098280.0,3773131.0,-19144650.0
000479418b5561ab694a2870cc04fd43_25,29303.0,44004.0,3771380.0,-19332740.0,3769983.0,-19342650.0
000506a39775e5bca661ac80e3f466eb_29,31505.0,40142.0,3760880.0,-19100420.0,3755349.0,-19161350.0
0005401ceddaf27a9b7f0d42ef1fbe95_1,33003.0,34659.0,3751328.0,-19162360.0,3751349.0,-19162840.0
00063a4f6c12e1e4de7d876580620667_3,31718.0,52502.0,3747364.0,-19278460.0,3766296.0,-19170290.0


In [25]:
# 1. total time elapsed (seconds)

df_test_traj['total_time'] = df_test_traj['time_exit'] - df_test_traj['time_entry']

# 2. prepare whether entry point of last trajectory is in cityhall

x_in_city = (last_traj_test['x_entry'] >=3750901.5068) & (last_traj_test['x_entry']<=3770901.5068)
y_in_city = (last_traj_test['y_entry'] >= -19268905.6133) & (last_traj_test['y_entry'] <= -19208905.6133)

last_traj_test['entry_inside'] = 1*(x_in_city & y_in_city)

# 3. the distance from the entry point of last trajectory from the city hall's mid point

last_traj_test['distance_from_center'] = ((3760901.5068 - last_traj_test['x_entry']).pow(2) + \
                        (-19238905.6133 - last_traj_test['y_entry']).pow(2)).pow(1/2)

# 3.0 time stayed in last trajectory

last_traj_test['total_time']=last_traj_test['time_exit_seconds']-last_traj_train['time_entry_seconds']

# 4. total distance traveled

df_test_traj['total_travel'] = ((df_test_traj['x_exit'] - df_test_traj['x_entry']).pow(2) + \
                                 (df_test_traj['y_exit'] - df_test_traj['y_entry']).pow(2)).pow(1/2)

# distance from city hall boundaries

last_traj_test.loc[(last_traj_test['x_entry'] >=3750901.5068) & (last_traj_test['x_entry']<=3770901.5068) & (last_traj_test['y_entry'] >= -19268905.6133) & (last_traj_test['y_entry'] <= -19208905.6133), 'distance_2'] = 0
last_traj_test.loc[(last_traj_test['x_entry'] <3750901.5068) & (last_traj_test['y_entry'] >= -19268905.6133) & (last_traj_test['y_entry'] <= -19208905.6133), 'distance_2'] = 3750901.5068 - last_traj_test['x_entry']
last_traj_test.loc[(last_traj_test['x_entry']>3770901.5068) & (last_traj_test['y_entry'] >= -19268905.6133) & (last_traj_test['y_entry'] <= -19208905.6133), 'distance_2'] = last_traj_test['x_entry'] - 3770901.5068
last_traj_test.loc[(last_traj_test['x_entry'] >=3750901.5068) & (last_traj_test['x_entry']<=3770901.5068) & (last_traj_test['y_entry'] < -19268905.6133), 'distance_2'] = -19268905.6133 - last_traj_train['y_entry']
last_traj_test.loc[(last_traj_test['x_entry'] >=3750901.5068) & (last_traj_test['x_entry']<=3770901.5068) & (last_traj_test['y_entry'] > -19208905.6133), 'distance_2'] = last_traj_train['y_entry'] + 19208905.6133
last_traj_test.loc[(last_traj_test['x_entry']>3770901.5068) & (last_traj_test['y_entry'] > -19208905.6133), 'distance_2'] = ((3770901.5068 - last_traj_test['x_entry']).pow(2) + (-19208905.6133 - last_traj_test['y_entry']).pow(2)).pow(1/2)
last_traj_test.loc[(last_traj_test['x_entry'] <3750901.5068) & (last_traj_test['y_entry'] > -19208905.6133), 'distance_2'] = ((3750901.5068 - last_traj_test['x_entry']).pow(2) + (-19208905.6133 - last_traj_test['y_entry']).pow(2)).pow(1/2)
last_traj_test.loc[(last_traj_test['x_entry']>3770901.5068) & (last_traj_test['y_entry'] < -19268905.6133), 'distance_2'] = ((3770901.5068 - last_traj_test['x_entry']).pow(2) + (-19268905.6133 - last_traj_test['y_entry']).pow(2)).pow(1/2)
last_traj_test.loc[(last_traj_test['x_entry'] <3750901.5068) & (last_traj_test['y_entry'] < -19268905.6133), 'distance_2'] = ((3750901.5068 - last_traj_test['x_entry']).pow(2) + (-19268905.6133 - last_traj_test['y_entry']).pow(2)).pow(1/2)


# 5. Avg. Velocity

df_test_traj['Avg_velocity'] = df_test_traj['total_travel'] / df_test_traj['total_time']


# 6. Avg. Bearing

a = []
for i in range(len(df_test_traj['x_entry'].values)):
    a.append(calculate_initial_compass_bearing((df_test_traj['x_entry'].values[i], df_test_traj['y_entry'].values[i]) , \
                                 (df_test_traj['x_exit'].values[i],  df_test_traj['y_exit'].values[i])))

bearing = np.array(a)


df_bearing = pd.DataFrame(bearing, columns = ['bearing'])

df_bearing.index = df_test_traj.index

df_test_traj = df_test_traj.merge(df_bearing, left_index=True, right_index=True)

df_test_traj.head()

Unnamed: 0_level_0,time_entry,time_exit,x_entry,y_entry,x_exit,y_exit,total_time,total_travel,Avg_velocity,bearing
hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
00032f51796fd5437b238e3a9823d13d_31,42197.0,49393.0,3773413.0,-19098280.0,3773131.0,-19144650.0,7196.0,46372.36472,6.444186,61.231058
000479418b5561ab694a2870cc04fd43_25,29303.0,44004.0,3771380.0,-19332740.0,3769983.0,-19342650.0,14701.0,10012.111189,0.68105,6.510434
000506a39775e5bca661ac80e3f466eb_29,31505.0,40142.0,3760880.0,-19100420.0,3755349.0,-19161350.0,8637.0,61184.374019,7.083984,94.14922
0005401ceddaf27a9b7f0d42ef1fbe95_1,33003.0,34659.0,3751328.0,-19162360.0,3751349.0,-19162840.0,1656.0,484.498442,0.292572,134.438556
00063a4f6c12e1e4de7d876580620667_3,31718.0,52502.0,3747364.0,-19278460.0,3766296.0,-19170290.0,20784.0,109821.492536,5.283944,2.462068


In [26]:
# 7. merge with last traj

last_data_test = last_traj_test.loc[:,['entry_inside', 'distance_from_center', 'total_time', 'distance_2']]

df_test_traj = df_test_traj.merge(last_data_test, on='hash', how = 'outer')

df_test_traj.fillna(0, inplace=True)

# Normalize & Make it NumPy

In [27]:
#make Numpy

test_data=df_test_traj.loc[:,['total_time_y','entry_inside',
                              'total_time_x','distance_from_center', 'total_travel','Avg_velocity',
                              'bearing','distance_2']].values


normalized_col = min_max_scaler.fit_transform(test_data[:,[0,2,3,4,5,6,7]])

test_data = np.concatenate((test_data[:,[1]],normalized_col), axis = 1)

df_test_data = pd.DataFrame(test_data)

df_test_data.columns = ['entry_inside','total_time_y',
                        'total_time_x','distance_from_center', 'total_travel',
                         'Avg_velocity','bearing','distance_2']

# Prepare the Test Data - 2: With only 1 Trajectory

choose the features: 

Change the time values into float (total seconds)

Finally, store train_data as NumPy arrays, and normalize them.

Features:
1. Total time elapsed
2. within the park center
3. distance from park center
3. distance from park boundaries

In [28]:
# 1. total time elapsed (seconds)

one_traj_test['total_time'] = one_traj_test['time_exit_seconds'] - one_traj_test['time_entry_seconds']

# 2. prepare whether entry point is in cityhall

x_in_city = (one_traj_test['x_entry'] >=3750901.5068) & (one_traj_test['x_entry']<=3770901.5068)
y_in_city = (one_traj_test['y_entry'] >= -19268905.6133) & (one_traj_test['y_entry'] <= -19208905.6133)

one_traj_test['entry_inside'] = 1*(x_in_city & y_in_city)


# 3. the distance from the entry point of last trajectory from the city hall's mid point

one_traj_test['distance_from_center'] = ((3760901.5068 - one_traj_test['x_entry']).pow(2) + \
                        (-19238905.6133 - one_traj_test['y_entry']).pow(2)).pow(1/2)


# 4. distance from city hall boundaries

one_traj_test.loc[(one_traj_test['x_entry'] >=3750901.5068) & (one_traj_test['x_entry']<=3770901.5068) & (one_traj_test['y_entry'] >= -19268905.6133) & (one_traj_test['y_entry'] <= -19208905.6133), 'distance_2'] = 0
one_traj_test.loc[(one_traj_test['x_entry'] <3750901.5068) & (one_traj_test['y_entry'] >= -19268905.6133) & (one_traj_test['y_entry'] <= -19208905.6133), 'distance_2'] = 3750901.5068 - one_traj_test['x_entry']
one_traj_test.loc[(one_traj_test['x_entry']>3770901.5068) & (one_traj_test['y_entry'] >= -19268905.6133) & (one_traj_test['y_entry'] <= -19208905.6133), 'distance_2'] = one_traj_test['x_entry'] - 3770901.5068
one_traj_test.loc[(one_traj_test['x_entry'] >=3750901.5068) & (one_traj_test['x_entry']<=3770901.5068) & (one_traj_test['y_entry'] < -19268905.6133), 'distance_2'] = -19268905.6133 - one_traj_test['y_entry']
one_traj_test.loc[(one_traj_test['x_entry'] >=3750901.5068) & (one_traj_test['x_entry']<=3770901.5068) & (one_traj_test['y_entry'] > -19208905.6133), 'distance_2'] = one_traj_test['y_entry'] + 19208905.6133
one_traj_test.loc[(one_traj_test['x_entry']>3770901.5068) & (one_traj_test['y_entry'] > -19208905.6133), 'distance_2'] = ((3770901.5068 - one_traj_test['x_entry']).pow(2) + (-19208905.6133 - one_traj_test['y_entry']).pow(2)).pow(1/2)
one_traj_test.loc[(one_traj_test['x_entry'] <3750901.5068) & (one_traj_test['y_entry'] > -19208905.6133), 'distance_2'] = ((3750901.5068 - one_traj_test['x_entry']).pow(2) + (-19208905.6133 - one_traj_test['y_entry']).pow(2)).pow(1/2)
one_traj_test.loc[(one_traj_test['x_entry']>3770901.5068) & (one_traj_test['y_entry'] < -19268905.6133), 'distance_2'] = ((3770901.5068 - one_traj_test['x_entry']).pow(2) + (-19268905.6133 - one_traj_test['y_entry']).pow(2)).pow(1/2)
one_traj_test.loc[(one_traj_test['x_entry'] <3750901.5068) & (one_traj_test['y_entry'] < -19268905.6133), 'distance_2'] = ((3750901.5068 - one_traj_test['x_entry']).pow(2) + (-19268905.6133 - one_traj_test['y_entry']).pow(2)).pow(1/2)

one_traj_test.fillna(0, inplace=True) #fill nan


# Normalize, NumPy-ize

In [29]:
#make a numpy array

one_test_data=one_traj_test.loc[:,['entry_inside','distance_from_center','distance_2','total_time']].values 

normalized_col = min_max_scaler.fit_transform(one_test_data[:,[1,2,3]])

one_test_data = np.concatenate((one_test_data[:,[0]],normalized_col), axis = 1)

df_one_test_data = pd.DataFrame(one_test_data)

df_one_test_data.columns = ['entry_inside','distance_from_center','distance_2','total_time']

# Evaluation of the Models

print the summary and test accuracy

In [30]:
#evaluate the accuracy of the model
model_multi.summary()

model_single.summary()

train_loss, train_acc = model_multi.evaluate(train_data, train_label)

print('Train accuracy of Multi Traj Model:', train_acc)

one_train_loss, one_train_acc = model_single.evaluate(one_train_data, one_train_label)

print('Train accuracy of Single Traj Model:', one_train_acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 8)                 0         
_________________________________________________________________
dense (Dense)                (None, 100)               900       
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 11,101
Trainable params: 11,101
Non-trainable params: 0
_________________________________________________________________
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_

# Plot the model's Learning Curve

https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/

In [31]:
# # summarize history for acc
# plt.plot(history.history['acc'])
# plt.plot(history.history['val_acc'])
# plt.title('model accuracy')
# plt.ylabel('accuracy')
# plt.xlabel('epoch')
# plt.legend(['train', 'test'], loc='upper left')
# plt.show()

# # summarize history for loss
# plt.plot(history.history['loss'])
# plt.plot(history.history['val_loss'])
# plt.title('model loss')
# plt.ylabel('loss')
# plt.xlabel('epoch')
# plt.legend(['train', 'test'], loc='upper left')
# plt.show()

# # summarize history for f1, recall, precision
# plt.plot(f1.val_f1s)
# plt.plot(f1.val_recalls)
# plt.plot(f1.val_precisions)
# plt.title('f1')
# plt.ylabel('f1')
# plt.xlabel('epoch')
# plt.legend(['f1', 'recall','precision'], loc='upper left')
# plt.show()

# Predict the Data

Predict the test_data_pred and if the p(xi) is over 0.5, save it as 1, otherwise 0. Predictions is the NumPy array saving the result. Formulate pd.DataFrame from df_testPred['trajectory_id'] and predictions ('target') so that the output DataFrame is in ['id', 'target'] format.

In [45]:
###################prediction for Multi Traj#######################
predictions = model_multi.predict(test_data)

predictions = (predictions >= 0.5) *1


id_multi = pd.DataFrame(last_traj_test['trajectory_id'])

target_multi = pd.DataFrame(predictions)
# target.columns = ['zeros','target']
# target = target['target']
target_multi.columns = ['target']

output_multi = pd.concat([id_multi.reset_index(drop=True),target_multi.reset_index(drop=True)], axis=1)
output_multi.columns = ['id', 'target']


###################prediction for Single Traj#######################
predictions = model_single.predict(one_test_data)

predictions = (predictions >= 0.5) *1


id_single = pd.DataFrame(one_traj_test['trajectory_id'])

target_single = pd.DataFrame(predictions)
# target.columns = ['zeros','target']
# target = target['target']
target_single.columns = ['target']

output_single = pd.concat([id_single.reset_index(drop=True),target_single.reset_index(drop=True)], axis=1)
output_single.columns = ['id', 'target']


####################append single and multi########################

output = output_multi.append(output_single).reset_index(drop = True)

output.to_csv("/Users/Godwithus/Desktop/EY/Hased.csv", index=False)

output

In [33]:
#debugging
print("test_data_pred", test_data_pred.shape)
print("predictions", predictions.shape)
print("df_test_pred", df_test_pred.shape)
print("id", id.shape)
print("target", target.shape)
print("output", output.shape)

print(output)

print(target.sum()/np.size(target, 0))

NameError: name 'test_data_pred' is not defined