In [1]:
import numpy as np
import pandas as pd
import json
import urllib.request
import datetime

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, Dropout, LSTM, SimpleRNN
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

Using TensorFlow backend.


In [2]:
with urllib.request.urlopen('https://challenges.unify.id/v1/mle/user_4a438fdede4e11e9b986acde48001122.json') as url:
    data = json.loads(url.read().decode())
data.keys()

dict_keys(['user_data', 'user_label', 'next'])

Here, I loop through the urls to gather all users and data

In [3]:
user_ls = []
user_data = []
user_next = 'user_4a438fdede4e11e9b986acde48001122.json'

while True:
    
    if user_next is None:
        break
    
    with urllib.request.urlopen('https://challenges.unify.id/v1/mle/'+user_next) as url:
        data = json.loads(url.read().decode())
    
    user_ls.append(data['user_label'])
    user_data.append(data['user_data'])
    user_next = data['next']

This is a function that verifies if an entered string is valid or not

In [4]:
def string_verifier(char_ls):
    string = ''
    for char in char_ls:
        if char['character'] == '[backspace]':
            string = string[:-1]
        else:
            string += char['character']
    return string == 'Be Authentic. Be Yourself. Be Typing.'
    

Using the function above, I scan through all the data and use the string verifier to determine how many of the users typings are valid. I then store those numbers in a dictionary and remove all un-verified entries from the data

In [5]:
valid_entries = {}
user_data_cleaned = []
for i in range(len(user_data)):
    user_data_cleaned_entry = []
    for j in range(len(user_data[i])):
        if string_verifier(user_data[i][j]):
            if i in valid_entries.keys():
                valid_entries[i] = valid_entries[i] + 1
            else:
                valid_entries[i] = 1
            user_data_cleaned_entry.append(user_data[i][j])
    user_data_cleaned.append(user_data_cleaned_entry)
user_data = user_data_cleaned

Below, I distinguish the unqualified users as users with less than 300 valid entries

In [6]:
unqualified_users = [user_ls[key] for key in valid_entries.keys() if valid_entries[key]<300]
unqualified_users

['4a4587d0de4e11e9857aacde48001122',
 '4a47beecde4e11e9aea4acde48001122',
 '4a47f568de4e11e98afcacde48001122',
 '4a48cf62de4e11e9a654acde48001122',
 '4a4cab0ade4e11e99c31acde48001122']

I also make sure that those users aren't included in the data anymore and remove them

In [7]:
user_ls = [user_ls[key] for key in valid_entries.keys() if valid_entries[key]>=300]
user_data = [user_data[key] for key in valid_entries.keys() if valid_entries[key]>=300]

Here, I aggregate my data into a dataframe. I also create a column for time differences, which are the microseconds in the timedelta between keystokes. I feel that this data is better since there is no need to store the enter datetime and that this would be more efficient to train on.

In [8]:
user_col = []
user_char_col = []
user_time_col = []
user_time_diffs_col = []

for i in range(len(user_ls)):
    
    user = user_ls[i]
    data = user_data[i]
    
    for j in range(len(data)):
        user_col.append(user)
        user_char_col.append([keystroke['character'] for keystroke in data[j]])
        times = pd.to_datetime([keystroke['typed_at'] for keystroke in data[j]])
        user_time_col.append(times)
        user_time_diffs_col.append([0]+[(times[i+1]-times[i]).microseconds for i in range(len(times)-1)])

data = {'User': user_col, 'Characters': user_char_col, 'Times': user_time_col, 'Time_Differences': user_time_diffs_col} 

df = pd.DataFrame(data) 

In [9]:
df.head()

Unnamed: 0,User,Characters,Times,Time_Differences
0,4a438fdede4e11e9b986acde48001122,"[B, e, , A, u, t, h, e, n, t, i, c, ., , b, ...","DatetimeIndex(['2019-06-30 15:05:44.743893', '...","[0, 13945, 1499, 1758, 15259, 9911, 5716, 1029..."
1,4a438fdede4e11e9b986acde48001122,"[B, e, , A, u, t, h, e, n, t, i, c, ., , B, ...","DatetimeIndex(['2018-10-07 15:05:44.873524', '...","[0, 6543, 16507, 4790, 12739, 1711, 2104, 2453..."
2,4a438fdede4e11e9b986acde48001122,"[B, e, , A, u, t, h, h, [backspace], e, n, t,...","DatetimeIndex(['2018-12-09 15:05:44.858437', '...","[0, 12474, 16723, 2809, 14726, 9322, 2371, 717..."
3,4a438fdede4e11e9b986acde48001122,"[B, e, , A, u, t, h, e, n, t, i, c, ., , b, ...","DatetimeIndex(['2018-10-29 15:05:44.726315', '...","[0, 20640, 5449, 9855, 3401, 9706, 7410, 15984..."
4,4a438fdede4e11e9b986acde48001122,"[B, e, , A, u, t, h, h, [backspace], e, n, t,...","DatetimeIndex(['2019-02-10 15:05:44.852374', '...","[0, 8864, 4650, 9603, 9576, 8698, 3582, 2831, ..."


Baseline is calculated by the guessing approach of selecting the user with the highest number of entries everytime. This would yeild the percentage below:

In [10]:
df.User.value_counts()[0]/df.shape[0]*100

4.571594539645658

I also make the assumption that the max number of keystroke a user will enter is 50, which is 8 higher that the actual max keystrokes a user entered in the training data.

In [11]:
# max_len = len(max(df.Time_Differences, key=lambda x:len(x)))
max_len = 50

Here, i one-hot encode the users to create targets, and also take the number of users as the lstm layers

In [12]:
targets = pd.get_dummies(df.User)
lstm_dim = targets.shape[1]
lstm_dim

35

The code below format the features by standardizing the time differences between key strokes and converting the characters into their ascii value. I also combine those two metrics in pairs for the tensor input for the RNN and concatenate them for the random forrest classifier.

In [14]:
features = []
rf_features = []

for i in range(df.shape[0]):
    
    time_diff = df.Time_Differences[i]
    chars = df.Characters[i]
    feature_entry = []
    rf_feature_entry = []
    
    for j in range(len(time_diff)):
        norm_time_diff = (time_diff[j]-np.mean(time_diff))/np.std(time_diff)
        char_num = 0
        if chars[j] == '[backspace]':
            char_num = ord('\b')
        else:
            char_num = ord(chars[j])
    
        feature_entry.append([norm_time_diff, char_num])
        
    feature_entry = feature_entry + [[0,0]]*(max_len-len(feature_entry))
    rf_feature_entry = [entry[0] for entry in feature_entry] + [entry[1] for entry in feature_entry]
    
    features.append(feature_entry)
    rf_features.append(rf_feature_entry)

features = np.array(features)
rf_features = np.array(rf_features)

In [15]:
len(features) == sum([1 if len(time_diff)==max_len else 0 for time_diff in features])

True

In [16]:
train_features, test_features, train_targets, test_targets = train_test_split(rf_features, targets, test_size = 0.20, random_state = 22)

Here, I hyperparameterize my RandomForestClassifier on n_estimators. Given more time, I would also have hyperparameterized on max_depth. Based on the results, 5 estimators provides a high accuracy compared to the baseline.

In [40]:
accuracies = {}
for n in range(5,25):
    rf = RandomForestClassifier(n_estimators = n, n_jobs = -1, random_state = 22)

    rf.fit(train_features, train_targets)
    predictions = rf.predict(test_features)

    correct_preds = 0
    for i in range(len(predictions)):
        if (test_targets.iloc[i]==predictions[i]).all():
            correct_preds+=1

    accuracies['n_estimators:{}'.format(n)]=correct_preds/len(predictions)*100
    
accuracies

{'n_estimators:5': 12.663374963694451,
 'n_estimators:6': 9.700842288701713,
 'n_estimators:7': 12.111530641882078,
 'n_estimators:8': 9.236131280859714,
 'n_estimators:9': 11.327330816148708,
 'n_estimators:10': 9.090909090909092,
 'n_estimators:11': 10.107464420563463,
 'n_estimators:12': 8.655242521057218,
 'n_estimators:13': 9.555620098751088,
 'n_estimators:14': 8.480975893116469,
 'n_estimators:15': 9.061864652918967,
 'n_estimators:16': 8.248620389195468,
 'n_estimators:17': 9.032820214928842,
 'n_estimators:18': 8.277664827185593,
 'n_estimators:19': 8.887598024978216,
 'n_estimators:20': 8.161487075225095,
 'n_estimators:21': 8.742375835027593,
 'n_estimators:22': 8.103398199244845,
 'n_estimators:23': 8.451931455126344,
 'n_estimators:24': 7.841998257333721}

In [85]:
rf = RandomForestClassifier(n_estimators = 5, n_jobs = -1, random_state = 22)
rf.fit(train_features, train_targets)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=-1,
                       oob_score=False, random_state=22, verbose=0,
                       warm_start=False)

Attempted to run a Recurrent Neural Network with LSTM's, however I was unable to make a network big enough with enough training given my limited resources.

In [None]:
# model = Sequential()

# model.add(LSTM((lstm_dim),batch_input_shape=(None,max_len,2),return_sequences=True))
# model.add(LSTM((lstm_dim),return_sequences=True))
# model.add(LSTM((lstm_dim),return_sequences=False))
# # model.add(SimpleRNN((lstm_dim),batch_input_shape=(None,max_len,2),return_sequences=True))
# # model.add(SimpleRNN((lstm_dim),batch_input_shape=(None,max_len,2),return_sequences=True))
# # model.add(SimpleRNN((lstm_dim),batch_input_shape=(None,max_len,2),return_sequences=False))
# model.add(Dense(lstm_dim, activation='softmax'))

# # model.layers[0].set_weights([embedding_matrix])
# # model.layers[0].trainable = False #since GloVe embeddings are pretrained

# model.compile(optimizer='Adam',loss='categorical_crossentropy',metrics=['acc'])
# history = model.fit(features,targets,epochs=100, validation_split=0.2, callbacks=[EarlyStopping(patience=5)]) 

prototype for running entries. Note: input format is expected to match sample data given in the email.

In [89]:
filepath = input("Please enter the filepath of your entry")
with open(filepath) as json_file:
    data = json.load(json_file)
attempts = data['attempts']

entry_count = 0
max_len = 50

for attempt in attempts:
    entry_count+=1
    if not string_verifier(attempt):
        print('Entry {}: Invalid Input'.format(entry_count))
    else:
        times = pd.to_datetime([entry['typed_at'] for entry in attempt])
        
        time_diffs = np.array([0]+[(times[i+1]-times[i]).microseconds for i in range(len(times)-1)])
        time_diffs_normalized = (time_diffs-time_diffs.mean())/time_diffs.std()
        char_nums = [ord('\b') if entry['character'] == '[backspace]' else ord(entry['character']) for entry in attempt]
        time_diffs_padded = list(time_diffs_normalized)+[0]*(max_len-len(time_diffs_normalized))
        char_nums_padded = list(char_nums)+[0]*(max_len-len(char_nums))
        
        feature = np.array([time_diffs_padded + char_nums_padded])
        prediction = rf.predict(feature)
        if 1 not in prediction:
            print('Entry {}: Unable to distinguish user'.format(entry_count))
        else:
            user = user_ls[sum([i if prediction[0][i]==1 else 0 for i in range(len(prediction[0]))])]
            print('Entry {}: User {}'.format(entry_count,user))

Please enter the filepath of your entry./Desktop/sample_test.json
Entry 1: Unable to distinguish user
Entry 2: Unable to distinguish user
Entry 3: Unable to distinguish user
Entry 4: Unable to distinguish user
Entry 5: Unable to distinguish user
Entry 6: Unable to distinguish user
Entry 7: Unable to distinguish user
Entry 8: Unable to distinguish user
Entry 9: Unable to distinguish user
Entry 10: Invalid Input
Entry 11: Unable to distinguish user
Entry 12: Unable to distinguish user
Entry 13: Unable to distinguish user
Entry 14: Unable to distinguish user
Entry 15: Invalid Input
Entry 16: User 4a4a518cde4e11e99ae3acde48001122
Entry 17: Unable to distinguish user
Entry 18: Invalid Input
Entry 19: Unable to distinguish user
Entry 20: Unable to distinguish user
Entry 21: Unable to distinguish user
Entry 22: Unable to distinguish user
Entry 23: User 4a438fdede4e11e9b986acde48001122
Entry 24: Unable to distinguish user
Entry 25: Invalid Input
Entry 26: Unable to distinguish user
Entry 27: U

### Additional Questions

1. If you had one additional day, what would you change or improve to your submission?

If I had an additional day, I would spend that time trying to run a larger recurrent neural network and see if I could get that to work. In the context of this problem, a recurrent neural network would make a lot of sense since the keystrokes and their times occur in a sequence, and recurrent networks are great at handling sequencial data.

2. How would you modify your solution if the number of users was 1,000 times larger?

The biggest way I'd need to modify my solution would be in terms of hardware, since the hardware I'm currently using (my laptop) would not have enough computational power to train on that many users. This would especially be the case if I were trying to use neural networks, as those would take a lot longer to run with more data, especially since an increase in users would make the tensors larger.

3. What insights and takeaways do you have on the distribution of user performance?

From this challenge, I learned how difficult it can be to identify users based on the small time intervals between their keystrokes, and that there is definately a trade off between model accuracy and real-time performance of the model. I also learned that this is much more of a challenge than I thought and I will definately explore this problem in the future. 

4. What aspect(s) would you change about this challenge?

One aspect of this challenge I would change would be to have more time, since this challenge could definately benefit from more time to train and experiment with models.

5. What aspect(s) did you enjoy about this challenge?

I enjoyed that this challenge had a meaningfull real-world application and how relevant it is to the kind of work the company does. Doing this challenge really gave me a feel to the kind of problems UnifyId tries to solve.

