# Check tripletloss model using another dataset

- Dataset: Musk dataset https://archive.ics.uci.edu/ml/datasets/Musk+%28Version+2%29

**As of now tripletloss model doesn't show any improvements.**

## Check Musk dataset.

In [1]:
import pandas as pd

In [2]:
musk_df = pd.read_csv("../data/clean2.data", header=None)

In [3]:
musk_df.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,159,160,161,162,163,164,165,166,167,168
0,MUSK-211,211_1+1,46,-108,-60,-69,-117,49,38,-161,...,-308,52,-7,39,126,156,-50,-112,96,1.0
1,MUSK-211,211_1+10,41,-188,-145,22,-117,-6,57,-171,...,-59,-2,52,103,136,169,-61,-136,79,1.0


Attributes information:

```
   Attribute:           Description:
   molecule_name:       Symbolic name of each molecule.  Musks have names such
                        as MUSK-188.  Non-musks have names such as
                        NON-MUSK-jp13.
   conformation_name:   Symbolic name of each conformation.  These
                        have the format MOL_ISO+CONF, where MOL is the
                        molecule number, ISO is the stereoisomer
                        number (usually 1), and CONF is the
                        conformation number. 
   f1 through f162:     These are "distance features" along rays (see
                        paper cited above).  The distances are
                        measured in hundredths of Angstroms.  The
                        distances may be negative or positive, since
                        they are actually measured relative to an
                        origin placed along each ray.  The origin was
                        defined by a "consensus musk" surface that is
                        no longer used.  Hence, any experiments with
                        the data should treat these feature values as
                        lying on an arbitrary continuous scale.  In
                        particular, the algorithm should not make any
                        use of the zero point or the sign of each
                        feature value. 
   f163:                This is the distance of the oxygen atom in the
                        molecule to a designated point in 3-space.
                        This is also called OXY-DIS.
   f164:                OXY-X: X-displacement from the designated
                        point.
   f165:                OXY-Y: Y-displacement from the designated
                        point.
   f166:                OXY-Z: Z-displacement from the designated
                        point. 
   class:               0 => non-musk, 1 => musk
```

In [4]:
musk_df[0].unique()

array(['MUSK-211', 'MUSK-212', 'MUSK-213', 'MUSK-214', 'MUSK-215',
       'MUSK-217', 'MUSK-219', 'MUSK-224', 'MUSK-228', 'MUSK-238',
       'MUSK-240', 'MUSK-256', 'MUSK-273', 'MUSK-284', 'MUSK-287',
       'MUSK-294', 'MUSK-300', 'MUSK-306', 'MUSK-314', 'MUSK-321',
       'MUSK-322', 'MUSK-323', 'MUSK-330', 'MUSK-331', 'MUSK-333',
       'MUSK-344', 'MUSK-f152', 'MUSK-f158', 'MUSK-j33', 'MUSK-j51',
       'MUSK-jf15', 'MUSK-jf17', 'MUSK-jf46', 'MUSK-jf47', 'MUSK-jf58',
       'MUSK-jf59', 'MUSK-jf66', 'MUSK-jf67', 'MUSK-jf78', 'NON-MUSK-192',
       'NON-MUSK-197', 'NON-MUSK-199', 'NON-MUSK-200', 'NON-MUSK-207',
       'NON-MUSK-208', 'NON-MUSK-210', 'NON-MUSK-216', 'NON-MUSK-220',
       'NON-MUSK-226', 'NON-MUSK-232', 'NON-MUSK-233', 'NON-MUSK-244',
       'NON-MUSK-249', 'NON-MUSK-251', 'NON-MUSK-252', 'NON-MUSK-253',
       'NON-MUSK-270', 'NON-MUSK-271', 'NON-MUSK-286', 'NON-MUSK-288',
       'NON-MUSK-289', 'NON-MUSK-290', 'NON-MUSK-295', 'NON-MUSK-296',
       'NON-MUSK-297', 

In [5]:
musk_df[1].unique()

array(['211_1+1', '211_1+10', '211_1+11', ..., 'jp13_2+7', 'jp13_2+8',
       'jp13_2+9'], dtype=object)

Let's use 2 ~ 167 columns as feature and 0,1 and 168 columns as labels.

### Traing with GBDT to check the problem difficulty.

In [6]:
import lightgbm as lgb
import numpy as np

In [7]:
musk_df = musk_df.sample(frac=1, random_state=23)

In [8]:
musk_df = musk_df.reset_index(drop=True)

In [9]:
musk_df.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,159,160,161,162,163,164,165,166,167,168
0,NON-MUSK-j147,j147_4+189,37,-121,-105,128,-117,106,210,-18,...,-69,-193,-110,-118,68,234,-47,-178,145,0.0
1,NON-MUSK-f146,f146_1+381,270,-194,-145,-62,-117,52,56,-171,...,-301,57,-135,-19,-13,226,-70,-39,-211,0.0


In [10]:
len(musk_df)

6598

In [11]:
train_X = musk_df.iloc[0:4000, 2:168]
train_Y = musk_df.iloc[0:4000, 168]

test_X = musk_df.iloc[4000:, 2:168]
test_Y = musk_df.iloc[4000:, 168]

In [12]:
lgb_train = lgb.Dataset(train_X, train_Y)

In [13]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 50,
    'learning_rate': 0.05,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'min_child_weight': 2,
    'gamma': 0.2,
    'verbose': 0
}

In [14]:
%%time

gbm = lgb.train(params,
                lgb_train,
                num_boost_round=40,
                valid_sets=lgb_train)

[1]	training's binary_logloss: 0.405072
[2]	training's binary_logloss: 0.376796
[3]	training's binary_logloss: 0.352722
[4]	training's binary_logloss: 0.330895
[5]	training's binary_logloss: 0.31256
[6]	training's binary_logloss: 0.295727
[7]	training's binary_logloss: 0.281168
[8]	training's binary_logloss: 0.267566
[9]	training's binary_logloss: 0.255118
[10]	training's binary_logloss: 0.243148
[11]	training's binary_logloss: 0.232214
[12]	training's binary_logloss: 0.222262
[13]	training's binary_logloss: 0.212953
[14]	training's binary_logloss: 0.204266
[15]	training's binary_logloss: 0.195906
[16]	training's binary_logloss: 0.188334
[17]	training's binary_logloss: 0.180681
[18]	training's binary_logloss: 0.173663
[19]	training's binary_logloss: 0.167115
[20]	training's binary_logloss: 0.160801
[21]	training's binary_logloss: 0.154902
[22]	training's binary_logloss: 0.149273
[23]	training's binary_logloss: 0.143978
[24]	training's binary_logloss: 0.139
[25]	training's binary_loglos

In [15]:
predict_prob = gbm.predict(test_X)

In [16]:
predict_label = [ 1 if elem >= 0.5 else 0 for elem in predict_prob]

In [17]:
acc = sum( np.array(predict_label) == np.array(test_Y) ) / len(predict_label)

In [18]:
acc

0.9634334103156275

In [19]:
sum(predict_label)

299

In [20]:
sum(test_Y)

376.0

This classification problem is solvable (at least for GBDT).

## Try triplet loss model

In [21]:
import random

In [22]:
train_normalized_feature_dict = { 
    k:v/np.linalg.norm(v) 
    for k,v 
    in zip(train_X.index, [elem[1].values for elem in train_X.iterrows()]) # elem[1].values pick up features of each row
}

In [23]:
%%time

sim_dict = {
    idx:{ idx_:np.sum(train_normalized_feature_dict[idx]*train_normalized_feature_dict[idx_])
            for idx_
            in train_X.index } 
    for idx 
    in train_X.index
}

CPU times: user 2min 12s, sys: 890 ms, total: 2min 13s
Wall time: 2min 13s


In [24]:
sorted( sim_dict[0].values() )[-5:]

[0.9182183781695297,
 0.9190310498816693,
 0.9240562721459058,
 0.9272741728267794,
 1.0]

In [25]:
def sort_similarity_by_value(sim_dict, app_id):
    '''
    input:
        sim_dict: similary dictionary
        app_id: target application id
    return:
        [(parsed1, sim1), (parsed2, sim2), ...] sorted by similarities
    '''
    return [(parsed, sim_dict[app_id][parsed]) for parsed in sorted(sim_dict[app_id], key=sim_dict[app_id].get)]

In [26]:
def make_uncited_grants_for_app_id(sim_dict, app_id, sidx, eidx, num, shuffle=True):
    '''
    input:
        sim_dict: 
        app_id: target application id
        sidx: start index to slice the sorted (parsed, sim) list
        eidx: end index to slice the sorted (parsed, sim) list
        num: number of grants that will be returned
    return:
        [parsed_1, parsed_2, ..., parsed_num] that are NOT cited to reject app_id
    ''' 
    assert sidx != 0, "Use except for 0 as sidx value"
    
    sorted_grants_list = sort_similarity_by_value(sim_dict, app_id)
    unsimilar_list = sorted_grants_list[sidx:eidx]
    similar_list = sorted_grants_list[-eidx:-sidx]
    if shuffle:
        random.shuffle(unsimilar_list)
        random.shuffle(similar_list)
    
    unsimilar_pairs = []
    similar_pairs = []
    
    idx = 0
    while len(unsimilar_pairs) != num:
        id_unsimilar_pair = unsimilar_list[idx][0]
        id_similar_pair = similar_list[idx][0]
        if train_Y[id_unsimilar_pair] == 0 and train_Y[id_similar_pair] == 1:
            unsimilar_pairs.append(id_unsimilar_pair)
            similar_pairs.append(id_similar_pair)
        idx += 1
        
    return unsimilar_pairs, similar_pairs

To return different uncited grants each call, change random seed as below.

In [27]:
random.seed(0)
make_uncited_grants_for_app_id(sim_dict, 0, 1, 500, 4)

([1247, 665, 3234, 325], [1445, 2516, 3042, 3800])

In [28]:
random.seed(1)
make_uncited_grants_for_app_id(sim_dict, 0, 1, 500, 4)

([2401, 302, 3655, 1082], [2682, 484, 165, 1445])

In [29]:
def create_triplet_pairs(sidx, eidx):
    all_elems = []
    
    for idx in train_X.index:
        unsimilar_pairs, similar_pairs = make_uncited_grants_for_app_id(sim_dict, idx, sidx, eidx, 1)
        
        for similar_idx, unsimilar_idx in zip(similar_pairs, unsimilar_pairs):
            all_elems.append([idx, similar_idx, unsimilar_idx])
    
    result_df = pd.DataFrame(all_elems)
    result_df.columns = ['idx', 'similar_idx', 'unsimilar_idx']
    
    return result_df

In [30]:
%%time

random.seed(0)
test = create_triplet_pairs(1, 500)

CPU times: user 20.7 s, sys: 50 ms, total: 20.7 s
Wall time: 20.7 s


In [31]:
test.head(2)

Unnamed: 0,idx,similar_idx,unsimilar_idx
0,0,1445,1247
1,1,2166,175


In [32]:
len(test)

4000

## Train model

In [33]:
import os
import tensorflow as tf

tf.enable_eager_execution()
tfe = tf.contrib.eager

In [34]:
class Model(object):
    def __init__(self, input_shape, output_shape):
        self.input_shape = input_shape
        self.output_shape = output_shape
        self.W = tfe.Variable( tf.random_normal( [self.input_shape, self.output_shape] ), name='weight' )
        self.B = tfe.Variable( tf.random_normal( [self.output_shape] ), name='bias' ) 
        self.variables = [ self.W, self.B ]
    
    def frwrd_pass(self,X_train):
        out = tf.matmul( X_train, self.W ) + self.B
        
        return out

We tried more complex models, but they didn't show any improvements.


```
class Model(object):
    def __init__(self, input_shape, output_shape1, output_shape2):
        self.input_shape = input_shape
        self.output_shape1 = output_shape1
        self.output_shape2 = output_shape2
        self.W1 = tfe.Variable( tf.random_normal( [self.input_shape, self.output_shape1] ), name='weight' )
        self.B1 = tfe.Variable( tf.random_normal( [self.output_shape1] ), name='bias' ) 
        self.W2 = tfe.Variable( tf.random_normal( [self.output_shape1, self.output_shape2] ), name='weight' )
        self.B2 = tfe.Variable( tf.random_normal( [self.output_shape2] ), name='bias' ) 
        self.variables = [ self.W1, self.B1, self.W2, self.B2 ]
    
    def frwrd_pass(self,X_train):
        out = tf.matmul( X_train, self.W1 ) + self.B1
        out = tf.nn.relu(out)
        out = tf.matmul( out, self.W2 ) + self.B2
        
        return out
```

In [42]:
def tripletloss(anchor_out, positive_out, negative_out, margin=0.2):
    norm_a_out = tf.nn.l2_normalize(anchor_out, axis=1)
    norm_p_out = tf.nn.l2_normalize(positive_out, axis=1)
    norm_n_out = tf.nn.l2_normalize(negative_out, axis=1)
    
    d_pos = tf.losses.cosine_distance(norm_a_out, norm_p_out, axis=1, reduction=tf.losses.Reduction.NONE)
    d_neg = tf.losses.cosine_distance(norm_a_out, norm_n_out, axis=1, reduction=tf.losses.Reduction.NONE)
    
    loss = tf.maximum(0.0, margin + d_pos - d_neg)
    
    return tf.reduce_mean(loss)

In [43]:
def create_training_input_np(sidx, eidx):
    anchor_list = []
    positive_list = []
    negative_list = []
    
    triplet_pairs = create_triplet_pairs(sidx, eidx)
    
    for row in triplet_pairs.itertuples():
        anchor_list.append(train_normalized_feature_dict[row.idx])
        positive_list.append(train_normalized_feature_dict[row.similar_idx])
        negative_list.append(train_normalized_feature_dict[row.unsimilar_idx])
    
    return np.array([np.array(anchor_list), np.array(positive_list), np.array(negative_list)])

In [44]:
def train_with_changing_negative_pair(sidx, eidx, batch_size, epochs):
    optimizer = tf.train.AdamOptimizer(learning_rate=0.00001)
    
    seed = 0
    for i in range(epochs):
        seed += 1
        random.seed(seed)
        
        input_data_np = create_training_input_np(sidx, eidx)
        data_num = int(input_data_np.shape[1])
        rand_idx = np.random.permutation(data_num)
        index_data_np = np.array([
            input_data_np[0][rand_idx], 
            input_data_np[1][rand_idx], 
            input_data_np[2][rand_idx]])

        input_data = tf.convert_to_tensor(input_data_np, dtype=tf.float32)
        anchor_data, positive_data, negative_data = input_data

        for iter_id in range(data_num // batch_size):        
            with tf.GradientTape() as tape:
                anchor_out = model.frwrd_pass(anchor_data[iter_id*batch_size : (iter_id+1)*batch_size])
                positive_out = model.frwrd_pass(positive_data[iter_id*batch_size : (iter_id+1)*batch_size])
                negative_out = model.frwrd_pass(negative_data[iter_id*batch_size : (iter_id+1)*batch_size])
                curr_loss = tripletloss(anchor_out, positive_out, negative_out)
            grads = tape.gradient( curr_loss, model.variables )
            optimizer.apply_gradients(zip(grads, model.variables), global_step=tf.train.get_or_create_global_step())

        if i % 10 == 0:
            print( "Loss at step {:d}: {:.5f}".format(i, curr_loss) )

In [45]:
start_end_index_pairs = (
    (1000, 2000),
    (2000, 3000)
)

In [48]:
model = Model(input_shape=166, output_shape=50)

In [49]:
%%time

for sidx, eidx in start_end_index_pairs:
    print("   start index: {}, end index: {}".format(sidx,eidx))
    train_with_changing_negative_pair(sidx, eidx, batch_size=20, epochs=51)

   start index: 1000, end index: 2000
Loss at step 0: 0.10612
Loss at step 10: 0.10253
Loss at step 20: 0.07504
Loss at step 30: 0.06947
Loss at step 40: 0.08785
Loss at step 50: 0.07011
   start index: 2000, end index: 3000
Loss at step 0: 0.34330
Loss at step 10: 0.29408
Loss at step 20: 0.32570
Loss at step 30: 0.28743
Loss at step 40: 0.24244
Loss at step 50: 0.21860
CPU times: user 53min 38s, sys: 2.34 s, total: 53min 41s
Wall time: 53min 47s


In [50]:
os.makedirs('../trained_model/tripletloss_musk', exist_ok=True)
saver = tfe.Saver(model.variables)
saver.save("../trained_model/tripletloss_musk/ckpt")

'../trained_model/tripletloss_musk/ckpt'

In [51]:
test_normalized_feature_dict = { 
    k:v/np.linalg.norm(v) 
    for k,v 
    in zip(test_X.index, [elem[1].values for elem in test_X.iterrows()]) # elem[1].values pick up features of each row
}

In [52]:
sorted_keys = sorted(test_normalized_feature_dict.keys())

test_feature_tensors = tf.convert_to_tensor(
    np.array([ test_normalized_feature_dict[k] for k in sorted_keys ]),
    dtype=tf.float32)

In [53]:
test_extracted_features = model.frwrd_pass(test_feature_tensors).numpy()

In [54]:
test_extracted_features.shape

(2598, 50)

In [55]:
test_extracted_features_df = pd.DataFrame({ 
    'ids':sorted_keys, 'extracted_feature':[ v/np.linalg.norm(v) for v in test_extracted_features ]
})

In [56]:
test_extracted_features_df.head(2)

Unnamed: 0,extracted_feature,ids
0,"[-0.066523544, 0.20527224, 0.12628277, -0.1699...",4000
1,"[-0.13204022, 0.1919553, 0.23854902, 0.0546946...",4001


In [57]:
test_Y[4000]

1.0

In [58]:
test_Y[4001]

0.0

In [59]:
%%time

test_sim_dict = {
    idx:{ idx_:np.sum(test_normalized_feature_dict[idx]*test_normalized_feature_dict[idx_])
            for idx_
            in test_X.index } 
    for idx 
    in test_X.index
}

CPU times: user 51.8 s, sys: 250 ms, total: 52.1 s
Wall time: 52.2 s


In [76]:
result = []

for idx in test_Y.index:
    anchor_label = test_Y[idx]
    topk_similar = sort_similarity_by_value(test_sim_dict, idx)[-100:-1]
    
    for target_idx, _ in topk_similar:
        if test_Y[target_idx] == anchor_label:
            result.append(1)
        else:
            result.append(0)

In [77]:
print(len(result))
print(sum(result))

257202
205781


In [65]:
# sort_similarity_by_value(test_sim_dict, 4000)[-6:]

[(5642, 0.9795089363407963),
 (5606, 0.9795310255088824),
 (4625, 0.9814915960083728),
 (6060, 0.9826777896705929),
 (5919, 0.9842811996045822),
 (4000, 1.0)]

In [66]:
# print(test_Y[5919])
# print(test_Y[6060])
# print(test_Y[4625])
# print(test_Y[5606])
# print(test_Y[5642])

1.0
1.0
1.0
1.0
1.0


In [62]:
%%time

test_sim_dict_predict = {
    idx:{ idx_:np.sum(v*v_)
            for idx_, v_
            in zip(test_extracted_features_df['ids'], test_extracted_features_df['extracted_feature']) } 
    for idx, v
    in zip(test_extracted_features_df['ids'], test_extracted_features_df['extracted_feature'])
}

CPU times: user 41.5 s, sys: 250 ms, total: 41.8 s
Wall time: 41.8 s


In [78]:
result = []

for idx in test_Y.index:
    anchor_label = test_Y[idx]
    topk_similar = sort_similarity_by_value(test_sim_dict_predict, idx)[-100:-1]
    
    for target_idx, _ in topk_similar:
        if test_Y[target_idx] == anchor_label:
            result.append(1)
        else:
            result.append(0)

In [79]:
print(len(result))
print(sum(result))

257202
202296


In [67]:
# sort_similarity_by_value(test_sim_dict_predict, 4000)[-6:]

[(5642, 0.9848649),
 (6060, 0.98657715),
 (4625, 0.9879164),
 (5217, 0.9881265),
 (5919, 0.9921755),
 (4000, 1.0)]

In [68]:
# print(test_Y[5919])
# print(test_Y[5217])
# print(test_Y[4625])
# print(test_Y[6060])
# print(test_Y[5642])

1.0
1.0
1.0
1.0
1.0


# ===== Trial and error =====