I want to convey two things in this notebook.
## 1. Don't have to be hesitant about using Loop.
They say "avoid loops!'.
But I think It's not bad idea to use loops for this competition.
Because:
* We have to use small batch inference using Time-series API.
* Loops have very small overhead for each batch.
* Loops are more flexible.
* Even loops are not so slow. 3 features are extracted within 10 minits for 100M train data, as you can see blow.

## 2. Future information should not be used.
Time-series API doesn't allow us to use information from the future.
So we should not use it, especially user statistics from future make things very bad.

In [1]:
import pickle
import pandas as pd
import numpy as np
import gc
from sklearn.metrics import roc_auc_score
from collections import defaultdict, deque
from tqdm.notebook import tqdm
import lightgbm as lgb

## setting
CV files are generated by [this notebook](https://www.kaggle.com/its7171/cv-strategy)

In [2]:
train_pickle = '../input/riiid-cross-validation-files/cv1_train.pickle'
valid_pickle = '../input/riiid-cross-validation-files/cv1_valid.pickle'
question_file = '../input/features/question3.csv'
debug = False
validaten_flg = False

## feature engineering

In [3]:
import pickle
feld_needed = ['user_id','content_id','answered_correctly','prior_question_elapsed_time', 'prior_question_had_explanation']

loaded_dictionary = open(train_pickle, "rb")
train = pickle.load(loaded_dictionary)
train = train[feld_needed]


loaded_dictionary = open(valid_pickle, "rb")
valid = pickle.load(loaded_dictionary)
valid = valid[feld_needed]

In [4]:
train = train.loc[train.answered_correctly != -1].reset_index(drop=True)
valid = valid.loc[valid.answered_correctly != -1].reset_index(drop=True)
_=gc.collect()

In [5]:
prior_question_elapsed_time_mean = train.prior_question_elapsed_time.dropna().values.mean()

In [6]:
train.content_id += 1
valid.content_id += 1

In [7]:
train = train[-40000000:]
_=gc.collect()

In [8]:
train_user_prev_q_a = pd.read_csv('../input/user-prev-for-saint/train_user_prev_perform.csv')
train_user_prev_q_a = train_user_prev_q_a[-40000000:]
train_user_prev_q_a.user_prev_answer_lag += 1
_=gc.collect()

In [9]:
train = pd.concat([train,train_user_prev_q_a['user_prev_answer_lag']], axis = 1)
del(train_user_prev_q_a)
_=gc.collect()

In [10]:
train.max()

user_id                           2147482216
content_id                             13523
answered_correctly                         1
prior_question_elapsed_time           300000
prior_question_had_explanation          True
user_prev_answer_lag                       3
dtype: object

In [11]:
valid_user_prev_q_a = pd.read_csv('../input/user-prev-for-saint/valid_user_prev_perform.csv')
valid_user_prev_q_a.user_prev_answer_lag += 1

In [12]:
valid = pd.concat([valid,valid_user_prev_q_a['user_prev_answer_lag']], axis = 1)
del(valid_user_prev_q_a)

In [13]:
train_time_diff = pd.read_csv('../input/time-diff-lgbm/train_time_diff.csv')
train_time_diff = train_time_diff[-40000000:]
_=gc.collect()

In [14]:
train = pd.concat([train,train_time_diff], axis = 1)
train.time_diff.loc[train.time_diff >= 1e6] = 1e6
del(train_time_diff)
_=gc.collect()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [15]:
valid_time_diff = pd.read_csv('../input/time-diff-lgbm/valid_time_diff.csv')

In [16]:
valid = pd.concat([valid,valid_time_diff], axis = 1)
valid.time_diff.loc[valid.time_diff >= 1e6] = 1e6
del(valid_time_diff)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [17]:
question = pd.read_csv('../input/features/question3.csv')
question.tags.fillna('-1-1', inplace = True)
question['tag_num'] = pd.factorize(question.tags)[0]

question.content_id += 1
question.tag_num += 1

train = train.merge(question[['content_id','part','tag_num']], on = 'content_id', how = 'left')
valid = valid.merge(question[['content_id','part','tag_num']], on = 'content_id', how = 'left')

In [18]:
FEATS = ['prior_question_elapsed_time','time_diff']

In [19]:
train.prior_question_elapsed_time.fillna(prior_question_elapsed_time_mean, inplace = True)
valid.prior_question_elapsed_time.fillna(prior_question_elapsed_time_mean, inplace = True)

train.prior_question_had_explanation = train.prior_question_had_explanation*1 + 1
valid.prior_question_had_explanation = valid.prior_question_had_explanation*1 + 1

train.prior_question_had_explanation.fillna(3, inplace = True)
valid.prior_question_had_explanation.fillna(3, inplace = True)

In [20]:
train.prior_question_elapsed_time = train.prior_question_elapsed_time/300000
train.time_diff = train.time_diff/1e6

valid.prior_question_elapsed_time = valid.prior_question_elapsed_time/300000 
valid.time_diff = valid.time_diff/1e6

In [21]:
train_user_count = train.user_id.value_counts()
train_del_user = train_user_count[train_user_count<30]
train = train[~train.user_id.isin(train_del_user.index)]

valid_user_count = valid.user_id.value_counts()
valid_del_user = valid_user_count[valid_user_count<30]
valid = valid[~valid.user_id.isin(valid_del_user.index)]

In [22]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_len = 60

In [23]:
train_group = train.groupby('user_id')
del(train)
_ = gc.collect()


In [24]:
train_y = [frame['answered_correctly'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in train_group
           for i in range(0, len(frame['answered_correctly'].to_numpy()[:, None]), max_len) ]

train_y = np.reshape(pad_sequences(train_y, padding="pre"),(-1,max_len,1))


f = open("train_y.pkl","wb")
pickle.dump(train_y,f)
f.close()

del(train_y)
_ = gc.collect()

In [25]:
train_current_question = [frame['content_id'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in train_group
                         for i in range(0, len(frame['content_id'].to_numpy()[:,None]), max_len )]

train_current_question = np.reshape(pad_sequences(train_current_question, padding="pre"),(-1,max_len))


f = open("train_current_question.pkl","wb")
pickle.dump(train_current_question,f)
f.close()

del(train_current_question)
_ = gc.collect()

In [26]:
train_current_tag = [frame['tag_num'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in train_group
                         for i in range(0, len(frame['tag_num'].to_numpy()[:,None]), max_len )] 

train_current_tag = np.reshape(pad_sequences(train_current_tag, padding="pre"),(-1,max_len))

f = open("train_current_tag.pkl","wb")
pickle.dump(train_current_tag,f)
f.close()

del(train_current_tag)
_ = gc.collect()

In [27]:
train_current_part = [frame['part'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in train_group
                         for i in range(0, len(frame['part'].to_numpy()[:,None]), max_len )] 
train_current_part = np.reshape(pad_sequences(train_current_part, padding="pre"),(-1,max_len))


f = open("train_current_part.pkl","wb")
pickle.dump(train_current_part,f)
f.close()

del(train_current_part)
_ = gc.collect()

In [28]:
train_past_answer = [frame['user_prev_answer_lag'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in train_group
                         for i in range(0, len(frame['user_prev_answer_lag'].to_numpy()[:,None]), max_len )]

train_past_answer = np.reshape(pad_sequences(train_past_answer, padding="pre"),(-1,max_len))


f = open("train_past_answer.pkl","wb")
pickle.dump(train_past_answer,f)
f.close()

del(train_past_answer)
_ = gc.collect()

In [29]:
train_prior_exp = [frame['prior_question_had_explanation'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in train_group
                         for i in range(0, len(frame['prior_question_had_explanation'].to_numpy()[:,None]), max_len )]

train_prior_exp = np.reshape(pad_sequences(train_prior_exp, padding="pre"),(-1,max_len))

f = open("train_prior_exp.pkl","wb")
pickle.dump(train_prior_exp,f)
f.close()

del(train_prior_exp)
_ = gc.collect()

In [30]:
train_other_feats = [frame[FEATS].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in train_group
                         for i in range(0, len(frame[FEATS].to_numpy()[:,None]), max_len )]

train_other_feats = np.reshape(pad_sequences(train_other_feats, padding="pre", dtype = 'float32'),(-1,max_len, len(FEATS)))

f = open("train_other_feats.pkl","wb")
pickle.dump(train_other_feats,f)
f.close()

del(train_other_feats)
_ = gc.collect()

In [31]:
valid_group = valid.groupby('user_id')

valid_y = [frame['answered_correctly'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in valid_group
           for i in range(0, len(frame['answered_correctly'].to_numpy()[:, None]), max_len) ]
valid_current_question = [frame['content_id'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in valid_group
                         for i in range(0, len(frame['content_id'].to_numpy()[:,None]), max_len )]
valid_current_tag = [frame['tag_num'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in valid_group
                         for i in range(0, len(frame['tag_num'].to_numpy()[:,None]), max_len )] 
valid_current_part = [frame['part'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in valid_group
                         for i in range(0, len(frame['part'].to_numpy()[:,None]), max_len )] 

valid_past_answer = [frame['user_prev_answer_lag'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in valid_group
                         for i in range(0, len(frame['user_prev_answer_lag'].to_numpy()[:,None]), max_len )]
valid_other_feats = [frame[FEATS].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in valid_group
                         for i in range(0, len(frame[FEATS].to_numpy()[:,None]), max_len )]
valid_prior_exp = [frame['prior_question_had_explanation'].to_numpy()[:, None].tolist()[i:i+max_len] for _, frame in valid_group
                         for i in range(0, len(frame['prior_question_had_explanation'].to_numpy()[:,None]), max_len )]


valid_y = np.reshape(pad_sequences(valid_y, padding="pre"),(-1,max_len,1))
valid_current_question = np.reshape(pad_sequences(valid_current_question, padding="pre"),(-1,max_len))
valid_current_tag = np.reshape(pad_sequences(valid_current_tag, padding="pre"),(-1,max_len))
valid_current_part = np.reshape(pad_sequences(valid_current_part, padding="pre"),(-1,max_len))
valid_past_answer = np.reshape(pad_sequences(valid_past_answer, padding="pre"),(-1,max_len))
valid_other_feats = np.reshape(pad_sequences(valid_other_feats, padding="pre", dtype = 'float32'),(-1,max_len, len(FEATS)))
valid_prior_exp = np.reshape(pad_sequences(valid_prior_exp, padding="pre"),(-1,max_len))


f = open("valid_prior_exp.pkl","wb")
pickle.dump(valid_prior_exp,f)
f.close()
del(valid_prior_exp)
_ = gc.collect()

f = open("valid_y.pkl","wb")
pickle.dump(valid_y,f)
f.close()
del(valid_y)
_ = gc.collect()

f = open("valid_current_question.pkl","wb")
pickle.dump(valid_current_question,f)
f.close()
del(valid_current_question)
_ = gc.collect()

f = open("valid_current_tag.pkl","wb")
pickle.dump(valid_current_tag,f)
f.close()
del(valid_current_tag)
_ = gc.collect()

f = open("valid_current_part.pkl","wb")
pickle.dump(valid_current_part,f)
f.close()
del(valid_current_part)
_ = gc.collect()

f = open("valid_past_answer.pkl","wb")
pickle.dump(valid_past_answer,f)
f.close()
del(valid_past_answer)
_ = gc.collect()

f = open("valid_other_feats.pkl","wb")
pickle.dump(valid_other_feats,f)
f.close()
del(valid_other_feats)
_ = gc.collect()

In [32]:
import tensorflow as tf
from tensorflow.keras import optimizers
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Activation, Dense, Dropout, LSTM, Masking, Embedding, Concatenate, Input, Reshape,Flatten, AveragePooling1D
from tensorflow.keras.layers import concatenate
from tensorflow.keras.regularizers import l1, l2, l1_l2
from tensorflow.keras.metrics import AUC
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Lambda
#from tensorflow.keras.layers import merge
from tensorflow.keras.layers import multiply, Reshape
import pandas as pd
import numpy as np
import gc
from sklearn.metrics import roc_auc_score
from collections import defaultdict
from tqdm import tqdm
from tqdm import trange
from tensorflow.keras.utils import Sequence

## modeling

In [33]:
def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
    return pos * angle_rates


def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

  # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

  # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)




def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

  # add extra dimensions to add the padding
  # to the attention logits.
    return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)




def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)




def scaled_dot_product_attention(q, k, v, mask):
    """Calculate the attention weights.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead) 
    but it must be broadcastable for addition.

    Args:
      q: query shape == (..., seq_len_q, depth)
      k: key shape == (..., seq_len_k, depth)
      v: value shape == (..., seq_len_v, depth_v)
      mask: Float tensor with shape broadcastable 
            to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
      output, attention_weights
    """

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # add the mask to the scaled tensor.
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)  

    # softmax is normalized on the last axis (seq_len_k) so that the scores
    # add up to 1.
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return output, attention_weights




class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth).
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

        concat_attention = tf.reshape(scaled_attention, 
                                       (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights





def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
        tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])


    
    
    
class EncoderLayer2(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.2):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):

        attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2



class Encoder2(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff,
                   maximum_position_encoding, rate=0.2):
        super(Encoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        #self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, 
                                                self.d_model)


        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                           for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):

        seq_len = tf.shape(x)[1]

        # adding embedding and position encoding.
        #x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)
    
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x  # (batch_size, input_seq_len, d_model)

    
class DecoderLayer2(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.2):
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)


    def call(self, x, enc_output, training, 
                look_ahead_mask, padding_mask):
    # enc_output.shape == (batch_size, input_seq_len, d_model)

        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)

        attn2, attn_weights_block2 = self.mha2(
                enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

        ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

        return out3, attn_weights_block1, attn_weights_block2

    
    
class Decoder2(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff,
                    maximum_position_encoding, rate=0.2):
        super(Decoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        #self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate) 
                       for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training, 
               look_ahead_mask, padding_mask):

        seq_len = tf.shape(x)[1]
        attention_weights = {}

        #x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
              x, block1, block2 = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)

        attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
        attention_weights['decoder_layer{}_block2'.format(i+1)] = block2

    # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights    

class Transformer2(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, padding_length, rate=0.2):
        super(Transformer, self).__init__()

        self.encoder = Encoder2(num_layers, d_model, num_heads, dff, padding_length)
        
        self.decoder = Decoder2(num_layers, d_model, num_heads, dff, padding_length)

        self.second_final_layer = tf.keras.layers.Dense(dff)
        self.final_layer = Dense(1,activation = 'sigmoid')
    
    def call(self, inp1, inp2, training, en_combined_mask, de_look_ahead_mask, de_padding_mask):

        enc_output = self.encoder(inp1, training, en_combined_mask)  # (batch_size, inp_seq_len, d_model)
        dec_output, attention_weights = self.decoder(
                inp2, enc_output, training, de_look_ahead_mask, de_padding_mask)
            
        second_final_output = self.second_final_layer(dec_output)  # (batch_size, tar_seq_len, question_answer_pair_size)
        final_output = self.final_layer(second_final_output)
        return final_output



In [34]:

num_other_feats = len(FEATS)
num_layers = 2
d_model = 256
num_heads = 4
dff = 128

n_question = 13523
n_tag = 1520
n_part = 7
n_answer = 3

pe_input = 100
epoch = 10
max_len = 100


def build(num_layers, d_model, num_heads, dff, n_question, n_tag, n_part, n_answer, pe_input, num_other_feats, max_len):

    en_input1 = Input(batch_shape = (None, None), name = 'current_question')
    en_input1_embed = Embedding(n_question, d_model)(en_input1)
    en_input2 = Input(batch_shape = (None, None), name = 'current_tag')
    en_input2_embed = Embedding(n_tag, d_model)(en_input2)
    en_input3 = Input(batch_shape = (None, None), name = 'current_part')
    en_input3_embed = Embedding(n_part, d_model)(en_input3)
    en_input = tf.math.add_n([en_input1_embed, en_input2_embed, en_input3_embed])

    en_look_ahead_mask = create_look_ahead_mask(tf.shape(en_input1)[1])
    en_padding_mask = create_padding_mask(en_input1)
    en_combined_mask = tf.maximum(look_ahead_mask, padding_mask)
    
    
    
    #en_input1_embed = K.sum(en_input1_embed, axis = -2)
    de_input4 = Input(batch_shape = (None, None), name = 'past_answer')
    de_input4_embed = Embedding(n_answer, d_model)(en_input4)

    de_input5 = Input(batch_shape = (None, None, num_other_feats), name = 'other_feature')
    de_input5_masked = (Masking(mask_value= 0, input_shape = (None, None, num_other_feats)))(en_input5)
    de_input5_embed = Dense(d_model, input_shape = (None, None, num_other_feats), activation = 'sigmoid')(en_input5_masked)    
    de_input = tf.math.add_n([de_input4_embed, de_input5_embed])
    
    de_look_ahead_mask = create_look_ahead_mask(tf.shape(de_input4)[1])
    de_padding_mask = create_padding_mask(de_input4)

    
    
    transformer = Transformer2(num_layers, d_model, num_heads, dff, pe_input)
    
    final_output = transformer(en_input, de_input, True, en_combined_mask, de_look_ahead_mask, de_padding_mask)
    
    
    model = Model(inputs=[en_input1, en_input2, en_input3, de_input4, de_input5], outputs=final_output)
    model.compile( optimizer = 'adam',
                    loss = 'binary_crossentropy',
                    metrics=['accuracy',AUC()])
    
    return model

#my_model = build(num_layers, d_model, num_heads, dff, n_question, n_tag, n_part, n_answer, pe_input, num_other_feats, max_len)



In [35]:
#my_model.fit([train_current_question, train_current_tag, train_current_part, train_past_answer, train_other_feats] ,train_y, 
#             validation_data=([valid_current_question, valid_current_tag, valid_current_part, valid_past_answer, valid_other_feats] ,valid_y), batch_size = 200,  epochs = 15, verbose = 1)

In [36]:
my_model.save_weights('SAINT_model_feature_extraction.h5')

NameError: name 'my_model' is not defined

Have a fun with loops! :)