# We make up a toy problem to gain intuition about attenion heads

Let's use the following list of words (and a special token):
<Start>, the, man, chicken, ordered,woman, beef    7 words,so V=7, and token ids are 1 to 7

This our corpus:  
<Start> man ordered the chicken

<Start> woman ordered the beef

The corpus will be translated to 2 sequences of token ids as input.

Exercise Instructions:
Let's examine the Attention Wt matrix to see how dependencies might be encoded  (look for '<<<-----' markers)

your item 1:
Run the notebook, do print screen for output predictions and attention weights for 1 case that makes good predictions.  Write 1 or 2 sentences on how attention weights are coding the dependency

your item 2:
Try different values of H parameter (20 works well I think, but not always).

Try 10, 20, 40. Try at least a few runs,
  Write 1 or 2 sentences summarizing if H larger/smaller than 20 is noticably better or worse (look at the predictions for token 'the' to see if it predicts 'chicken' or 'beef' appropriately, and look at the loss value from training. Note, the loss is taken over all predictions, not just 'chicken' or 'beef').

In [1]:

#Let's first make a VxV matrix that represents 'association' between row and col
#eg <start>    predicts 'man' with value 1,
#      'the'   predicts 'beef  or chicken' with value 0.5,0.5 

import pandas as pd
import numpy as np
np.set_printoptions(precision=4)

import tensorflow as tf

#tf.keras.utils.set_random_seed(777)

#set the embedding/attention head size parameters
E = 7  #size of embedding layer
H = 20  #size of attention head    #<<<<<<<<<---- try 10,20,40 ---
                                     # How does H parameter affect the Attent. Wts?  hint: look at model summary output

#Note B will be the batch size, it will be set to number of sequences we create



In [2]:
# Set up sequence input
if 1:
    colnames  =  ["<ST>", "the", "man","chkn","ordrd","woman","beef"]
    V    = len(colnames)
    #------ make a sequence of length T --------------
    sequence2use = np.asarray([[1,3,5,2,4],[1,6,5,2,7]])  #start is first token, with array index 0, 
    B,T = sequence2use.shape
    sequence2pred       = np.zeros((B,T),dtype=int)
    for bi in range(B):
      sequence2pred[bi,0:-1] = sequence2use[bi,1:]
      sequence2pred[bi,-1]   = sequence2use[bi,0]

#set up a dataframe info for nice printouts later
rownamesxb =list(list())
for bi in range(B):
    rownames=list()
    for i in range(T):
        rownames.append(str(i)+' '+colnames[sequence2use[bi,i]-1])
    rownamesxb.append(rownames)


In [3]:
print(sequence2use)
print(sequence2pred)
rownamesxb

[[1 3 5 2 4]
 [1 6 5 2 7]]
[[3 5 2 4 1]
 [6 5 2 7 1]]


[['0 <ST>', '1 man', '2 ordrd', '3 the', '4 chkn'],
 ['0 <ST>', '1 woman', '2 ordrd', '3 the', '4 beef']]

In [4]:
#Now set up training data  Batch size is just 1  
#set up token sequences of id numbers  
Xtrain  = np.zeros((B,T,1)) #sequence2use.copy()
for bi in range(B):
    for ti in range(T):
       Xtrain[bi,ti]=sequence2use[bi,ti]-1  #Xtrain_ids[bi,ti]-1  #index starts at 0 so subtract1


#make postrain same batch size as X,Y
Postrain=np.zeros((B,T,1))
for bi in range(B):
  Postrain[bi,:,0] = np.arange(T)  #set Position to integer 1...T
#Postrain  = tf.expand_dims(Postrain,axis=2)     #add batch dim to input
print(Postrain.shape)

Ytrain=sequence2pred.copy()
for bi in range(B):
   for ti in range(T):
      Ytrain[bi,ti]=Ytrain[bi,ti]-1  #index starts at 0 so subtract1
Ytrain = tf.expand_dims(Ytrain,axis=2)
print(Ytrain.shape)


(2, 5, 1)
(2, 5, 1)


2023-06-29 00:13:15.202947: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-06-29 00:13:15.202984: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-10-36): /proc/driver/nvidia/version does not exist
2023-06-29 00:13:15.203753: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
#Now set up model related values
scale_value    = np.divide(1,np.sqrt(H)) #use H b/c it's dimension of Qmat, Kmat
scale_constant =tf.constant(scale_value,dtype=tf.float32)  #this will scale Attn Wts before softmax

#make TxT matrix of scale to apply elememnt wise  
initializer = tf.keras.initializers.Constant(scale_value)
scale_matrix = initializer(shape=(T,T))

#Make a mask
Msk=np.ones((T,T))
Msk = tf.convert_to_tensor(np.tril(Msk))


In [6]:
#show the Xtrain input
print(Xtrain)


[[[0.]
  [2.]
  [4.]
  [1.]
  [3.]]

 [[0.]
  [5.]
  [4.]
  [1.]
  [6.]]]


In [7]:
#Now build model to learn transformation for Q,K,V matrices

Xseqids_in       = tf.keras.layers.Input(shape=(T),name='Xseqids')        #this takes in the token ids and makes TxV input
Xembed         = tf.keras.layers.Embedding(V,V,input_length=T,name='XEmbed')(Xseqids_in) #1st V is numbr of possible ids, 2nd V is embedding dimension (columns), T is rows

#Xsequence     = tf.keras.layers.Input(shape=(T,V)) #the batch size is left unspecified
Pos_Input     = tf.keras.layers.Input(shape=(T),name='PosInput') #just the t=1...T integer
Pos_Embed     = tf.keras.layers.Embedding(T,V, input_length=T,name='PosEmbed')(Pos_Input) #input will be a number between 1 and T for length=T times
Xinputs       = tf.keras.layers.Add(name='XwithPos')([Xembed, Pos_Embed])

#now feed to Q,K,V transformations
Qmat       = tf.keras.layers.Dense(H,activation='linear',use_bias=False,name='Qmat')(Xinputs) #so for BxTxV input this should be T*V to H dense?
Kmat       = tf.keras.layers.Dense(H,activation='linear',use_bias=False,name='Kmat')(Xinputs) 
Vmat       = tf.keras.layers.Dense(V,activation='linear',use_bias=False,name='Vmat')(Xinputs) #Vmat output is V columns to use it as output

#now apply QtoK take softmax, scale it , apply to V
QK          = tf.keras.layers.Dot(axes=(2))([Qmat,Kmat])  #it will treat each Batch item separately
QKscaled    = tf.keras.layers.Lambda(lambda x: x * scale_constant)(QK)    #for each x in QK mult by scale
Attn_Wts     = tf.keras.layers.Softmax(axis=2,name='AttnWts')(QKscaled, mask=Msk)        #apply mask, apply softmax across row elemnts

Vout       = tf.keras.layers.Dot(axes=(2,1),name='Voutput')([Attn_Wts,Vmat])     #turn this off for our tests, and ..

#add this Add layer, with Xinputs skipped ahead to this input
VoutandXinput = tf.keras.layers.Add(name='skipadd')([Vout,Xinputs])  #<<< this is a skip connection with plus (aka residual skip connection)

Pout  = tf.keras.layers.Activation(activation='sigmoid')(VoutandXinput)

my_attn_model   = tf.keras.Model(inputs = [Xseqids_in,Pos_Input], outputs=Pout)


In [8]:
#configur loss and optimizer  
#Note: Sparse loss means Keras expect target to be an integer indicating which output should be 'on'
#    For the matrix output, Keras assumes to consider output predictions in each row
#my_attn_model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.01), 
my_attn_model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01), 
                      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)) #logits=T means it is a linear output -inf to +inf

my_attn_model.summary()


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 Xseqids (InputLayer)           [(None, 5)]          0           []                               
                                                                                                  
 PosInput (InputLayer)          [(None, 5)]          0           []                               
                                                                                                  
 XEmbed (Embedding)             (None, 5, 7)         49          ['Xseqids[0][0]']                
                                                                                                  
 PosEmbed (Embedding)           (None, 5, 7)         35          ['PosInput[0][0]']               
                                                                                              

In [9]:
#Note, if the loss is still going down, you can just keep running this cell to 
#  continue training

red_lrate = tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', factor=0.2,min_lr=0.0001, patience=10,verbose=1)

my_attn_model.fit( [Xtrain,Postrain],Ytrain,batch_size=2,shuffle=True,epochs=10,verbose=2)
my_attn_model.fit( [Xtrain,Postrain],Ytrain,batch_size=2,shuffle=True,epochs=5000,verbose=0,callbacks=[red_lrate])
my_attn_model.fit( [Xtrain,Postrain],Ytrain,batch_size=2,shuffle=True,epochs=10,verbose=2)



Epoch 1/10
1/1 - 0s - loss: 1.9300 - 391ms/epoch - 391ms/step
Epoch 2/10
1/1 - 0s - loss: 1.9267 - 1ms/epoch - 1ms/step
Epoch 3/10
1/1 - 0s - loss: 1.9234 - 1ms/epoch - 1ms/step
Epoch 4/10
1/1 - 0s - loss: 1.9202 - 946us/epoch - 946us/step
Epoch 5/10
1/1 - 0s - loss: 1.9169 - 1ms/epoch - 1ms/step
Epoch 6/10
1/1 - 0s - loss: 1.9137 - 1ms/epoch - 1ms/step
Epoch 7/10
1/1 - 0s - loss: 1.9104 - 947us/epoch - 947us/step
Epoch 8/10
1/1 - 0s - loss: 1.9072 - 940us/epoch - 940us/step
Epoch 9/10
1/1 - 0s - loss: 1.9040 - 929us/epoch - 929us/step
Epoch 10/10
1/1 - 0s - loss: 1.9008 - 983us/epoch - 983us/step
Epoch 1/10
1/1 - 0s - loss: 0.1658 - 3ms/epoch - 3ms/step
Epoch 2/10
1/1 - 0s - loss: 0.1657 - 1ms/epoch - 1ms/step
Epoch 3/10
1/1 - 0s - loss: 0.1657 - 982us/epoch - 982us/step
Epoch 4/10
1/1 - 0s - loss: 0.1657 - 1ms/epoch - 1ms/step
Epoch 5/10
1/1 - 0s - loss: 0.1657 - 1ms/epoch - 1ms/step
Epoch 6/10
1/1 - 0s - loss: 0.1657 - 987us/epoch - 987us/step
Epoch 7/10
1/1 - 0s - loss: 0.1657 - 97

<keras.callbacks.History at 0x155154100070>

In [10]:
Vout_pred =my_attn_model.predict([Xtrain,Postrain])

import pandas as pd
for bi in range(B):
  Outputdf = pd.DataFrame(data=Vout_pred[bi], index=rownamesxb[bi], columns=colnames)
  print('\n This is the output predictions use Head size',H)  #<<<<------------check these out, are they good?
  print(Outputdf.round(3).head(T))



 This is the output predictions use Head size 20
          <ST>    the    man   chkn  ordrd  woman   beef
0 <ST>   0.596  0.037  0.988  0.202  0.193  0.988  0.077
1 man    0.062  0.579  0.575  0.243  0.996  0.184  0.802
2 ordrd  0.092  0.991  0.122  0.612  0.319  0.050  0.403
3 the    0.225  0.705  0.016  0.996  0.008  0.018  0.837
4 chkn   0.994  0.103  0.365  0.679  0.104  0.634  0.333

 This is the output predictions use Head size 20
          <ST>    the    man   chkn  ordrd  woman   beef
0 <ST>   0.596  0.037  0.988  0.202  0.193  0.988  0.077
1 woman  0.037  0.578  0.189  0.061  0.999  0.051  0.940
2 ordrd  0.129  0.994  0.049  0.626  0.153  0.034  0.511
3 the    0.028  0.686  0.163  0.727  0.786  0.034  0.992
4 beef   0.994  0.107  0.369  0.679  0.099  0.633  0.340


## Now we want to examine the Attetion Wt matrix see how those matrices affect 'the' predictions for each input sequence



In [11]:
#get layer outputs (like we did with MNIST), to use below
layer_outputs     = [layer.output for layer in my_attn_model.layers[:]]
my_model_actvtns  = tf.keras.models.Model(inputs=my_attn_model.input, outputs=layer_outputs)
my_actvtns_output = my_model_actvtns.predict([Xtrain,Postrain]) 
#my_actvtns_output = my_model_actvtns.predict(Xtrain) 

print('There are ',str(len(my_actvtns_output))+ ' layers with output activations')


There are  14 layers with output activations


In [12]:
#Print layer names and count layers
layercnt=0
for layer in my_attn_model.layers[:]:
    print('layer num',layercnt,layer.name)
    layercnt+=1


layer num 0 Xseqids
layer num 1 PosInput
layer num 2 XEmbed
layer num 3 PosEmbed
layer num 4 XwithPos
layer num 5 Qmat
layer num 6 Kmat
layer num 7 dot
layer num 8 lambda
layer num 9 AttnWts
layer num 10 Vmat
layer num 11 Voutput
layer num 12 skipadd
layer num 13 activation


In [13]:
#now get Attn Wts   

#  <<<<<<< ------------ Can you inteprety how Attn Wts are picking out predictions  
layernum2get=9
AttnW_output = my_actvtns_output[layernum2get]

for bi in range(B):

  AttnW_outputdf = pd.DataFrame(data=AttnW_output[bi], index=rownamesxb[bi], columns=rownamesxb[bi])
  print('for i-th input:',bi,'  These are the output at layer ',my_attn_model.layers[layernum2get].name)
  print(AttnW_outputdf.round(3).head(T))
  print('The head size H was: ',H)

for i-th input: 0   These are the output at layer  AttnWts
         0 <ST>  1 man  2 ordrd  3 the  4 chkn
0 <ST>    1.000  0.000    0.000  0.000   0.000
1 man     0.093  0.907    0.000  0.000   0.000
2 ordrd   0.064  0.561    0.375  0.000   0.000
3 the     0.003  0.213    0.759  0.025   0.000
4 chkn    0.258  0.003    0.160  0.523   0.056
The head size H was:  20
for i-th input: 1   These are the output at layer  AttnWts
         0 <ST>  1 woman  2 ordrd  3 the  4 beef
0 <ST>    1.000    0.000    0.000  0.000   0.000
1 woman   0.079    0.921    0.000  0.000   0.000
2 ordrd   0.091    0.374    0.535  0.000   0.000
3 the     0.001    0.769    0.223  0.007   0.000
4 beef    0.252    0.001    0.162  0.526   0.059
The head size H was:  20
