# Siamese Networks

can find similarities for example between new and old questions asked, which would help to answer the new question if the old one is similar.

Below is an example architecture. Even though this are 2 networks, only one has to be trained, since both are using the same parameters. The only difference would be the input (e.g. different word sequences). The output vectors will be compared. The result is cosine similarity (-1 <= y_hat <= 1).

![](img/siamese.png)

[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/oUdcN/architecture)

In [1]:
#import numpy as np
import trax
from trax import layers as tl
import trax.fastmath.numpy as np
import numpy

In [2]:
numpy.random.seed(10)
%config Completer.use_jedi = False

In [3]:
def L2_normalize(x):
    return x / np.sqrt(np.sum(x * x, axis=-1, keepdims=True))

In [4]:
tensor = numpy.random.random((2,5))
tensor

array([[0.77132064, 0.02075195, 0.63364823, 0.74880388, 0.49850701],
       [0.22479665, 0.19806286, 0.76053071, 0.16911084, 0.08833981]])

In [5]:
norm_tensor = L2_normalize(tensor)
norm_tensor



DeviceArray([[0.57393795, 0.01544148, 0.4714962 , 0.55718327, 0.37093794],
             [0.26781026, 0.23596111, 0.9060541 , 0.20146926, 0.10524315]],            dtype=float32)

In [6]:
vocab_size = 500
model_dimension = 128

# A simple LSTM
LSTM = tl.Serial(
        tl.Embedding(vocab_size=vocab_size, d_feature=model_dimension),
        tl.LSTM(model_dimension),
        tl.Mean(axis=1),
        tl.Fn('Normalize', lambda x: normalize(x))
    )

# Turns into a Siamese network via 'Parallel'
Siamese = tl.Parallel(LSTM, LSTM)

In [7]:
Siamese

Parallel_in2_out2[
  Serial[
    Embedding_500_128
    LSTM_128
    Mean
    Normalize
  ]
  Serial[
    Embedding_500_128
    LSTM_128
    Mean
    Normalize
  ]
]

## It's all lost - How to calculate simple loss

To calculate loss, we need to compare sequences. The original question is called `Anchor`, the similar one `Positive` and the completely unrelated `Negative`.


*Do you like this course?* - Anchor

*Are you happy with this course?* - Positive

*Do you speak German?* - Negative

![image.png](img/sim.png)
[Source](https://www.wikiwand.com/en/Cosine_similarity)

- similiarity between Anchor A and Positive P: `s(A,P) ~ 1`
- similiarity between Anchor A and Negative N: `s(A,N) ~ -1`

-> Try to minimize the difference = `s(A,N) - s(A,P)`

![](img/siamese_loss_1.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/Dts95/cost-function)




## We're not lost. Use triplets
Computing loss like shown above, may bring us far away from `zer0`. However, ReLU (having Loss on the y-axis and difference on x) does the trick 😉. We want Loss >= 0.
![](img/triplets.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/Xm3vv/triplets)

Usually ReLU would go through zero. With alpha we could shift it a bit to the left/right and thereby controll loss.

![image.png](img/triplet_summary.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/Xm3vv/triplets)

## Costs, costs, costs

A batch of training data may look like this:

|q1 |q2  |
--- | --- 
|How much is the fish?|What does the fish cost?|
|How old are you?|What is your age?|
|...|...|

Within a row questions have a similar meaning. Within a column questions must have a unique meaning.

<br>
<br>

Below we see encoded question matrices.
![](img/hard_negative_mining.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/g0yAF/computing-the-cost-i)

<br>
<br>

We build a similarity matrix, which can then be used to calculate the costs.
![](img/sim_matrix.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/g0yAF/computing-the-cost-i)

- mean negative aka mean_neg (mean of oranges in a row)
- closest negative aka closest_neg (greatest orange in a row)
- mean_neg accelerates training, reduces noise
- closest_neg has higher penalties

<br>
<br>

This was our previous Loss function.

$\mathcal{L} = \max{(\mathrm{s}(A,N) -\mathrm{s}(A,P) +\alpha, 0)}$

We replace cos(A,N) with mean_neg (cost_1) & closest_neg (cost_2).

$\mathcal{L_\mathrm{1}} = \max{(mean\_neg -\mathrm{s}(A,P)  +\alpha, 0)}$

$\mathcal{L_\mathrm{2}} = \max{(closest\_neg -\mathrm{s}(A,P)  +\alpha, 0)}$

And finally we calculate the full loss.

$\mathcal{L_\mathrm{Full}} = \mathcal{L_\mathrm{1}} + \mathcal{L_\mathrm{2}}$


## Enough theory about costs! Python please :)

### 1. Similarity Score

In [8]:
import numpy as np


def cosine_similarity(v1, v2):
    """Calculates similarity between vectors"""
    return np.dot(v1, v2) / (np.sqrt(np.dot(v1, v1)) * np.sqrt(np.dot(v2, v2)))

v1 = np.array([1, 2, 3.3, 4], dtype=float)
v2 = np.array([1.5, 2, 3, 4], dtype=float)

print(cosine_similarity(v1, v2))


v2 = np.array([1.5, 2, 3, -4], dtype=float)

print(cosine_similarity(v1, v2))

v2 = -np.array([1.5, 2, 3, 4], dtype=float)

print(cosine_similarity(v1, v2))

0.9946662395953002
-0.01900636126615228
-0.9946662395953002


### 2. Similarity Score matrix

In [9]:
v1_1 = np.array([1, 2, 3])
v1_2 = np.array([9, 8, 7])
v1_3 = np.array([-1, -4, -2])
v1_4 = np.array([1, -7, 2])
v1 = np.vstack([v1_1, v1_2, v1_3, v1_4])

print(v1.shape)
v1

(4, 3)


array([[ 1,  2,  3],
       [ 9,  8,  7],
       [-1, -4, -2],
       [ 1, -7,  2]])

In [10]:
v2_1 = v1_1 + np.random.normal(0.01, 1, 3)
v2_2 = v1_2 + np.random.normal(0.01, 1, 3)
v2_3 = v1_3 + np.random.normal(0.01, 1, 3)
v2_4 = v1_4 + np.random.normal(0.01, 1, 3)
v2 = np.vstack([v2_1, v2_2, v2_3, v2_4])

print(v2.shape)
v2

(4, 3)


array([[ 1.27551159,  2.11854853,  3.01429143],
       [ 8.83539979,  8.44302619,  8.21303737],
       [-1.95506567, -2.96172592, -1.76136987],
       [ 1.45513761, -8.12660221,  2.14513688]])

In [11]:
assert len(v1) == len(v2), "batch sizes must match"
assert v1.shape == v2.shape, "shapes must match"

One could now loop over each vector combination (4x4) and calculate the matrix.

In [12]:
len_v = len(v1)

for i in range(len_v):
    print("",f"Row {i+1}",  sep="\n")
    for j in range(len_v):
        print(cosine_similarity(v1[i], v2[j]))


Row 1
0.9977567116697676
0.9141005340085158
-0.8879260314833884
-0.26201839722275444

Row 2
0.9120366010156774
0.9974121429851325
-0.9716636847049086
-0.3105877856710746

Row 3
-0.8831049175844488
-0.8748968670442764
0.9542332192488748
0.6846115343014618

Row 4
-0.26267254906979726
-0.31274986428523927
0.523953745127065
0.9991906807604154


However, a more elegant way is to use the dot product of L2-normalized vectors. Math ^^...

In [13]:
similarity_matrix = np.dot(L2_normalize(v1), L2_normalize(v2).T)
similarity_matrix

array([[ 0.99775671,  0.91410053, -0.88792603, -0.2620184 ],
       [ 0.9120366 ,  0.99741214, -0.97166368, -0.31058779],
       [-0.88310492, -0.87489687,  0.95423322,  0.68461153],
       [-0.26267255, -0.31274986,  0.52395375,  0.99919068]])

Note, how nicely we can see high similarity at the diagonale 🙂.

In [15]:
np.diag(np.diag(similarity_matrix))

array([[0.99775671, 0.        , 0.        , 0.        ],
       [0.        , 0.99741214, 0.        , 0.        ],
       [0.        , 0.        , 0.95423322, 0.        ],
       [0.        , 0.        , 0.        , 0.99919068]])

### 4. mean_neg & closest_neg

In [25]:
mean_neg = np.sum(
    similarity_matrix - np.diag(np.diag(similarity_matrix)
                               ), axis=-1, keepdims=True
                )/ (similarity_matrix.shape[0] - 1)
mean_neg

array([[-0.07861463],
       [-0.12340496],
       [-0.35779675],
       [-0.01715622]])

In [26]:
closest_neg = []

for i in similarity_matrix:
    curr_max = max(i)
    closest_neg.append(max([j for j in i if j != curr_max]))

closest_neg = np.array([closest_neg]).T
closest_neg

array([[0.91410053],
       [0.9120366 ],
       [0.68461153],
       [0.52395375]])

### 5. Costs (finally)
All this reshaping is a bit confusing, though...

In [29]:
alpha = 0.25

L_1 = np.maximum(mean_neg - np.diag(similarity_matrix).reshape(similarity_matrix.shape[0], 1) + alpha, 0)
L_2 = np.maximum(closest_neg - np.diag(similarity_matrix).reshape(similarity_matrix.shape[0], 1) + alpha, 0)
L_full = L_1 + L_2

cost = np.sum(L_full)
cost

0.33096828036929316

## One shot learning

- in classification u train on k classes & retrain for a new class
- in one shot learning u just compare 1 class with another via a siamese network
- only similarity score is needed