# Similar Images @ Large Scale, Explained

### Abstract

#### Challenge
Online spam accounts often use the same avatar, simple hashing method on exactness/similarity like md5sum can be evaded easily.

Some similarity algorithm compare the similarity between the activations of images, often end up compare vectors. Such solution require all such vectors loaded into memory.Which puts a lot of engineering headaches when the project swells up in size.

#### Solution
Here we provide a neural network model to hash the image into strings, hence the easiness of storing and engineering. And it can be applied to really large scale. The way we deal with similar image is simply to perfom SQL operation apply to string in database.

![production structure](https://github.com/raynardj/silse/blob/master/img/production.png?raw=true)

The main point of this technique, is to conquer **variations** created on purpose by the malicious users, with unlabeled dataset.

We can put in the variation mode we choose, the similarity detection can be color/shift/shear/rescale tolerable, but still pin point nothing but the similar images.

### Data input
$\large X{\small \alpha}$ is the original input, $\large X{\small \beta}$ is the slightly transformed image. The variations are manually created based on the original input. 

The following list of variation, does not intend to, or should not be definitive or conclusive:

* Rotation (upto 10 or -10 degrees)
* Shift in height
* Shift in width
* Recaling in each element(in every channel), in 10% or -10%

$ \large \{( X_{\alpha},X_{\beta}) \in R^3 \}$

### Hashing the image
To train a convolutional neural network model $\large f(X)$

$\large W_{\alpha}=f(X_{\alpha})$

$\large W_{\beta}=f(X_{\beta})$

Now we get $\large W{\small \alpha}$ and $\large W{\small \beta}$

$\large W= \{ (w _{1},w _{2},...,w_{48})\in R;w_{i} \in (0,1) \}$ are vectors of length 48, during inference, it will later be transfromed to hexidecimal string like "8d04a2e4068" of length 12, I call it as the "twin value".

### Loss function
If $\large X{\small \alpha},X{\small \beta}$ are similar images, $\large W_{\alpha}$ and $\large W_{\beta}$ show have look-alike distribution. Elsewise, the distribution should be as different as possible.

![train structure](https://github.com/raynardj/silse/blob/master/img/training.png?raw=true)

At here, we define a loss function manually:

* $\large L_{mae}(W_{\alpha},W_{\beta})$ is the mean absolute error of $\large W_{\alpha}$ and $\large W_{\beta}$.

* $\large L_{sim}=s(1-\log(L_{mae}(W_{\alpha},W_{\beta})))+(s-1)(-L_{mae}(W_{\alpha},W_{\beta}))$

$s$ is the input indicating the if $ \large X_{\alpha} $ and $\large X_{\beta}$ look alike:

* $s=0$ meaning: $ \large X_{\alpha}  $ and $\large  X_{\beta}$ look alike;
* $s=1$ meaning: $ \large X_{\alpha}  $ and $\large  X_{\beta}$ do not look alike;

The rest is good old Adam optimization, with label purposely set to all zero

### The structure of  $\large f(X)$
$\large f(X)$ use 108,108 as the input size, rgb as color channels.

The preprocessing function normalize each picture, let $ \large \{( X_{\alpha},X_{\beta}) \in R^3 ; X_{\alpha i j c}\in (-1,1),X_{\beta i j c}\in (-1,1)\}$.

For down-sampling, we set convolution stride to (2,2) on conv2d_3,conv2d_6,conv2d_9 to drop the least information. 

Poolings are to be tried for comparison.

![model structure](https://github.com/raynardj/silse/blob/master/img/structure.png?raw=true)

The outcome is surprisingly better then other hashing technique, so the following structure wasn't even the result of constant hyper parametering and fine-tuning.

I suspect certain upsampling(deconvolution) first then downsampling will make the model more capable of dealing with height/width/resize variations.

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
_________________________________________________________________
universal_input (InputLayer) (None, 108, 108, 3)       0         
_________________________________________________________________
color_preprocessing (Lambda) (None, 108, 108, 3)       0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 108, 108, 64)      1792      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 108, 108, 64)      36928     
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 54, 54, 64)        36928     
_________________________________________________________________
batch_normalization_1 (Batch (None, 54, 54, 64)        256       
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 54, 54, 128)       73856     
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 54, 54, 128)       147584    
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 27, 27, 128)       147584    
_________________________________________________________________
batch_normalization_2 (Batch (None, 27, 27, 128)       512       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 27, 27, 128)       147584    
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 27, 27, 128)       147584    
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 14, 14, 128)       147584    
_________________________________________________________________
batch_normalization_3 (Batch (None, 14, 14, 128)       512       
_________________________________________________________________
flatten_layer (Flatten)      (None, 25088)             0         
_________________________________________________________________
fc2_160 (Dense)              (None, 48)                1204272   
_________________________________________________________________
Total params: 2,092,976

Trainable params: 2,092,336

Non-trainable params: 640
_________________________________________________________________


### 摘要

#### 问题

线上垃圾账户通常会使用相似的头像（机器批量生产）， 建议的Hashing方式比如md5sum和其他更模糊的hashing， 很容易被垃圾用户绕过。

有些相似度的算法，会比较图像特定向量输出的相似性。 这样的方法一般需要将整个向量载入内存，当项目开始臃肿的时候， 这样的机制会在工程商让人很头疼。

#### 解决方案

在这里我们提供一个神经网络模型来将图片Hash为固定的字符串格式， 因此相似度特征的存储和工程都变得简易且灵活多样，并可以轻松应用于更庞大的数据环境，因为我们只需要简易的SQL操作即可处理数据库里的字符串。

这一项技术的两点，在于征服了恶意用户刻意制造的**图像变化**，仅仅使用未设置标签的数据集。

我们可以切换图像变化的模式，相似图片的侦测可以应对色彩，位移，斜拉,明暗上的变化，但仅挑选出相似的图片。

### 数据输入

$\large X{\small \alpha}$ 是原本的图像输入, $\large X{\small \beta}$ 是轻微变化的图像，图像变化是基于原图人工制造的。

以下是可以进行的变化， 这份列表并不意在叙述所有的可能性。
* 旋转
* 纵向位移
* 横向位移
* 每个色道的值大小

$ \large \{( X_{\alpha},X_{\beta}) \in R^3 \}$

### Hash 图像
需要训练一个卷积神经网络模型$\large f(X)$

$\large W_{\alpha}=f(X_{\alpha})$

$\large W_{\beta}=f(X_{\beta})$

我们有了$\large W{\small \alpha}$ 和 $\large W{\small \beta}$

$\large W= \{ (w _{1},w _{2},...,w_{48})\in R;w_{i} \in (0,1) \}$ 是长度为48的向量， 在生产环境中，这个向量会被转化为产长度为12位的十六进制字符串，如"8d04a2e4068"，我们称之为twin值。

### 损失函数
如果 $\large X{\small \alpha},X{\small \beta}$ 是相似的图像, $\large W_{\alpha}$ 和 $\large W_{\beta}$ 应该有着相似的数据分布， 反之数据分布应该尽量的迥异。

此处我们设计一个损失函数：

* $\large L_{mae}(W_{\alpha},W_{\beta})$ 是$\large W_{\alpha}$ 和$\large W_{\beta}$ 的平均绝对差.

* $\large L_{sim}=s(1-\log(L_{mae}(W_{\alpha},W_{\beta})))+(s-1)(-L_{mae}(W_{\alpha},W_{\beta}))$

$s$ 是 $ \large X_{\alpha} $ 和 $\large X_{\beta}$ 是否相似的数据输入:

* $s=0$ 意味着: $ \large X_{\alpha}  $ 和 $\large X_{\beta}$ 长得很像;
* $s=1$ 意味着: $ \large X_{\alpha}  $ 和 $\large X_{\beta}$ 长得不像;

接下来就可以用Adam优化器优化， y 标签设置成一群0。

### $\large f(X)$ 的结构
$\large f(X)$ 使用的输入大小为108,108 , 具备红绿蓝三个色道.

预处理会正则化每一张图片, 使 $ \large \{( X_{\alpha},X_{\beta}) \in R^3 ; X_{\alpha i j c}\in (-1,1),X_{\beta i j c}\in (-1,1)\}$.

为了down-sampling, 我们设置conv2d_3,conv2d_6,conv2d_9的卷积步距为(2,2)，这样可以尽量少地丢失信息. 

可以进一步测试用池化去比较相应的结果.

由于结果意外地喜人、很快就由于其他Hash技巧，以下的模型结构没有进行任何的超参调参和调优。

我觉得先 upsampling(deconvolution) 再 downsampling 会使模型更能消化位移、大小、旋转之类的变化， 可以进一步测试。


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
_________________________________________________________________
universal_input (InputLayer) (None, 108, 108, 3)       0         
_________________________________________________________________
color_preprocessing (Lambda) (None, 108, 108, 3)       0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 108, 108, 64)      1792      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 108, 108, 64)      36928     
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 54, 54, 64)        36928     
_________________________________________________________________
batch_normalization_1 (Batch (None, 54, 54, 64)        256       
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 54, 54, 128)       73856     
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 54, 54, 128)       147584    
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 27, 27, 128)       147584    
_________________________________________________________________
batch_normalization_2 (Batch (None, 27, 27, 128)       512       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 27, 27, 128)       147584    
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 27, 27, 128)       147584    
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 14, 14, 128)       147584    
_________________________________________________________________
batch_normalization_3 (Batch (None, 14, 14, 128)       512       
_________________________________________________________________
flatten_layer (Flatten)      (None, 25088)             0         
_________________________________________________________________
fc2_160 (Dense)              (None, 48)                1204272   
_________________________________________________________________
Total params: 2,092,976

Trainable params: 2,092,336

Non-trainable params: 640
_________________________________________________________________
