# Learning Deep Structured Semantic Models for Web Search using Clickthrough Data

    Po-Sen Huang
    Xiaodong He, Jianfeng Gao, Li Deng,Alex Acero, Larry Heck
    CIKM’13, Oct. 27 2013

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf

## 总结
使用DNN将高维稀疏text features映射到低维稠密语义空间。

- 输入层x是 bag-of-words term vectors(one-hot vector)
- Word Hashing:reduce the dimensionality of the bag-of-words term vectors
- activation function:tanh
- 通过cos得到一个文档D和一个query Q的相似度R(Q,D)
- 通过softmax得到基于给定Query最大的后验概率

## INTRODUCTION
These latent semantic models(`LSA`:latent semantic analysis) address the language discrepancy between Web documents and search queries by grouping different terms that occur in a similar context into the same semantic cluster. Thus, a query and a document, represented as two vectors in the lower-dimensional semantic space, can still have a high similarity score even if they do not share any term.

`PLSA`/`LDA`:these models are often trained in an unsupervised manner using an objective function that is only loosely coupled with the evaluation metric for the retrieval task. Thus the performance of these models on Web search tasks is not as good as originally expected.

Both `BLTM` and `DPM` outperform significantly the unsupervised latent semantic models, including LSA and PLSA, in the document ranking task.

`deep auto-encoders`:They demonstrated that hierarchical semantic structure embedded in the query and the document can be extracted via deep learning.

## RELATED WORK
### Latent Semantic Models and the Use of Clickthrough Data
- 通过SVD，一个document D可以映射成一个向量
- 对Query做同样的事情
- 相似度：这样就是比较两个向量的距离

### Deep Learning
They proposed a `semantic hashing` (SH) method which uses bottleneck features learned from the deep auto-encoder for information retrieval.

1. a stack of generative models (i.e., the restricted Boltzmann machine) are learned to map layer-by-layer a term vector representation of a document to a low-dimensional semantic concept vector.
1. the model parameters are fine-tuned so as to minimize the cross entropy error between the original term vector of the document and the reconstructed term vector.

The intermediate layer activations are used as features (i.e., bottleneck) for document ranking.

SH suffers from two problems:

1. the model parameters are optimized for the re-construction of the document term vectors rather than for differentiating the relevant documents from the irrelevant ones for a given query. 
1. in order to make the computational cost manageable, the term vectors of documents consist of only the most-frequent 2000 words.

## DEEP STRUCTURED SEMANTIC MODELS FOR WEB SEARCH
### DNN for Computing Semantic Features
![1](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/33010786.jpg)

- denote x as the input term vector
- y as the output vector
- $l_i, i=1,...,N-1$, as the intermediate hidden layers
- $W_i$ as the i-th weight matrix
- $b_i$ as the i-th bias term,

$l_1=W_1x$

$l_i=f(W_il_{i-1}+b_i),i=2,...,N-1$

$y=f(W_Nl_{N-1}+b_N)$

use the tanh as the activation function at the output layer and the hidden layers $l_i,i=2,...,N-1$:

$f(x)=\frac{1-e^{-2x}}{1+e^{-2x}}$

The semantic relevance score between a query Q and a document D is then measured as:

$R(Q,D)=cosine(y_Q,y_D)=\frac{y_Q^T y_D}{\Vert y_Q \Vert\Vert y_D \Vert}$

### WordHashing
Given a word (e.g. good), we first add word starting and ending marks to the word (e.g. #good#). Then, we break the word into letter n-grams (e.g. letter trigrams: #go, goo, ood, od#). Finally, the word is represented using a vector of letter n-grams.

Compared with the original size of the one-hot vector, word hashing allows us to represent a query or a document using a vector with much lower dimensionality.

### Learning the DSSM
we compute the posterior probability of a document given a query from the semantic relevance score between them through a softmax function

$P(D|Q)=\frac{exp(\gamma R(Q,D))}{\sum_{D_r \in D} exp(\gamma R(Q,D'))}$

- $\gamma$ is a smoothing factor in the softmax function
- D denotes the set of candidate documents to be ranked

In training, the model parameters are estimated to maximize
the likelihood of the clicked documents given the queries across the training set. Equivalently, we need to minimize the following loss function

$L(\Lambda)=-log \prod_{(Q,D^+)} P(D^+ | Q)$

- $D^+$ is the clicked document
- $\Lambda$ denotes the papameter set of the neural networks {$W_i$,$b_i$}