# A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

    Yelong Shen,Xiaodong He,Jianfeng Gao,Li Deng,Grégoire Mesnil
    CIKM’14, November 03 2014

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2014_cdssm_final.pdf

## 总结
在DSSM基础上增加卷积层和Max-Pooling层

## EXTRACTING CONTEXTUAL FEATURES FOR IR USING CLSM
### The CLSM Architecture
1. a word-n-gram layer obtained by running a contextual sliding window over the input word sequence (i.e., a query or a document),
1. a letter-trigram layer that transforms each word-trigram into a letter-trigram representation vector,
1. a convolutional layer that extracts contextual features for each word with its neighboring words defined by a window, e.g., a word-n-gram,
1. a max-pooling layer that discovers and combines salient word-n-gram features to form a fixed-length sentence-level feature vector,
1. a semantic layer that extracts a high-level semantic feature vector for the input word sequence.

![1](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/1347804.jpg)

### Letter-trigram based Word-n-gram Representation
![2](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/39329482.jpg)

In Figure 1, the letter-trigram matrix $W_f$ denotes the transformation from a word to its letter-trigram count vector, which requires no learning.

for the t-th word-n-gram at the word-n-gram layer, we have:

$l_t=[f_{t-d}^T,...,f_t^T,...,f_{t+d}^T]^T, t=1,...,T\ (1)$

- $f_t$ is the letter-trigram representation of the t-th word
- n=2d+1 is the letter-trigram representation of the t-th word

### Modeling Word-n-gram-Level Contextual Features at the Convolutional Layer
The convolution operation can be viewed as sliding window based feature extraction.

$h_t=tanh(W_c \cdot l_t), t=1,...,T$

- $W_c$ is the feature transformation matrix, as known as the convolution matrix, that are shared among all word n-grams.

![3](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/874497.jpg)

### Modeling Sentence-Level Semantic Features Using Max Pooling
`Max pooling`:These local features need to be aggregated to obtain a sentence-level feature vector with a fixed size independent of the length of the input word sequence. Since many words do not have significant influence on the semantics of the sentence, we want to suppress the non-significant local features and retain in the global feature vector only the salient features that are useful for IR.

$v(i)=max_{t=1,...,T}{ h_t(i)},i=1,...,K$

- v(i) is the i-th element of the max pooling layer v, 
- $h_t(i)$ is the i-th element of the t-th local feature vector. 
- K is the dimensionality of the max pooling layer, which is the same as the dimensionality of the local contextual feature vectors ${ h_t }$.

![4](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/7996358.jpg)

For each sentence, we examine the five most active neurons at the max-pooling layer, measured by v(i), and highlight the words in bold who win at these five neurons in the max operation (e.g., whose local features give these five highest neuron activation values). These examples show that the important concepts, as represented by these key words, make the most significant contribution to the overall semantic meaning of the sentence.

### Latent Semantic Vector Representations
After the sentence-level feature vector is produced by the max-pooling operation, one more non-linear transformation layer is applied to extract the high-level semantic representation, denoted by . 

$y=tanh(W_s \cdot v)$

- v is the global feature vector after max pooling,
- $W_s$ is the semantic projection matrix, 
- y is the vector representation of the input query (or document) in the latent semantic space, with a dimensionality of L.

### Using the CLSM for IR
Given a query and a set of documents to be ranked, we first compute the semantic vector representations for the query and all the documents using the CLSM as described above.

We compute the relevance score between the query and each document by measuring the cosine similarity between their semantic vectors.

In Web search, given the query, the documents are ranked by their semantic relevance scores.

## Learning the CLSM for IR
We first convert the semantic relevance score between a query and a positive document to the posterior probability of that document given the query through softmax:

$P(D^+|Q)=\frac{exp(\gamma R(Q,D^+))}{\sum_{D' \in D} exp(\gamma R(Q,D'))}$

In training, the model parameters are learned to maximize the likelihood of the clicked documents given the queries across the training set. That is, we minimize the following loss function

$L(\Lambda)=-log \prod_{(Q,D^+)} P(D^+|Q)$