# Information Retrieval

## Outline:  
1. Ranking 
1. Siamise Networks
1. Triplet Loss
1. KNN
1. LSH

## Readings
1. MAP https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52
1. Hierarchial Navigable Small Worlds https://arxiv.org/abs/1603.09320

## 1 Ranking 

Given a query, return a list of documents from database, sorted in descending order by their relevance. 

Pointwise approach: *relevance = h(query, document)*, $relevance \in R$  
Pairwise approach: *relevance = h(query, document)*, $relevance \in \{0,1\}$  

<img src=images/if.png style="height:300px">


Evalulation:
For a given query predict list of documents of a finite size and look how well they are sorted by the true relevance.  

**Mean Average Precision**

Supports only binary relevance.  

$$ AP_n =  \frac 1  {GTP} \sum_{i=1}^n \frac {TP_i} i$$
$$ MAP_n = \frac 1 Q \sum_{q=1}^Q AP_n$$

where  
$GTP$ - total number of ground true positives   
$TP_i$ - number of true positives up to i-th position  
$Q$ - number of queries   

**Normalized Dicounted Commulative Gain**

Supports multilevel relevance.  

$$ DCG_n = \sum_{i=1}^n \frac {rel_i} {\log_2 (i+1)} $$
$$ NDCG_n = \frac {DCG_n} {IDCG_n}$$

where  
$DCG$ - dicounted commulative gain  
$IDCG$ - ideal discounted cumulative gain (as if your recommendation was as good as possible)  
$rel_i$ - relevance score for i-th position in predicted ranked list  

##  2 Siamese Networks

<img src=images/siam1.jpeg style="height:400px">

In [2]:
# sample from Quora duplicate detection

df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,199530,301049,301050,"As a Canadian student, is it wiser to complete...",How much will it cost to Indian student to stu...,0
1,387099,29541,519407,What is your favorite Indian sweet dish?,What's your favorite Indian dish? Why?,0
2,337316,464776,464777,Is there proof of Jon being Rhaegar and Lyanna...,Where does GRRM imply that Jon Snow is Rhaegar...,0
3,164415,255489,255490,Knowing how Prithviraj's last 3 films were flo...,Which is the Best Comedy scene in Malayalam ci...,0
4,382707,514592,514593,What causes damage to the somatosensory cortex...,What causes damage to the somatosensory cortex...,1


In [5]:
from allennlp.modules.elmo import Elmo, batch_to_ids

options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

# Compute two different representation for each token.
# Each representation is a linear weighted combination for the
# 3 layers in ELMo (i.e., charcnn, the outputs of the two BiLSTM))
elmo = Elmo(options_file, weight_file, 2, dropout=0)

02/28/2019 12:48:59 - INFO - allennlp.modules.elmo -   Initializing ELMo


In [102]:
df_train, df_val = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)

xq1_train = batch_to_ids(df_train.question1.values)
xq2_train = batch_to_ids(df_train.question2.values)
y_train = tt.from_numpy(df_train.is_duplicate.values).float()

xq1_val = batch_to_ids(df_val.question1.values)
xq2_val = batch_to_ids(df_val.question2.values)
y_val = tt.from_numpy(df_val.is_duplicate.values).float()

In [103]:
batch_size = 32
train_loader = DataLoader(TensorDataset(xq1_train, xq2_train, y_train), batch_size=batch_size)
val_loader = DataLoader(TensorDataset(xq1_val, xq2_val, y_val), batch_size=batch_size)


In [104]:
def _train_epoch(model, iterator, optimizer, curr_epoch):

    model.train()

    running_loss = 0

    n_batches = len(iterator)
    iterator = tqdm_notebook(iterator, total=n_batches, desc='epoch %d' % (curr_epoch), leave=True)

    for i, batch in enumerate(iterator):
        optimizer.zero_grad()

        loss = model(batch)
        loss.backward()
        optimizer.step()

        curr_loss = loss.data.cpu().detach().item()
        
        loss_smoothing = i / (i+1)
        running_loss = loss_smoothing * running_loss + (1 - loss_smoothing) * curr_loss

        iterator.set_postfix(loss='%.5f' % running_loss)

    return running_loss

def _test_epoch(model, iterator):
    model.eval()
    epoch_loss = 0

    n_batches = len(iterator)
    with tt.no_grad():
        for batch in iterator:
            loss = model(batch)
            epoch_loss += loss.data.item()

    return epoch_loss / n_batches


def nn_train(model, train_iterator, valid_iterator, optimizer, n_epochs=100,
          scheduler=None, early_stopping=0):

    prev_loss = 100500
    es_epochs = 0
    best_epoch = None
    history = pd.DataFrame()

    for epoch in range(n_epochs):
        train_loss = _train_epoch(model, train_iterator, optimizer, epoch)
        valid_loss = _test_epoch(model, valid_iterator)

        valid_loss = valid_loss
        print('validation loss %.5f' % valid_loss)

        record = {'epoch': epoch, 'train_loss': train_loss, 'valid_loss': valid_loss}
        history = history.append(record, ignore_index=True)

        if early_stopping > 0:
            if valid_loss > prev_loss:
                es_epochs += 1
            else:
                es_epochs = 0

            if es_epochs >= early_stopping:
                best_epoch = history[history.valid_loss == history.valid_loss.min()].iloc[0]
                print('Early stopping! best epoch: %d val %.5f' % (best_epoch['epoch'], best_epoch['valid_loss']))
                break

            prev_loss = min(prev_loss, valid_loss)

In [None]:
class MyModel(nn.Module):
    
    def __init__(self, elmo, criterion):
        super(MyModel, self).__init__()
        self.elmo = elmo
        self.criterion = criterion
        
        self.fc = nn.Linear(1024*2, 128)
        
        self.out = nn.Linear(128*3, 1)
        
    def branch(self, x):
        x = self.elmo(x)['elmo_representations']
        x = tt.cat(x, dim=-1)
        x = x.mean(dim=1)
        x = self.fc(x)
        return x
        
    def forward(self, batch):
        
        q1, q2, y = batch
        
        q1 = self.branch(q1)
        q2 = self.branch(q2)
        
        # simetric functions
        x = tt.cat([tt.abs(q1-q2), q1*q2, q1+q2], dim=-1)
        
        x = self.out(x).squeeze(1)
        loss = self.criterion(x,y)
        
        return loss



model = MyModel(elmo, nn.BCEWithLogitsLoss())

optimizer = optim.Adam(model.parameters())

nn_train(model, train_loader, val_loader, optimizer, n_epochs=2)

# 3 Triplet loss

Distance to samples from the same class should be less than to samples from other classes

Euclidean distance: 
<img src=images/triplet.png height=200/>

Cosine distance: 
<img src=images/triplet2.png height=400/>

In [106]:
def triplet_loss(anchor_embed, pos_embed, neg_embed):
    return F.cosine_similarity(anchor_embed, neg_embed) - F.cosine_similarity(anchor_embed, pos_embed)
    
    
class Tripletnet(nn.Module):
    def __init__(self):
        super(Tripletnet, self).__init__()
        ...
        
    def branch(self, x):
        ....

    def forward(self, anchor, pos, neg):
        
        anchor = self.branch(anchor)
        pos = self.branch(pos)
        neg = self.branch(neg)
        
        return triplet_loss(anchor, pos, neg)

### Hard Negative Mining


Sometimes, if you use random samples as negative examples, classification may be too easy for you model.  
You can consider taking samples from previous epoch, where your model made mistakes, as negative examples.  

## Another loss

Generalization of triplet loss

$$ Loss(D) = - \frac 1 B \sum_{i=1}^B \log \frac {\exp (D_{ii})} {\sum_{j=1}^B \exp (D_{ij})} $$

where  
$D$ - some similarity matrix between samples in a batch  
$B$ - batch size  

# 4 K-Nearest Neighbors (KNN)

Training complexity: O(1) - just remember all train set  
Inference complexity: O(n) - have to compare each test sample with all train samples  

<img src=images/knn.png style="height:300px"/>

## 5 Locale Sensitive Hashing (LSH)

Good implementation can be found here `https://github.com/spotify/annoy`

Definition:  
LSH family $F$ is a family of hash functions that maps metric space $M$ to set of buckets $S$.  
$$ h: M \rightarrow S $$

Let  
$p,q \in M $ - points in space  
$d$ be the metric in $M$  
$c$ - some scalar, $c > 1$
, then for $h \in F$:  

* if $d(p,q) \leq R$ then $P[h(p) = h(q)] \geq p_1$  
* if $d(p,q) \geq cR$ then $P[h(p) = h(q)] \leq p_2$  

And family $F$ is called $(R, cR, p_1, p_2)$ - sensitive


Assumption: uniform distribution
    
<img src=images/lsh1.jpeg height=400/>

Amplification:  

1. AND construction

Define new family of hash functions $G = {g}$, where each consists of k hash functions from $F$ chosen at random $g = h_1, ..., h_k$.

$g(p) = g(q)$ iff $h_i(p) = h_i(q)$ **for all** $i$  

Then, family $G$ is $(d_1, d_2, p_1^k, p_2^k)$ - sensitive

2. OR construction

Define new family of hash functions $G = {g}$, where each consists of k hash functions from $F$ chosen at random $g = h_1, ..., h_k$.

$g(p) = g(q)$ iff $h_i(p) = h_i(q)$ **at least for one** $i$  

Then, family $G$ is $(d_1, d_2, 1 - (1 - p_1)^k, 1 - (1- p_2)^k)$ - sensitive

LSH maps:  
<img src=images/lsh2.png style="height:400px">