<a href="https://colab.research.google.com/github/sagar9926/Question-Duplicates-Using-Siamese-Network-/blob/main/Question_Duplicates_Siamese_Neural_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question duplicates

In this assignment I have explored Siamese networks applied to natural language processing. 

## Outline

- [Overview](#0)
- [Part 1: Importing the Data](#1)
    - [1.1 Loading in the data](#1.1)
    - [1.2 Converting a question to a tensor](#1.2)
    - [1.3 Understanding the iterator](#1.3)
        - [Exercise 01](#ex01)
- [Part 2: Defining the Siamese model](#2)
    - [2.1 Understanding Siamese Network](#2.1)
        - [Exercise 02](#ex02)
    - [2.2 Hard  Negative Mining](#2.2)
        - [Exercise 03](#ex03)
- [Part 3: Training](#3)
    - [3.1 Training the model](#3.1)
        - [Exercise 04](#ex04)
- [Part 4: Evaluation](#4)
    - [4.1 Evaluating your siamese network](#4.1)
    - [4.2 Classify](#4.2)
        - [Exercise 05](#ex05)
- [Part 5: Testing with your own questions](#5)
    - [Exercise 06](#ex06)
- [On Siamese networks](#6)

<a name='0'></a>
### Overview
In this assignment, concretely you will: 

- Learn about Siamese networks
- Understand how the triplet loss works
- Understand how to evaluate accuracy
- Use cosine similarity between the model's outputted vectors
- Use the data generator to get batches of questions
- Predict using your own model

By now, you are familiar with trax and know how to make use of classes to define your model. We will start this homework by asking you to preprocess the data the same way you did in the previous assignments. After processing the data you will build a classifier that will allow you to identify whether to questions are the same or not. 
<img src = "https://github.com/amanjeetsahu/Natural-Language-Processing-Specialization/raw/d562105e68a0b85012ad3ebbb29b2af6344ad4e5/Natural%20Language%20Processing%20with%20Sequence%20Models/Week%204/meme.png" style="width:550px;height:300px;"/>


You will process the data first and then pad in a similar way you have done in the previous assignment. Your model will take in the two question embeddings, run them through an LSTM, and then compare the outputs of the two sub networks using cosine similarity. Before taking a deep dive into the model, start by importing the data set.


<a name='1'></a>
# Part 1: Importing the Data
<a name='1.1'></a>
### 1.1 Loading in the data

You will be using the Quora question answer dataset to build a model that could identify similar questions. This is a useful task because you don't want to have several versions of the same question posted. Several times when teaching I end up responding to similar questions on piazza, or on other community forums. This data set has been labeled for you. Run the cell below to import some of the packages you will be using. 

## Importing Libraries

In [1]:
import random
import torchtext,torch
from torchtext.legacy import data
import pandas as pd
import numpy as np
import spacy
import nltk
import random as rnd
import torch.optim as optim


nltk.download('punkt')

SEED = 43
torch.manual_seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
print("Device : ",device)

Device :  cuda


# Importing Data

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
path = "/content/drive/MyDrive/Study Decks/Natural Language Processing/questions.csv"
df = pd.read_csv(path)
N=len(df)
print('Number of question pairs: ', N)
df.head()

Number of question pairs:  404351


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [5]:
N_train = 300000
N_test  = 10*1024
data_train = df[:N_train]
data_test  = df[N_train:N_train+N_test]
print("Train set:", len(data_train), "Test set:", len(data_test))

td_index = (data_train['is_duplicate'] == 1).to_numpy()
td_index = [i for i, x in enumerate(td_index) if x] 
print('number of duplicate questions: ', len(td_index))
print('indexes of first ten duplicate questions:', td_index[:10])

Q1_train_words = np.array(data_train['question1'][td_index])
Q2_train_words = np.array(data_train['question2'][td_index])
y_train = np.array([1]*len(Q1_train_words))

Q1_test_words = np.array(data_test['question1'])
Q2_test_words = np.array(data_test['question2'])
y_test  = np.array(data_test['is_duplicate'])

print('TRAINING QUESTIONS:\n')
print('Question 1: ', Q1_train_words[0])
print('Question 2: ', Q2_train_words[0], '\n')
print('Question 1: ', Q1_train_words[5])
print('Question 2: ', Q2_train_words[5], '\n')

print('TESTING QUESTIONS:\n')
print('Question 1: ', Q1_test_words[0])
print('Question 2: ', Q2_test_words[0], '\n')
print('is_duplicate =', y_test[0], '\n')

Train set: 300000 Test set: 10240
number of duplicate questions:  111486
indexes of first ten duplicate questions: [5, 7, 11, 12, 13, 15, 16, 18, 20, 29]
TRAINING QUESTIONS:

Question 1:  Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
Question 2:  I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me? 

Question 1:  What would a Trump presidency mean for current international master’s students on an F1 visa?
Question 2:  How will a Trump presidency affect the students presently in US or planning to study in US? 

TESTING QUESTIONS:

Question 1:  How do I prepare for interviews for cse?
Question 2:  What is the best way to prepare for cse? 

is_duplicate = 0 



In [6]:
#create arrays
Q1_train = np.empty_like(Q1_train_words)
Q2_train = np.empty_like(Q2_train_words)

Q1_test = np.empty_like(Q1_test_words)
Q2_test = np.empty_like(Q2_test_words)

# Building the vocabulary with the train set         (this might take a minute)
from collections import defaultdict

vocab = defaultdict(lambda: 0)
vocab['<PAD>'] = 1

for idx in range(len(Q1_train_words)):
    Q1_train[idx] = nltk.word_tokenize(Q1_train_words[idx])
    Q2_train[idx] = nltk.word_tokenize(Q2_train_words[idx])
    q = Q1_train[idx] + Q2_train[idx]
    for word in q:
        if word not in vocab:
            vocab[word] = len(vocab) + 1
print('The length of the vocabulary is: ', len(vocab))

The length of the vocabulary is:  36342


In [7]:
for idx in range(len(Q1_test_words)): 
    Q1_test[idx] = nltk.word_tokenize(Q1_test_words[idx])
    Q2_test[idx] = nltk.word_tokenize(Q2_test_words[idx])

In [8]:
# Converting questions to array of integers
for i in range(len(Q1_train)):
    Q1_train[i] = [vocab[word] for word in Q1_train[i]]
    Q2_train[i] = [vocab[word] for word in Q2_train[i]]

        
for i in range(len(Q1_test)):
    Q1_test[i] = [vocab[word] for word in Q1_test[i]]
    Q2_test[i] = [vocab[word] for word in Q2_test[i]]

print('first question in the train set:\n')
print(Q1_train_words[0], '\n') 
print('encoded version:')
print(Q1_train[0],'\n')

print('first question in the test set:\n')
print(Q1_test_words[0], '\n')
print('encoded version:')
print(Q1_test[0]) 

first question in the train set:

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me? 

encoded version:
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21] 

first question in the test set:

How do I prepare for interviews for cse? 

encoded version:
[32, 38, 4, 107, 65, 1015, 65, 11522, 21]


In [9]:
# Splitting the data
cut_off = int(len(Q1_train)*.8)
train_Q1, train_Q2 = Q1_train[:cut_off], Q2_train[:cut_off]
val_Q1, val_Q2 = Q1_train[cut_off: ], Q2_train[cut_off:]
print('Number of duplicate questions: ', len(Q1_train))
print("The length of the training set is:  ", len(train_Q1))
print("The length of the validation set is: ", len(val_Q1))

Number of duplicate questions:  111486
The length of the training set is:   89188
The length of the validation set is:  22298


In [10]:
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: data_generator
def data_generator(Q1, Q2, batch_size, pad=1, shuffle=True):
    """Generator function that yields batches of data

    Args:
        Q1 (list): List of transformed (to tensor) questions.
        Q2 (list): List of transformed (to tensor) questions.
        batch_size (int): Number of elements per batch.
        pad (int, optional): Pad character from the vocab. Defaults to 1.
        shuffle (bool, optional): If the batches should be randomnized or not. Defaults to True.
    Yields:
        tuple: Of the form (input1, input2) with types (numpy.ndarray, numpy.ndarray)
        NOTE: input1: inputs to your model [q1a, q2a, q3a, ...] i.e. (q1a,q1b) are duplicates
              input2: targets to your model [q1b, q2b,q3b, ...] i.e. (q1a,q2i) i!=a are not duplicates
    """

    input1 = []
    input2 = []
    idx = 0
    len_q = len(Q1)
    question_indexes = [*range(len_q)]
    
    if shuffle:
        rnd.shuffle(question_indexes)
    
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    while True:
        if idx >= len_q:
            # if idx is greater than or equal to len_q, set idx accordingly 
            # (Hint: look at the instructions above)
            idx = 0
            # shuffle to get random batches if shuffle is set to True
            if shuffle:
                rnd.shuffle(question_indexes)
        
        # get questions at the `question_indexes[idx]` position in Q1 and Q2
        q1 = Q1[question_indexes[idx]]
        q2 = Q2[question_indexes[idx]]
        
        # increment idx by 1
        idx += 1
        # append q1
        input1.append(q1)
        # append q2
        input2.append(q2)
        if len(input1) == batch_size:
            # determine max_len as the longest question in input1 & input 2
            # Hint: use the `max` function. 
            # take max of input1 & input2 and then max out of the two of them.
            max_len = max([len(x) for x in input1] + [len(x) for x in input2])
            # pad to power-of-2 (Hint: look at the instructions above)
            max_len = 2**int(np.ceil(np.log2(max_len)))
            b1 = []
            b2 = []
            for q1, q2 in zip(input1, input2):
                # add [pad] to q1 until it reaches max_len
                q1 = q1 + [pad]*(max_len - len(q1))
                # add [pad] to q2 until it reaches max_len
                q2 = q2 + [pad]*(max_len - len(q2)) 
                # append q1
                b1.append(torch.tensor(q1))
                # append q2
                b2.append(torch.tensor(q2))
            # use b1 and b2
            
            yield torch.stack([*b1]),torch.stack([*b2])
    ### END CODE HERE ###
            # reset the batches
            input1, input2 = [], []  # reset the batches

In [11]:
batch_size = 4
res1, res2 = next(data_generator(train_Q1, train_Q2, batch_size))
print("First questions  : ",'\n', res1, '\n')
print("Second questions : ",'\n', res2)

First questions  :  
 tensor([[   32,     4,    33,   331,    43,   230,  3717,    21,     1,     1,
             1,     1,     1,     1,     1,     1],
        [   30,   156,    78,   317,   307,   617,    11,  2121,    21,     1,
             1,     1,     1,     1,     1,     1],
        [  244,   156,    78, 20106,  1759,   127,    56,   792,    21,     1,
             1,     1,     1,     1,     1,     1],
        [  219,   138,   473,   165,  2618,   267,  1596,  6917,    21,     1,
             1,     1,     1,     1,     1,     1]]) 

Second questions :  
 tensor([[  30,   33,    4,   38,   39,  331,   43,  230, 1212,   21,    1,    1,
            1,    1,    1,    1],
        [  30,  156,   78,  317,  307, 2121,   11,  617,   21,    1,    1,    1,
            1,    1,    1,    1],
        [ 244,  156,   78,  134, 2131, 1759,  131,   56,  792,   21,    1,    1,
            1,    1,    1,    1],
        [ 219,  138,  473,  165, 2618,  267, 1438, 1596, 6917,   21,    1,    1,
   

# Model

## Siamese Neural Network

Siamese neural network is a class of neural network architectures that contain two or more identical sub networks. identical here means they have the same configuration with the same parameters and weights. Parameter updating is mirrored across both sub networks.It is used find the similarity of the inputs by comparing its feature vectors.

For more details check this blog : https://innovationincubator.com/siamaese-neural-network-with-paytorch-code-example/

In [12]:
import torch.nn as nn
import torch.nn.functional as F
# https://towardsdatascience.com/lstms-in-pytorch-528b0440244#:~:text=The%20input%20to%20the%20LSTM,batch_size%2C%20sequence_length%2C%20hidden_size)%20.

# https://stackoverflow.com/questions/48302810/whats-the-difference-between-hidden-and-output-in-pytorch-lstm#:~:text=In%20Pytorch%2C%20the%20output%20parameter,LSTM%20stack%20in%20every%20layer.

class SiameseNetwork(nn.Module):
    # Define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers):
        
        super().__init__()          
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # LSTM layer
        self.encoder = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           batch_first=True)
                           
        # try using nn.GRU or nn.RNN here and compare their performances
        # try bidirectional and compare their performances
        
        # Dense layer
        self.fc = nn.Linear(hidden_dim, hidden_dim)

    def forward_once(self, text):
        
        # text = [batch size, sent_length]
        embedded = self.embedding(text)
        # embedded = [batch size, sent_len, emb dim]
        
        packed_output, (hidden, cell) = self.encoder(embedded)
        
        #hidden = [batch size, num layers * num directions,hid dim]
        #cell = [batch size, num layers * num directions,hid dim]

        packed_output = self.fc(packed_output)
        packed_output = torch.mean(packed_output,axis = 1 )
        packed_output = packed_output / torch.norm(packed_output,dim = 1,keepdim= True)
        #packed_output = (packed_output - packed_output.mean(axis = 1,keepdim = True))/packed_output.std(axis = 1,keepdim = True,unbiased=False)
        return packed_output
        
    def forward(self, input1, input2):
        output1 = self.forward_once(input1)
        output2 = self.forward_once(input2)
        return output1, output2

In [13]:
# Define hyperparameters
size_of_vocab = len(vocab)
embedding_dim = 128
num_hidden_nodes = 100
num_output_nodes = 3
num_layers = 1
#dropout = 0.2

# Instantiate the model
model = SiameseNetwork(size_of_vocab, embedding_dim, num_hidden_nodes, num_output_nodes, num_layers)
model = model.to(device)

In [14]:
[p.shape for p in model.parameters()]

[torch.Size([41789, 128]),
 torch.Size([400, 128]),
 torch.Size([400, 100]),
 torch.Size([400]),
 torch.Size([400]),
 torch.Size([100, 100]),
 torch.Size([100])]

In [15]:
print(model)

#No. of trianable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
print(f'The model has {count_parameters(model):,} trainable parameters')

SiameseNetwork(
  (embedding): Embedding(41789, 128)
  (encoder): LSTM(128, 100, batch_first=True)
  (fc): Linear(in_features=100, out_features=100, bias=True)
)
The model has 5,451,092 trainable parameters


In [16]:
batch_size = 256
train_generator = data_generator(train_Q1, train_Q2, batch_size, vocab['<PAD>'])
val_generator = data_generator(val_Q1, val_Q2, batch_size, vocab['<PAD>'])
print('train_Q1.shape ', train_Q1.shape)
print('val_Q1.shape   ', val_Q1.shape)

train_Q1.shape  (89188,)
val_Q1.shape    (22298,)


# Triplet Loss

In [17]:
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: TripletLossFn
def TripletLossFn(v1, v2, margin=torch.tensor([0.5])):
    """Custom Loss function.

    Args:
        v1 (numpy.ndarray): Array with dimension (batch_size, model_dimension) associated to Q1.
        v2 (numpy.ndarray): Array with dimension (batch_size, model_dimension) associated to Q2.
        margin (float, optional): Desired margin. Defaults to 0.25.

    Returns:
        jax.interpreters.xla.DeviceArray: Triplet Loss.
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    
    # use fastnp to take the dot product of the two batches (don't forget to transpose the second argument)
    #v1 = v1 / torch.norm(v1,dim = 1,keepdim= True)
    #v2 = v2 / torch.norm(v2,dim = 1,keepdim= True)
    #print("v1 : ",v1)
    #print("v2 :",v2)
    scores = torch.mm(v1,v2.T)  # pairwise cosine sim
    #print("scores : ",scores)
    # calculate new batch size
    batch_size = len(scores)
    # use fastnp to grab all postive `diagonal` entries in `scores`
    positive = torch.diag(scores)  # the positive ones (duplicates)
    
    # multiply `fastnp.eye(batch_size)` with 2.0 and subtract it out of `scores`
    negative_without_positive = scores - 2.0 * torch.eye(batch_size,device = device)
    
    # take the row by row `max` of `negative_without_positive`. 
    # Hint: negative_without_positive.max(axis = [?])  
    closest_negative = negative_without_positive.max(axis=1)
    #print("closest_negative : ",closest_negative)
    # subtract `fastnp.eye(batch_size)` out of 1.0 and do element-wise multiplication with `scores`
    negative_zero_on_duplicate = scores * (1.0 - torch.eye(batch_size,device = device))
    # use `fastnp.sum` on `negative_zero_on_duplicate` for `axis=1` and divide it by `(batch_size - 1)` 
    mean_negative = torch.sum(negative_zero_on_duplicate, axis=1) / (batch_size-1)
    #print("mean_negative ",mean_negative)
    # compute `fastnp.maximum` among 0.0 and `A`
    # A = subtract `positive` from `margin` and add `closest_negative` 
    triplet_loss1 = torch.maximum(torch.tensor([0.0]).cuda(), margin.cuda() - positive + closest_negative.values)
    
    # compute `fastnp.maximum` among 0.0 and `B`
    # B = subtract `positive` from `margin` and add `mean_negative`
    triplet_loss2 = torch.maximum(torch.tensor([0.0]).cuda(), margin.cuda() - positive + mean_negative)
    # add the two losses together and take the `fastnp.mean` of it
    triplet_loss = torch.mean(triplet_loss1 + triplet_loss2)
    
    ### END CODE HERE ###
    #print("triplet_loss : ",triplet_loss)
    return triplet_loss

In [18]:
v1 = torch.tensor(np.array([[0.26726124, 0.53452248, 0.80178373],[0.5178918 , 0.57543534, 0.63297887],[0.8617891 , 0.43543534, 0.21297887]])).cuda()
v2 = torch.tensor(np.array([[ 0.26726124,  0.53452248,  0.80178373],[-0.5178918 , -0.57543534, -0.63297887],[0.8617891 , 0.43543534, 0.21297887]])).cuda()
print("Triplet Loss:", TripletLossFn(v2,v1))

Triplet Loss: tensor(0.8774, device='cuda:0', dtype=torch.float64)


# TrainingLoop

In [19]:
def train(train_generator,model,optimizer,criterion,epochs = 100):
    counter = []
    loss_history = [] 
    iteration_number= 0
    best_loss = 10000
    PATH = '/content/best-model.pt'
    for epoch in range(epochs):
        for i, batch in enumerate(train_generator,0):
            batch_Q1, batch_Q2  = batch
            batch_Q1, batch_Q2  = batch_Q1.cuda(), batch_Q2.cuda() 
            optimizer.zero_grad()
            output1,output2 = model(batch_Q1, batch_Q2)
            loss_contrastive = criterion(output1,output2)

            loss_contrastive.backward()
            optimizer.step()
            
            if loss_contrastive < best_loss:
              best_loss = loss_contrastive
              print("Save model... best loss = ",best_loss.item())
              torch.save(model.state_dict(), PATH)
            if (i+1) % (89188//batch_size)  == 0 : 
                
                print("Epoch number {}\n Current loss {}\n".format(epoch,loss_contrastive))
                break

    return model

In [20]:
# Train the model
optimizer = optim.Adam(model.parameters(), lr=0.01)

criterion = TripletLossFn
model = train(train_generator,model,optimizer,criterion)


Save model... best loss =  0.9958070516586304
Save model... best loss =  0.9918943047523499
Save model... best loss =  0.9697502255439758
Save model... best loss =  0.8653439283370972
Save model... best loss =  0.8480695486068726
Save model... best loss =  0.8233498930931091
Save model... best loss =  0.7832368016242981
Save model... best loss =  0.7467554211616516
Save model... best loss =  0.7076646089553833
Save model... best loss =  0.6916126012802124
Save model... best loss =  0.6892021894454956
Save model... best loss =  0.6608090400695801
Save model... best loss =  0.6399219632148743
Save model... best loss =  0.6276798844337463
Save model... best loss =  0.6135058403015137
Save model... best loss =  0.6064757704734802
Save model... best loss =  0.6018864512443542
Save model... best loss =  0.5986359119415283
Save model... best loss =  0.583355724811554
Save model... best loss =  0.5597968101501465
Save model... best loss =  0.55203777551651
Save model... best loss =  0.54868537

In [21]:
PATH = '/content/best-model.pt'
model.load_state_dict(torch.load(PATH))

<All keys matched successfully>

In [22]:

def classify(test_Q1, test_Q2, y, threshold, model, vocab, data_generator=data_generator, batch_size=64):
    """Function to test the accuracy of the model.

    Args:
        test_Q1 (numpy.ndarray): Array of Q1 questions.
        test_Q2 (numpy.ndarray): Array of Q2 questions.
        y (numpy.ndarray): Array of actual target.
        threshold (float): Desired threshold.
        model (trax.layers.combinators.Parallel): The Siamese model.
        vocab (collections.defaultdict): The vocabulary used.
        data_generator (function): Data generator function. Defaults to data_generator.
        batch_size (int, optional): Size of the batches. Defaults to 64.

    Returns:
        float: Accuracy of the model.
    """
    accuracy = 0
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    for i in range(0, len(test_Q1), batch_size):
        # Call the data generator (built in Ex 01) with shuffle=False using next()
        # use batch size chuncks of questions as Q1 & Q2 arguments of the data generator. e.g x[i:i + batch_size]
        # Hint: use `vocab['<PAD>']` for the `pad` argument of the data generator
        q1, q2 = next(data_generator(test_Q1[i:i + batch_size], test_Q2[i:i + batch_size], batch_size, vocab['<PAD>'], shuffle=False))
        # use batch size chuncks of actual output targets (same syntax as example above)
        y_test = y[i:i + batch_size]
        # Call the model
        v1, v2 = model(q1, q2)
        v1 = v1.detach().numpy()
        v2 = v2.detach().numpy()
        
        for j in range(batch_size):
            # take dot product to compute cos similarity of each pair of entries, v1[j], v2[j]
            # don't forget to transpose the second argument
            d = np.dot(v1[j], v2[j].T)
            # is d greater than the threshold?
            res = d > threshold
            # increment accurancy if y_test is equal `res`
            accuracy += (y_test[j] == res)
    # compute accuracy using accuracy and total length of test questions
    accuracy = accuracy / len(test_Q1)
    ### END CODE HERE ###
    
    return accuracy

In [23]:
# this takes around 1 minute
accuracy = classify(Q1_test,Q2_test, y_test, 0.7, model.to('cpu'), vocab, batch_size = 512) 
print("Accuracy", accuracy)

Accuracy 0.75107421875


In [24]:
#Accuracy 0.6232421875

In [25]:
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: predict
def predict(question1, question2, threshold, model, vocab, data_generator=data_generator, verbose=True):
    """Function for predicting if two questions are duplicates.

    Args:
        question1 (str): First question.
        question2 (str): Second question.
        threshold (float): Desired threshold.
        model (trax.layers.combinators.Parallel): The Siamese model.
        vocab (collections.defaultdict): The vocabulary used.
        data_generator (function): Data generator function. Defaults to data_generator.
        verbose (bool, optional): If the results should be printed out. Defaults to False.

    Returns:
        bool: True if the questions are duplicates, False otherwise.
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    # use `nltk` word tokenize function to tokenize
    q1 = nltk.word_tokenize(question1)  # tokenize
    q2 = nltk.word_tokenize(question2)  # tokenize
    Q1, Q2 = [], []
    for word in q1:  # encode q1
        # increment by checking the 'word' index in `vocab`
        Q1 += [vocab[word]]
    for word in q2:  # encode q2
        # increment by checking the 'word' index in `vocab`
        Q2 += [vocab[word]]
        
    # Call the data generator (built in Ex 01) using next()
    # pass [Q1] & [Q2] as Q1 & Q2 arguments of the data generator. Set batch size as 1
    # Hint: use `vocab['<PAD>']` for the `pad` argument of the data generator
    Q1, Q2 = next(data_generator([Q1], [Q2], 1, vocab['<PAD>']))
    # Call the model
    v1, v2 = model(Q1, Q2)
    # take dot product to compute cos similarity of each pair of entries, v1, v2
    # don't forget to transpose the second argument
    d = np.dot(v1[0].detach().numpy(), v2[0].detach().numpy().T)
    # is d greater than the threshold?
    res = d > threshold
    
    ### END CODE HERE ###
    
    if(verbose):
        print("Q1  = ", Q1, "\nQ2  = ", Q2)
        print("d   = ", d)
        print("res = ", res)

    return res

In [26]:
# Feel free to try with your own questions
question1 = "When will I see you?"
question2 = "When can I see you again?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, vocab, verbose = True)

Q1  =  tensor([[585,  76,   4,  46,  53,  21,   1,   1]]) 
Q2  =  tensor([[ 585,   33,    4,   46,   53, 7287,   21,    1]])
d   =  0.71987504
res =  True


True

In [27]:
# Feel free to try with your own questions
question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, vocab, verbose=True)

Q1  =  tensor([[  443,  1145,  3158,  1169,    78, 29071,    21,     1]]) 
Q2  =  tensor([[  443,  1145,    60, 15323,    28,    78,  7438,    21]])
d   =  -0.09125066
res =  False


False

In [28]:
# Feel free to try with your own questions
question1 = "What is your name?"
question2 = "Can you tell me your name?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, vocab, verbose=True)

Q1  =  tensor([[  30,  156,   56, 1377,   21,    1,    1,    1]]) 
Q2  =  tensor([[ 219,   53, 1593,   20,   56, 1377,   21,    1]])
d   =  0.7301646
res =  True


True

In [40]:
# Feel free to try with your own questions
for index in range(20):
  question1 = Q1_test_words[index]
  question2 = Q2_test_words[index]

  print("Question 1 :",question1)
  print("Question 2 :",question2)
  print("")
  # 1 means it is duplicated, 0 otherwise
  print("Are these questions Similar : ",predict(question1 , question2, 0.7, model, vocab, verbose=False))
  print("")

Question 1 : How do I prepare for interviews for cse?
Question 2 : What is the best way to prepare for cse?

Are these questions Similar :  False

Question 1 : What is the best bicycle to buy under 10k?
Question 2 : Which is the best bike in in dia to buy in INR 10k?

Are these questions Similar :  True

Question 1 : How do I become Mutual funds distributer for all company mutual funds?
Question 2 : How do I become mutual funds distributor for all company mutual funds?

Are these questions Similar :  True

Question 1 : Will this relationship work?
Question 2 : Relationship: Will this relationship work?

Are these questions Similar :  True

Question 1 : How does Brexit affect India?
Question 2 : Will the GBP/AUD be affected by Brexit?

Are these questions Similar :  False

Question 1 : Is Intel HD graphics card 5500 greater than Geforce 820M 2GB NVIDIA graphics card?
Question 2 : What is the difference between the Nvidia GeForce 820M and the GeForce GT 820M graphics card?

Are these que