Next, we venture into Siamese Networks, specialized architectures designed for comparing and understanding relationships in textual data.

**A Siamese network is a type of neural network architecture designed for tasks involving similarity measurement or finding relationships between input samples. It's named after Siamese twins because the network structure involves two identical subnetworks (or twins) that share the same architecture and weights.**

**The main purpose of a Siamese network**\
     to learn embeddings or representations of input data in such a way that similar inputs are mapped closer together in the embedding space while dissimilar inputs are farther apart. It's commonly used in tasks like:

- Signature Verification: Determining if two signatures belong to the same person.
- Face Recognition: Comparing faces to verify identity.
- Similarity Matching: Comparing texts, images, or other data to find similarities.

The network takes pairs of inputs and learns to output a similarity score or a distance metric that quantifies how similar the inputs are. During training, the network's parameters (weights and biases) are adjusted to minimize the distance between similar pairs and maximize the distance between dissimilar pairs.

Siamese networks often use distance-based metrics like contrastive loss or triplet loss to train the network effectively. Contrastive loss aims to minimize the distance between similar pairs while pushing dissimilar pairs apart. Triplet loss works with three examples: an anchor, a positive example (similar to the anchor), and a negative example (dissimilar to the anchor).

=> Overall, Siamese networks are valuable in learning representations that capture the essence of similarity between pairs of data points, enabling various applications in tasks that involve measuring similarity or dissimilarity.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import jaccard_score

In [None]:
df = pd.read_csv('/kaggle/input/quora-question-pairs/train.csv.zip')
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [None]:
df.isnull().sum()
df.dropna(inplace=True)
df.shape

(404287, 6)

In [None]:
# Remove the punctuation from the questions and apply some filters to the data
import string
from multiprocessing import Pool, cpu_count

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

def convert_to_lower(text):
    return text.lower()

def remove_contractions(text):
    # Define a list of common contractions and their expanded forms
    contractions = {
        "don't": "do not",
        "won't": "will not",
        "can't": "cannot",
        # Add more contractions and their expansions as needed
    }
    for contraction, expansion in contractions.items():
        text = text.replace(contraction, expansion)
    return text

def replace_currency_symbols(text):
    # Define a dictionary mapping currency symbols to currency names
    currency_symbols = {
        "$": "USD",
        "€": "EUR",
        "£": "GBP",

    }
    for symbol, currency_name in currency_symbols.items():
        text = text.replace(symbol, currency_name)
    return text

def remove_hyperlinks(text):
    # Remove URLs and hyperlinks using regular expression
    text = re.sub(r'http\S+', '', text)
    return text

def remove_html_tags(text):
    # Remove HTML tags using regular expression
    text = re.sub(r'<.*?>', '', text)
    return text

def process_column(column):
    with Pool(cpu_count()) as pool:
        processed_column = pool.map(remove_punctuation, column)
        processed_column = pool.map(convert_to_lower, processed_column)
        processed_column = pool.map(remove_contractions, processed_column)
        processed_column = pool.map(replace_currency_symbols, processed_column)
        processed_column = pool.map(remove_hyperlinks, processed_column)
        processed_column = pool.map(remove_html_tags, processed_column)
    return processed_column

In [None]:
columns_to_drop = ['id', 'qid1', 'qid2']
df = df.drop(columns=columns_to_drop, axis=1)
df['question1'] = process_column(df['question1'])
df['question2'] = process_column(df['question2'])
df.head()

Unnamed: 0,question1,question2,is_duplicate
0,what is the step by step guide to invest in sh...,what is the step by step guide to invest in sh...,0
1,what is the story of kohinoor kohinoor diamond,what would happen if the indian government sto...,0
2,how can i increase the speed of my internet co...,how can internet speed be increased by hacking...,0
3,why am i mentally very lonely how can i solve it,find the remainder when math2324math is divide...,0
4,which one dissolve in water quikly sugar salt ...,which fish would survive in salt water,0


In [None]:
import tensorflow as tf
from keras.models import Model
from keras.layers import Input, Embedding, LSTM, Lambda, Dense
from keras import backend as K
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Assuming 'question1' and 'question2' are your input questions
# and 'is_duplicate' is the binary label indicating whether the questions are duplicates or not.
questions1 = df['question1']
questions2 = df['question2']
labels = df['is_duplicate']

# Assuming questions1, questions2, and labels are your input data
# Fill missing values with empty strings
questions1 = questions1.fillna('')
questions2 = questions2.fillna('')

# Tokenize your questions if needed
max_sequence_length = 80
embedding_dim = 300
questions = df['question1'].astype(str) + ' ' + df['question2'].astype(str)

tokens = [word for sentence in questions for word in sentence.split()]

# Compute the vocabulary size
vocabulary_size = len(set(tokens))
questions = (questions1 + ' ' + questions2).astype(str)
tokenizer = Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(questions)

sequences1 = tokenizer.texts_to_sequences(questions1)
sequences2 = tokenizer.texts_to_sequences(questions2)
padded_sequences1 = pad_sequences(sequences1, maxlen=max_sequence_length)
padded_sequences2 = pad_sequences(sequences2, maxlen=max_sequence_length)

input_layer1 = Input(shape=(max_sequence_length,))
input_layer2 = Input(shape=(max_sequence_length,))

embedding_layer = Embedding(input_dim=vocabulary_size, output_dim=embedding_dim)

lstm_layer = LSTM(units=50)

x1 = embedding_layer(input_layer1)
x1 = lstm_layer(x1)

x2 = embedding_layer(input_layer2)
x2 = lstm_layer(x2)

distance_layer =  Lambda(lambda x: tf.keras.backend.abs(x[0] - x[1]),
                               output_shape=lambda _: (1,))([x1, x2])

output_layer = Dense(units=1, activation='sigmoid')(distance_layer)

siamese_model = Model(inputs=[input_layer1, input_layer2], outputs=output_layer)

siamese_model.compile(optimizer=Adam(0.0001), loss='binary_crossentropy', metrics=['accuracy'])
callbacks = [
    EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True),
    ModelCheckpoint(filepath='siamese_model_weights.h5', save_best_only=True)
]
siamese_model.fit([padded_sequences1, padded_sequences2], labels, epochs=5, batch_size=32, validation_split=0.2, callbacks=callbacks)

Epoch 1/5

  saving_api.save_model(


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7af9c2d79ed0>