In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/mini-wsdm-siamese-network/tensorflow2/mini-version/1/submission.csv
/kaggle/input/mini-wsdm-siamese-network/tensorflow2/mini-version/1/best_siamese_model.keras
/kaggle/input/wsdm-cup-multilingual-chatbot-arena/sample_submission.csv
/kaggle/input/wsdm-cup-multilingual-chatbot-arena/train.parquet
/kaggle/input/wsdm-cup-multilingual-chatbot-arena/test.parquet
/kaggle/input/xlm_roberta/keras/xlm_roberta_base_multi/3/config.json
/kaggle/input/xlm_roberta/keras/xlm_roberta_base_multi/3/tokenizer.json
/kaggle/input/xlm_roberta/keras/xlm_roberta_base_multi/3/metadata.json
/kaggle/input/xlm_roberta/keras/xlm_roberta_base_multi/3/model.weights.h5
/kaggle/input/xlm_roberta/keras/xlm_roberta_base_multi/3/assets/tokenizer/vocabulary.spm


# WSDM Cup - Multilingual Chatbot Arena: Human Preference Prediction

### Approach for Training a Siamese Neural Network on Multilingual Text Data

In this Notebook, we train a **Siamese neural network** to classify which response, **Response A** or **Response B**, is better based on a given **prompt**. The goal is to predict whether **model_a** or **model_b** is the better response based on the context provided by the **prompt**. Here's a step-by-step explanation of the approach used in the code:

---

### 1. **Data Preparation**
- **Dataset**: The dataset provided contains multiple columns, including `prompt`, `response_a`, `response_b`, and `winner`. 
  - `prompt`: A question or context for which we have two potential responses (Response A and Response B).
  - `response_a` and `response_b`: The responses from two different models to the given prompt.
  - `winner`: A label that tells us which response was better, marked as `model_a` or `model_b`.
  
- **Data Preprocessing**:
  - **Concatenation of Prompts and Responses**: For both `response_a` and `response_b`, we concatenate the prompt with each response to prepare the inputs for the Siamese network.
  - **Tokenization**: We use the `xlm-roberta-base` tokenizer, a multilingual model, to tokenize and preprocess the text data. This tokenizer handles multiple languages and converts text into tokens that the model can understand.
  - **Max Length**: The maximum length for the input sequence is set dynamically based on the training data to ensure that the model can handle long sequences effectively.

---

### 2. **Model Architecture**
We use a **Siamese network** architecture, which is typically used to compare two inputs and learn their similarity. The network consists of the following components:

- **Shared Embedding Layer**: The model uses a shared embedding layer to map both inputs (`response_a` and `response_b`) into the same feature space. This ensures that both inputs are treated equivalently.
  
- **Bidirectional LSTMs**: We use three layers of **Bidirectional LSTMs (Long Short-Term Memory)** to capture dependencies in the sequence. The bidirectional LSTM processes the input text from both directions (forward and backward) and generates richer representations of the inputs.
  
- **Dense Layers**: After processing the inputs through the LSTMs, the outputs are concatenated and passed through a series of dense layers with ReLU activations to capture complex patterns.
  
- **Final Logistic Layer**: The final layer is a **sigmoid activation function** that outputs a probability score indicating whether `response_a` or `response_b` is better, based on the learned features.

---

### 3. **Model Training Strategy**
- **Multi-GPU Training**: We utilize **TensorFlow's `MirroredStrategy`** to train the model on multiple GPUs, which helps speed up the training process by distributing the workload across available GPUs. This is particularly useful for training large models or handling large datasets.
  
- **Optimization**: The model is optimized using the **Adam optimizer** with binary cross-entropy loss, as the task is a binary classification problem (predicting whether `response_a` or `response_b` is better).

- **Epochs**: The model is trained for a specified number of epochs. Each epoch consists of processing the data in batches and updating the model's weights to minimize the loss function.

---

### 4. **Callbacks**
To improve training efficiency and ensure that the best model is saved, we use the following callbacks:

- **ModelCheckpoint**: This callback monitors the validation loss during training and saves the model whenever it improves (i.e., when the validation loss decreases). This ensures that the best version of the model is stored, even if later epochs result in worse performance.
  
- **EarlyStopping (optional)**: Early stopping can be added to stop training when the model's performance on the validation set stops improving. This prevents overfitting and saves computational resources. In this implementation, early stopping is optional, as we have focused on using the `ModelCheckpoint` to save the best model.

---

### 5. **Prediction**
Once the model is trained, we use it to predict the better response for the **test dataset**. We process the test data in the same way as the training data (tokenizing the inputs and padding them) and use the trained model to classify whether `response_a` or `response_b` is better. The predictions are then saved in the required format for submission.

---

### Summary of the Workflow:
1. **Data Loading & Preprocessing**: Tokenize and prepare the text data (concatenate prompt and responses, handle multi-language input).
2. **Model Building**: Construct a Siamese neural network using LSTMs to compare the two responses.
3. **Training Strategy**: Use multiple GPUs to train the model efficiently and save the best model using `ModelCheckpoint`.
4. **Prediction**: Use the trained model to predict the better response for the test set.

This approach ensures that we are leveraging the full potential of the dataset while also optimizing the training process through effective model architecture and callbacks.


In [2]:
# import numpy as np
# import pandas as pd
# from sentencepiece import SentencePieceProcessor
# import tensorflow as tf
# from tensorflow.keras.layers import Dense, Input, Bidirectional, LSTM, Dropout
# from tensorflow.keras.models import Model
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import classification_report

# # Enable multi-GPU training
# strategy = tf.distribute.MirroredStrategy()

# # Load and preprocess dataset
# data = pd.read_parquet('/kaggle/input/wsdm-cup-multilingual-chatbot-arena/train.parquet')  # Adjust path

# # Assuming the dataset has columns: 'prompt', 'response_a', 'response_b', and 'winner'
# data['winner'] = data['winner'].map({'model_a': 0, 'model_b': 1})

# # Concatenate prompt and responses for tokenization
# data['input_a'] = data['prompt'] + " " + data['response_a']
# data['input_b'] = data['prompt'] + " " + data['response_b']

# # Load local SentencePiece tokenizer
# tokenizer_path = "/kaggle/input/xlm_roberta/keras/xlm_roberta_base_multi/3/assets/tokenizer/vocabulary.spm"
# sp = SentencePieceProcessor()
# sp.load(tokenizer_path)

# # Define maximum sequence length
# max_length = 256  # Adjust as needed

# # Tokenize and pad sequences using SentencePiece
# def tokenize_and_pad_sentencepiece(texts, max_length):
#     # Tokenize and truncate sequences
#     tokenized_texts = [sp.encode(text, out_type=int)[:max_length] for text in texts]
#     # Pad sequences to ensure consistent length
#     padded_texts = tf.keras.preprocessing.sequence.pad_sequences(
#         tokenized_texts, maxlen=max_length, padding='post', truncating='post'
#     )
#     return np.array(padded_texts)

# # Tokenize input data
# tokenized_a = tokenize_and_pad_sentencepiece(data['input_a'].tolist(), max_length)
# tokenized_b = tokenize_and_pad_sentencepiece(data['input_b'].tolist(), max_length)

# # Prepare labels
# labels = data['winner'].values

# # Train-test split
# X_a_train, X_a_test, X_b_train, X_b_test, y_train, y_test = train_test_split(
#     tokenized_a, tokenized_b, labels, test_size=0.2, random_state=42
# )

# # Define the Siamese network
# embedding_dim = 768  # Adjust based on the embedding size of the model

# def create_siamese_network():
#     # Input layers
#     input_a = Input(shape=(max_length,))
#     input_b = Input(shape=(max_length,))
    
#     # Shared embedding layer
#     embedding = tf.keras.layers.Embedding(input_dim=sp.vocab_size(), output_dim=embedding_dim, input_length=max_length)
    
#     # Shared LSTM layers (deeper network)
#     shared_lstm1 = Bidirectional(LSTM(64, return_sequences=True))
#     shared_lstm2 = Bidirectional(LSTM(32, return_sequences=False))
    
#     # Process inputs through shared layers
#     x_a = embedding(input_a)
#     x_a = shared_lstm1(x_a)
#     x_a = shared_lstm2(x_a)
    
#     x_b = embedding(input_b)
#     x_b = shared_lstm1(x_b)
#     x_b = shared_lstm2(x_b)
    
#     # Combine outputs
#     combined = tf.keras.layers.concatenate([x_a, x_b])
#     combined = Dense(32, activation='relu')(combined)
#     combined = Dropout(0.5)(combined)
#     combined = Dense(16, activation='relu')(combined)
#     combined = Dropout(0.5)(combined)
    
#     # Logistic output
#     output = Dense(1, activation='sigmoid')(combined)
    
#     # Define the model
#     model = Model(inputs=[input_a, input_b], outputs=output)
#     return model

# # Compile and train the model within the distribution strategy
# with strategy.scope():
#     model = create_siamese_network()
#     model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# # Callbacks for early stopping and model checkpointing
# early_stopping = tf.keras.callbacks.EarlyStopping(
#     monitor='val_loss', patience=3, restore_best_weights=True
# )

# checkpoint = tf.keras.callbacks.ModelCheckpoint(
#     'best_siamese_model.keras',
#     monitor='val_loss',
#     save_best_only=True,
#     mode='min',
#     save_weights_only=False
# )

# # Train the model
# history = model.fit(
#     [X_a_train, X_b_train],
#     y_train,
#     epochs=10,  # Adjust as needed
#     batch_size=64,  # Distributed across GPUs
#     validation_data=([X_a_test, X_b_test], y_test),
#     callbacks=[checkpoint]
# )

# # Evaluate the model
# loss, accuracy = model.evaluate([X_a_test, X_b_test], y_test)
# print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")

# # Predict and evaluate
# predictions = (model.predict([X_a_test, X_b_test]) > 0.5).astype(int)
# print(classification_report(y_test, predictions, target_names=['model_a', 'model_b']))


In [3]:
# import matplotlib.pyplot as plt

# # Plot training accuracy
# plt.plot(history.history['accuracy'], label='Train Accuracy')
# plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
# plt.xlabel('Epochs')
# plt.ylabel('Accuracy')
# plt.legend()
# plt.show()


In [4]:
import numpy as np
import pandas as pd
from sentencepiece import SentencePieceProcessor
import tensorflow as tf
from tensorflow.keras.models import load_model

# Load the test data
test_data = pd.read_parquet('/kaggle/input/wsdm-cup-multilingual-chatbot-arena/test.parquet')  # Adjust path

model = load_model('/kaggle/input/mini-wsdm-siamese-network/tensorflow2/mini-version/1/best_siamese_model.keras')
# Concatenate prompt and responses for tokenization
test_data['input_a'] = test_data['prompt'] + " " + test_data['response_a']
test_data['input_b'] = test_data['prompt'] + " " + test_data['response_b']

# Load local SentencePiece tokenizer
tokenizer_path = "/kaggle/input/xlm_roberta/keras/xlm_roberta_base_multi/3/assets/tokenizer/vocabulary.spm"
sp = SentencePieceProcessor()
sp.load(tokenizer_path)

# Define maximum sequence length
max_length = 256  # Ensure it matches the length used during training

# Tokenize and pad sequences using SentencePiece
def tokenize_and_pad_sentencepiece(texts, max_length):
    # Tokenize and truncate sequences
    tokenized_texts = [sp.encode(text, out_type=int)[:max_length] for text in texts]
    # Pad sequences to ensure consistent length
    padded_texts = tf.keras.preprocessing.sequence.pad_sequences(
        tokenized_texts, maxlen=max_length, padding='post', truncating='post'
    )
    return np.array(padded_texts)

# Tokenize test data
tokenized_test_a = tokenize_and_pad_sentencepiece(test_data['input_a'].tolist(), max_length)
tokenized_test_b = tokenize_and_pad_sentencepiece(test_data['input_b'].tolist(), max_length)

# Predict winners
predictions = model.predict([tokenized_test_a, tokenized_test_b], batch_size=32)
predicted_labels = (predictions > 0.5).astype(int)  # Convert probabilities to binary (0 or 1)

# Map predictions back to model names
test_data['winner'] = np.where(predicted_labels == 0, 'model_a', 'model_b')

# Prepare submission file
submission = test_data[['id', 'winner']]
submission.to_csv('submission.csv', index=False)

print("Submission file created: 'submission.csv'")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step
Submission file created: 'submission.csv'
