<a href="https://colab.research.google.com/github/subhashpolisetti/AutoGluon_ML_End-to-End_Implementations_Part-2/blob/main/5_Text_Similarity_Matching_with_AutoGluon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Similarity Matching with AutoGluon**

This notebook demonstrates how to use **AutoGluon** for **text similarity tasks**. It provides an end-to-end pipeline for loading data, training models, evaluating performance, and generating predictions. The task focuses on identifying semantic similarity between pairs of text.

---

## **Key Objectives**
1. Train a text similarity model using the **SNLI (Stanford Natural Language Inference)** dataset and a custom dataset.
2. Evaluate the model's performance using standard metrics like **AUC** and **RMSE**.
3. Generate similarity predictions and probabilities for new text pairs.
4. Extract embeddings for deeper insights into text representations.

---

## **Steps Covered**

### **1. Installation and Setup**
- Install necessary libraries, including `autogluon` and compatible PyTorch versions.

### **2. Dataset Preparation**
- **SNLI Dataset**:
  - A standard dataset containing pairs of text labeled with semantic similarity information.
  - Load training and testing datasets using `load_pd`.
- **Custom Dataset**:
  - Example data includes text pairs and similarity scores for regression tasks.
  - Preview the data structure for better understanding.

### **3. Model Configuration**
- **AutoGluon MultiModalPredictor**:
  - Configure the predictor for two types of tasks:
    1. **Text Similarity (Binary Matching)**:
       - Use the SNLI dataset to classify whether two texts have similar meanings.
    2. **Similarity Scoring (Regression)**:
       - Use a custom dataset to predict the degree of similarity between two texts.

### **4. Model Training**
- Train the model on the **SNLI training dataset** for binary text similarity classification.
- Use a time limit to constrain training duration (adjustable based on resources).

### **5. Model Evaluation**
- Evaluate the model's performance on the SNLI test dataset.
- Metrics:
  - **AUC (Area Under the Curve)** for binary classification.
  - **RMSE (Root Mean Squared Error)** for regression tasks.

### **6. Predictions and Probabilities**
- Generate similarity predictions for new text pairs.
- Output probabilities indicating the likelihood of semantic similarity.

### **7. Embedding Extraction**
- Extract embeddings for text pairs to visualize and analyze semantic representations.
- Embeddings can be used for downstream tasks or deeper model insights.

---

## **Key Features**
- Demonstrates binary text similarity classification and regression for similarity scoring.
- Uses AutoGluon’s multimodal capabilities for handling text data efficiently.
- Provides a clear pipeline for text-based machine learning tasks, including model evaluation and visualization.

---

## **Example Applications**
- **Semantic Search**: Find documents or passages similar to a query.
- **Duplicate Detection**: Identify duplicate or near-duplicate content in large text corpora.
- **Text Summarization**: Match summaries with their original texts for validation.

---

This notebook serves as a practical guide for implementing text similarity tasks using **AutoGluon**, with easy-to-follow steps for training, evaluation, and prediction.


In [None]:
!pip install autogluon

Collecting autogluon
  Downloading autogluon-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.core==1.1.1 (from autogluon.core[all]==1.1.1->autogluon)
  Downloading autogluon.core-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.features==1.1.1 (from autogluon)
  Downloading autogluon.features-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.tabular==1.1.1 (from autogluon.tabular[all]==1.1.1->autogluon)
  Downloading autogluon.tabular-1.1.1-py3-none-any.whl.metadata (13 kB)
Collecting autogluon.multimodal==1.1.1 (from autogluon)
  Downloading autogluon.multimodal-1.1.1-py3-none-any.whl.metadata (12 kB)
Collecting autogluon.timeseries==1.1.1 (from autogluon.timeseries[all]==1.1.1->autogluon)
  Downloading autogluon.timeseries-1.1.1-py3-none-any.whl.metadata (12 kB)
Collecting scipy<1.13,>=1.5.4 (from autogluon.core==1.1.1->autogluon.core[all]==1.1.1->autogluon)
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metad

In [None]:
!pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.0+cu118 --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torch==2.0.0+cu118
  Downloading https://download.pytorch.org/whl/cu118/torch-2.0.0%2Bcu118-cp310-cp310-linux_x86_64.whl (2267.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 GB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.15.1+cu118
  Downloading https://download.pytorch.org/whl/cu118/torchvision-0.15.1%2Bcu118-cp310-cp310-linux_x86_64.whl (6.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.1/6.1 MB[0m [31m56.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==2.0.0+cu118
  Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.0.0%2Bcu118-cp310-cp310-linux_x86_64.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m74.9 MB/s[0m eta [36m0:00:00[0m
Collecting triton==2.0.0 (from torch==2.0.0+cu118)
  Downloading https://download.pytorch.org/whl/triton-2.0.0-1-cp310-cp31

In [None]:
from autogluon.core.utils.loaders import load_pd
import pandas as pd

snli_train = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/snli/snli_train.csv', delimiter="|")
snli_test = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/snli/snli_test.csv', delimiter="|")
snli_train.head()

Loaded data from: https://automl-mm-bench.s3.amazonaws.com/snli/snli_train.csv | Columns = 3 / 3 | Rows = 366603 -> 366603
Loaded data from: https://automl-mm-bench.s3.amazonaws.com/snli/snli_test.csv | Columns = 3 / 3 | Rows = 6605 -> 6605


Unnamed: 0,premise,hypothesis,label
0,A person on a horse jumps over a broken down a...,"A person is at a diner , ordering an omelette .",0
1,A person on a horse jumps over a broken down a...,"A person is outdoors , on a horse .",1
2,Children smiling and waving at camera,There are children present,1
3,Children smiling and waving at camera,The kids are frowning,0
4,A boy is jumping on skateboard in the middle o...,The boy skates down the sidewalk .,0


In [None]:
data = pd.read_csv('/content/text_similarity_data.csv')

# Display the first few rows of your data
print(data.head())

                    text1                            text2  similarity
0  The cat sat on the mat  The feline rested on the carpet        0.90
1          The car is red                 The bike is blue        0.10
2     The quick brown fox              The fast orange fox        0.80
3      The sun is shining                It is a sunny day        0.85
4        A dog is barking      The puppy is barking loudly        0.95


In [None]:
predictor = MultiModalPredictor(
    problem_type='regression',  # Set the problem type as regression
    label='similarity',  # Specify the target column
    eval_metric='rmse'  # Use RMSE (Root Mean Squared Error) as the evaluation metric
)

In [None]:
from autogluon.multimodal import MultiModalPredictor

# Initialize the model
predictor = MultiModalPredictor(
        problem_type="text_similarity",
        query="premise", # the column name of the first sentence
        response="hypothesis", # the column name of the second sentence
        label="label", # the label column name
        match_label=1, # the label indicating that query and response have the same semantic meanings.
        eval_metric='auc', # the evaluation metric
    )

# Fit the model
predictor.fit(
    train_data=snli_train,
    time_limit=180,
)

No path specified. Models will be saved in: "AutogluonModels/ag-20240923_015729"
AutoGluon Version:  1.1.1
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          12
Pytorch Version:    2.0.0+cu118
CUDA Version:       11.8
Memory Avail:       74.85 GB / 83.48 GB (89.7%)
Disk Space Avail:   159.69 GB / 235.68 GB (67.8%)
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /content/Au

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

GPU Count: 1
GPU Count to be Used: 1
GPU 0 Name: NVIDIA A100-SXM4-40GB
GPU 0 Memory: 1.88GB/40.0GB (Used/Total)

INFO: Using 16bit Automatic Mixed Precision (AMP)
INFO: GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | query_model       | HFAutoModelForTextPrediction | 33.4 M | train
1 | response_model    | HFAutoModelForTextPrediction | 33.4 M | train
2 | validation_metric | BinaryAUROC                  | 0      | train
3 | loss_func         | ContrastiveLoss              | 0      | train
4 | miner_func        | PairMarginMiner              | 0      | train
---------------------------------------------------------------------------
33.4 M    Trainable params
0         Non-trainable params
33.4 M    Total pa

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()


Training: |          | 0/? [00:00<?, ?it/s]

INFO: Time limit reached. Elapsed time is 0:03:00. Signaling Trainer to stop.


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Epoch 0, global step 122: 'val_roc_auc' reached 0.89070 (best 0.89070), saving model to '/content/AutogluonModels/ag-20240923_015729/epoch=0-step=122.ckpt' as top 3
Start to fuse 1 checkpoints via the greedy soup algorithm.
  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/content/AutogluonModels/ag-20240923_015729")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).




<autogluon.multimodal.predictor.MultiModalPredictor at 0x7e93c83056f0>

In [None]:
score = predictor.evaluate(snli_test)
print("evaluation score: ", score)

  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

evaluation score:  {'roc_auc': 0.9054827935898537}


In [None]:
pred_data = pd.DataFrame.from_dict({"premise":["The teacher gave his speech to an empty room."],
                                    "hypothesis":["There was almost nobody when the professor was talking."]})

predictions = predictor.predict(pred_data)
print('Predicted entities:', predictions[0])

  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

Predicted entities: 1


In [None]:
probabilities = predictor.predict_proba(pred_data)
print(probabilities)

  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

          0         1
0  0.207242  0.792758


In [None]:
embeddings_1 = predictor.extract_embedding({"premise":["The teacher gave his speech to an empty room."]})
print(embeddings_1.shape)
embeddings_2 = predictor.extract_embedding({"hypothesis":["There was almost nobody when the professor was talking."]})
print(embeddings_2.shape)

  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

(1, 384)


  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

(1, 384)
