# AutoGluon Text Samples (Sentiment & Similarity)

Small synthetic runs to validate text classification and text matching.

In [1]:
from autogluon.multimodal import MultiModalPredictor
import pandas as pd
from pathlib import Path

time_limit = 30
try:
    HERE = Path(__file__).resolve().parent
except NameError:
    HERE = Path('.')


## Sentiment Classification (binary)

In [2]:
sentences = [
    ("I loved the new phone, it is fantastic!", "positive"),
    ("This laptop is awful and slow", "negative"),
    ("Great battery life and awesome screen", "positive"),
    ("Terrible customer service experience", "negative"),
    ("The movie was delightful and fun", "positive"),
    ("Food was cold and tasted bad", "negative"),
    ("Outstanding camera quality", "positive"),
    ("Horrible sound quality", "negative"),
]
sent_df = pd.DataFrame(sentences, columns=["text", "label"])
train_df = sent_df.sample(frac=0.75, random_state=42)
test_df = sent_df.drop(train_df.index)
train_df, test_df

(                                      text     label
 1            This laptop is awful and slow  negative
 5             Food was cold and tasted bad  negative
 0  I loved the new phone, it is fantastic!  positive
 7                   Horrible sound quality  negative
 2    Great battery life and awesome screen  positive
 4         The movie was delightful and fun  positive,
                                    text     label
 3  Terrible customer service experience  negative
 6            Outstanding camera quality  positive)

In [3]:
sent_predictor = MultiModalPredictor(label="label", problem_type="classification", path=str(HERE / "ag_text_models" / "sentiment"))
sent_predictor.fit(train_df, tuning_data=test_df, time_limit=time_limit)
sent_metrics = sent_predictor.evaluate(test_df)
sent_preds = sent_predictor.predict(test_df)
sent_metrics

AutoGluon Version:  1.1.1
Python Version:     3.11.14
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 25.1.0: Mon Oct 20 19:32:41 PDT 2025; root:xnu-12377.41.6~2/RELEASE_ARM64_T6000
CPU Count:          8
Pytorch Version:    2.3.1
CUDA Version:       CUDA is not available
Memory Avail:       6.96 GB / 16.00 GB (43.5%)
Disk Space Avail:   264.13 GB / 460.43 GB (57.4%)


AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).


	2 unique label values:  ['negative', 'positive']


	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])



AutoMM starts to create your model. âœ¨âœ¨âœ¨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /Users/varad/Projects/CMPE-255-Sec-47-Data-Mining/Assignment 6/AutoGluon/extra-credit/text/ag_text_models/sentiment
    ```



Seed set to 0




GPU Count: 0
GPU Count to be Used: 0



GPU available: True (mps), used: True


TPU available: False, using: 0 TPU cores


HPU available: False, using: 0 HPUs



  | Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 108 M  | train
1 | validation_metric | BinaryAUROC                  | 0      | train
2 | loss_func         | CrossEntropyLoss             | 0      | train
---------------------------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.573   Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/Users/varad/Projects/CMPE-255-Sec-47-Data-Mining/.venv/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py:298: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 0, global step 1: 'val_roc_auc' reached 1.00000 (best 1.00000), saving model to '/Users/varad/Projects/CMPE-255-Sec-47-Data-Mining/Assignment 6/AutoGluon/extra-credit/text/ag_text_models/sentiment/epoch=0-step=1.ckpt' as top 3


AutoMM has created your model. ðŸŽ‰ðŸŽ‰ðŸŽ‰

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/Users/varad/Projects/CMPE-255-Sec-47-Data-Mining/Assignment 6/AutoGluon/extra-credit/text/ag_text_models/sentiment")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).




Predicting: |          | 0/? [00:00<?, ?it/s]

Predicting: |          | 0/? [00:00<?, ?it/s]

{'roc_auc': 1.0}

## Sentence Similarity (binary matching)

In [4]:
pairs = [
    ("The cat sat on the mat", "A cat is sitting on a mat", 1),
    ("The sky is blue today", "It might rain tomorrow", 0),
    ("He is reading a book", "A man reads a novel", 1),
    ("She loves pizza", "He dislikes sushi", 0),
    ("Dogs are great pets", "Cats are wonderful companions", 0),
    ("Python is a programming language", "Snakes are reptiles", 0),
    ("New York is a big city", "NYC has many skyscrapers", 1),
    ("The concert was amazing", "The show was terrible", 0),
]
pair_df = pd.DataFrame(pairs, columns=["query", "response", "label"])
match_predictor = MultiModalPredictor(label="label", query="query", response="response", match_label="label", path=str(HERE / "ag_text_models" / "matching"))
match_predictor.fit(pair_df, time_limit=time_limit)
match_metrics = match_predictor.evaluate(pair_df)
match_metrics

AutoGluon Version:  1.1.1
Python Version:     3.11.14
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 25.1.0: Mon Oct 20 19:32:41 PDT 2025; root:xnu-12377.41.6~2/RELEASE_ARM64_T6000
CPU Count:          8
Pytorch Version:    2.3.1
CUDA Version:       CUDA is not available
Memory Avail:       4.83 GB / 16.00 GB (30.2%)
Disk Space Avail:   263.78 GB / 460.43 GB (57.3%)


AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).


	2 unique label values:  [1, 0]


	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])



AutoMM starts to create your model. âœ¨âœ¨âœ¨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /Users/varad/Projects/CMPE-255-Sec-47-Data-Mining/Assignment 6/AutoGluon/extra-credit/text/ag_text_models/matching
    ```



Seed set to 0




GPU Count: 0
GPU Count to be Used: 0



GPU available: True (mps), used: True


TPU available: False, using: 0 TPU cores


HPU available: False, using: 0 HPUs



  | Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 108 M  | train
1 | validation_metric | BinaryAUROC                  | 0      | train
2 | loss_func         | CrossEntropyLoss             | 0      | train
---------------------------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.573   Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/Users/varad/Projects/CMPE-255-Sec-47-Data-Mining/.venv/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py:298: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 0, global step 1: 'val_roc_auc' reached 0.00000 (best 0.00000), saving model to '/Users/varad/Projects/CMPE-255-Sec-47-Data-Mining/Assignment 6/AutoGluon/extra-credit/text/ag_text_models/matching/epoch=0-step=1.ckpt' as top 3


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 1, global step 2: 'val_roc_auc' reached 0.00000 (best 0.00000), saving model to '/Users/varad/Projects/CMPE-255-Sec-47-Data-Mining/Assignment 6/AutoGluon/extra-credit/text/ag_text_models/matching/epoch=1-step=2.ckpt' as top 3


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 2, global step 3: 'val_roc_auc' reached 1.00000 (best 1.00000), saving model to '/Users/varad/Projects/CMPE-255-Sec-47-Data-Mining/Assignment 6/AutoGluon/extra-credit/text/ag_text_models/matching/epoch=2-step=3.ckpt' as top 3


Start to fuse 3 checkpoints via the greedy soup algorithm.


Predicting: |          | 0/? [00:00<?, ?it/s]

Predicting: |          | 0/? [00:00<?, ?it/s]

Predicting: |          | 0/? [00:00<?, ?it/s]

AutoMM has created your model. ðŸŽ‰ðŸŽ‰ðŸŽ‰

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/Users/varad/Projects/CMPE-255-Sec-47-Data-Mining/Assignment 6/AutoGluon/extra-credit/text/ag_text_models/matching")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).




Predicting: |          | 0/? [00:00<?, ?it/s]

{'roc_auc': 1.0}