<a href="https://colab.research.google.com/github/subhashpolisetti/AutoGluon_ML_End-to-End_Implementations_Part-2/blob/main/2_AutoGluon_Sentiment_Analysis_and_Sentence_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis and Sentence Similarity with AutoGluon**

This notebook demonstrates how to use **AutoGluon** for performing **sentiment analysis** and **sentence similarity prediction** tasks. It is designed as an end-to-end guide for building, training, evaluating, and visualizing text-based machine learning models.

---

## **1. Installation and Setup**
- Install required libraries such as `autogluon.multimodal`, PyTorch, and its dependencies.
- Configure the environment to ensure compatibility for model training and evaluation.

---

## **2. Sentiment Analysis**
### **Objective**:
Classify sentences into **positive** or **negative** sentiments.

### **Dataset**:
- **Source**: AutoGluon Text Dataset (SST).
- **Description**: Contains labeled sentences for binary classification.

### **Steps**:
1. **Load and Inspect Data**: Explore the training and testing datasets.
2. **Model Training**:
   - Use `MultiModalPredictor` for binary classification.
   - Train the model on a subset of the training dataset for demonstration purposes.
3. **Evaluation**:
   - Measure accuracy and F1-score on the test dataset.
4. **Prediction**:
   - Predict sentiments of new sentences.
   - Visualize the class probabilities of predictions.

---

## **3. Sentence Similarity Prediction**
### **Objective**:
Predict a similarity score between two sentences on a scale from **0** (completely dissimilar) to **5** (identical).

### **Dataset**:
- **Source**: Semantic Textual Similarity dataset.
- **Description**: Includes pairs of sentences with human-annotated similarity scores.

### **Steps**:
1. **Load Data**: Import the sentence pairs for training and testing.
2. **Model Training**:
   - Use `MultiModalPredictor` for regression tasks.
   - Train the model using a similarity dataset.
3. **Evaluation**:
   - Compute metrics such as RMSE, Pearson Correlation, and Spearman Rank Correlation.
4. **Prediction**:
   - Predict similarity scores for new sentence pairs.

---

## **4. Visualization and Embedding Analysis**
- **Embedding Extraction**: Extract text embeddings for the test dataset using the trained model.
- **Visualization**:
  - Apply TSNE for dimensionality reduction.
  - Visualize embeddings in a 2D space to understand the separation between labels or sentence similarities.

---

## **5. Saving and Loading Models**
- Save the trained models using AutoGluon’s `save()` method.
- Reload models for reuse and verification of predictions.

---

## **Key Features**:
1. End-to-end workflow for text-based machine learning using AutoGluon.
2. Includes both classification and regression examples.
3. Provides visualization of embeddings for deeper insights.
4. Simple and reproducible steps for building NLP models.

This notebook is ideal for learning or demonstrating sentiment analysis and sentence similarity prediction using a powerful and user-friendly framework like AutoGluon.


In [None]:
# Install the AutoGluon multimodal library for text analysis and other dependencies.
!pip install autogluon.multimodal

Collecting autogluon.multimodal
  Downloading autogluon.multimodal-1.1.1-py3-none-any.whl.metadata (12 kB)
Collecting scipy<1.13,>=1.5.4 (from autogluon.multimodal)
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Collecting Pillow<11,>=10.0.1 (from autogluon.multimodal)
  Downloading pillow-10.4.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.2 kB)
Collecting boto3<2,>=1.10 (from autogluon.multimodal)
  Downloading boto3-1.35.20-py3-none-any.whl.metadata (6.6 kB)
Collecting torch<2.4,>=2.2 (from autogluon.multimodal)
  Downloading torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting lightning<2.4,>=2.2 (from autogluon.multimodal)
  Downloading lightning-2.3.3-py3-none-any.whl.metadata (35 kB)
Collecting transformers<4.41.0,>=4.38.0 (from transformers[sentencepiece]<4.41.0,>=4.38.0->autoglu

In [None]:
%matplotlib inline

import numpy as np
import warnings
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')
np.random.seed(123)

In [None]:
from autogluon.core.utils.loaders import load_pd
train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet')
test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet')
subsample_size = 1000  # subsample data for faster demo, try setting this to larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head(10)

Unnamed: 0,sentence,label
43787,very pleasing at its best moments,1
16159,", american chai is enough to make you put away...",0
59015,too much like an infomercial for ram dass 's l...,0
5108,a stirring visual sequence,1
67052,cool visual backmasking,1
35938,hard ground,0
49879,"the striking , quietly vulnerable personality ...",1
51591,pan nalin 's exposition is beautiful and myste...,1
56780,wonderfully loopy,1
28518,"most beautiful , evocative",1


In [None]:
!pip install torch==2.0.0+cu117 torchaudio==2.0.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu117
Collecting torch==2.0.0+cu117
  Downloading https://download.pytorch.org/whl/cu117/torch-2.0.0%2Bcu117-cp310-cp310-linux_x86_64.whl (1843.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 GB[0m [31m503.2 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==2.0.1+cu117
  Using cached https://download.pytorch.org/whl/cu117/torchaudio-2.0.1%2Bcu117-cp310-cp310-linux_x86_64.whl (4.4 MB)
Collecting triton==2.0.0 (from torch==2.0.0+cu117)
  Using cached https://download.pytorch.org/whl/triton-2.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
Collecting lit (from triton==2.0.0->torch==2.0.0+cu117)
  Downloading lit-18.1.8-py3-none-any.whl.metadata (2.5 kB)
Downloading lit-18.1.8-py3-none-any.whl (96 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.4/96.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected pa

In [None]:
!pip uninstall torch torchvision torchaudio -y

[0mFound existing installation: torchvision 0.18.1
Uninstalling torchvision-0.18.1:
  Successfully uninstalled torchvision-0.18.1
[0m

In [None]:
!pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu117
Collecting torch==2.0.1+cu117
  Using cached https://download.pytorch.org/whl/cu117/torch-2.0.1%2Bcu117-cp310-cp310-linux_x86_64.whl (1843.9 MB)
Collecting torchvision==0.15.2+cu117
  Downloading https://download.pytorch.org/whl/cu117/torchvision-0.15.2%2Bcu117-cp310-cp310-linux_x86_64.whl (6.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.1/6.1 MB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==2.0.1+cu117
  Using cached https://download.pytorch.org/whl/cu117/torchaudio-2.0.1%2Bcu117-cp310-cp310-linux_x86_64.whl (4.4 MB)
INFO: pip is looking at multiple versions of torchaudio to determine which version is compatible with other requirements. This could take a while.
[31mERROR: Cannot install torch==2.0.1+cu117, torchaudio==2.0.1+cu117 and torchvision==0.15.2+cu117 because these package versions have conflicting dependencies.[0m[31m
[0m
The conflict is

In [None]:
import torch
print(torch.__version__)

2.0.0+cu117


In [None]:
test_score = predictor.evaluate(test_data)
print(test_score)

In [None]:
test_score = predictor.evaluate(test_data, metrics=['acc', 'f1'])
print(test_score)

In [None]:
sentence1 = "it's a charming and often affecting journey."
sentence2 = "It's slow, very, very, very slow."
predictions = predictor.predict({'sentence': [sentence1, sentence2]})
print('"Sentence":', sentence1, '"Predicted Sentiment":', predictions[0])
print('"Sentence":', sentence2, '"Predicted Sentiment":', predictions[1])

In [None]:
probs = predictor.predict_proba({'sentence': [sentence1, sentence2]})
print('"Sentence":', sentence1, '"Predicted Class-Probabilities":', probs[0])
print('"Sentence":', sentence2, '"Predicted Class-Probabilities":', probs[1])

In [None]:
test_predictions = predictor.predict(test_data)
test_predictions.head()

In [None]:
loaded_predictor = MultiModalPredictor.load(model_path)
loaded_predictor.predict_proba({'sentence': [sentence1, sentence2]})

In [None]:
new_model_path = f"./tmp/{uuid.uuid4().hex}-automm_sst"
loaded_predictor.save(new_model_path)
loaded_predictor2 = MultiModalPredictor.load(new_model_path)
loaded_predictor2.predict_proba({'sentence': [sentence1, sentence2]})

In [None]:
embeddings = predictor.extract_embedding(test_data)
print(embeddings.shape)

In [None]:
from sklearn.manifold import TSNE
X_embedded = TSNE(n_components=2, random_state=123).fit_transform(embeddings)
for val, color in [(0, 'red'), (1, 'blue')]:
    idx = (test_data['label'].to_numpy() == val).nonzero()
    plt.scatter(X_embedded[idx, 0], X_embedded[idx, 1], c=color, label=f'label={val}')
plt.legend(loc='best')

In [None]:
sts_train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/train.parquet')[['sentence1', 'sentence2', 'score']]
sts_test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/dev.parquet')[['sentence1', 'sentence2', 'score']]
sts_train_data.head(10)

In [None]:
print('Min score=', min(sts_train_data['score']), ', Max score=', max(sts_train_data['score']))

In [None]:
sts_model_path = f"./tmp/{uuid.uuid4().hex}-automm_sts"
predictor_sts = MultiModalPredictor(label='score', path=sts_model_path)
predictor_sts.fit(sts_train_data, time_limit=60)

In [None]:
test_score = predictor_sts.evaluate(sts_test_data, metrics=['rmse', 'pearsonr', 'spearmanr'])
print('RMSE = {:.2f}'.format(test_score['rmse']))
print('PEARSONR = {:.4f}'.format(test_score['pearsonr']))
print('SPEARMANR = {:.4f}'.format(test_score['spearmanr']))

In [None]:
sentences = ['The child is riding a horse.',
             'The young boy is riding a horse.',
             'The young man is riding a horse.',
             'The young man is riding a bicycle.']

score1 = predictor_sts.predict({'sentence1': [sentences[0]],
                                'sentence2': [sentences[1]]}, as_pandas=False)

score2 = predictor_sts.predict({'sentence1': [sentences[0]],
                                'sentence2': [sentences[2]]}, as_pandas=False)

score3 = predictor_sts.predict({'sentence1': [sentences[0]],
                                'sentence2': [sentences[3]]}, as_pandas=False)
print(score1, score2, score3)