# Sentiment Analysis with Large Language Model embeddings

## Our Dataset

This dataset describes the contents of the heart-disease diagnosis.

The dataset in this study is from [Kaggle](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment/data), which is called Twitter US Airline Sentiment.

- Dataset: https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment/data

## Variable Table

| Original Dataset             | Data Type     | Description    |                                                         
|------------------------------|---------------|---------------------------------------------------------------------------------------------|
| tweet_id                     | ID            | A unique identifier for each tweet.                                                         | 
| airline_sentiment            | Categorical   | The sentiment expressed in the tweet (positive, neutral, negative).                         | 
| airline_sentiment_confidence | Numerical     | Confidence score in the sentiment label (0 to 1).                                           | 
| negativereason               | Categorical   | Reason for negative sentiment (e.g., "Late Flight", "Customer Service Issue").              | 
| negativereason_confidence    | Numerical     | Confidence score in the negative reason label (0 to 1).                                     | 
| airline                      | Categorical   | The airline mentioned in the tweet (e.g., United, Delta, etc.).                             | 
| airline_sentiment_gold       | Categorical   | Sentiment label by trusted annotator (gold standard).                                       | 
| name                         | Text          | Name of the user who posted the tweet.                                                      | 
| negativereason_gold          | Categorical   | Negative reason label by trusted annotator (gold standard).                                 | 
| retweet_count                | Numerical     | Number of times the tweet was retweeted.                                                    | 
| text                         | Text          | The full content of the tweet.                                                              | 
| tweet_coord                  | Geospatial    | Latitude and longitude coordinates where the tweet was posted, if available.                | 
| tweet_created                | Datetime      | Timestamp when the tweet was created.                                                       | 
| tweet_location               | Text          | Location specified in the user's profile.                                                   | 
| user_timezone                | Categorical   | Time zone specified in the user's profile.                                                  | 

<br/>

## Data Used for Modeling

| Feature                      | Data Type   | Description  |
|-----------------------------|-------------|--------------|
| **Target Variable: `encoded_sentiment`** | Categorical | This is an engineered variable derived from `airline_sentiment` for multi-class sentiment classification. It encodes sentiment as: 0 = Negative, 1 = Neutral, 2 = Positive. |
| **Feature: `text`**         | Text        | Contains consumer tweets about U.S. airlines. This field undergoes preprocessing, including removal of URLs and mentions (`@`), and stopword removal |


<br/>

# 1. Load Data

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("../../data/tweets.csv")

In [None]:
# Show basic info
df.info()

In [None]:
# Show the first few rows
df.head(3)

# 2. Data Preprocessing

## 2.1 Handle Duplicates

In [None]:
# Check for duplicate rows
duplicate_rows = df[df.duplicated()]
print(f"Number of duplicate rows: {len(duplicate_rows)}")

# Drop duplicate rows
df.drop_duplicates(inplace=True)

In [None]:
# Confirm the shape after removal
print(f"Shape after dropping duplicates: {df.shape}")

## 2.2 Handle Missing Values

In [None]:
# Check for missing values for each variables in the dataset
print("\nMissing values count for each variables:")
print("-------------------------------------------")
print(df.isnull().sum())

print("""\n\n**Note**: We won't remove any rows with missing values here as 
our main field we use is 'text' and 'airline_sentiment' column,
which has no missing values""")

## 2.3 Text Processing

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

### 2.3.1 Feature Engineering

For feature engineering in sentiment analysis, we will perform the following steps:

- `Stopwords Removal`: Eliminating common words (e.g., "the", "is", "and") that don't contribute meaningful information.

In [None]:
import re

# Initialize stopwords, stemmer
stop_words = set(stopwords.words('english'))

# Step 1: Lowercase and clean the text
def clean_text(text):
    text = text.lower()                                 # Lowercase
    text = re.sub(r"http\S+|www\S+|https\S+", '', text) # Remove URLs
    text = re.sub(r'@\w+', '', text)                    # Remove mentions
    # text = re.sub(r'#\w+', '', text)                    # NOTE: Do not remove hashtags, 
                                                                # as there is a lot of hashtags with sentiment indication, 
                                                                # such as '#thankyou', '#happycustomer', etc...
    # text = re.sub(r'[^a-z\s]', '', text)                # NOTE: no need to remove numbers and punctuation for llm embedding
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Copy only the 'text' column to df_copy
df2 = df[['text', 'airline_sentiment']].copy()

# processed text after rule-based removing URL, twitter username
df2['clean_text'] = df2['text'].apply(clean_text)



In [None]:
df2.shape

In [None]:
# Check for missing values per column
print("\nMissing values per column:")
print(df2.isnull().sum())

### 2.3.2 Target Engineering

We will convert the 'airline_sentiment' column into numerical values to use it as the target variable in our model, where `negative` = 0, `neutral` = 1, and `positive` = 2 

In [None]:
print("The target variable contains unique values of: ", df2['airline_sentiment'].unique(), 
      "which we are going to map it into 0, 1 and 2 respectively")

# Encode the sentiment column
df2['encoded_sentiment'] = df2['airline_sentiment'].map({"negative": 0, "neutral": 1, "positive": 2})


## 3.0 Data Preparation for Modeling

In [None]:
df2

In [None]:
# Split into features (X) and target labels (y)
X = df2['clean_text']
y = df2['encoded_sentiment']

### 3.1 Train test split with stratified sampling

In [None]:
from sklearn.model_selection import train_test_split

# Split before SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                        stratify=y,            
                                        test_size=0.2, 
                                        random_state=42)

In [None]:
pd.concat([X_train, y_train], axis=1).to_csv("../../data/train.csv")
pd.concat([X_test, y_test], axis=1).to_csv("../../data/test.csv")

In [None]:
# Verify that class distribution is preserved after the train-test split (i.e., stratified correctly)

print("Train class distribution:")
print(y_train.value_counts(normalize=True))

print("\nTest class distribution:")
print(y_test.value_counts(normalize=True))

print("""\n**Observation**: The class proportions appear to be preserved across the training and test sets, 
indicating a successful stratified split.""")

### 3.2 Text Representation with LLM embedding

In [None]:
import os
import pandas as pd 
from utils.prepare_llm_embedding import generate_embeddings_from_series 

EMBEDDING_TRAIN = "../../data/llm_embedding_train.csv"
if os.path.exists(EMBEDDING_TRAIN):
    pass
else:
    processed_text_series = pd.Series(X_train.to_list(),
                                    index=X_train.index.to_list()) 
    llm_embedding = generate_embeddings_from_series(processed_text_series,
                            additional_data={"encoded_sentiment": y_train.to_list()},
                            output_csv_path="../../data/llm_embedding_train.csv",
                            max_workers=20)
    print(llm_embedding)

In [None]:
import pandas as pd 
from utils.prepare_llm_embedding import generate_embeddings_from_series 

EMBEDDING_TEST = "../../data/llm_embedding_test.csv"
if os.path.exists(EMBEDDING_TEST):
    pass
else:
    processed_text_series = pd.Series(X_test.to_list(), 
                                    index=X_test.index.to_list()) 
    llm_embedding = generate_embeddings_from_series(processed_text_series,
                            additional_data={"encoded_sentiment": y_test.to_list()},
                            output_csv_path="../../data/llm_embedding_test.csv",
                            max_workers=20) 
    print(llm_embedding)

In [None]:
import numpy
import json

train_vectorized = pd.read_csv("../../data/llm_embedding_train.csv")
test_vectorized = pd.read_csv("../../data/llm_embedding_test.csv")

X_train_vectorized = train_vectorized["embedding_json"].apply(json.loads) # convert string into a list of 765 items in 1 column
X_train_vectorized = numpy.vstack(X_train_vectorized) # turn that list of 765 items into 765 features / columns
y_train = train_vectorized["encoded_sentiment"]

X_test_vectorized = test_vectorized["embedding_json"].apply(json.loads)
X_test_vectorized = numpy.vstack(X_test_vectorized)
y_test = test_vectorized["encoded_sentiment"]

### 3.3 Handling class imbalance issue with SMOTE

In [None]:
from imblearn.over_sampling import SMOTE

# Handling imbalanced using SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_vectorized, y_train)

In [None]:
# Check class distribution before and after applying SMOTE to confirm successful balancing

print("Class distribution in training set (before SMOTE):")
print(y_train.value_counts())

print("\nClass distribution in training set (after SMOTE):")
print(y_train_resampled.value_counts())

print("""\n**Observation**: The class distribution in the training set has been balanced after applying SMOTE, 
confirming that oversampling was successful.""")


## 4.0 Modeling

(i) Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
model.fit(X_train_resampled, y_train_resampled)

# Predictions on training and test sets
y_train_pred = model.predict(X_train_resampled)
y_test_pred = model.predict(X_test_vectorized)

# Accuracy scores
train_accuracy = accuracy_score(y_train_resampled, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Output
print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy:  {test_accuracy:.4f}\n")

print("Classification Report (Test):")
print(classification_report(y_test, y_test_pred))

print("Confusion Matrix (Test):")
print(confusion_matrix(y_test, y_test_pred))

(ii) Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize the Decision Tree
dt_model = DecisionTreeClassifier() # max_depth=768, 
                                  # min_samples_split=10, 
                                  # criterion='entropy', 
                                  # min_samples_leaf=100)

# Train the model
dt_model.fit(X_train_resampled, y_train_resampled)

# Predictions on training and test sets
y_train_pred = dt_model.predict(X_train_resampled)
y_test_pred = dt_model.predict(X_test_vectorized)

# Accuracy scores
train_accuracy = accuracy_score(y_train_resampled, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Output
print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy:  {test_accuracy:.4f}\n")

print("Classification Report (Test):")
print(classification_report(y_test, y_test_pred))

print("Confusion Matrix (Test):")
print(confusion_matrix(y_test, y_test_pred))

(iii) Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize the Decision Tree
gb_model = GradientBoostingClassifier(max_depth=768, 
                                      min_samples_split=10, 
                                      min_samples_leaf=100)

# Train the model
gb_model.fit(X_train_resampled, y_train_resampled)

# Predictions on training and test sets
y_train_pred = gb_model.predict(X_train_resampled)
y_test_pred = gb_model.predict(X_test_vectorized)

# Accuracy scores
train_accuracy = accuracy_score(y_train_resampled, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Output
print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy:  {test_accuracy:.4f}\n")

print("Classification Report (Test):")
print(classification_report(y_test, y_test_pred))

print("Confusion Matrix (Test):")
print(confusion_matrix(y_test, y_test_pred))


## 5.0 Feature Importance

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# Plot decision tree
plt.figure(figsize=(200, 200))
plot_tree(dt_model, filled=True, max_depth=5, 
          feature_names=None, 
          class_names=["Negative", "Neutral", "Positive"])
plt.show()

- Reference: https://medium.com/nlplanet/two-minutes-nlp-explain-predictions-with-shap-values-2a0e34219177

In [None]:
import shap

# Create a SHAP explainer
explainer = shap.Explainer(dt_model,
                           X_train_resampled.toarray(), 
                           feature_names=None)

# Compute SHAP values for the test set
shap_values = explainer(X_test_vectorized.toarray())

print(shap_values.values.shape)


In [None]:
shap.initjs()

ind = 3
print(f"The sentiment of the {ind}-th row of text item is",
      y_test.replace({0: 'Negative', 1: 'Neutral', 2: 'Positive'}).iloc[ind])

# print(X_test_vectorized[ind])
shap.plots.waterfall(shap_values[ind,:,1])

In [None]:
shap.initjs()

ind = 0
print(f"The sentiment of the {ind}-th row of text item is", 
      y_test.replace({0: 'Negative', 1: 'Neutral', 2: 'Positive'}).iloc[ind])

# print(X_test_vectorized[ind])
shap.plots.waterfall(shap_values[ind,:,1])

In [None]:
# Letâ€™s do the same with a negative review.

ind = 122
print(f"The sentiment of the {ind}-th row of text item is", 
      y_test.replace({0: 'Negative', 1: 'Neutral', 2: 'Positive'}).iloc[ind])

shap.initjs()
# print(X_test_vectorized[ind])
shap.plots.waterfall(shap_values[ind,:,1])