2.0 Sentiment Analysis and Topic Modelling Results

This notebook loads the processed data, visualizes the sentiment findings, and performs the LDA Topic Modelling and keyword clustering.

1. Load Data and Setup

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Import the Topic Modelling module
from src.topic_modeler import get_lda_topics

# Load the processed data
PROCESSED_FILE_NAME = "tarte_processed_sentiment.csv"
df = pd.read_csv(f'data/processed/{PROCESSED_FILE_NAME}')

sns.set_theme(style="whitegrid")
print(f"Data loaded successfully. Total comments analyzed: {len(df)}")


2. Comparative Sentiment Visualization

We compare VADER and TextBlob, and categorize VADER scores for clear visualization of sentiment distribution.

In [None]:
# Define VADER categories for visualization
def categorize_vader(score):
    if score >= 0.05:
        return 'Positive'
    elif score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

df['vader_sentiment'] = df['vader_compound'].apply(categorize_vader)

# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Subplot 1: VADER Sentiment Distribution (Bar Chart)
sns.countplot(x='vader_sentiment', data=df, order=['Negative', 'Neutral', 'Positive'], palette='viridis', ax=axes[0])
axes[0].set_title('VADER Sentiment Distribution (Categorized)')
axes[0].set_xlabel('Sentiment')
axes[0].set_ylabel('Count')

# Subplot 2: TextBlob Polarity Distribution (Histogram)
sns.histplot(df['textblob_polarity'], bins=30, kde=True, color='indianred', ax=axes[1])
axes[1].set_title('TextBlob Polarity Distribution (Continuous)')
axes[1].set_xlabel('Polarity Score (-1.0 to +1.0)')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

3. Sentiment vs. Engagement

We explore if highly emotional comments lead to more engagement (likes).

In [None]:
# Create a metric for absolute emotional strength
df['vader_strength'] = np.abs(df['vader_compound'])

plt.figure(figsize=(10, 6))
sns.scatterplot(x='vader_strength', y='comment_likes', hue='vader_sentiment', data=df, palette='viridis', alpha=0.6)
plt.title('Comment Engagement vs. Emotional Strength (VADER)')
plt.xlabel('Absolute VADER Compound Score (Emotional Strength)')
plt.ylabel('Comment Likes')
plt.show()


4. Topic Modelling and Keyword Clustering

We run the Topic Modelling function to identify key themes in the public discussion.

In [None]:
# --- PARAMETERS ---
N_TOPICS = 5
N_TOP_WORDS = 8

# Run the LDA model
topic_results = get_lda_topics(df, 
                               text_column='clean_comment', 
                               n_topics=N_TOPICS, 
                               n_top_words=N_TOP_WORDS)

print(f"\n--- Extracted Topics (LDA) ---")
print(topic_results)

# Interpretation: 
# The topics reveal the common clusters of conversation, which might include 
# 'Brand Ethics', 'Influencer Loyalty/Support', 'Apology Critique', etc. 
# This fulfills the 'keyword clustering' requirement.
