# Exercise: -- Sentiment Guesser –-

## Understanding Top-down & Bottom-up Approaches

Learning objectives
- Understand the difference between top-down (rule-based) and bottom-up (pattern-based) approaches in sentiment analysis.
- Implement a simple sentiment detection model using both approaches.
- Reflect on the advantages and limitations of each method.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


data = {"review": [
    "I absolutely love this product! It works great and is amazing.",
    "This is the worst purchase I have ever made. Totally terrible!",
    "It's okay, but I expected better. Not great but not bad either.",
    "Fantastic quality! Exceeded my expectations.",
    "I hate this so much. Worst thing ever!",
    "Pretty decent, could be improved but overall not bad.",
    "Superb experience! Highly recommend.",
    "Disappointed. Not what I expected at all.",
    "Well, it works... I guess.",  
    "I was excited to try this, but it completely failed my expectations.",  
    "It's a product."  
]}

df = pd.DataFrame(data)
df

### --- Step 1: Manual Sentiment Classification ---

With your team, please discuss whether you should enter _Positive_, _Negative_, or _Neutral_ for each review.
Please also rate your certainty for the label from 1 (very uncertain) to 5 (very certain).

In [None]:
df["ManualSentiment"] = ["" for _ in range(len(df))]  # Placeholder for student input
df["Certainty"] = [0 for _ in range(len(df))]  # Placeholder for certainty rating

print("\n### Manual Sentiment Classification ###")
print("Please manually classify the following reviews as Positive, Negative, or Neutral.")
print("Also rate your certainty from 1 (very uncertain) to 5 (very certain).\n")

valid_sentiments = {"Positive", "Negative", "Neutral"}

for i, review in enumerate(df["review"]):
    while True:
        sentiment = input(f"Review: {review}\nYour Sentiment (Positive/Negative/Neutral): ").strip().capitalize()
        if sentiment in valid_sentiments:
            df.at[i, "ManualSentiment"] = sentiment
            break
        print("Invalid input. Please enter Positive, Negative, or Neutral.")
    
    while True:
        try:
            certainty = int(input("Certainty (1-5): "))
            if 1 <= certainty <= 5:
                df.at[i, "Certainty"] = certainty
                break
            else:
                print("Please enter a number between 1 and 5.")
        except ValueError:
            print("Invalid input. Please enter a number between 1 and 5.")


In [None]:
df

# --- Step 2: Top-Down Approach ---
Use a predefined lexicon of positive and negative words. Can you think of additional words that should be added to the list? how does adding more words change your results?

In [None]:
positive_words = {"love", "great", "excellent", "amazing", "happy", "fantastic", "superb", "recommend"}
negative_words = {"worst", "terrible", "awful", "hate", "bad", "disappointed"}

def top_down_sentiment(text):
    words = text.lower().split()
    pos_count = sum(1 for word in words if word in positive_words)
    neg_count = sum(1 for word in words if word in negative_words)
    return "Positive" if pos_count > neg_count else "Negative" if neg_count > pos_count else "Neutral"

df["TopDownSentiment"] = df["review"].apply(top_down_sentiment)
df

# --- Step 3: Bottom-Up Approach ---
Use word frequency patterns to infer sentiment. Start by displaying the most common words.

In [None]:
# Split reviews into words, convert to lowercase, and count word frequencies
all_words = " ".join(df["review"]).lower().split()
word_counts = Counter(all_words)


print("Most common words in the dataset (top 20):")
for word, count in word_counts.most_common(20):
    print(f"{word}: {count}")


Based on the output above, please insert the most common positive and negative words in the sample

In [None]:
# Input common positive words
print("\nPlease input common positive words based on the frequency analysis")
positive_input = input("Enter positive words (comma-separated): ")
common_positive = set(positive_input.lower().split(','))

# Input common negative words
print("\nPlease input common negative words based on the frequency analysis (e.g., worst, bad, terrible).")
negative_input = input("Enter negative words (comma-separated): ")
common_negative = set(negative_input.lower().split(','))

In [None]:

def bottom_up_sentiment(text):
    words = text.lower().split()
    pos_count = sum(1 for word in words if word in common_positive)
    neg_count = sum(1 for word in words if word in common_negative)
    return "Positive" if pos_count > neg_count else "Negative" if neg_count > pos_count else "Neutral"

df["BottomUpSentiment"] = df["review"].apply(bottom_up_sentiment)

In [None]:
df

# --- Step 4: VADER Sentiment Analysis ---

In the final step, we will use a pre-trained model for sentiment classification. Since VADER relies on a predefined lexicon with sentiment scores assigned in advance, you could argue that it follows a top-down approach—applying predefined knowledge (the dictionary) to analyze new text. However, it also exhibits bottom-up characteristics because it computes the overall sentiment by aggregating individual word scores and adjusting based on syntactic rules (e.g., negation, intensifiers, punctuation).

So, it’s a bit of both:
- Top-down because it starts with a pre-built sentiment dictionary.
- Bottom-up because it builds sentiment from individual words and adjusts based on context.

In [None]:
analyzer = SentimentIntensityAnalyzer()

def vader_sentiment(text):
    score = analyzer.polarity_scores(text)
    return "Positive" if score["compound"] > 0.05 else "Negative" if score["compound"] < -0.05 else "Neutral"

df["VADER_Sentiment"] = df["review"].apply(vader_sentiment)

In [None]:
df

# --- Step 5: Agreement calculation  ---

what methods renders the highest agreement with human annotations?

In [None]:
def agreement_score(manual, predicted):
    return 1 if manual == predicted else 0

# apply the agreement score calculation
df["agreement_topdown"] = df.apply(lambda row: agreement_score(row["ManualSentiment"], row["TopDownSentiment"]), axis=1)
df["agreement_bottomup"] = df.apply(lambda row: agreement_score(row["ManualSentiment"], row["BottomUpSentiment"]), axis=1)
df["agreement_vader"] = df.apply(lambda row: agreement_score(row["ManualSentiment"], row["VADER_Sentiment"]), axis=1)

# calculate non-weighted agreement percentages
def non_weighted_agreement(agreement_col):
    return df[agreement_col].mean()

agreement_summary = {
    "top-down agreement": non_weighted_agreement("agreement_topdown"),
    "bottom-up agreement": non_weighted_agreement("agreement_bottomup"),
    "vader agreement": non_weighted_agreement("agreement_vader")
}

# --- step 6: visualization of agreement scores ---

In [None]:
print("\n### final sentiment analysis results ###\n", df[["review", "ManualSentiment", "Certainty", "TopDownSentiment", "BottomUpSentiment", "VADER_Sentiment"]])
print("\n### non-weighted agreement scores ###\n", agreement_summary)

# plot agreement scores
plt.figure(figsize=(8, 5))
plt.bar(agreement_summary.keys(), agreement_summary.values(), color=['blue', 'green', 'red'])
plt.ylim(0, 1)
plt.xlabel("sentiment analysis method")
plt.ylabel("agreement score")
plt.title("agreement between manual annotations and automated methods")
plt.show()