## <center> Project - Sentiment Analysis on Suka Dessert</center>

#### Group Sixth Sense
#### 1) Muhammad 'Umar bin Zolkifle, SW01082397
#### 2) Izzat Hatta bin Azizi, SW01082390
#### 3) Muhammad Hakimi bin Azizi, SW01082355
#### 4) Amirul Farhan bin Kamaruzaman, SW01082374
#### 5) Maizatul Aufa binti Zamidi, SW01082394
#### 6) Najah Zdafirah binti Mohamad Zakir, IS01082508

In [4]:
# import libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
svm_report = LinearSVC(dual='auto')

In [5]:
#load Dataset
file_path = "processed-reviewsv2.csv"
df = pd.read_csv(file_path)

In [6]:
# Join tokens from 'processed_review' column into a single string per row
df['clean_text'] = df['processed_review'].apply(lambda x: " ".join(eval(x)) if isinstance(x, str) else "")

In [7]:
# Apply keyword-based sentiment labeling on the cleaned text
positive_keywords = ['good', 'delicious', 'nice', 'great', 'tasty', 'love', 'friendly', 'awesome', 'perfect', 'best']
negative_keywords = ['bad', 'worst', 'slow', 'expensive', 'disappointed', 'not', 'poor', 'rude', 'overpriced', 'cold']

In [8]:
def label_sentiment(text):
    text = str(text).lower()
    if any(word in text for word in positive_keywords):
        return 'positive'
    elif any(word in text for word in negative_keywords):
        return 'negative'
    else:
        return 'neutral'

df['Sentiment'] = df['clean_text'].apply(label_sentiment)

In [9]:
# TF-IDF vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000)
X = tfidf.fit_transform(df['clean_text'])
y = df['Sentiment']

In [10]:
# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# Define Evaluation Function
def evaluate_model(model, model_name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True, zero_division=0)
    df_report = pd.DataFrame(report).transpose()
    print(f"\n--- {model_name} Evaluation Report ---")
    print(df_report[["precision", "recall", "f1-score"]])
    return df_report

In [12]:
# Train and Evaluate Models
logistic_report = evaluate_model(LogisticRegression(), "Logistic Regression")
naive_bayes_report = evaluate_model(MultinomialNB(), "Naive Bayes")
svm_report = evaluate_model(LinearSVC(), "SVM (Linear)")


--- Logistic Regression Evaluation Report ---
              precision    recall  f1-score
negative       0.000000  0.000000  0.000000
neutral        0.772727  0.680000  0.723404
positive       0.660000  0.916667  0.767442
accuracy       0.694444  0.694444  0.694444
macro avg      0.477576  0.532222  0.496949
weighted avg   0.598308  0.694444  0.634903

--- Naive Bayes Evaluation Report ---
              precision    recall  f1-score
negative       0.000000  0.000000  0.000000
neutral        1.000000  0.560000  0.717949
positive       0.620690  1.000000  0.765957
accuracy       0.694444  0.694444  0.694444
macro avg      0.540230  0.520000  0.494635
weighted avg   0.657567  0.694444  0.632266

--- SVM (Linear) Evaluation Report ---
              precision    recall  f1-score
negative       0.666667  0.181818  0.285714
neutral        0.703704  0.760000  0.730769
positive       0.738095  0.861111  0.794872
accuracy       0.722222  0.722222  0.722222
macro avg      0.702822  0.600976  0.6



## Explanation

Methods Used:
Logistic Regression, Naive Bayes, and SVM (Linear) for sentiment classification (positive, neutral, negative).

Why These Methods:

Logistic Regression:
This is a simple and popular method often used first when trying to solve classification problems. It works well for both two or more categories (like positive, neutral, and negative) and gives good results when the data is fairly straightforward.

Naive Bayes:
This method is fast and works well with text, which makes it a good choice for sentiment analysis. Even though it makes some simple assumptions about the data, it often gives surprisingly good results for tasks like this.

SVM (Linear):
SVM is a strong method for finding the best line or boundary to separate categories. When using a linear version, it’s faster and still works well, especially with text data where there are many features (like words).

Naive Bayes is known to work well with text classification due to its simplicity and efficiency.

SVM is powerful for high-dimensional data like TF-IDF vectors and is often used in text classification tasks for its robustness and accuracy.

Testing all three provides a comparison of performance to choose the best model for real-world application.

#### Model Effectiveness
Evaluation metrics used: Accuracy, Precision, Recall, F1-Score (per class)
| Model               | Accuracy  | Best F1-Score Class | Weakness Observed                 |
| ------------------- | --------- | ------------------- | --------------------------------- |
| Logistic Regression | 69.4%     | Positive (0776)     | Poor on **negative** class (0.00) |
| Naive Bayes         | 69.4%     | Positive (0.77)     | No prediction for negative        |
| SVM (LinearSVC)     | **72.2%** | Positive (0.79)     | Best balanced performance         |


#### Observation:

SVM outperformed the others with the highest accuracy and better performance on both neutral and negative classes.

Naive Bayes and Logistic Regression performed well on positive and neutral but failed to predict the negative class effectively, likely due to class imbalance or lack of strong negative signals in the dataset.



 #### Visualize Results & Extract Insights

##### What patterns did you discover?
Most reviews were positive, which aligns with the strong performance of all models in that category.

Neutral reviews were also common and well-handled by all models.

Negative reviews were sparse and harder to detect, causing poor performance in that category for some models.

##### What insights can be derived from your analysis?
Suka Dessert receives mostly positive sentiment, indicating customer satisfaction.

Few negative reviews suggest occasional issues (e.g., service, pricing) that could be further analyzed.

Sentiment trends could help monitor business performance over time.