## BUSINESS UNDERSTANDING

## Business Overview

Social media platforms and organizations such as Twitter, Apple and Google allow customers to freely share their opinions and experiences about products and brands. They tend to receive large numbers of tweets every day that reflect public sentiment, including both positive feedback and negative complaints. Analyzing this information manually is time-consuming, expensive, and unrealistic due to the large volume of unstructured text data.

This project applies Natural Language Processing (NLP) and machine learning techniques to automatically analyze Twitter data and classify the sentiment expressed toward Apple and Google products. By automating sentiment analysis, organizations can gain timely insights into customer opinions, monitor brand perception, and respond more effectively to emerging issues and trends.

## Business Problem

 While this data contains valuable insights into public perception and customer satisfaction, its unstructured nature and high volume make manual analysis impractical and inefficient. The business problem addressed in this project is the need to automatically analyze and classify tweets related to Apple and Google products in order to determine the sentiment expressed by users. By developing a machine learning–based natural language processing model that categorizes tweets as positive, negative, or neutral, organizations can more effectively monitor brand reputation, identify emerging issues, and support data-driven decision-making in marketing, customer service, and product development.

## DATA UNDERSTANDING

The dataset used in this project contains Twitter posts related to Apple and Google products that were collected and labeled through a crowdsourcing process. Each record represents a short, user-generated tweet expressing an opinion or reaction toward a brand or product. The dataset includes the original tweet text as well as sentiment labels that indicate whether an emotion is directed toward a product or brand. The target variable categorizes tweets into positive, negative, neutral (no emotion), or unclear sentiment. Preliminary exploration shows that the tweet text is largely complete, with very few missing values, while some supporting columns contain substantial missing data and are therefore excluded from modeling. Additionally, the dataset exhibits class imbalance, with neutral or non-emotional tweets appearing more frequently than positive or negative ones. A clear understanding of the dataset’s structure, quality, and class distribution is crucial for effective preprocessing, model development, and the selection of appropriate evaluation metrics.

## DATA ANALYSIS

In [27]:
import pandas as pd
import numpy as np
import sklearn

df = pd.read_csv("../data/judge-1377884607_tweet_product_company.csv", encoding="latin-1")
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [28]:
df.columns

Index(['tweet_text', 'emotion_in_tweet_is_directed_at',
       'is_there_an_emotion_directed_at_a_brand_or_product'],
      dtype='object')

In [29]:
df.shape

(9093, 3)

In [30]:
#checking for missing values
df.isnull().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [31]:
df["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

 For sentiment classification, Target column is "is_there_an_emotion_directed_at_a_brand_or_product"
 1. For binary classification, we want to keep positives and negatives only while ignoring can't tell/neutral.
2. We then bring back neutral and siscuss perfomance change

In [32]:
# remove empty tweets as the model cannot learn from an empty tweet
df = df.dropna(subset=['tweet_text'])
df. info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9092 non-null   object
dtypes: object(3)
memory usage: 284.1+ KB


We go ahead and keep positives and negatives only 

## NLP Processing

In [33]:
#creating a new dataset
# keeping relevant columns
df_clean = df[
    ['tweet_text', 'is_there_an_emotion_directed_at_a_brand_or_product']
].copy()


In [34]:
#rename columns for clarity
df_clean = df_clean.rename(
    columns={
        'tweet_text': 'tweet',
        'is_there_an_emotion_directed_at_a_brand_or_product': 'sentiment'
    }
)

df_clean.head()



Unnamed: 0,tweet,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion


At this point, raw tweets contain URLs, mentions(@user), hashtags, punctuation, numbers which count as noise and are not meaningful. We can go ahead and clean it: 

## Text Cleaning

In [35]:
import re

def clean_tweet(text):
    text = text.lower()                         # normalize case
    text = re.sub(r"http\S+", "", text)         # remove URLs
    text = re.sub(r"@\w+", "", text)            # remove mentions
    text = re.sub(r"#", "", text)               # remove hashtag symbol
    text = re.sub(r"[^a-z\s]", "", text)        # remove punctuation/numbers
    return text


In [36]:
#applying cleaning
df_clean['clean_tweet'] = df_clean['tweet'].apply(clean_tweet)

df_clean[['tweet', 'clean_tweet']].head()

# why are numbers that make sense all removed anyway?
#this keeeps both raw and cleaned data


Unnamed: 0,tweet,clean_tweet
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,i have a g iphone after hrs tweeting at rise...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,know about awesome ipadiphone app that youl...
2,@swonderlin Can not wait for #iPad 2 also. The...,can not wait for ipad also they should sale ...
3,@sxsw I hope this year's festival isn't as cra...,i hope this years festival isnt as crashy as ...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,great stuff on fri sxsw marissa mayer google ...


## EDA

## Frequent Words(Overall)

In [37]:
#checking for the most frequently used words overally
from collections import Counter

all_words = " ".join(df_clean['clean_tweet']).split()
Counter(all_words).most_common(20)


[('sxsw', 9535),
 ('the', 4373),
 ('link', 4284),
 ('to', 3589),
 ('at', 3069),
 ('rt', 2959),
 ('ipad', 2875),
 ('for', 2545),
 ('google', 2337),
 ('a', 2281),
 ('apple', 2146),
 ('in', 1931),
 ('of', 1714),
 ('is', 1711),
 ('and', 1618),
 ('iphone', 1517),
 ('store', 1467),
 ('on', 1330),
 ('new', 1089),
 ('i', 1071)]

## Frequent words by Sentiment

In [38]:
#checking for frequent words by sentiment
for s in df_clean['sentiment'].unique():
    words = " ".join(
        df_clean[df_clean['sentiment'] == s]['clean_tweet']
    ).split()
    print(f"\n{s} sentiment:")
    print(Counter(words).most_common(10))



Negative emotion sentiment:
[('sxsw', 580), ('the', 301), ('to', 256), ('ipad', 194), ('is', 160), ('iphone', 157), ('a', 151), ('google', 140), ('at', 137), ('rt', 137)]

Positive emotion sentiment:
[('sxsw', 3112), ('the', 1583), ('link', 1209), ('ipad', 1196), ('to', 1160), ('at', 1007), ('rt', 935), ('for', 908), ('apple', 838), ('a', 781)]

No emotion toward brand or product sentiment:
[('sxsw', 5685), ('link', 2926), ('the', 2419), ('to', 2125), ('at', 1874), ('rt', 1853), ('google', 1517), ('for', 1476), ('ipad', 1438), ('a', 1305)]

I can't tell sentiment:
[('sxsw', 158), ('the', 70), ('at', 51), ('to', 48), ('link', 48), ('ipad', 47), ('a', 44), ('google', 44), ('for', 41), ('is', 39)]


## Noise and Quality Check

In [39]:
#noise check and text quality
df_clean[['tweet', 'clean_tweet']].sample(5)


Unnamed: 0,tweet,clean_tweet
3359,A really great comparison review of the iPad 2...,a really great comparison review of the ipad ...
2484,HTML 5 and you. At the google booth. #sxsw {l...,html and you at the google booth sxsw link
8023,Final session today. Left Brain Search = Googl...,final session today left brain search google ...
4718,Thinking now that I shoulda gone to SXSW this ...,thinking now that i shoulda gone to sxsw this ...
2425,Sosososo cuteeeeee {link} #thingsthatdontgotog...,sosososo cuteeeeee link thingsthatdontgotogeth...


## MODELING

## Feature Engineering (TF-IDF Vectorization)

Since machine learning models require numerical input, cleaned tweets are converted into numerical features using Term Frequency–Inverse Document Frequency (TF-IDF).

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

X = df_clean['clean_tweet']
y = df_clean['sentiment']

tfidf = TfidfVectorizer(
    stop_words='english',
    max_features=5000,
    ngram_range=(1, 2)
)

X_tfidf = tfidf.fit_transform(X)


## Train-Test Split

In [41]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


## Baseline Model: Logistic Regression

In [42]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)


LogisticRegression(max_iter=1000)

## Model Evaluation

In [43]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = log_reg.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


                                    precision    recall  f1-score   support

                      I can't tell       0.00      0.00      0.00        31
                  Negative emotion       0.64      0.06      0.11       114
No emotion toward brand or product       0.69      0.89      0.77      1078
                  Positive emotion       0.65      0.45      0.53       596

                          accuracy                           0.68      1819
                         macro avg       0.49      0.35      0.35      1819
                      weighted avg       0.66      0.68      0.64      1819

[[  0   0  25   6]
 [  0   7  85  22]
 [  0   3 958 117]
 [  0   1 327 268]]


  _warn_prf(average, modifier, msg_start, len(result))


In [44]:
#Addressing class imbalance
log_reg_balanced = LogisticRegression(
    max_iter=1000,
    class_weight='balanced'
)

log_reg_balanced.fit(X_train, y_train)
y_pred_balanced = log_reg_balanced.predict(X_test)

print(classification_report(y_test, y_pred_balanced))



                                    precision    recall  f1-score   support

                      I can't tell       0.00      0.00      0.00        31
                  Negative emotion       0.30      0.55      0.39       114
No emotion toward brand or product       0.76      0.65      0.70      1078
                  Positive emotion       0.56      0.59      0.58       596

                          accuracy                           0.61      1819
                         macro avg       0.40      0.45      0.42      1819
                      weighted avg       0.65      0.61      0.63      1819



## Model Comparison 

In [45]:
# To strengthen the modeling section, an additional algorithm such as Multinomial Naive Bayes can be evaluated.
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(X_train, y_train)

y_pred_nb = nb.predict(X_test)

print(classification_report(y_test, y_pred_nb))


                                    precision    recall  f1-score   support

                      I can't tell       0.00      0.00      0.00        31
                  Negative emotion       0.55      0.05      0.10       114
No emotion toward brand or product       0.68      0.88      0.77      1078
                  Positive emotion       0.64      0.44      0.52       596

                          accuracy                           0.67      1819
                         macro avg       0.47      0.34      0.35      1819
                      weighted avg       0.65      0.67      0.63      1819



  _warn_prf(average, modifier, msg_start, len(result))


## Multiclass Sentiment Classification

In [49]:
df_multi = df[
    ['tweet_text', 'is_there_an_emotion_directed_at_a_brand_or_product']
].copy()

df_multi = df_multi.rename(columns={
    'tweet_text': 'tweet',
    'is_there_an_emotion_directed_at_a_brand_or_product': 'sentiment'
})

df_multi['clean_tweet'] = df_multi['tweet'].apply(clean_tweet)


Binary vs Multiclass Performance Discussion

Binary classification achieves higher performance due to simpler decision boundaries

Multiclass classification introduces ambiguity, especially between neutral and weak sentiment

Neutral tweets significantly contribute to class imbalance

Model selection depends on business objectives:

Binary → clearer action signals

Multiclass → richer sentiment insight

Modeling Summary

TF-IDF effectively captures tweet semantics

Logistic Regression provides a strong, interpretable baseline

Class weighting improves fairness across sentiment classes

Including neutral sentiment lowers performance but increases realism