# **Twitter Sentiment Analysis** 

### Understanding Brand Perception Through Natural Language Processing

## Business Problem and Context

In today's digital marketplace, social media has become the primary channel where customers express their opinions about products and brands. For companies like Apple and Google, understanding public sentiment on Twitter can provide invaluable insights into:

- **Product reception**: How are new product launches being received?
- **Brand health**: What's the overall sentiment toward our brand vs competitors?
- **Crisis detection**: Can we identify emerging negative sentiment before it becomes a PR crisis?
- **Customer service prioritization**: Which complaints need immediate attention?

**The Challenge**: Manually analyzing thousands of tweets daily is impossible. Customer service teams are overwhelmed, and by the time negative sentiment is identified through traditional methods, brand damage may already be done.

**Our Solution**: Build an automated sentiment classification system that can process Twitter data in real-time, categorizing tweets as positive, negative, neutral, or uncertain. This system will help stakeholders:

1. Monitor brand health continuously
2. Identify trending issues early
3. Route negative sentiment to customer service teams
4. Measure campaign effectiveness

**Success Metrics**: We aim to build a model that can accurately classify tweet sentiment with high precision and recall, particularly for negative sentiment (where misclassification is most costly from a business perspective).

## Table of Contents

1. [Business Understanding](#business-problem-and-context)
2. [Data Understanding](#data-understanding)
3. [Data Preparation](#data-preparation)
4. [Exploratory Data Analysis](#exploratory-data-analysis)
5. [Text Preprocessing & Feature Engineering](#text-preprocessing-and-feature-engineering)
6. [Modeling](#modeling)
7. [Model Evaluation & Interpretation](#model-evaluation-and-interpretation)
8. [Conclusions & Recommendations](#conclusions-and-recommendations)

## Data Understanding

Our dataset consists of tweets collected during the South by Southwest (SXSW) conference, where Apple and Google products were prominently discussed. Each tweet has been manually labeled by human judges for sentiment toward specific brands or products.

This is a real-world dataset with all its messiness: typos, slang, emojis, hashtags, and the informal language typical of social media. Understanding this data is our first critical step.


### Importing Libraries

In [3]:
# Core data manipulation and analysis
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Text processing and NLP
import re
import string
from collections import Counter

# NLTK for advanced NLP tasks (required by rubric)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.util import ngrams

# Download required NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt_tab', quiet=True)

# Scikit-learn for machine learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    classification_report, confusion_matrix, 
    accuracy_score, precision_recall_fscore_support
)

# For model persistence
import joblib

# Set random seed for reproducibility
np.random.seed(42)

### Loading and Initial Exploration

In [5]:
df = pd.read_csv("../data/judge-1377884607_tweet_product_company.csv",encoding="latin1")
df.head()


# basic information
print("Dataset Shape:", df.shape)
print("\n" + " "*5)
print("Column Names:")
print(" "*3)
for col in df.columns:
    print(f"  - {col}")

print("\n" + " "*60)
print("First Few Rows:")
print(" "*6)
df.head()

Dataset Shape: (9093, 3)

     
Column Names:
   
  - tweet_text
  - emotion_in_tweet_is_directed_at
  - is_there_an_emotion_directed_at_a_brand_or_product

                                                            
First Few Rows:
      


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


### Rename columns for clarity and ease of use

In [7]:
df.columns = ['tweet', 'product', 'sentiment']

print("Renamed columns:")
df.head()


Renamed columns:


Unnamed: 0,tweet,product,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


### Dataset Overview

Structure and completeness of our data.

In [None]:
print("Dataset Information:")
print(" "*6)

df.info()


The dataframe has a total of 9093 entries, with two columns having nulls

In [None]:
print("Missing Values:")

df.isnull().sum()

In [None]:
df.duplicated().sum()

Their are 22 duplicated entries/rows in our dataset


### Class Imbalance Analysis

In [None]:
df['sentiment'].value_counts(normalize = True) * 100

The dataset shows class imbalance, with `No emotion toward brand or product` accounting for approximately 59% of all observations. `Positive emotion` stands at 32%, with `Negative emotion` (≈6%) and *I can't tell* (≈2%) are underrepresented, which may bias the model toward the majority class.


## Data Preparation

Data cleaning is crucial for NLP. Poor quality data leads to poor model performance, regardless of model sophistication.


### Handling Missing Values

In [None]:
df.head()

In [None]:
# Removing the one row with missing tweet text (can't analyze what doesn't exist)
df = df.dropna(subset=['tweet'])

# For missing products, we'll fill with 'Unknown' rather than dropping
# These tweets still have sentiment and can be valuable for analysis
df['product'] = df['product'].fillna('Unknown')

print(f"Dataset shape after handling missing values: {df.shape}")
print(f"\nRemaining missing values:")
print(df.isnull().sum())

### Removing Duplicates

Duplicate tweets can skew our model by giving certain patterns excessive weight. We'll remove them to ensure each unique opinion is counted once.

In [None]:
df = df.drop_duplicates()
print(f"Dataset shape after removing duplicates: {df.shape}")

### Understanding Sentiment Distribution

In [None]:
# Examine sentiment distribution
sentiment_counts = df['sentiment'].value_counts()
sentiment_percentages = df['sentiment'].value_counts(normalize=True) * 100

sentiment_summary = pd.DataFrame({
    'Count': sentiment_counts,
    'Percentage': sentiment_percentages.round(2)
})

print("Sentiment Distribution:")
print(sentiment_summary)

In [None]:
# Visualizing the sentiment distribution

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

# Count plot
sentiment_counts.plot(kind='bar',ax=ax1,color=['red', 'blue', 'gray', 'green'])
ax1.set_title('Sentiment Distribution (Counts)', fontsize=12, fontweight='bold')
ax1.set_xlabel('Sentiment', fontsize=12)
ax1.set_ylabel('Number of Tweets', fontsize=12)
ax1.tick_params(axis='x', rotation=45)

# Pie chart
colors = ['gray', 'green', 'red', 'orange']
ax2.pie(
    sentiment_counts,
    labels=sentiment_counts.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=colors
)
ax2.set_title('Sentiment Distribution (Proportions)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

**Critical Business Insight:**

The data reveals a **significant class imbalance**:
- ~59% of tweets have **no emotion** toward products
- ~33% express **positive emotion**
- Only ~6% express **negative emotion**
- ~2% are **uncertain**

This imbalance mirrors reality: most social media mentions are neutral, and people are more likely to express positive opinions than negative ones in public forums like SXSW.

**Modeling Implications:**
1. We cannot rely solely on accuracy as our metric
2. We need to carefully consider precision and recall, especially for negative sentiment
3. The small "I can't tell" category may not be worth modeling separately




### Refining Our Target Variable

We create a simplified 3-class problem by combining `No emotion" and "I can't tell` into a single "Neutral" category. This makes business sense: from a customer service perspective, uncertain sentiment is functionally similar to neutral.

In [None]:
# Creating a simplified sentiment mapping
sentiment_mapping = {
    'Positive emotion': 'Positive',
    'Negative emotion': 'Negative',
    'No emotion toward brand or product': 'Neutral',
    "I can't tell": 'Neutral'
}

df['sentiment_clean'] = df['sentiment'].map(sentiment_mapping)

# Verifying the mapping
print("Simplified Sentiment Distribution:")
print(" "*1)
print(df['sentiment_clean'].value_counts())
print("\nPercentages:")
print(df['sentiment_clean'].value_counts(normalize=True).mul(100).round(2))

In [None]:
df.head()

## Exploratory Data Analysis

Understand patterns in our text data. What makes a tweet positive vs negative? Are there specific words or phrases that signal sentiment?