## BUSINESS UNDERSTANDING

## Business Overview

Social media platforms and organizations such as Twitter, Apple and Google allow customers to freely share their opinions and experiences about products and brands. They tend to receive large numbers of tweets every day that reflect public sentiment, including both positive feedback and negative complaints. Analyzing this information manually is time-consuming, expensive, and unrealistic due to the large volume of unstructured text data.

This project applies Natural Language Processing (NLP) and machine learning techniques to automatically analyze Twitter data and classify the sentiment expressed toward Apple and Google products. By automating sentiment analysis, organizations can gain timely insights into customer opinions, monitor brand perception, and respond more effectively to emerging issues and trends.

## Business Problem

 While this data contains valuable insights into public perception and customer satisfaction, its unstructured nature and high volume make manual analysis impractical and inefficient. The business problem addressed in this project is the need to automatically analyze and classify tweets related to Apple and Google products in order to determine the sentiment expressed by users. By developing a machine learning–based natural language processing model that categorizes tweets as positive, negative, or neutral, organizations can more effectively monitor brand reputation, identify emerging issues, and support data-driven decision-making in marketing, customer service, and product development.

## DATA UNDERSTANDING

The dataset used in this project contains Twitter posts related to Apple and Google products that were collected and labeled through a crowdsourcing process. Each record represents a short, user-generated tweet expressing an opinion or reaction toward a brand or product. The dataset includes the original tweet text as well as sentiment labels that indicate whether an emotion is directed toward a product or brand. The target variable categorizes tweets into positive, negative, neutral (no emotion), or unclear sentiment. Preliminary exploration shows that the tweet text is largely complete, with very few missing values, while some supporting columns contain substantial missing data and are therefore excluded from modeling. Additionally, the dataset exhibits class imbalance, with neutral or non-emotional tweets appearing more frequently than positive or negative ones. A clear understanding of the dataset’s structure, quality, and class distribution is crucial for effective preprocessing, model development, and the selection of appropriate evaluation metrics.

## DATA ANALYSIS

In [10]:
import pandas as pd
import numpy as np
import sklearn

df = pd.read_csv("../data/judge-1377884607_tweet_product_company.csv", encoding="latin-1")
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [11]:
df.columns

Index(['tweet_text', 'emotion_in_tweet_is_directed_at',
       'is_there_an_emotion_directed_at_a_brand_or_product'],
      dtype='object')

In [15]:
df.shape

(9093, 3)

In [None]:
#checking for missing values
df.isnull().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [16]:
df["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

 For sentiment classification, Target column is "is_there_an_emotion_directed_at_a_brand_or_product"
 1. For binary classification, we want to keep positives and negatives only while ignoring can't tell/neutral.
2. We then bring back neutral and siscuss perfomance change

In [22]:
# remove empty tweets as the model cannot learn from an empty tweet
df = df.dropna(subset=['tweet_text'])
df. info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9092 non-null   object
dtypes: object(3)
memory usage: 284.1+ KB


We go ahead and keep positives and negatives only 

## NLP Processing

In [24]:
#creating a new dataset
# keeping relevant columns
df_clean = df[
    ['tweet_text', 'is_there_an_emotion_directed_at_a_brand_or_product']
].copy()


In [25]:
#rename columns for clarity
df_clean = df_clean.rename(
    columns={
        'tweet_text': 'tweet',
        'is_there_an_emotion_directed_at_a_brand_or_product': 'sentiment'
    }
)

df_clean.head()



Unnamed: 0,tweet,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion


At this point, raw tweets contain URLs, mentions(@user), hashtags, punctuation, numbers which count as noise and are not meaningful. We can go ahead and clean it: 

In [26]:
import re

def clean_tweet(text):
    text = text.lower()                         # normalize case
    text = re.sub(r"http\S+", "", text)         # remove URLs
    text = re.sub(r"@\w+", "", text)            # remove mentions
    text = re.sub(r"#", "", text)               # remove hashtag symbol
    text = re.sub(r"[^a-z\s]", "", text)        # remove punctuation/numbers
    return text


In [None]:
#applying cleaning
df_clean['clean_tweet'] = df_clean['tweet'].apply(clean_tweet)

df_clean[['tweet', 'clean_tweet']].head()

# why are numbers that make sense all removed anyway?
#this keeeps both raw and cleaned data


Unnamed: 0,tweet,clean_tweet
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,i have a g iphone after hrs tweeting at rise...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,know about awesome ipadiphone app that youl...
2,@swonderlin Can not wait for #iPad 2 also. The...,can not wait for ipad also they should sale ...
3,@sxsw I hope this year's festival isn't as cra...,i hope this years festival isnt as crashy as ...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,great stuff on fri sxsw marissa mayer google ...
