# SENTIMENT ANALYSIS OF TWEETS ABOUT BRANDS AND PRODUCTS

>Sentiment analysis of tweets about Google and Apple and their products.

<div style="display: flex; justify-content: center; gap: 20px;">
  <img src="google.jpg" alt="Google Logo" width="200">
  <img src="apple.png" alt="Apple Logo" width="200">
  <img src="android.jpg" alt="Apple Logo" width="300">
    
</div>


# Project Summary

>This project focuses on building a text classification model that classifies tweets into categories such as positive, negative or neutral about Google, Apple and Android and their products. This would help these companies understand how customers feel about their products, services or brand and through this information they can Improve products and services, monitor brand reputation and detect potential PR issues early.

# 1. Business Understanding

>Google is a global technology company founded in 1998 by Larry Page and Sergey Brin, best known for its search engine that organizes and provides access to the world’s information. Headquartered in Mountain View, California, Google is now a subsidiary of Alphabet Inc. Its products and services include Google Search, Gmail, YouTube, Android, Google Maps and Google Cloud.

>Apple Inc. is an American technology company founded in 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne, headquartered in Cupertino, California. It is renowned for designing and manufacturing innovative consumer electronics, software, and online services. Apple’s flagship products include the iPhone, iPad, Mac, Apple Watch and AirPods, along with software like iOS, macOS and iCloud services.

>Android is an open-source mobile operating system developed by Google, designed primarily for touchscreen devices such as smartphones and tablets. Based on a modified version of the Linux kernel, Android offers a customizable and flexible platform that supports millions of apps through the Google Play Store.

##  Business Problem


>In today’s competitive digital marketplace, companies like Google and Apple rely heavily on public perception to maintain brand loyalty and market growth. With millions of users expressing their opinions daily on social media platforms like Twitter, it becomes increasingly challenging for companies to manually analyze and understand the sentiment behind these vast amounts of data. Sentiment analysis helps determine the type or nature of sentiment expressed in these platforms. By automating sentiment detection, companies can gain real-time insights into customer opinions, improve their products, and make data-driven marketing and business decisions.



## Objectives


>- To build a text classification model that classifies tweets into categories such as positive, negative or no emotion.
>- To preprocess the tweet data by cleaning, tokenizing and transforming text into a machine-readable format.
>- To evaluate model performance using appropriate metrics such as accuracy and precision to ensure reliability.


## Metrics of Success

>- The model's performance will be evaluated using accuracy as the primary metric with a target accuracy of about 80%.

# 2. Data Understanding

> The Brands and Product Emotions dataset is from a cloud-native data catalog and metadata platform, data.world. The dataset contains sentiment from Human raters in over 9,000 Tweets as positive, negative, or neither. The dataset contains 9093 rows and 3 columns.

## features

>- `tweet_text` - The actual text from tweets
>- `emotion_in_tweet_is_directed_at` - The brand or product the tweet is about
>- `is_there_an_emotion_directed_at_a_brand_or_product` - Nature of sentiment


## Data limitations

>- Temporal relevance - The sentiments captured may be tied to specific events or time periods, reducing the dataset’s relevance for future analysis.
>- Incomplete Data - Although the dataset contains around 9,000 rows, only about 3,000 entries in brand/product column have text. This significantly reduces the usable data and may limit the model’s ability to learn diverse patterns.

## 2.1 Loading the Dataset

In [44]:
# importing necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk import FreqDist
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
tweet_df = pd.read_csv('judge-1377884607_tweet_product_company.csv', encoding='latin1')
tweet_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [3]:
# shape of dataset
print("Rows: ", tweet_df.shape[0])
print("Columns: ", tweet_df.shape[1])


Rows:  9093
Columns:  3


## 2.2 Information about the dataset

In [4]:
# dataset information
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


## 2.3 Checking for missing values

In [5]:
# checking for null values
print("The dataset has", tweet_df.isna().sum().sum(), "missing values")

The dataset has 5803 missing values


## 2.4 Checking for duplicates

In [6]:
print("The dataset has", tweet_df['tweet_text'].duplicated().sum(), "duplicates")

The dataset has 27 duplicates


# 3. Data Preparation

## 3.1 Handling missing values and duplicates

In [7]:
# checking distribution of null values
tweet_df.isna().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

Since we only have one empty row in 'tweet_text' column we drop the row

In [8]:
# dropping null values in 'tweet_text' column
tweet_df.dropna(subset=['tweet_text'], inplace=True)

In [9]:
tweet_df.isna().sum()

tweet_text                                               0
emotion_in_tweet_is_directed_at                       5801
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [10]:
# checking all the categories in 'emotion_in_tweet_is_directed_at' column 
tweet_df['emotion_in_tweet_is_directed_at'].unique()

array(['iPhone', 'iPad or iPhone App', 'iPad', 'Google', nan, 'Android',
       'Apple', 'Android App', 'Other Google product or service',
       'Other Apple product or service'], dtype=object)

Since we have 5801 empty rows in 'emotion_in_tweet_is_directed_at' column If we drop all of them that could be half of our dataset gone meaning fewer examples for our model to learn from, so instead we replace the NaN with "Unknown" category.

In [18]:
# filling in the null values with 'Unknown' category
tweet_df['emotion_in_tweet_is_directed_at'].fillna('Unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  tweet_df['emotion_in_tweet_is_directed_at'].fillna('Unknown', inplace=True)


In [19]:
tweet_df['emotion_in_tweet_is_directed_at'].unique()

array(['Apple product or service', 'Google product or service', 'Unknown'],
      dtype=object)

In [12]:
tweet_df.isna().sum()

tweet_text                                            0
emotion_in_tweet_is_directed_at                       0
is_there_an_emotion_directed_at_a_brand_or_product    0
dtype: int64

In [13]:
# dropping the duplicate tweet texts
tweet_df = tweet_df.drop_duplicates(subset=['tweet_text'], keep='first')

In [14]:
print("The dataset has", tweet_df['tweet_text'].duplicated().sum(), "duplicates")

The dataset has 0 duplicates


## 3.2 EDA

## 3.3 Text Preprocessing

In [16]:
# reducing the categories into 3 
new_categories = {
    'iPad or iPhone App': 'Apple product or service',
    'iPad': 'Apple product or service',
    'iPhone': 'Apple product or service',
    'Apple': 'Apple product or service',
    'Other Apple product or service': 'Apple product or service',

    'Android': 'Google product or service',
    'Android App': 'Google product or service',
    'Google': 'Google product or service',
    'Other Google product or service': 'Google product or service'
    }

tweet_df['emotion_in_tweet_is_directed_at'] = tweet_df['emotion_in_tweet_is_directed_at'].map(new_categories)

In [21]:
tweet_df['emotion_in_tweet_is_directed_at'].unique()

array(['Apple product or service', 'Google product or service', 'Unknown'],
      dtype=object)

In [23]:
tweet_df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

is_there_an_emotion_directed_at_a_brand_or_product
No emotion toward brand or product    5372
Positive emotion                      2968
Negative emotion                       569
I can't tell                           156
Name: count, dtype: int64

In [24]:
# reducing the sentiment categories into 3
emotion = {
    'Positive emotion': 'Positive emotion',
    'Negative emotion': 'Negative emotion',
    'No emotion toward brand or product': 'Neutral or unclear sentiment',
    "I can't tell": 'Neutral or unclear sentiment'

    }

tweet_df['is_there_an_emotion_directed_at_a_brand_or_product'] = tweet_df['is_there_an_emotion_directed_at_a_brand_or_product'].map(emotion)

Text preprocessing steps
	                 	         
1. Lowercasing	        	
2. Remove punctuation	 	       
3. Tokenization	         	
4. Remove stopwords	     	
5. Lemmatization	     	

In [35]:
# Creating an intance of the RegexpTokenizer with the variable name `tokenizer`
tokenizer = RegexpTokenizer(r"(?u)\w{3,}")

# creating an instance of WordNetLemmatizer
lemmatizer =  WordNetLemmatizer()

# creating a list of stopwords in the english language
stopwords_list = stopwords.words('english')

In [36]:
# creating a function to perform the preprocessing
def preprocess(text, tokenizer, lemmatizer, stopwords_list):
    # lowercase the text
    text = text.lower()
    # remove punctuation, symbols and tokenize the text
    text = tokenizer.tokenize(text)
    # remove stopwords
    text = [word for word in text if word not in stopwords_list]
    # lemmatize the text
    text = [lemmatizer.lemmatize(word) for word in text]
    # return the preprocessed text as single string
    return " ".join(text)
    

In [41]:
# testing our function on a small sample
sample = 'Spin Play iPad launch Party. Hanging with @mention and @mention #sxsw (@mention Cedar Street Courtyard) {link}'
preprocess(sample, tokenizer, lemmatizer, stopwords_list)

'spin play ipad launch party hanging mention mention sxsw mention cedar street courtyard link'

In [42]:
# preprocessing our tweet text and creating a new column for the cleaned text
tweet_df['cleaned_text'] = tweet_df['tweet_text'].apply(lambda x: preprocess(x, tokenizer, lemmatizer, stopwords_list))

In [43]:
tweet_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,cleaned_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Apple product or service,Negative emotion,wesley83 iphone hr tweeting rise_austin dead n...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Apple product or service,Positive emotion,jessedee know fludapp awesome ipad iphone app ...
2,@swonderlin Can not wait for #iPad 2 also. The...,Apple product or service,Positive emotion,swonderlin wait ipad also sale sxsw
3,@sxsw I hope this year's festival isn't as cra...,Apple product or service,Negative emotion,sxsw hope year festival crashy year iphone app...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google product or service,Positive emotion,sxtxstate great stuff fri sxsw marissa mayer g...


## Frequency Distributions

Now that we've done some basic cleaning and tokenization, let's go ahead and create a `Frequency Distribution`to see the number of times each word is used.

In [52]:
# combining all the words into one big list
all_words = " ".join(tweet_df['cleaned_text']).split()

In [48]:
# counting how often each word appears
all_words_freqdist = FreqDist(all_words)
all_words_freqdist.most_common(20)

[('sxsw', 9600),
 ('mention', 7107),
 ('link', 4305),
 ('google', 2653),
 ('ipad', 2513),
 ('apple', 2333),
 ('quot', 1696),
 ('iphone', 1585),
 ('store', 1523),
 ('new', 1084),
 ('austin', 971),
 ('amp', 827),
 ('app', 825),
 ('launch', 683),
 ('circle', 683),
 ('social', 660),
 ('pop', 599),
 ('android', 596),
 ('today', 577),
 ('network', 467)]

Some words like 'sxsw', 'mention', 'link', 'rt', 'amp', 'quot' do not add meaning to our text so we add them to our stopwords list

In [50]:
# add custom meaningless tokens
custom_stopwords = {'sxsw', 'mention', 'link', 'rt', 'amp', 'quot'}

# merge them
stopwords_list.extend(custom_stopwords)

In [51]:
# we rerun the function
tweet_df['cleaned_text'] = tweet_df['tweet_text'].apply(lambda x: preprocess(x, tokenizer, lemmatizer, stopwords_list))

In [53]:
all_words_freqdist = FreqDist(all_words)
all_words_freqdist.most_common(20)

[('google', 2653),
 ('ipad', 2513),
 ('apple', 2333),
 ('iphone', 1585),
 ('store', 1523),
 ('new', 1084),
 ('austin', 971),
 ('app', 825),
 ('launch', 683),
 ('circle', 683),
 ('social', 660),
 ('pop', 599),
 ('android', 596),
 ('today', 577),
 ('network', 467),
 ('ipad2', 464),
 ('get', 456),
 ('line', 448),
 ('via', 436),
 ('party', 401)]