# Sentiment Classification of Apple and Google Tweets Using NLP
By DS-PT II Group 6

## Introduction
The purpose of this project is to build a model that can rate the sentiment of a tweet based on its content. The data is from https://data.world/crowdflower/brands-and-product-emotions. It contains over 9000 tweets about Apple and Google products rated by humans as either postive, negative, or neutral.

This notebook contains



## Exploratory Data Analysis

### Importing libraries

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import re
import string

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize




from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings("ignore")



### Load the dataset

In [2]:
df = pd.read_csv('tweet_data.csv', encoding= 'latin-1')
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [3]:
df.shape

(9093, 3)

From the above, we can see that the data is organized into 9093 rows and 3 columns: 
* tweet_text
* emotion_in_tweet_is_directed_at
* is_there_an_emotion_directed_at_a_brand_or_product

These column names are quite lengthy. Let's simplify them

In [4]:
# Change column names to be more user-friendly
df = df.rename(columns = {'tweet_text': 'Tweet', 
                         'emotion_in_tweet_is_directed_at': 'Device', 
                         'is_there_an_emotion_directed_at_a_brand_or_product': 'Emotion'})

# Confirm the changes
df.head()
                        

Unnamed: 0,Tweet,Device,Emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [6]:
# Check for missing values 
print(df.isnull().sum())
# Check for duplicates
print(df.duplicated().sum())


Tweet         1
Device     5802
Emotion       0
dtype: int64
22


From the above, we see that there is 1 missing value in the 'Tweet' column, 5802 missing values in the 'Device' column and none in the 'Emotion' column. There are 22 duplicates.

We will proceed to drop the duplicates column and fill 'unknown' for the missing values in the device column. For the 1 missing value in the 'Tweet' column we will delete the entire row.

In [None]:
# Drop duplicates
df = df.drop_duplicates()
# Fill missing values in 'Device' column with 'unknown'
df['Device'] = df['Device'].fillna('unknown')
# Drop rows with missing values in 'Tweet' column
df = df.dropna(subset=['Tweet'])


In [8]:
# Confirm the changes
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9070 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Tweet    9070 non-null   object
 1   Device   9070 non-null   object
 2   Emotion  9070 non-null   object
dtypes: object(3)
memory usage: 283.4+ KB


Let's take a look at the value counts within the emotion column to understand the relationship between positive, negative and neutral tweets.

In [9]:
df['Emotion'].value_counts()

Emotion
No emotion toward brand or product    5375
Positive emotion                      2970
Negative emotion                       569
I can't tell                           156
Name: count, dtype: int64

As you can see in the cell above, there are 4 different types of sentiments expressed about the devices: 'No emotion toward brand or product', 'Positive emotion', 'Negative emotion' and 'I can't tell'.

These are too lengthy, so we will rename them into something more user-friendly.

Also, the first and last column are quite similar, so we will proceed to merge them into one.


In [10]:
# Merge 'No emotion toward brand or product' and 'I can't tell' into 'Neutral'
def clean_emotions(df, column): #Building function to change emotions
    emotion_list = [] #Making list for new names of emotions
    for i in df[column]:
        if i == "No emotion toward brand or product": #Renaming no emotions
            emotion_list.append('Neutral') #Renaming as Neutral
        elif i == "I can't tell": #Renaming I can't tell
            emotion_list.append('Neutral') #Renaming as Neutral
        elif i == "Positive emotion": #Renaming positive emotion
            emotion_list.append('Positive') #Renaming as Positive
        elif i == "Negative emotion": #Renaming negative emotion
            emotion_list.append('Negative') #Renaming as Negative
    df['Emotion'] = emotion_list #Setting column to new names
    return df

df = clean_emotions(df, 'Emotion') #Set df to clean emotions function
df['Emotion'].value_counts() #Checking value counts to see if they were changed

Emotion
Neutral     5531
Positive    2970
Negative     569
Name: count, dtype: int64

Let's take a look at the value counts within the device column to understand the column distribution

In [None]:
# 
df['Device'].value_counts()

Device
unknown                            5788
iPad                                945
Apple                               659
iPad or iPhone App                  469
Google                              428
iPhone                              296
Other Google product or service     293
Android App                          80
Android                              77
Other Apple product or service       35
Name: count, dtype: int64

The distribution across products is quite skewed. More than half of the tweets don’t reference a specific product. To address this, we plan to introduce a new column called “Brand”, which will indicate whether the tweet relates to Apple or Google, using the information already available in the “Product” column. Since the dataset focuses on these two companies, having this brand-level detail may prove useful later, so it makes sense to set it up now.

The process will be as follows: first, we’ll review all entries in the “Device” column. Next, we’ll write a function that goes through this column and assigns the appropriate brand to the new column. If the product is not specified, the function will then check the tweet text itself for product-related keywords. If no keywords are found, the brand will remain “Unknown.” If terms for both Apple and Google are detected, the entry will be labeled “Both.” The goal is to create more balanced classes for this new feature.

In [13]:
device_mapping = {
    "iPad": "Apple",
    "Apple": "Apple",
    "iPad or iPhone App": "Apple",
    "iPhone": "Apple",
    "Other Apple product or service": "Apple",
    
    "Google": "Google",
    "Other Google product or service": "Google",
    "Android": "Google",
    "Android App": "Google",
    
    "Unknown": "Unknown",
    
}

# Map first from the Device column
df["Brand"] = df["Device"].map(device_mapping).fillna("Unknown")

# Handle "Both" case if tweet mentions both Apple and Google/Android
df["Brand"] = df.apply(
    lambda row: "Both" if (
        ("apple" in str(row["Tweet"]).lower() or "ip" in str(row["Tweet"]).lower()) and
        ("google" in str(row["Tweet"]).lower() or "android" in str(row["Tweet"]).lower())
    ) else row["Brand"],
    axis=1
)

# Check final distribution
brand_distribution = df["Brand"].value_counts()
brand_distribution

Brand
Unknown    5575
Apple      2359
Google      836
Both        300
Name: count, dtype: int64