### **A Comparative Analysis of Public Sentiment on Twitter towards Apple and Google Products using Natural Language Processing.**  


**Group Name: Group 5**

**Members**
1. *Rose Miriti*
2. *Isaac Wadhare*
3. *Lydia Chumba*
4. *Erick Mauti*
5. *Marilyn Akinyi*
6. *Rodgers Otieno*
7. *Samwel Ongechi*

**Technical Mentor: George Kamundia**

**Phase: Phase 4 Project**


# Sentiment Analysis of Tweets on Apple and Google Products

##  Summary

This project focuses on analyzing public sentiment expressed on Twitter regarding Apple and Google products, using a labeled dataset of over 9,000 tweets categorized as positive, negative, or neutral. The goal is to build a proof-of-concept NLP model capable of classifying tweets according to sentiment, providing actionable insights that can guide business strategy, marketing, and product development for the two companies.

The workflow begins with **business and data understanding**, where the problem is defined, the dataset is explored, and the distribution of sentiment classes is analyzed. The **data preparation** stage includes text cleaning, tokenization, stopword removal, and lemmatization and stemming. Text data will then be transformed into numerical representations using TF-IDF vectors or word embeddings, creating features suitable for machine learning models.

For **modeling**, baseline models such as Logistic Regression and Naive Bayes will be implemented first to establish performance benchmarks. For advanced modeling, we will implement a **neural network**. We will also employ Ensemble methods like Random Forest and Gradient Boosting (XGBoost). 

A **validation strategy** using stratified train-test splits and K-Fold cross-validation will ensure the models generalize well to unseen data. **Evaluation metrics** will include accuracy, precision, recall, F1-score, and confusion matrices to assess multiclass classification performance. The project will produce insights into public sentiment trends, which will directly inform recommendations answering key objectives regarding customer perception, sentiment drivers, and business strategies.


## Business Problem

Apple and Google face continuous public scrutiny on social media regarding product launches and services. Understanding real-time customer sentiment is critical to improve products, marketing strategies, and customer satisfaction.  

**Business problem:**  
*"Can we automatically classify the sentiment of tweets about Apple and Google products to support actionable business insights?"*


## Project Objectives

1. **Determine the overall public sentiment** towards Apple and Google products on Twitter.  
2. **Identify tweet characteristics and themes** that contribute to positive, negative, or neutral sentiment.  
3. **Provide actionable insights** from sentiment trends to inform business decisions, marketing strategies, and product improvements.  

## 1.0 Importing the necessary libraries for the analysis

In [1]:
# Import essential libraries
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import emoji
import string
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Import NLTK specific modules
from nltk.corpus import stopwords, gutenberg
import contractions
from nltk.tokenize import word_tokenize, sent_tokenize, regexp_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import FreqDist, bigrams, trigrams, ngrams
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from nltk import CFG, ChartParser
from nltk import pos_tag
from wordcloud import WordCloud


# Import scikit-learn for machine learning
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold, RandomizedSearchCV
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import ComplementNB
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score, precision_score, recall_score
from sklearn.pipeline import Pipeline
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
import time


# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

print("All libraries imported successfully!")
print("Environment setup complete!")

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

All libraries imported successfully!
Environment setup complete!


## 1.1 Loading the dataset
- Loading the data into a pandas DataFrame and view the first 5 records.

In [2]:
df = pd.read_csv('Data\judge-1377884607_tweet_product_company.csv', encoding = 'latin-1')
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## 1.2 Data Inspecting 
- Data structure


In [3]:
print("Data Structure")
print(f"Shape: {df.shape} \n")
print(f'{df.info()} \n')  
print("OBSERVATION:")
print(f"There are {df.shape[1]} features. " )
print(f"And {df.shape[0]} records in our dataset." )
print("All features are categorical.")
print("There are missing values in two columns: 'tweet_text' and 'emotion_in_tweet_is_directed_at'.")

Data Structure
Shape: (9093, 3) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB
None 

OBSERVATION:
There are 3 features. 
And 9093 records in our dataset.
All features are categorical.
There are missing values in two columns: 'tweet_text' and 'emotion_in_tweet_is_directed_at'.


In [4]:
print((df.isnull().sum()/len(df))*100)

tweet_text                                             0.010997
emotion_in_tweet_is_directed_at                       63.807324
is_there_an_emotion_directed_at_a_brand_or_product     0.000000
dtype: float64


### Observation
- **Missing Values**:  
  - `tweet_text`: 0.01%  
  - `emotion_in_tweet_is_directed_at`: 63.8% missing values.  
  - Target column has **no missing values**. 

In [5]:
# Checking for duplicates
print(f"There are {df.duplicated().sum()} duplicates in our dataset. \nWe need to drop them to prevent false outcome .")

There are 22 duplicates in our dataset. 
We need to drop them to prevent false outcome .


- Check uniques values for sentiment features and their frequency

In [6]:
print(f"{df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()} \n")

print("Observation:")
print("There is need to merge neutral sentiments and remove the word 'emotion' from positive and negative sentiments")
print("Neutral emotions holds more than 50% of the dataset")


is_there_an_emotion_directed_at_a_brand_or_product
No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: count, dtype: int64 

Observation:
There is need to merge neutral sentiments and remove the word 'emotion' from positive and negative sentiments
Neutral emotions holds more than 50% of the dataset
