# TWITTER SENTIMENT ANALYSIS USING NLP

## **1. BUSINESS UNDERSTANDING**

### 1.1 BUSINESS OVERVIEW

Social media has become a major channel through which customers share their experiences, opinions, and emotions about products and brands. Platforms such as Twitter allow users to express their views openly and instantly, creating large volumes of unstructured text data. These public opinions can significantly influence purchasing decisions, brand perception, and customer loyalty.

For technology companies like Apple and Google, monitoring and understanding customer sentiment is essential. Customer feedback shared on social media provides valuable insights into product performance, customer satisfaction, and emerging issues. When analyzed effectively, this information can support better decision-making, product improvement, and brand management.


### 1.2 BUSINESS PROBLEM

Apple and Google face the challenge of managing customer opinions expressed on social media. Negative sentiment about products or services can quickly harm brand perception, reduce customer loyalty, and impact sales.Without an automated solution, emerging product issues or trending complaints may go unnoticed, leading to reputational risks and missed market opportunities.

This project aims to address this problem by using Natural Language Processing (NLP) and machine learning techniques to automatically classify and analyze sentiments expressed in tweets related to Apple and Google products.

### 1.3 BUSINESS OBJECTIVES

#### 1.3.1 Main objective

To build a sentiment analysis system that uses NLP techniques to analyze and classify Twitter comments about Apple and Google products.

#### 1.3.2 Specific objectives

* Develop machine learning model to classify tweets as positive or negative.
* To clean and preprocess raw Twitter text data for analysis.
* To explore and visualize sentiment distribution between Apple and Google products
* To evaluate model performance using appropriate metrics such as accuracy and precision to ensure reliability.

#### 1.3.3 Key questions

* How can a machine learning model be developed to classify tweets as positive or negative?
* What methods can be used to clean and preprocess raw Twitter text data to prepare it for analysis?
* How can the sentiment distribution between Apple and Google products be explored and visualized effectively?
* Which evaluation metrics such as accuracy and precision can be used to assess the modelâ€™s performance and ensure its reliability?

### 1.4 SUCCESS CRITERIA

The project will be considered successful if:

* A reliable sentiment classification model is developed with acceptable performance metrics.

* Clear visualizations are produced to show sentiment trends and comparisons between brands.

* The analysis provides actionable insights that help understand customer opinions and brand perception.

## **2. DATA UNDERSTANDING**

The dataset for this project comes from [CrowdFlower via data.world](https://data.world/crowdflower/brands-and-product-emotions) and consists of tweets related to Apple and Google products. The dataset contains a total of 9,093 records and 3 features. Most tweets are text-based and include user mentions, hashtags and product names.

Key Features in the Dataset:

`tweet_text` - The actual content of the tweet as written by the user. This serves as the main input for Natural Language Processing (NLP) to determine the expressed sentiment.

`emotion_in_tweet_is_directed_at` - Specifies the brand, company or product that the emotion is directed at (e.g., Apple, Google, iPhone, Android). This helps in comparing sentiment between brands.

`is_there_an_emotion_directed_at_a_brand_or_product` - The target variable indicating whether a tweet expresses a positive, negative, or neutral emotion toward a brand or product.

### 2.1 Load data

Import necessary libraries and load the dataset.

In [9]:
# Data loading and manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Sklearn Libraries
from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

# Sklearn Model_selection
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# Sklearn Metrics
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import roc_curve, roc_auc_score

# TensorFlow
import tensorflow as tf

# NLP libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk import FreqDist
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer, word_tokenize

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [10]:
#load dataset
df = pd.read_csv("judge-1377884607_tweet_product_company.csv", encoding ='latin1')

# Display the first few rows of the dataset
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [11]:
# Summary information about dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [12]:
#descriptive statistics
df.describe()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
count,9092,3291,9093
unique,9065,9,4
top,RT @mention Marissa Mayer: Google Will Connect...,iPad,No emotion toward brand or product
freq,5,946,5389


In [13]:
# Dataset size and structure
df.shape

(9093, 3)

In [14]:
#checking for missiing values
df.isna().sum().sum()

5803

In [15]:
#checking for duplicates
df.duplicated().sum()

22

### Target variable Distribution

In [16]:
# Sentiment count
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [17]:
df.rename(
    columns={'is_there_an_emotion_directed_at_a_brand_or_product': 'sentiment'},inplace=True)

## **3. DATA PREPARATION**

In this section, we will look into data cleaning and data preprocessing for our dataset.

### 3.1 Data Cleaning

#### Handling missing values

In [None]:
#Checking missing values
df.isnull().sum()

In [None]:
# Handle missing values in tweet_text
df = df.dropna(subset = ['tweet_text'])

# Confirm that there are no missing values in tweet_text
df['tweet_text'].isnull().sum()

In [None]:
# Check value counts in emotion_in_tweet_is_directed_at
print(df['emotion_in_tweet_is_directed_at'].value_counts())

# Check missing values
print(f"Missing values in this column: {df['emotion_in_tweet_is_directed_at'].isnull().sum()}")

In [None]:
# Fill missing values 
df['emotion_in_tweet_is_directed_at'] = df['emotion_in_tweet_is_directed_at'].fillna('Unknown') 

# Check missing values
df['emotion_in_tweet_is_directed_at'].isnull().sum()

#### Handling Duplicates

In [None]:
# Check for duplicates
df.duplicated().sum()

In [None]:
# Remove duplicate rows
df = df.drop_duplicates()

### 3.2 Text Cleaning

This step removes irrelevant elements that do not contribute to sentiment.

In [None]:
def clean_text(text):
    
# Remove URLs   
    text = re.sub(r'http\S+|www\S+', '', text)  
 # Remove mentions
    text = re.sub(r'@\w+', '', text) 
# Remove hashtag symbols
    text = re.sub(r'#', '', text)                
# Remove punctuation & numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text) 
# Remove extra spaces
    text = re.sub(r'\s+', ' ', text)             
    
    return text.lower().strip()

In [None]:
# Apply cleaning function
df['clean_text'] = df['tweet_text'].apply(clean_text)

# Check results
df[['tweet_text', 'clean_text']].head()

### 3.2 Lemmatization

Reduce words to their base form to improve model generalization.