# Analyzing Sentiment Through Emojis
In digital communication, emojis serve as expressive symbols that often convey feelings more vividly than words. From chat platforms to social media, these visual cues enhance how people share emotions online.

Sentiment analysis—also called opinion mining—aims to interpret a user's emotional tone based on their language. Traditionally, such analyses focus solely on textual data. However, incorporating emojis introduces an additional dimension of meaning, enabling more nuanced interpretations of user sentiment.

The underlying data used in sentiment detection typically maps emotional weight, or polarity, to individual words. By evaluating these polarities, we can determine the overall emotional tone embedded within a given message.

In [None]:
import pandas as pd
import numpy as np

# Data Preparation

### Emoji Data Preparation

Emojis are pictographic characters in Unicode that serve as a concise way to convey ideas and emotions. Unlike the handful of traditional emoticons with obvious sentimental meanings, the emoji lexicon consists of hundreds of symbols with varying connotations.


In [None]:
# load only the columns needed for sentiment analysis
emoji_columns = ["Emoji", "Negative", "Neutral", "Positive"]
csv_path      = Path("dataset") / "Emoji_Sentiment_Data.csv"

raw_emoji_df = pd.read_csv(csv_path, usecols=emoji_columns)

raw_emoji_df

Unnamed: 0,Emoji,Negative,Neutral,Positive
0,😂,3614,4163,6845
1,❤,355,1334,6361
2,♥,252,1942,4950
3,😍,329,1390,4640
4,😭,2412,1218,1896
...,...,...,...,...
964,➛,0,1,0
965,♝,0,1,0
966,❋,0,1,0
967,✆,0,1,0


In [None]:
raw_emoji_df.Emoji.values

array(['😂', '❤', '♥', '😍', '😭', '😘', '😊', '👌', '💕', '👏', '😁', '☺', '♡',
       '👍', '😩', '🙏', '✌', '😏', '😉', '🙌', '🙈', '💪', '😄', '😒', '💃', '💖',
       '😃', '😔', '😱', '🎉', '😜', '☯', '🌸', '💜', '💙', '✨', '😳', '💗', '★',
       '█', '☀', '😡', '😎', '😢', '💋', '😋', '🙊', '😴', '🎶', '💞', '😌', '🔥',
       '💯', '🔫', '💛', '💁', '💚', '♫', '😞', '😆', '😝', '😪', '�', '😫', '😅',
       '👊', '💀', '😀', '😚', '😻', '©', '👀', '💘', '🐓', '☕', '👋', '✋', '🎊',
       '🍕', '❄', '😥', '😕', '💥', '💔', '😤', '😈', '►', '✈', '🔝', '😰', '⚽',
       '😑', '👑', '😹', '👉', '🍃', '🎁', '😠', '🐧', '☆', '🍀', '🎈', '🎅', '😓',
       '😣', '😐', '✊', '😨', '😖', '💤', '💓', '👎', '💦', '✔', '😷', '⚡', '🙋',
       '🎄', '💩', '🎵', '➡', '😛', '😬', '👯', '💎', '🌿', '🎂', '🌟', '🔮', '❗',
       '👫', '🏆', '✖', '☝', '😙', '⛄', '👅', '♪', '🍂', '💏', '🔪', '🌴', '👈',
       '🌹', '🙆', '➜', '👻', '💰', '🍻', '🙅', '🌞', '🍁', '⭐', '▪', '🎀', '━',
       '☷', '🐷', '🙉', '🌺', '💅', '🐶', '🌚', '👽', '🎤', '👭', '🎧', '👆', '🍸',
       '🍷', '®', '🍉', '😇', '☑', '🏃', '😿', '│', '💣', '🍺', '▶', '😲

### Convert Sentiment to Binary Scale and Standardize Values Between 0 and 1


In [None]:
# derive a binary sentiment flag (0 = negative, 1 = positive)
binary_sentiment = (
    (df_emoji["Positive"] > df_emoji["Negative"])
    | (
        (df_emoji["Positive"] == df_emoji["Negative"])
        & (df_emoji["Neutral"] % 2 == 1)
      )
).astype(int)

# build the new DataFrame in one go
new_df_emoji = pd.DataFrame({
    "sentiment": binary_sentiment,
    "emoji":     df_emoji["Emoji"]
})

new_df_emoji

Unnamed: 0,sentiment,emoji
0,1,😂
1,1,❤
2,1,♥
3,1,😍
4,0,😭
...,...,...
964,1,➛
965,1,♝
966,1,❋
967,1,✆


# Preparing the Tweet Dataset for Analysis
The project includes a built-in dataset containing 10,000 tweets to get started quickly. However, for more extensive experimentation or model training, you may opt to use a much larger dataset of 1.6 million tweets, available publicly.

In [None]:
# load the preprocessed tweet dataset and drop the unnamed index column
csv_path     = Path("dataset") / "processed_tweet_dataset.csv"
raw_posts_df = pd.read_csv(csv_path)
posts_df     = raw_posts_df.drop(columns=[raw_posts_df.columns[0]]).copy()

posts_df

Unnamed: 0,sentiment,post
0,0,"- Awww, that's a bummer. You shoulda got David..."
1,0,Picked Mich St to win it all from the get go. ...
2,0,throat is closing up and i had some string che...
3,0,"If he doesn't get better in a few days, he cou..."
4,0,I'm sure everyone has ruined my gift to you Wh...
...,...,...
9995,1,- i know now what is that haha X)
9996,1,- had a great time with some of the best peopl...
9997,1,"Tyreseee, when you're heading to The Netherlan..."
9998,1,"don't know what you could possibly mean, dear ..."


# Sentiment Classification with Naive Bayes
Naive Bayes offers a straightforward yet effective approach to building classification models. It works by assigning a category or label to a given data instance—expressed as a vector of features—by estimating the likelihood of each possible class. The class with the highest probability is then chosen from a predefined set of categories.

In [None]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

# Understanding TF-IDF: Term Frequency–Inverse Document Frequency
TF-IDF is a numerical metric used to evaluate the significance of a word within a specific document, relative to a larger set of documents or a corpus. It balances how often a word appears in one document against how common it is across all documents, helping highlight terms that are distinctively meaningful rather than just frequently used.

In [None]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

# prepare English stopwords set
english_stopwords = set(stopwords.words("english"))

# initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(
    use_idf=True,
    lowercase=True,
    strip_accents="ascii",
    stop_words=english_stopwords
)

In [None]:
# print each emoji with its binary sentiment and count totals
total_emojis   = len(sentiment_df)
positive_emojis = sentiment_df["sentiment"].sum()

for emoji_char, sentiment_value in sentiment_df[["emoji", "sentiment"]].itertuples(index=False):
    print(f"{emoji_char} = {sentiment_value}")

print(f"Processed {total_emojis} emojis, {positive_emojis} positives")


😂 = 1
❤ = 1
♥ = 1
😍 = 1
😭 = 0
😘 = 1
😊 = 1
👌 = 1
💕 = 1
👏 = 1
😁 = 1
☺ = 1
♡ = 1
👍 = 1
😩 = 0
🙏 = 1
✌ = 1
😏 = 1
😉 = 1
🙌 = 1
🙈 = 1
💪 = 1
😄 = 1
😒 = 0
💃 = 1
💖 = 1
😃 = 1
😔 = 0
😱 = 1
🎉 = 1
😜 = 1
☯ = 1
🌸 = 1
💜 = 1
💙 = 1
✨ = 1
😳 = 1
💗 = 1
★ = 1
█ = 0
☀ = 1
😡 = 0
😎 = 1
😢 = 1
💋 = 1
😋 = 1
🙊 = 1
😴 = 0
🎶 = 1
💞 = 1
😌 = 1
🔥 = 1
💯 = 1
🔫 = 0
💛 = 1
💁 = 1
💚 = 1
♫ = 1
😞 = 0
😆 = 1
😝 = 1
😪 = 0
� = 1
😫 = 0
😅 = 1
👊 = 1
💀 = 0
😀 = 1
😚 = 1
😻 = 1
© = 1
👀 = 1
💘 = 1
🐓 = 1
☕ = 1
👋 = 1
✋ = 1
🎊 = 1
🍕 = 1
❄ = 1
😥 = 1
😕 = 0
💥 = 1
💔 = 0
😤 = 0
😈 = 1
► = 1
✈ = 1
🔝 = 1
😰 = 0
⚽ = 1
😑 = 0
👑 = 1
😹 = 1
👉 = 1
🍃 = 1
🎁 = 1
😠 = 0
🐧 = 1
☆ = 1
🍀 = 1
🎈 = 1
🎅 = 1
😓 = 0
😣 = 0
😐 = 0
✊ = 1
😨 = 0
😖 = 0
💤 = 1
💓 = 1
👎 = 0
💦 = 1
✔ = 1
😷 = 0
⚡ = 1
🙋 = 1
🎄 = 1
💩 = 0
🎵 = 1
➡ = 1
😛 = 1
😬 = 1
👯 = 1
💎 = 1
🌿 = 1
🎂 = 1
🌟 = 1
🔮 = 1
❗ = 1
👫 = 1
🏆 = 1
✖ = 1
☝ = 1
😙 = 1
⛄ = 1
👅 = 1
♪ = 1
🍂 = 1
💏 = 1
🔪 = 1
🌴 = 1
👈 = 1
🌹 = 1
🙆 = 1
➜ = 1
👻 = 1
💰 = 1
🍻 = 1
🙅 = 0
🌞 = 1
🍁 = 1
⭐ = 1
▪ = 1
🎀 = 1
━ = 1
☷ = 1
🐷 = 1
🙉 = 1
🌺 = 1
💅 = 1
🐶 = 1
🌚 = 1
👽 = 1
🎤 = 1
👭 = 1
🎧 = 

In [None]:
# summarize total vs. positive counts
total_emojis    = len(new_df_emoji)
positive_count  = new_df_emoji["sentiment"].sum()

print(
    f"Total positive emojis: {positive_count} out of {total_emojis} "
    f"({positive_count/total_emojis:.0%})"
)

Total Positive Emojis are (795:969) or 82%


In [None]:
# prepare features and labels for modeling
labeled_posts_df = posts_df.copy()

# 0 = negative, 1 = positive
labels = labeled_posts_df["sentiment"].values

# transform text into TF–IDF features
features = tfidf_vectorizer.fit_transform(labeled_posts_df["post"].values)

# display shapes
print(f"Labels shape: {labels.shape}")
print(f"Features shape: {features.shape}")
print(f"{features.shape[0]} samples × {features.shape[1]} features")


(10000,)
(10000, 13339)
10000 observations X 13339 unique words


# Model Training
When training the model using the 10,000-sample dataset, we can expect to reach an accuracy of approximately 80%, reflecting a solid baseline performance for sentiment classification.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

# split into train/test sets (25% test, fixed seed for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(
    features,
    labels,
    test_size=0.25,
    random_state=42
)

# initialize and train a multinomial Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# get predicted probabilities for the positive class
y_probs = classifier.predict_proba(X_test)[:, 1]

# compute ROC AUC
roc_auc = roc_auc_score(y_test, y_probs)
print(f"ROC AUC Score: {roc_auc:.4f}")

0.7993952625916019

### Processing the inputs - Extraction of emoji and texts

In [None]:
import emoji
import re

def split_text_and_emojis(raw_text):
    """
    Remove mentions, URLs, hashtags, and ampersands from `raw_text`,
    then return a tuple (clean_text, extracted_emojis).
    """
    # tokens not starting with unwanted prefixes
    tokens = raw_text.split()
    filtered_tokens = [
        tok for tok in tokens
        if not tok.startswith(("@", "http://", "https://", "#", "&"))
    ]

    # find all emoji characters in the original text
    emoji_chars = [ch for ch in raw_text if ch in emoji.UNICODE_EMOJI]

    # remove any token that contains one of the extracted emojis
    text_tokens = [
        tok for tok in filtered_tokens
        if not any(e in tok for e in emoji_chars)
    ]

    clean_text = " ".join(text_tokens)
    emojis = "".join(emoji_chars)
    return clean_text, emojis

# Example usage
raw_text = "#samplesenti @emojitweets i ❤❤❤ sentiment \" analysis \" http://senti.com/pic_01.jpg"
filtered_text, extracted_emojis = split_text_and_emojis(raw_text)

print("All Char:", filtered_text)
print("All Emojis:", extracted_emojis)


All Char: ['#', 's', 'a', 'm', 'p', 'l', 'e', 's', 'e', 'n', 't', 'i', ' ', '@', 'e', 'm', 'o', 'j', 'i', 't', 'w', 'e', 'e', 't', 's', ' ', 'i', ' ', '❤', '❤', '❤', ' ', 's', 'e', 'n', 't', 'i', 'm', 'e', 'n', 't', ' ', '&', 'q', 'u', 'o', 't', ';', ' ', 'a', 'n', 'a', 'l', 'y', 's', 'i', 's', ' ', '&', 'q', 'u', 'o', 't', ';', ' ', 'h', 't', 't', 'p', ':', '/', '/', 's', 'e', 'n', 't', 'i', '.', 'c', 'o', 'm', '/', 'p', 'i', 'c', '_', '0', '1', '.', 'j', 'p', 'g', ' ']

All Emoji: ['❤', '❤', '❤']

 i sentiment analysis

 ❤❤❤


### Get the sentiments of the processed posts

In [None]:
def predict_sentiment(text_input, vectorizer=tfidf_vectorizer, model=classifier):
    """
    Given a single string `text_input`, return its predicted sentiment label (0 or 1).
    """
    # transform the input text into TF–IDF features
    features = vectorizer.transform([text_input])
    # predict and return the label as an integer
    return int(model.predict(features)[0])

# example usage
print(predict_sentiment("i sentiment analysis"))

0


In [None]:
def fetch_emoji_sentiments(emoji_chars, sentiment_df=sentiment_df):
    """
    Map each emoji in `emoji_chars` to its binary sentiment (0/1)
    using the provided `sentiment_df`.
    """
    # build a lookup dict once
    emoji_to_sentiment = dict(zip(sentiment_df["emoji"], sentiment_df["sentiment"]))
    # return sentiment for each character, or None if not found
    return [emoji_to_sentiment.get(ch) for ch in emoji_chars]

sentiment_values = fetch_emoji_sentiments("❤❤❤")
print("Sentiment value of each emoji:", sentiment_values)


Sentiment value of each emoji: [1, 1, 1]


### Building the sentiment analysis

In [None]:
def determine_overall_sentiment(text_input):
    """
    Extract text and emojis from the input, compute their individual sentiments,
    and return an overall label ("Positive" or "Negative").
    """
    # split into clean text and emojis
    clean_text, emojis = split_text_and_emojis(text_input)
    print(f'\tExtracted text: "{clean_text}", emojis: "{emojis}"')

    # predict text sentiment (0 or 1)
    text_label = predict_sentiment(clean_text)
    print(f'\tText sentiment: {text_label}')

    # get each emoji’s sentiment (list of 0/1 or None)
    emoji_labels = fetch_emoji_sentiments(emojis)
    # sum up only the found sentiments
    emoji_sum = sum(l for l in emoji_labels if l is not None)
    # average over emojis (0 if none)
    emoji_avg = emoji_sum / len(emojis) if emojis else 0
    print(f'\tEmoji average sentiment: {emoji_avg:.2f}')

    # compute combined average
    combined_avg = (text_label + emoji_sum) / (1 + len(emojis))
    print(f'\tOverall average sentiment: {combined_avg:.2f}')

    # decide final label
    return "Positive" if combined_avg >= 0.5 else "Negative"

print(determine_overall_sentiment("i ❤❤❤ sentiment analysis"))

	Extracted: "i sentiment analysis" , ❤❤❤
	Text value: 0
	Emoji average value: 1.0
	Average value: 0.75
Positive


### Conclusion

In this approach, we treat text and emojis as two separate sources of sentiment information. Emojis are given fixed polarity scores based on their own sentiment distribution, while the tweet text is evaluated independently by our classifier. To determine a tweet’s overall sentiment, we combine the text-based prediction with the average polarity of any emojis it contains. This fusion ensures that stable emoji sentiment values meaningfully influence the final result, enhancing the robustness of our analysis.
