<a href="https://colab.research.google.com/github/udithac/llm/blob/main/Twitter_Emotion_Analysis_Indian_Election_2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'tweetdatasets:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F5228925%2F8715438%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240626%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240626T140431Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D2008061a6749084b68ba80ff109ca1a7e59562bd387c167f75ab2ad7bad88ace785e52bbfafc6028ab34b1b8fa0d1b324a71fce53dd4b65a3976d4b7f872f0e388c337d820f634d141e41618c7153290afd0ee663aa57aecf3dacd0630fa0147f4c88397a83cb79697e09bbab7c843d8034541848ca9af7d4846dd6171bac968ce16e76f5a1fc8b7f26c22d3faaae08656b18009ff3c3390e6dd232e5225a3324e8407c046fd89be5ed5b5e72b5d290e6b3fb8b3c127f788be15fbf0aeed956ab72f6dcbccccc873131c2c41c68fe76eae743d83641e45a80115d94ea2a99e5096f1f7dcba96aec4cb456e93075fe858df67986563096a768578d25775bfeec3'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


# Emotion Analysis of Tweets by Political Candidates in the 2019 Indian Election.

### To constitute India's 17th Lok Sabha, general elections were held in Aprilâ€“May 2019. The results were announced on 23 May 2019. The main contenders were two alliance groups of the Incumbent BJP (Candidate: Mr. Narendra Modhi) leading National Democratic Alliance (Candidate: Mr. Rahul Ghandi) and the Opposition United Progressive Alliance and Indian National Congress respectively.
### Source: https://en.wikipedia.org/wiki/Results_of_the_2019_Indian_general_election

# Project Summary: Emotion Analysis of Tweets by Political Candidates in the 2019 Indian Election

## Objective
#### To analyze and visualize the trends in emotions expressed in tweets by the main political candidates, Rahul Gandhi and Narendra Modi, during the 2019 Indian election.

## Key Steps

#### Data Preparation
#### - Gathered tweet data for the main political candidates, Rahul Gandhi and Narendra Modi.
#### - Each tweet was annotated with an emotion (sentiment) label such as positive, negative, or neutral.

### Emotion Trend Analysis
#### - Grouped the tweet data by candidate, date, and sentiment.
#### - Counted the occurrences of each sentiment for each candidate on each date.
#### - Plotted line graphs to show how the number of tweets expressing different emotions changed over time for each candidate.

### Tweet Count Comparison
#### - Grouped the tweet data by candidate.
#### - Counted the total number of tweets for each candidate.
#### - Created a horizontal bar plot to display the tweet counts for each candidate, with counts labeled at the end of each bar for clarity.

## Visualizations

#### Emotion Trends by Candidate
#### - Line plots showing the trends in different emotions (positive, negative, neutral) for each candidate (Rahul Gandhi and Narendra Modi) over time.
#### - Enhanced readability with clear labels, legends, and grid lines.

### Tweet Counts by Candidate
#### - Horizontal bar plot comparing the total tweet counts for each candidate.
#### - Custom colors and labels to highlight the data distribution.

## Key Customizations
#### - Improved plot readability with larger fonts and clear labeling.
#### - Added count labels to bar plots for easy comparison.
#### - Enhanced plot aesthetics by removing unnecessary spines and using a tight layout to avoid overlaps.



In [None]:
# import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#open data sets
df_modi = pd.read_csv('/kaggle/input/tweetdatasets/ModiRelatedTweetsWithSentiment.csv')
df_rahul = pd.read_csv('/kaggle/input/tweetdatasets/RahulRelatedTweetsWithSentiment.csv')

In [None]:
# illustrate data frames
print("*** Modi Related Tweet Data Frame")
print(df_modi.info())
print(df_modi.describe())
print(df_modi.dtypes)
print(df_modi.head())
print(df_modi.tail())
print("*** Missing Values")
print(df_modi.isnull().sum())

In [None]:
# There are some missing values, lets drop those recs
df_modi.dropna(inplace=True)

In [None]:
df_modi.isnull().sum()

In [None]:
#Now No missing Values
df_modi.info()

In [None]:
print("*** Rahul Related Tweets")
print(df_rahul.info())
print(df_rahul.describe())
print(df_rahul.dtypes)
print(df_rahul.head())
print(df_rahul.tail())
print("*** Missing Values")
print(df_rahul.isnull().sum())

In [None]:
# There are no any missing values for Rahul

In [None]:
# Let's combine two data frames into one, then we can easily compare the results
df_modi['Candidate']='Modi'
df_rahul['Candidate']='Rahul'
df = pd.concat([ df_modi[['Candidate','Date','Tweet','Emotion']],df_rahul[['Candidate','Date','Tweet','Emotion']] ])
df

## Exploratory Data Analysis - EDA

In [None]:
df.rename(columns={'Emotion': 'Sentiment'}, inplace=True)

# Replace 'pos' and 'neg' with 'Positive' and 'Negative'
df['Sentiment'] = df['Sentiment'].replace({'pos': 'Positive', 'neg': 'Negative'})

In [None]:
import matplotlib.pyplot as plt

# Group by 'Candidate' and count the occurrences
candidate_counts = df.groupby('Candidate').size()

# Create the bar plot with custom colors
colors = ['orange', 'green']  # Assign specific colors to each bar
ax = candidate_counts.plot(kind='barh', color=colors, figsize=(10, 6))

# Add count labels at the end of each bar
for i in ax.patches:
    ax.text(i.get_width() + 0.3, i.get_y() + i.get_height() / 2,
            str(int(i.get_width())),
            ha='center', va='center', fontsize=12, color='black')

# Customize the plot appearance
plt.gca().spines[['top', 'right']].set_visible(False)
plt.xlabel('Count', fontsize=14)
plt.ylabel('Candidate', fontsize=14)
plt.title('Tweet Counts by Candidate', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Show the plot
plt.tight_layout()
plt.show()


In [None]:
# Group by 'Candidate' and count the occurrences
candidate_counts = df.groupby('Candidate').size()

# Define colors for the pie chart
colors = ['orange', 'green']

# Create the pie chart
plt.figure(figsize=(8, 6))
candidate_counts.plot(kind='pie', colors=colors, autopct='%1.1f%%', startangle=90, counterclock=False)

# Customize the plot appearance
plt.gca().spines[['top', 'right', 'left', 'bottom']].set_visible(False)
plt.ylabel('')  # Remove the default ylabel
plt.title('Tweet Distribution by Candidate')

# Show the plot
plt.show()

In [None]:
# Group emotions by candidate and sentiment
emotion_counts = df.groupby(['Candidate', 'Sentiment']).size().unstack(fill_value=0)

#Calculate total sentiment count per candidate (row-wise sum)
total_count_per_candidate = emotion_counts.sum(axis=1)

# Calculate percentages for each sentiment category
percentages = emotion_counts.div(total_count_per_candidate, axis=0) * 100

# Create the bar chart
percentages.plot(kind='bar', stacked=False, color=['#FC5D68', '#09A563'])  # Set your desired colors

# Customize plot
plt.title('Sentiment Distribution (%) by Candidate')
plt.xlabel('Candidate')
plt.ylabel('Percentage (%)')
plt.xticks(rotation=0)  # Rotate x-axis labels for better readability
plt.legend(title='Sentiment')

# Add percentage labels on top of bars
for bar_container in plt.gca().containers:
    for bar in bar_container:
        yval = bar.get_height()
        plt.text(bar.get_x() + bar.get_width() / 2, yval + 1, f'{yval:.1f}%', ha='center', va='bottom')

plt.show()

In [None]:
#Let's group the tweets by candicate, sentimentes
result=df.groupby(['Candidate', 'Sentiment']).size().unstack(fill_value=0)
print("*** Distribution of Sentiment by Candidates")
print(result)
print("*** Totals:")
print(result.sum(axis=1))

### The Data Frame is inblanced, where 'Modi' has significant more entries compaired to 'Rahul'.
### It is understood that this a "Natural Inbalance" since the world-senario inherently lead because Modi is the president/leader of the country at that time.
### We use Under-sampling method to remove from the majority class ('Modi') to match the size of the minority class to balance the distribution.

In [None]:
# Identify Minority and Majority Classes
candidate_counts = df['Candidate'].value_counts()
majority_class = candidate_counts.idxmax()
minority_class_count = candidate_counts.min()
print(f' Majority Class: {majority_class}')
print(f' Minority Class count: {minority_class_count}')

In [None]:
# Under-Sampling majority class
# Randomly sample entries from the majority class to match the minority class size
majority_class_df = df[df['Candidate'] == majority_class]
undersampled_majority_class = majority_class_df.sample(minority_class_count, random_state=42)  # Set random_state for reproducibility

# Combine undersampled majority with minority class for balanced data
balanced_df = pd.concat([undersampled_majority_class, df[df['Candidate'] != majority_class]])
balanced_df

In [None]:
#Let's group the tweets by candicate, sentimentes
df_b=balanced_df.groupby(['Candidate', 'Sentiment']).size().unstack(fill_value=0)
print("*** Balanced Data: Distribution of Sentiment by Candidates")
print(df_b)
print("*** Totals:")
print(df_b.sum(axis=1))

In [None]:
#balanced_df=df
balanced_df

In [None]:
# Group emotions by candidate and sentiment
b_emotion_counts = balanced_df.groupby(['Candidate', 'Sentiment']).size().unstack(fill_value=0)

#Calculate total sentiment count per candidate (row-wise sum)
b_total_count_per_candidate = b_emotion_counts.sum(axis=1)

# Calculate percentages for each sentiment category
b_percentages = b_emotion_counts.div(b_total_count_per_candidate, axis=0) * 100

# Create the bar chart
b_percentages.plot(kind='bar', stacked=False, color=['#FC5D68', '#09A563'])  # Set your desired colors

# Customize plot
plt.title('Balanced Data: Sentiment Distribution (%) by Candidate')
plt.xlabel('Candidate')
plt.ylabel('Percentage (%)')
plt.xticks(rotation=0)  # Rotate x-axis labels for better readability
plt.legend(title='Sentiment')

# Add percentage labels on top of bars
for bar_container in plt.gca().containers:
    for bar in bar_container:
        yval = bar.get_height()
        plt.text(bar.get_x() + bar.get_width() / 2, yval + 1, f'{yval:.1f}%', ha='center', va='bottom')

plt.show()

### Balance data treatment has been applied without harming the real senario....**

# Emotional Analysis
## Sentiment Analysis of Comments with pysentimiento

### This script demonstrates how to analyze the sentiment of comments using a pre-trained transformer model from the pysentimiento library. This allows us to understand the emotional tone behind the comments, which can be valuable for various tasks like customer feedback analysis or social media monitoring.

In [None]:
!pip install pysentimiento

In [None]:
from pysentimiento import create_analyzer
emotion_analyzer = create_analyzer(task="emotion", lang="en")

In [None]:
df

## Intergrate with the Emotion Model and update Emotions

In [None]:
output=emotion_analyzer.predict(balanced_df.Tweet.iloc[0])

In [None]:
from typing import DefaultDict, List

sentiments = DefaultDict(list)
for tweet in balanced_df.Tweet:
  probas = emotion_analyzer.predict(tweet).probas
  for key, value in probas.items():
    sentiments[key].append(value)

In [None]:
for key, value in sentiments.items():
  #
    balanced_df[key] = value
balanced_df

In [None]:
#print(balanced_df[balanced_df['others'] > 0.5])
#n=1000
#for index, row in balanced_df.iterrows():
#      if row['others']>.5:
#            tweet = row['Tweet']  # Access the tweet text using the column name
#            print(tweet)
#            n+=1
#            if n==1000:
#                exit

In [None]:
balanced_df_data= balanced_df.drop(columns=['Date','Tweet','Sentiment'])
balanced_df_data

In [None]:
balanced_df_data

In [None]:
# Filter out unwanted columns
balanced_df_data_filter = balanced_df_data.drop(columns=['others','Other']) # let's remove the other colums since those are unrelevent comments to the selected emotions.

# Group data by candidate and calculate mean sentiment scores
df_grouped = balanced_df_data_filter.groupby('Candidate').mean()

df_grouped.reset_index(inplace=True)

# Create horizontal bar chart
df_grouped.set_index('Candidate', inplace=True)
ax = df_grouped.plot(kind='barh', figsize=(10, 6), width=0.9,color=['#FBB008', '#0000ff','#FB4108','#045029','#a0522d','#800080'])  # Adjust figure size as needed



plt.title('Sentiment Analysis by Candidate (Mean Scores)')
plt.xlabel('Average Score')
plt.ylabel('Candidate')
plt.legend(title='Sentiment')
plt.tight_layout()
plt.show()

NameError: name 'balanced_df_data' is not defined

## Conclusion: Sentiment Distribution - Modi vs. Rahul
## Our analysis of sentiment distribution in comments related to Modi and Rahul reveals distinct patterns:

### Modi-related Comments: The sentiments of joy, sadness, and fear are predominantly associated with comments about Modi.
### Rahul-related Comments: Comments about Rahul show higher associations with anger, surprise, and disgust.
### Significant Ratios: The ratios of disgust, anger, and joy exhibit notable differences, underscoring the varying emotional responses elicited by each figure.



## Emotion Classification
#### Emotions can often be classified into broader categories based on their characteristics and typical expressions. Here's a common classification for the emotions based on the ones you've listed:

### Positive Emotions:
#### - Joy: A feeling of happiness or pleasure.
#### - Surprise: Feeling startled or amazed.

### Negative Emotions:
#### - Sadness: Feeling unhappy, sorrowful, or disappointed.
#### - Anger: Feeling irritated, frustrated, or hostile.
#### - Disgust: Feeling strong aversion or revulsion towards something.
#### - Fear: Feeling afraid, anxious, or scared.

##### These emotions can be grouped into positive (joy, surprise) and negative (sadness, anger, disgust, fear) categories based on their affective valence (positive or negative) and typical psychological and physiological responses.

In [None]:
# Convert data to DataFrame
summary_df = pd.DataFrame(data)
summary_df.set_index('Candidate', inplace=True)

# Define colors for positive and negative sentiments
colors = ['#276205', '#FA2F04']  # Blue for Positive, Orange for Negative

# Plotting the grouped horizontal bar chart
ax = summary_df[['Positive', 'Negative']].plot(kind='barh', width=0.4, stacked=True, color=colors, figsize=(10, 4))

# Customizing the chart
plt.title('Sentiment Distribution: Modi vs. Rahul')
plt.ylabel('Candidate')
plt.xlabel('Mean Score')

# Adding mean score values at the end of each bar
for i, (pos, neg) in enumerate(zip(summary_df['Positive'], summary_df['Negative'])):
    mean_score = (pos + neg) / 2  # Calculate mean score
    ax.text(pos + neg + 0.02, i, f'{mean_score:.2f}', ha='center', va='center', color='black')

# Adding a legend
plt.legend(['Positive', 'Negative'], title='Sentiment', loc='lower right')

# Show the plot
plt.show()
print(summary_df)

## Summary of Sentiment Analysis

### Data:
- **Modi**: Positive = 0.117563, Negative = 0.294752
- **Rahul**: Positive = 0.060727, Negative = 0.394429

### Key Observations:

#### Modi exhibits slightly higher positive emotions in comments compared to Rahul.
#### However, Rahul has a significantly higher proportion of negative emotional comments overall.
##### Remember that context and time frame play a crucial role in interpreting these sentiment scores.

## Note: Actual Result

### Source: https://en.wikipedia.org/wiki/Results_of_the_2019_Indian_general_election


### Compiled by Uditha C WICK.