
# **Introduction and Setup**:

##**Briefly explain the purpose of the notebook.**

The purpose of this notebook is to perform Exploratory Data Analysis (EDA) on a sentiment analysis project dataset. Sentiment analysis involves determining the sentiment or emotional tone expressed in a piece of text, typically classified into positive, negative, or neutral sentiments. EDA helps us gain insights into the dataset's characteristics, distribution of sentiments, and key textual features, which in turn aids in building an effective sentiment analysis model.

Exploratory Data Analysis (EDA) plays a crucial role in a sentiment analysis project as it helps us in the following ways:


*   Understanding Data Distribution
*   Identifying Data Quality Issues
*   Preprocessing Strategy
*   Top Words Analysis


## **Import necessary libraries.**

In [85]:
import pandas as pd
from collections import defaultdict
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from collections import defaultdict


# Download NLTK resources if not already downloaded
nltk.download('punkt')
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## **Load the dataset.**

In [86]:
# Load the dataset from a CSV file
csv_file_path = '/data/train.csv'
df = pd.read_csv(csv_file_path)

# Drop the 'selected_text' column as it's not needed for this analysis
df = df.drop(columns=['selected_text'])

# Convert the 'text' column to string type
df["text"] = df["text"].astype(str)

# **Data Exploration**
Display basic information about the dataset using df.shape, df.info(), and df.head().

In [87]:
# Display the shape and basic info of the DataFrame
print(df.shape)


(27664, 3)


In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27664 entries, 0 to 27663
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   textID     27664 non-null  object
 1   text       27664 non-null  object
 2   sentiment  27664 non-null  object
dtypes: object(3)
memory usage: 648.5+ KB


# **Sentiment Distribution Visualization:**
*   **Group and count sentiment values.**

In [89]:
# Group the data by sentiment and count the number of entries for each sentiment
temp = df.groupby('sentiment').count()['text'].reset_index().sort_values(by='text', ascending=False)

# Apply background gradient styling to the 'temp' DataFrame
styled_temp = temp.style.background_gradient(cmap='Purples')


*   **Visualize sentiment distribution using a Funnel-Chart.**



In [90]:
# Create a Funnel-Chart using Plotly to visualize sentiment distribution
fig = go.Figure(go.Funnelarea(
    text=temp.sentiment,
    values=temp.text,
    title={"position": "top center", "text": "Funnel-Chart of Sentiment Distribution"}
))
# Show the Funnel-Chart using the specified renderer
fig.show(renderer="colab")

In [91]:
# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,textID,text,sentiment
0,cb774db0d1,"I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,negative
2,088c60f138,my boss is bullying me...,negative
3,9642c003ef,what interview! leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...",negative


# **Text Preprocessing:**

*   **Define the preprocess_tweet function.**


In [92]:
# Preprocessing function to clean and tokenize tweets
def preprocess_tweet(tweet):
    tweet = tweet.lower()  # Convert text to lowercase
    tweet = tweet.translate(str.maketrans("", "", string.punctuation))  # Remove punctuation
    tokens = nltk.word_tokenize(tweet)  # Tokenize the text
    stemmer = PorterStemmer()  # Create a stemmer instance
    stopwords_set = set(stopwords.words("english"))  # Get a set of English stopwords
    tokens = [stemmer.stem(token) for token in tokens if token not in stopwords_set]  # Apply stemming and remove stopwords
    return tokens

*   **Define the calculate_word_counts function.**
*  **Tokenize and preprocess text data using NLTK.**

In [93]:
# Function to calculate word counts for a given list of tweets
def calculate_word_counts(tweets):
    word_count = defaultdict(int)
    for tweet in tweets:
        tokens = preprocess_tweet(tweet)  # Preprocess the tweet
        for token in tokens:
            word_count[token] += 1  # Count the occurrences of each token
    return word_count

# **Word Count Analysis:**

Calculate word counts for each sentiment class.

*   Calculate word counts for each sentiment class.
*   Convert word counts to DataFrames.
*   Display top words and their counts for each sentiment class.

In [94]:
# Calculate word counts for tweets with different sentiments
word_count_positive = calculate_word_counts(df[df['sentiment'] == 'positive']['text'])
word_count_negative = calculate_word_counts(df[df['sentiment'] == 'negative']['text'])
word_count_neutral = calculate_word_counts(df[df['sentiment'] == 'neutral']['text'])


In [95]:
# Convert word counts to DataFrames for better visualization
word_count_positive_df = pd.DataFrame(word_count_positive.items(), columns=['Word', 'Count'])
word_count_negative_df = pd.DataFrame(word_count_negative.items(), columns=['Word', 'Count'])
word_count_neutral_df = pd.DataFrame(word_count_neutral.items(), columns=['Word', 'Count'])


In [96]:
# Display the top words and their counts for each sentiment class
print("Word Counts - Positive Sentiment:")
print(word_count_positive_df.head())

print("\nWord Counts - Negative Sentiment:")
print(word_count_negative_df.head())

print("\nWord Counts - Neutral Sentiment:")
print(word_count_neutral_df.head())

Word Counts - Positive Sentiment:
    Word  Count
0    2am      2
1   feed      9
2   babi     65
3    fun    344
4  smile     45

Word Counts - Negative Sentiment:
    Word  Count
0   sooo     38
1    sad    398
2   miss    660
3    san      9
4  diego      6

Word Counts - Neutral Sentiment:
                       Word  Count
0                        id     72
1                   respond     11
2                        go   1055
3  httpwwwdothebouncycomsmf      1
4                 shameless      1


# **Top Words Visualization:**

Display top words for positive sentiment with a styled DataFrame.

*   **Display top words for positive sentiment with a styled DataFrame.**
*   **Visualize top positive words using a treemap.**
*   **Repeat the above steps for negative and neutral sentiments.**

In [97]:
# Sort and select the top 20 words for each sentiment class
top_positive_words = word_count_positive_df.sort_values(by='Count', ascending=False).head(20)
top_negative_words = word_count_negative_df.sort_values(by='Count', ascending=False).head(20)
top_neutral_words = word_count_neutral_df.sort_values(by='Count', ascending=False).head(20)

In [98]:
# Display the top words for each sentiment class - Positive
print("Top 20 Words - Positive Sentiment:")
print()
# Apply background gradient styling and display the DataFrame
styled_positive_words = top_positive_words.style.background_gradient(cmap='Blues')
styled_positive_words


Top 20 Words - Positive Sentiment:



Unnamed: 0,Word,Count
74,day,1326
16,love,1139
103,good,1056
66,happi,852
112,thank,814
38,im,742
143,mother,668
49,go,573
55,hope,519
426,great,481


In [99]:
# Create a treemap using Plotly Express to visualize top positive words
fig = px.treemap(top_positive_words, path=['Word'], values='Count', title='Tree Of Unique Positive Words')
fig.show()

In [100]:
# Display the top words for each sentiment class - Negative
print("\nTop 20 Words - Negative Sentiment:")
print()
# Apply background gradient styling and display the DataFrame
styled_negative_words = top_negative_words.style.background_gradient(cmap='Purples')
styled_negative_words




Top 20 Words - Negative Sentiment:



Unnamed: 0,Word,Count
54,im,1227
22,go,735
2,miss,660
155,get,613
66,work,493
102,like,492
106,dont,469
199,feel,467
299,cant,463
39,day,404


In [101]:
# Create a treemap using Plotly Express to visualize top negative words
fig = px.treemap(top_negative_words, path=['Word'], values='Count', title='Tree Of Unique Negative Words')
fig.show()


In [102]:
# Display the top words for each sentiment class - Neutral
print("\nTop 20 Words - Neutral Sentiment:")
print()

# Apply background gradient styling and display the DataFrame
styled_neutral_words = top_neutral_words.style.background_gradient(cmap='Greens')
styled_neutral_words


Top 20 Words - Neutral Sentiment:



Unnamed: 0,Word,Count
19,im,1058
2,go,1055
23,get,819
164,work,647
320,day,638
42,got,539
285,dont,491
112,like,482
224,time,468
233,lol,455


In [103]:
# Create a treemap using Plotly Express to visualize top neutral words
fig = px.treemap(top_neutral_words, path=['Word'], values='Count', title='Tree Of Unique Neutral Words')
fig.show()