***
# EDA for tweet_emotions.csv dataset
https://www.kaggle.com/datasets/pashupatigupta/emotion-detection-from-text
***

***
## 1 Load the dataset
 <span style="color:red">!!!Make sure to comment out the correct source (Google drive or local)!!!</span>

In [1]:
import pandas as pd
import numpy as np
import random

# read from the google drive
# url = 'https://drive.google.com/file/d/1xv5hff4DntWz6gn4KMXkBQ6t5_Vs16MQ/view?usp=sharing'
# url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]

# read from local
url = '../datasets/tweet_emotions.csv'

df = pd.read_csv(url)

print(df.head())

     tweet_id   sentiment                                            content
0  1956967341       empty  @tiffanylue i know  i was listenin to bad habi...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!
4  1956968416     neutral  @dannycastillo We want to trade with someone w...


***
## 2 Basic EDA

### Datatypes for each of the columns in our dataframe

In [2]:
print(df.dtypes)

tweet_id      int64
sentiment    object
content      object
dtype: object


### Unique types and count of sentiments in the dataset

In [3]:
# get the value counts of each unique value
counts = df['sentiment'].value_counts()

# convert the counts to percentages
percentages = counts / counts.sum() * 100

cp_df = pd.DataFrame({'count': counts, 'percentage': percentages})

print(cp_df)

            count  percentage
neutral      8638     21.5950
worry        8459     21.1475
happiness    5209     13.0225
sadness      5165     12.9125
love         3842      9.6050
surprise     2187      5.4675
fun          1776      4.4400
relief       1526      3.8150
hate         1323      3.3075
empty         827      2.0675
enthusiasm    759      1.8975
boredom       179      0.4475
anger         110      0.2750


### Max and min lengths of the tweets

In [4]:
lengths = df['content'].str.len()

# display the maximum and minimum lengths
print('Maximum length:', lengths.max())
print('Minimum length:', lengths.min())

Maximum length: 167
Minimum length: 1


***
## 3 Check each sentiment to see what type of content there is to see if it's legit

### A function to explore the sentiments

In [5]:
# pass the emotion and number of random tweets to see
def explore(emotion=None, num_tweets=5):
    if emotion == None:
        return 'no emotion provided'
    elif emotion not in all_sentiments:
        return 'not one of the sentiments in the dataframe'
    else:
        content = df.loc[df['sentiment'] == emotion, 'content']

        random_nums = random.sample(range(len(content)), num_tweets)
        
        print(f"Number of tweets for '{emotion}': {len(content)}")
        print(f"Percentage of the dataset: {cp_df.loc[emotion].percentage}")
        print(f"Random indices:' {random_nums}", end='\n\n')
        print(f"Random {num_tweets} Tweets for emotion '{emotion}':")

        counter = 1
        for tweet in content.iloc[random_nums]:
            print(f"{counter}: {tweet}", end='\n')
            counter = counter + 1

### All the sentiments

In [6]:
all_sentiments = sorted(df['sentiment'].unique())
print(all_sentiments)

['anger', 'boredom', 'empty', 'enthusiasm', 'fun', 'happiness', 'hate', 'love', 'neutral', 'relief', 'sadness', 'surprise', 'worry']


In [7]:
# this cell prints all of the sentiments all at once,
# but keeping it commented out for now

# unique_sentiments = df['sentiment'].unique()

# for emotion in unique_sentiments:
#     explore(emotion)
#     print('**********************************************************\n**********************************************************\n')

### Let's explore 'anger'

In [8]:
explore('anger')

Number of tweets for 'anger': 110
Percentage of the dataset: 0.27499999999999997
Random indices:' [9, 98, 39, 37, 65]

Random 5 Tweets for emotion 'anger':
1: Did a historical Jesus ever exist? Im finding it hard to prove, its all hearsay accounts ... it bugs me ...
2: @Buddy021193 i hear you.. it pisses me off haha
3: @msaysrawr *points at the gear question I just posted* I cant get the rest of my Dreadweave set
4: Very bad things.......I need to stop thinking!
5: @JennyTaylor94  yer it is...poor little cock  but she well doesnt deserve the stick off everyone! cowell once again going against producer


#### Initial thoughts on 'anger':

The number of 'anger' tweets, 110, is the lowest count in the dataset and represents just 0.2750% of the dataset. This will lead to an imbalance issue.<br>
<br>
Some of these Tweets seem to be mislabeled (e.g. "omg! goooood ass nappy nap  jusss woke up bout 2 clean up a lil then get ready", "OIL IS CHANGED!  And I am filthy.    But it's an accomplished filthy.", "Just sittin here waitin for my coffee to be full grown on farm town before going to bed"). But most seem to be labeled correctly (e.g. "What did I do to you!  sheesh", "The &quot;Catch Me If You Can&quot; DVD that I rented from Blockbuster.com yesterday was cracked. Figured it out about 35 minutes into the movie.", "@drakesizzle  If you don't want to come then don't come. JEEEEEZ.").

### Let's explore 'boredom'

In [9]:
explore('boredom')

Number of tweets for 'boredom': 179
Percentage of the dataset: 0.4475
Random indices:' [20, 90, 119, 124, 130]

Random 5 Tweets for emotion 'boredom':
1: fo shizzle. . . i'm bored and wanna go do something.  wish i went to pisay today. oh wellz. wonder who were there.
2: Bad week for connectivity...Arlington Panera wifi sucks. Maybe head to Legal Seafoods at airport. Dang...missing Metaverse U stream.
3: @jambomb oh that looks boring  and even more boring you have an exam on a saturday
4: longest flight EVER. not particularly unpleasant or uncomfortable, just really really long
5: Cleaning the House! Im so boring..


#### Initial thoughts on 'boredom':

This is the second lowest sentiment category with only 179 Tweets representing 0.4475% of the dataset. Like anger, this will be an imbalance issue.<br>
<br>
The interpretation of 'bored' seems to vary from the traditional sense (e.g. "I am sooooooo bored in textiles !", "is bored. my BFF doesn't want to hang out") to "this person must be bored because they have nothing better to do than write this Tweet" (e.g. "aw now where's that little asian girl who runs round pooping her pants in public? i miss laughing at her.", "my neighbours are far too loud in thier back garden, all I can hear is this loud woman that won't stop laughing").

### Let's explore 'empty'

In [10]:
explore('empty')

Number of tweets for 'empty': 827
Percentage of the dataset: 2.0675
Random indices:' [294, 737, 215, 389, 822]

Random 5 Tweets for emotion 'empty':
1: nothing much on tv, seen most of the good stuff...think i'll go to bed soon, but it's too hot to sleep
2: Watching jackass the movie  http://twitpic.com/4wlgi
3: It's a beautiful nice day and I'm stuck inside!
4: i am so bored.
5: Here we go again, back to work. Happy Mothers Day to all  Peace


#### Initial thoughts on 'empty'

It seems that the 'empty' sentiment is very random. At times, there are two or more sentiments expressed (e.g. "back from grimsby  it sucks bein back but was amazin wknd anyway!!"). Other times they are just statements of fact or questions (e.g. "On the way to santa monica", "@spook68 morning.any plans for today?"). Still other times they are just words ("HELLOO"). Sometimes, there are Tweets that seem like they should be labeled with one of the other emotions (e.g. "yay, joss is coming over on saturday" should probably be labeled 'happiness' or 'enthusiasm') and perhaps because there are possibly two different labels in that last example, the labeler chose to leave it empty.<br>
<br>
There are 827 of the 'empty' Tweets representing 2.0675% of the dataset. We should consider removing these from the dataset since these seem to be the "unknown" type of Tweets.

### Let's explore 'enthusiasm'

In [11]:
explore('enthusiasm')

Number of tweets for 'enthusiasm': 759
Percentage of the dataset: 1.8975
Random indices:' [481, 17, 368, 661, 65]

Random 5 Tweets for emotion 'enthusiasm':
1: @mariannemarlow It is a drink but they have a trainer brand too.. http://www.office.co.uk/brand/babycham/8 &lt;&lt;have a look
2: Dying to get my hands on the Diagnosis Murder DVD boxset but those pesky kids at Amazon still won't deliver to Zimbabwe
3: feel like going home and sleep till the next day!
4: @anna_007 Don't worry, they'll get bored of it! Just hang in there and don't give in!
5: i feel like taking a day off but cannot afford it  looking forward to the dfb cup final tmrw night though. go werder!!!


#### Initial thoughts on 'enthusiasm':

These set of Tweets also suffer from some mislabeling issues (e.g. "I'm bored, extremely bored. in the car. waiting for my dad. and dinner. chinese. yummm.", "wishes I could be the one going to our conference in the Bahamas next week"). Some Tweets seemed to be mislabeled as 'enthusiasm' due to certain keywords (e.g. "I made my parents add u guys on the family myspace...they were impressed by the song" -- 'impressed') or perhaps the number of exclamantion points (e.g. "im so new!! and i need ur help").

### Let's explore 'fun'

In [None]:
explore('fun')

#### Initial thoughts on 'fun':

Have not gotten to this.

### Let's explore 'happiness'

In [None]:
explore('happiness')

#### Initial thoughts on 'happiness':

Have not gotten to this.

### Let's explore 'hate'

In [None]:
explore('hate')

#### Initial thoughts on 'hate':

There are 1132 Tweets labeled 'hate'. They seem to strongly resemble 'anger' and can possibly be relabeled as such. In so doing, we would increase the number of 'anger' Tweets to help with the imbalance issue.

### Let's explore 'love'

In [None]:
explore('love')

#### Initial thoughts on 'love':

Have not gotten to this.

### Let's explore 'neutral'

In [None]:
explore('neutral')

#### Initial thoughts on 'neutral':

Have not gotten to this.

### Let's explore 'relief'

In [None]:
explore('relief')

#### Initial thoughts on 'relief':

Have not gotten to this.

### Let's explore 'sadness'

In [None]:
explore('sadness')

#### Initial thoughts on 'sadness':

Have not gotten to this.

### Let's explore 'surprise'

In [None]:
explore('surprise')

#### Initial thoughts on 'surprise':

Have not gotten to this.

### Let's explore 'worry'

In [None]:
explore('worry')

#### Initial thoughts on 'worry':

Have not gotten to this.

### General thoughts about the contents of these tweets:

There are @mentions, URLs, and character entities ("\&nbsp;", "\&quot;", "\&amp;") that we may want to remove as part of data cleaning since they are superfluous in terms of indicating emotion. There are also hashtags that we may or may not consider removing. Some of these hashtags don't actually add to the sentiment, but some do so we may consider keeping them.

***
## 4 Clean the data

### Clean the data by removing any @mentions from the content

In [12]:
# remove @mentions from the 'content' column
df['content'] = df['content'].str.replace(r'@\w+', '')

# display the resulting DataFrame
print(df.head())

     tweet_id   sentiment                                            content
0  1956967341       empty   i know  i was listenin to bad habit earlier a...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!
4  1956968416     neutral   We want to trade with someone who has Houston...


  df['content'] = df['content'].str.replace(r'@\w+', '')


### Clean the data by removing any whitespace from the front and end of each tweet

In [13]:
df['content'] = df['content'].str.strip()

# display the resulting DataFrame
print(df.head())

     tweet_id   sentiment                                            content
0  1956967341       empty  i know  i was listenin to bad habit earlier an...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!
4  1956968416     neutral  We want to trade with someone who has Houston ...
