***
# EDA for tweet_emotions.csv dataset
https://www.kaggle.com/datasets/pashupatigupta/emotion-detection-from-text
***

***
## 1 Load the dataset
 <span style="color:red">!!!Make sure to comment out the correct source (Google drive or local)!!!</span>

In [1]:
import pandas as pd
import numpy as np
import random

# read from the google drive
# url = 'https://drive.google.com/file/d/1xv5hff4DntWz6gn4KMXkBQ6t5_Vs16MQ/view?usp=sharing'
# url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]

# read from local
url = '../datasets/tweet_emotions.csv'

df = pd.read_csv(url)

print(df.head())

     tweet_id   sentiment                                            content
0  1956967341       empty  @tiffanylue i know  i was listenin to bad habi...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!
4  1956968416     neutral  @dannycastillo We want to trade with someone w...


***
## 2 Basic EDA

### Datatypes for each of the columns in our dataframe

In [2]:
print(df.dtypes)

tweet_id      int64
sentiment    object
content      object
dtype: object


### Unique types and count of sentiments in the dataset

In [25]:
# get the value counts of each unique value
counts = df['sentiment'].value_counts()

# convert the counts to percentages
percentages = counts / counts.sum() * 100

cp_df = pd.DataFrame({'count': counts, 'percentage': percentages})

print(cp_df)

            count  percentage
neutral      8638     21.5950
worry        8459     21.1475
happiness    5209     13.0225
sadness      5165     12.9125
love         3842      9.6050
surprise     2187      5.4675
fun          1776      4.4400
relief       1526      3.8150
hate         1323      3.3075
empty         827      2.0675
enthusiasm    759      1.8975
boredom       179      0.4475
anger         110      0.2750


### Max and min lengths of the tweets

In [4]:
lengths = df['content'].str.len()

# display the maximum and minimum lengths
print('Maximum length:', lengths.max())
print('Minimum length:', lengths.min())

Maximum length: 167
Minimum length: 1


***
## 3 Check each sentiment to see what type of content there is to see if it's legit

### A function to explore the sentiments

In [88]:
def explore(emotion=None):
    if emotion == None:
        return 'no emotion provided'
    elif emotion not in all_sentiments:
        return 'not one of the sentiments in the dataframe'
    else:
        content = df.loc[df['sentiment'] == emotion, 'content']

        random_nums = random.sample(range(len(content)), 5)
        print(f"Number of tweets for '{emotion}': {len(content)}")
        print(f"Percentage of the dataset: {cp_df.loc[emotion].percentage}")
        print(f"Random indices:' {random_nums}", end='\n\n')
        print(f"Random {len(random_nums)} Tweets for emotion '{emotion}':")

        counter = 1
        for tweet in content.iloc[random_nums]:
            print(f"{counter}: {tweet}", end='\n')
            counter = counter + 1

### All the sentiments

In [54]:
all_sentiments = sorted(df['sentiment'].unique())
print(all_sentiments)

['anger', 'boredom', 'empty', 'enthusiasm', 'fun', 'happiness', 'hate', 'love', 'neutral', 'relief', 'sadness', 'surprise', 'worry']


In [73]:
# this cell prints all of the sentiments all at once,
# but keeping it commented out for now

# unique_sentiments = df['sentiment'].unique()

# for emotion in unique_sentiments:
#     explore(emotion)
#     print('**********************************************************\n**********************************************************\n')

Number of tweets for 'empty': 827
Percentage of the dataset: 2.0675

Random indices:' [609, 74, 260, 211, 822]
Random 5 elements of 'content' column for emotion 'empty'
1: cheese and onion! or as my father says 'cheese and minging'
2: no wine here - I don;t drink  - but I have have plenty of forbidden cholocate
3: At the pub with the dog but seems to have misplaced friend with drinks
4: loads of insects are attacking me.. time to go inside
5: Here we go again, back to work. Happy Mothers Day to all  Peace
**********************************************************
**********************************************************

Number of tweets for 'sadness': 5165
Percentage of the dataset: 12.9125

Random indices:' [4153, 3718, 2283, 5094, 4605]
Random 5 elements of 'content' column for emotion 'sadness'
1: AWWWW  we're gonna miss you!
2: the justice left when DJ Talent was voted off (N)
3: Is at the botanical gardens and its beautiful but she forgot an extra memory card so no pictures toda

### Let's explore 'anger'

In [84]:
explore('anger')

Number of tweets for 'anger': 110
Percentage of the dataset: 0.27499999999999997
Random indices:' [63, 39, 77, 35, 6]

Random 5 elements of 'content' column for emotion 'anger':
1: And you could have it all, my empire of dirt! I'm in a&amp;e with dad I'm freezing fully shivering Every1 else is warm  no fones allowed ffs!
2: *points at the gear question I just posted* I cant get the rest of my Dreadweave set
3: my sisters fucking pc, just blued screened me
4: MOtherfuck QW
5: yes, boo for soar throats and earaches!


#### Initial thoughts on 'anger':

The number of 'anger' tweets, 110, is the lowest count in the dataset and represents just 0.2750% of the dataset. This will lead to an imbalance issue.<br>
<br>
Some of these Tweets seem to be mislabeled (e.g. "omg! goooood ass nappy nap  jusss woke up bout 2 clean up a lil then get ready", "OIL IS CHANGED!  And I am filthy.    But it's an accomplished filthy.", "Just sittin here waitin for my coffee to be full grown on farm town before going to bed"). But most seem to be labeled correctly (e.g. "What did I do to you!  sheesh", "The &quot;Catch Me If You Can&quot; DVD that I rented from Blockbuster.com yesterday was cracked. Figured it out about 35 minutes into the movie.", "@drakesizzle  If you don't want to come then don't come. JEEEEEZ.").

### Let's explore 'boredom'

In [85]:
explore('boredom')

Number of tweets for 'boredom': 179
Percentage of the dataset: 0.4475
Random indices:' [119, 152, 174, 80, 125]

Random 5 elements of 'content' column for emotion 'boredom':
1: oh that looks boring  and even more boring you have an exam on a saturday
2: I'm SUPER tired and probably could sleep ALL day BUT I work 12:30 to 9:30 today in Tool Rental... Oh the Joy!!
3: lol takes some getting use to  Just replying to your email. Power keeps cutting out here :S
4: sitting at the chevy dealership in utah waiting for the van to be fixed
5: waiting to go to the movies later for my 6th month. booored.


#### Initial thoughts on 'boredom':

This is the second lowest sentiment category with only 179 Tweets representing 0.4475% of the dataset. Like anger, this will be an imbalance issue.<br>
<br>
The interpretation of 'bored' seems to vary from the traditional sense (e.g. "I am sooooooo bored in textiles !", "is bored. my BFF doesn't want to hang out") to "this person must be bored because they have nothing better to do than write this Tweet" (e.g. "aw now where's that little asian girl who runs round pooping her pants in public? i miss laughing at her.", "my neighbours are far too loud in thier back garden, all I can hear is this loud woman that won't stop laughing").

### Let's explore 'empty'

In [86]:
explore('empty')

Number of tweets for 'empty': 827
Percentage of the dataset: 2.0675
Random indices:' [433, 637, 440, 184, 157]

Random 5 elements of 'content' column for emotion 'empty':
1: don't i know it! i live in the middle of nowhere, my house is spider central
2: 
3: Boo Im lonely and bored
4: Urgh.... feeling like crap today. Bad headache, tired, blood sugars too high.
5: Wanting to leave work early today but stuff keeps accumulating.  this Friday is so a Monday in disguise. lol


#### Initial thoughts on 'empty'

It seems that the 'empty' sentiment is very random. At times, there are two or more sentiments expressed (e.g. "back from grimsby  it sucks bein back but was amazin wknd anyway!!"). Other times they are just statements of fact or questions (e.g. "On the way to santa monica", "@spook68 morning.any plans for today?"). Still other times they are just words ("HELLOO"). Sometimes, there are Tweets that seem like they should be labeled with one of the other emotions (e.g. "yay, joss is coming over on saturday" should probably be labeled 'happiness' or 'enthusiasm') and perhaps because there are possibly two different labels in that last example, the labeler chose to leave it empty.<br>
<br>
There are 827 of the 'empty' Tweets representing 2.0675% of the dataset. We should consider removing these from the dataset since these seem to be the "UNK" type of Tweets.

### Let's explore 'enthusiasm'

In [94]:
explore('enthusiasm')

Number of tweets for 'enthusiasm': 759
Percentage of the dataset: 1.8975
Random indices:' [686, 70, 503, 492, 477]

Random 5 Tweets for emotion 'enthusiasm':
1: lol.. ass monkey
2: nah, he won't  but I will sit here and enjoy the view :-P
3: making more muffinsss, wheat jerm AANNNDD psyillium husk
4: Adventures with jamie and bethhh
5: i don't have any excuse other than night shifts! we got our orphan lambs from a local farmer so we cheated


#### Initial thoughts on 'enthusiasm':

asdf

### Let's explore 'fun'

In [43]:
explore('fun')

Number of tweets for 'fun': 1776
Percentage of the dataset: 4.44

Random indices:' [111, 584, 1127, 663, 1355]
Random 5 elements of 'content' column for emotion 'fun'
0: i got coupons to Popeye's chicken but I'll probably end up getting a burrito at freshii - this salad joint. healthy
1: haha who thought that?
2: just got back from six flags  wicked fun. even tho i almost died!
3: haha neither am I. It doesn't matter though you guys do what you want
4: We should have a twitter reunion it would be awesome to meet you all lol, iwonder howd iget that to pull off


#### Initial thoughts on 'fun':

asdf

### Let's explore 'happiness'

In [44]:
explore('happiness')

Number of tweets for 'happiness': 5209
Percentage of the dataset: 13.0225

Random indices:' [1169, 831, 4980, 776, 1475]
Random 5 elements of 'content' column for emotion 'happiness'
0: me too
1: Still working in the database and trying to decide what I want to eat.
2: I'm really happy for u n leigh  thnx for sharing this happiness with us, this means the world to us all!  love ya! Marta
3: I am so glad it's Friday. I just got off work and I'm so tired.
4: off to the land of pillows and blankets... mm, and the fan up on high... and did I mention the blankets? my favorite time ever.


#### Initial thoughts on 'happiness':

asdf

### Let's explore 'hate'

In [45]:
explore('hate')

Number of tweets for 'hate': 1323
Percentage of the dataset: 3.3075

Random indices:' [979, 412, 23, 1142, 320]
Random 5 elements of 'content' column for emotion 'hate'
0: Ouch. I used to hate it when I did that  (And then there are the irate callers who were trying to record, getting p***ed, etc)
1: I'm drawning in emails
2: Hahahaha! It's not horrible, if others were singing with I'm sure it could work. I wish I could afford my own drum set
3: Yeah, yeah. Less #degenerate than current occupants of U.S. House of Reprehensibles. You can bet on that. Like MineThatBird.
4: im tweeting... this is so hard... i dont get it...


#### Initial thoughts on 'hate':

There are 1132 Tweets labeled 'hate'. They seem to strongly resemble 'anger' and can possibly be relabeled as such. In so doing, we would increase the number of 'anger' Tweets to help with the imbalance issue.

### Let's explore 'love'

In [46]:
explore('love')

Number of tweets for 'love': 3842
Percentage of the dataset: 9.605

Random indices:' [2472, 987, 1688, 2637, 3643]
Random 5 elements of 'content' column for emotion 'love'
0: Chicken beer and good company makes a good night...
1: MY AUNTIE FROM QUEENSLAND IS DOWN TO STAY THE NIGHT! YAYA.
2: good luck! It's not too bad, and if it is, it's curved grading so u might end up surprised
3: no... But I met a new graphic design friend, so that was dooope!!!
4: Happy Mother's Day to all the Mom's out there, but especially to my wonderfuly Mommy


#### Initial thoughts on 'love':

asdf

### Let's explore 'neutral'

In [47]:
explore('neutral')

Number of tweets for 'neutral': 8638
Percentage of the dataset: 21.595

Random indices:' [5405, 5936, 8141, 8413, 7624]
Random 5 elements of 'content' column for emotion 'neutral'
0: I think I'll target it's original release date, which is July 29th.  Enough time to raise faux or ironic interest.
1: Will be your DJ for a little while! Tune in if you want  www.soompiradio.com
2: tweeten maar
3: Here's a link to Gregg's performace, its amazing  http://bit.ly/GsWrk
4: Have a good flight!


#### Initial thoughts on 'neutral':

asdf

### Let's explore 'relief'

In [48]:
explore('relief')

Number of tweets for 'relief': 1526
Percentage of the dataset: 3.8150000000000004

Random indices:' [1319, 421, 950, 237, 174]
Random 5 elements of 'content' column for emotion 'relief'
0: success!
1: thanks.  I'm self so I don't see the &quot;my Account&quot; area.  I'll have to dig deeper it seems
2: Relaxing after a busy week and a tedious Saturday . . .
3: bored out of my mind. I guess im paying the price for having so much fun yesterday.
4: Fighting a migraine   Medication is almost working.


#### Initial thoughts on 'relief':

asdf

### Let's explore 'sadness'

In [49]:
explore('sadness')

Number of tweets for 'sadness': 5165
Percentage of the dataset: 12.9125

Random indices:' [1250, 2304, 787, 2867, 3967]
Random 5 elements of 'content' column for emotion 'sadness'
0: I need to do some more post but I don't have time on this tour  ........... Apologies to all my supporters.
1: me and Arlando are totally done. I havn'e talked to him in a LONG time!Now I just have to find a 'worth while' man...
2: last thursday or yesterday that sucks i missed it  was it at lunch time ox
3: p.s. sorry about ur uncle
4: i give uppp a hour of tryin to tlk to   i love him but my minutesss lol


#### Initial thoughts on 'sadness':

asdf

### Let's explore 'surprise'

In [50]:
explore('surprise')

Number of tweets for 'surprise': 2187
Percentage of the dataset: 5.4675

Random indices:' [146, 419, 1848, 1430, 1455]
Random 5 elements of 'content' column for emotion 'surprise'
0: YoYo door nazis refused me entry on account of no ID  gutted! Heard it was a good night tho.. Next time I'll come prepared!
1: why did I ever delete my twitter?
2: justin timberlake + snl = awesome ... dude should just become a regular
3: Kyle is Cody's wee bro!
4: http://www.myspace.com/dica_grl Just got a crush on this song! Disco's Out! Murder's In! ruleaz?, zic!  www.myspace.com/discosoutmurdersin


#### Initial thoughts on 'surprise':

asdf

### Let's explore 'worry'

In [51]:
explore('worry')

Number of tweets for 'worry': 8459
Percentage of the dataset: 21.1475

Random indices:' [4193, 6135, 1746, 6764, 293]
Random 5 elements of 'content' column for emotion 'worry'
0: At tweetup loc, but don't recognize anyone  am outside in turquoise shirt. Please see me! #g4c09
1: i dont have a bank. i cash my shit at tom thumb. i've had four bank accounts--and they've all gone negative
2: Mouth hurts
3: oh dear, thats not good - I hope you get through it all with a smile
4: &quot;my name is Tony!!!!!! ...not hey!!!!&quot; -  poor tony


#### Initial thoughts on 'worry':

asdf

### General thoughts about the contents of these tweets:

There are @mentions, URLs, and character entities ("\&nbsp;", "\&quot;", "\&amp;") that we may want to remove as part of data cleaning since they are superfluous in terms of indicating emotion.

***
## Clean the data

### Clean the data by removing any @mentions from the content

In [20]:
# remove @mentions from the 'content' column
df['content'] = df['content'].str.replace(r'@\w+', '')

# display the resulting DataFrame
print(df.head())

     tweet_id   sentiment                                            content
0  1956967341       empty   i know  i was listenin to bad habit earlier a...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!
4  1956968416     neutral   We want to trade with someone who has Houston...


  df['content'] = df['content'].str.replace(r'@\w+', '')


### Clean the data by removing any whitespace from the front and end of each tweet

In [21]:
df['content'] = df['content'].str.strip()

# display the resulting DataFrame
print(df.head())

     tweet_id   sentiment                                            content
0  1956967341       empty  i know  i was listenin to bad habit earlier an...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!
4  1956968416     neutral  We want to trade with someone who has Houston ...


In [None]:
type(df.head())