***
# EDA for tweet_emotions.csv dataset
https://www.kaggle.com/datasets/pashupatigupta/emotion-detection-from-text
***

***
## 1 Load the dataset
 <span style="color:red">!!!Make sure to comment out the correct source (Google drive or local)!!!</span>

In [1]:
import pandas as pd
import numpy as np
import random

# read from the google drive
# url = 'https://drive.google.com/file/d/1xv5hff4DntWz6gn4KMXkBQ6t5_Vs16MQ/view?usp=sharing'
# url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]

# read from local
url = '../datasets/tweet_emotions.csv'

df = pd.read_csv(url)

print(df.head())

     tweet_id   sentiment                                            content
0  1956967341       empty  @tiffanylue i know  i was listenin to bad habi...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!
4  1956968416     neutral  @dannycastillo We want to trade with someone w...


***
## 2 Basic EDA

### Datatypes for each of the columns in our dataframe

In [2]:
print(df.dtypes)

tweet_id      int64
sentiment    object
content      object
dtype: object


### Unique types and count of sentiments in the dataset

In [3]:
# get the value counts of each unique value
counts = df['sentiment'].value_counts()

# convert the counts to percentages
percentages = counts / counts.sum() * 100

cp_df = pd.DataFrame({'count': counts, 'percentage': percentages})

print(cp_df)

            count  percentage
neutral      8638     21.5950
worry        8459     21.1475
happiness    5209     13.0225
sadness      5165     12.9125
love         3842      9.6050
surprise     2187      5.4675
fun          1776      4.4400
relief       1526      3.8150
hate         1323      3.3075
empty         827      2.0675
enthusiasm    759      1.8975
boredom       179      0.4475
anger         110      0.2750


### Max and min lengths of the tweets

In [4]:
lengths = df['content'].str.len()

# display the maximum and minimum lengths
print('Maximum length:', lengths.max())
print('Minimum length:', lengths.min())

Maximum length: 167
Minimum length: 1


***
## 3 Check each sentiment to see what type of content there is to see if it's legit

### A function to explore the sentiments

In [5]:
def explore(emotion=None):
    if emotion == None:
        return 'no emotion provided'
    elif emotion not in all_sentiments:
        return 'not one of the sentiments in the dataframe'
    else:
        content = df.loc[df['sentiment'] == emotion, 'content']

        random_nums = random.sample(range(len(content)), 5)
        print(f"Number of tweets for '{emotion}': {len(content)}")
        print(f"Percentage of the dataset: {cp_df.loc[emotion].percentage}")
        print(f"Random indices:' {random_nums}", end='\n\n')
        print(f"Random {len(random_nums)} Tweets for emotion '{emotion}':")

        counter = 1
        for tweet in content.iloc[random_nums]:
            print(f"{counter}: {tweet}", end='\n')
            counter = counter + 1

### All the sentiments

In [6]:
all_sentiments = sorted(df['sentiment'].unique())
print(all_sentiments)

['anger', 'boredom', 'empty', 'enthusiasm', 'fun', 'happiness', 'hate', 'love', 'neutral', 'relief', 'sadness', 'surprise', 'worry']


In [7]:
# this cell prints all of the sentiments all at once,
# but keeping it commented out for now

# unique_sentiments = df['sentiment'].unique()

# for emotion in unique_sentiments:
#     explore(emotion)
#     print('**********************************************************\n**********************************************************\n')

### Let's explore 'anger'

In [8]:
explore('anger')

Number of tweets for 'anger': 110
Percentage of the dataset: 0.27499999999999997
Random indices:' [65, 103, 54, 69, 2]

Random 5 Tweets for emotion 'anger':
1: @JennyTaylor94  yer it is...poor little cock  but she well doesnt deserve the stick off everyone! cowell once again going against producer
2: @brainofdane DUDE.  You're a hax0r!!!1!  You should put Final Cut Pro on there and tell me how stable it is
3: People at work are stressing me out.
4: @natss91 kill me as soon as you get here ,ok? my sister is having a sleepover tonight  and her obnoxious friends are driving me insane
5: Packing  I don't like it..


#### Initial thoughts on 'anger':

The number of 'anger' tweets, 110, is the lowest count in the dataset and represents just 0.2750% of the dataset. This will lead to an imbalance issue.<br>
<br>
Some of these Tweets seem to be mislabeled (e.g. "omg! goooood ass nappy nap  jusss woke up bout 2 clean up a lil then get ready", "OIL IS CHANGED!  And I am filthy.    But it's an accomplished filthy.", "Just sittin here waitin for my coffee to be full grown on farm town before going to bed"). But most seem to be labeled correctly (e.g. "What did I do to you!  sheesh", "The &quot;Catch Me If You Can&quot; DVD that I rented from Blockbuster.com yesterday was cracked. Figured it out about 35 minutes into the movie.", "@drakesizzle  If you don't want to come then don't come. JEEEEEZ.").

### Let's explore 'boredom'

In [9]:
explore('boredom')

Number of tweets for 'boredom': 179
Percentage of the dataset: 0.4475
Random indices:' [114, 95, 147, 1, 122]

Random 5 Tweets for emotion 'boredom':
1: J Ross you can't leave the killers still singing and run the titles - you should have been edited out for more music - happy - not
2: ya know why today sucks? its been raining, we have no $, &amp; no possibility of a magic friday.  so whats goin down tonight?
3: Stuck on NJ Transit for the past twenty minutes. Great way to start the week
4: Waiting in line @ tryst
5: Just got home from the BEA &amp; it was kinda boring (2 me) this year  but hung out with some GREAT authors &amp; co-workers!


#### Initial thoughts on 'boredom':

This is the second lowest sentiment category with only 179 Tweets representing 0.4475% of the dataset. Like anger, this will be an imbalance issue.<br>
<br>
The interpretation of 'bored' seems to vary from the traditional sense (e.g. "I am sooooooo bored in textiles !", "is bored. my BFF doesn't want to hang out") to "this person must be bored because they have nothing better to do than write this Tweet" (e.g. "aw now where's that little asian girl who runs round pooping her pants in public? i miss laughing at her.", "my neighbours are far too loud in thier back garden, all I can hear is this loud woman that won't stop laughing").

### Let's explore 'empty'

In [10]:
explore('empty')

Number of tweets for 'empty': 827
Percentage of the dataset: 2.0675
Random indices:' [801, 705, 576, 259, 231]

Random 5 Tweets for emotion 'empty':
1: @rssanborn games? Just wanted to clarify
2: @AnointedPromise Yes, have to be right for church
3: Making Banana Bread
4: Woke up... cleaned... Aunt Emmas... Walmart.. Commissary... Now its time for a nap!!!.. then off to work
5: this is absurd ! I feel like a dipping in the pool real quick . its too bad i dont have a poool


#### Initial thoughts on 'empty'

It seems that the 'empty' sentiment is very random. At times, there are two or more sentiments expressed (e.g. "back from grimsby  it sucks bein back but was amazin wknd anyway!!"). Other times they are just statements of fact or questions (e.g. "On the way to santa monica", "@spook68 morning.any plans for today?"). Still other times they are just words ("HELLOO"). Sometimes, there are Tweets that seem like they should be labeled with one of the other emotions (e.g. "yay, joss is coming over on saturday" should probably be labeled 'happiness' or 'enthusiasm') and perhaps because there are possibly two different labels in that last example, the labeler chose to leave it empty.<br>
<br>
There are 827 of the 'empty' Tweets representing 2.0675% of the dataset. We should consider removing these from the dataset since these seem to be the "UNK" type of Tweets.

### Let's explore 'enthusiasm'

In [11]:
explore('enthusiasm')

Number of tweets for 'enthusiasm': 759
Percentage of the dataset: 1.8975
Random indices:' [510, 502, 718, 475, 87]

Random 5 Tweets for emotion 'enthusiasm':
1: @youngscraphics - I produce/direct/film/edit... I write... I coordinate events... I manage Don Fetti... there ain't much I don't do!
2: Boredddddd Follower @meryreino Shes AMAZING!!  *Broken*
3: did some more work on Dig Dug. can get to level 16 without dying now  Mega Man tomorrow after work. Goal: 2 levels in 5 minutes
4: ....dont act like your not impressed
5: @xac Reminders are good! Speaking of which, we haven't had a Posture Check in awhile.....


#### Initial thoughts on 'enthusiasm':

asdf

### Let's explore 'fun'

In [12]:
explore('fun')

Number of tweets for 'fun': 1776
Percentage of the dataset: 4.44
Random indices:' [182, 1261, 429, 1175, 146]

Random 5 Tweets for emotion 'fun':
1: 4 more days until my birthday!!! I don't want to get older
2: @barrysma NEW motorcycle and you POPPED a cable already? wow-you ride HARD!
3: @MariaV_ST vaca!! buuu sigo en el work
4: bowling with cousins  awesome
5: Its 4.30am, sleep timeee. I wanted to watch Gossip Girl but i'm way too tired  Goodnight!


#### Initial thoughts on 'fun':

asdf

### Let's explore 'happiness'

In [13]:
explore('happiness')

Number of tweets for 'happiness': 5209
Percentage of the dataset: 13.0225
Random indices:' [4534, 3046, 1136, 825, 27]

Random 5 Tweets for emotion 'happiness':
1: @XKookie03 whyyyy hellloooo! Thx 4 checkin up on me  how r things? http://myloc.me/G4p
2: loven the rs ftw pvp is bac
3: @kyleandjackieo what about Your Body by Tom Novy or Voodoo Child by Rogue Traders. They are from 04/05. Good memories from these songs
4: @ToxicSociopath awww. well before we know it youll be back visiting XD we will hang out constantly and have another heartbreaking goodbye
5: @wedplanworkshop . Flights already booked, plus its GGD2 1st birthday. Can't miss that ! especially as we missed GGD1


#### Initial thoughts on 'happiness':

asdf

### Let's explore 'hate'

In [14]:
explore('hate')

Number of tweets for 'hate': 1323
Percentage of the dataset: 3.3075
Random indices:' [299, 852, 657, 494, 129]

Random 5 Tweets for emotion 'hate':
1: TGIF I don't like 12 hour workdays  I need to stand up, run around 4 a while.... too much sitting!!! Plus, I have honest ade tea 2day! YAY
2: boredddddddd, work tomorrow..  and sunday. i hate clarks
3: @kmrasmussen nah. How it sucks to wear a suit and how the temp goes up 10 degrees when someone sits next to you
4: DIDO &quot;US 2 Little Gods&quot; http://ow.ly/9UIn &quot;Just this moment/ Let it all stop here/ I've had my fill&quot;...words that make you panic...
5: @GGSerena boo you didnt answer my text


#### Initial thoughts on 'hate':

There are 1132 Tweets labeled 'hate'. They seem to strongly resemble 'anger' and can possibly be relabeled as such. In so doing, we would increase the number of 'anger' Tweets to help with the imbalance issue.

### Let's explore 'love'

In [15]:
explore('love')

Number of tweets for 'love': 3842
Percentage of the dataset: 9.605
Random indices:' [2804, 1056, 296, 89, 2314]

Random 5 Tweets for emotion 'love':
1: I love my life  Ni night twitter!&lt;3
2: Done with HW...gonna read a bit then pass out. Got a cool week to look forward too in between all the mayhem
3: eating some breakfast at Panera Bread. boring cloudy weather, lil drizzle
4: @melissaohh omg  when do they finish??
5: @oreoking awe thanks


#### Initial thoughts on 'love':

asdf

### Let's explore 'neutral'

In [16]:
explore('neutral')

Number of tweets for 'neutral': 8638
Percentage of the dataset: 21.595
Random indices:' [2738, 4214, 6369, 2699, 1746]

Random 5 Tweets for emotion 'neutral':
1: gymnastics time.  My last night for teaching Friday evening classes.   New summer schedule starts next week.
2: @WKJThD  Thanks for Following
3: @dhgarske ha. nothing any man does is right on mothers day except for taking kids off mum's hands for whole day
4: cof Cof Cof!
5: No... have to go on cruches next 2 weeks


#### Initial thoughts on 'neutral':

asdf

### Let's explore 'relief'

In [17]:
explore('relief')

Number of tweets for 'relief': 1526
Percentage of the dataset: 3.8150000000000004
Random indices:' [1051, 460, 485, 539, 216]

Random 5 Tweets for emotion 'relief':
1: @StorySeeker lol...but they aren't here! I'll tell them to do that Monday. lol
2: i finished new moon  in 1 day all up. maybe less, im quite proud, now who wants to lend me eclipse haha
3: Finally sleep time
4: Morning world! back to the office after longgggggggg weekend
5: Andrew's flight back to CO should be landing soon


#### Initial thoughts on 'relief':

asdf

### Let's explore 'sadness'

In [18]:
explore('sadness')

Number of tweets for 'sadness': 5165
Percentage of the dataset: 12.9125
Random indices:' [5141, 3288, 1590, 1485, 4864]

Random 5 Tweets for emotion 'sadness':
1: my back and legs kill from yesterday and we have a big old leak in the kitchen, looks like staying in pjs all day infront of the tv
2: @siirensiiren meagan rochelle &quot;the one u need&quot; i would say &quot;cater 2 u&quot; by he didnt produce that.
3: @MouseGoesSqueak ahhhh.same here with Geometry, like i said b4, if i didn't have it, i would be graduated!! so i feel ur pain hun!
4: @xxxmaggie oh that sucks  I'm sorry.
5: @sharkattack44 i wish there was an &quot;i like&quot; option (like fb) for things like this


#### Initial thoughts on 'sadness':

asdf

### Let's explore 'surprise'

In [19]:
explore('surprise')

Number of tweets for 'surprise': 2187
Percentage of the dataset: 5.4675
Random indices:' [1480, 1909, 892, 984, 1328]

Random 5 Tweets for emotion 'surprise':
1: Somehow my alarm became an hour fast and I came to realize it as I was leaving the house.. It feels good having an early start
2: Going to bed, stores are closed
3: @ClaudeKelly What day is it? What's #FF? I'm worst than you
4: @missuzliipzlive ilooked in my phone book and ur name was the first to show and i was like i got ti-ti number but it was just ur email
5: @RobHolladay I added it, are you still awake?


#### Initial thoughts on 'surprise':

asdf

### Let's explore 'worry'

In [20]:
explore('worry')

Number of tweets for 'worry': 8459
Percentage of the dataset: 21.1475
Random indices:' [555, 6361, 4951, 3091, 5957]

Random 5 Tweets for emotion 'worry':
1: I want to ride my bicycle today, but it's too cold and cloudy today
2: @jenniclarephoto Sometimes (although I usually go willingly  ) Don't know about the Churnet Valley event though.
3: @Impala_Guy Me neither  But itï¿½s getting better
4: Life Just Isn't Fair &gt; And I Feel
5: @selenagomez AWWWE! I live in Van, would've been so great to see you  but have a great flight!


#### Initial thoughts on 'worry':

asdf

### General thoughts about the contents of these tweets:

There are @mentions, URLs, and character entities ("\&nbsp;", "\&quot;", "\&amp;") that we may want to remove as part of data cleaning since they are superfluous in terms of indicating emotion.

***
## Clean the data

### Clean the data by removing any @mentions from the content

In [21]:
# remove @mentions from the 'content' column
df['content'] = df['content'].str.replace(r'@\w+', '')

# display the resulting DataFrame
print(df.head())

     tweet_id   sentiment                                            content
0  1956967341       empty   i know  i was listenin to bad habit earlier a...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!
4  1956968416     neutral   We want to trade with someone who has Houston...


  df['content'] = df['content'].str.replace(r'@\w+', '')


### Clean the data by removing any whitespace from the front and end of each tweet

In [22]:
df['content'] = df['content'].str.strip()

# display the resulting DataFrame
print(df.head())

     tweet_id   sentiment                                            content
0  1956967341       empty  i know  i was listenin to bad habit earlier an...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!
4  1956968416     neutral  We want to trade with someone who has Houston ...


In [23]:
type(df.head())

pandas.core.frame.DataFrame