***
# EDA for tweet_emotions.csv dataset
***
https://www.kaggle.com/datasets/pashupatigupta/emotion-detection-from-text

***
## Load the dataset
 <span style="color:red">!!!Make sure to comment out the correct source (Google drive or local)!!!</span>

In [1]:
import pandas as pd
import numpy as np
import random

# read from the google drive
# url = 'https://drive.google.com/file/d/1xv5hff4DntWz6gn4KMXkBQ6t5_Vs16MQ/view?usp=sharing'
# url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]

# read from local
url = '../datasets/tweet_emotions.csv'

df = pd.read_csv(url)

print(df.head())

     tweet_id   sentiment                                            content
0  1956967341       empty  @tiffanylue i know  i was listenin to bad habi...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!
4  1956968416     neutral  @dannycastillo We want to trade with someone w...


## These are the datatypes for each of the columns in our dataframe

In [2]:
print(df.dtypes)

tweet_id      int64
sentiment    object
content      object
dtype: object


## These are the unique types and count of sentiments captured in the dataset

In [3]:
# get the value counts of each unique value
counts = df['sentiment'].value_counts()

# convert the counts to percentages
percentages = counts / counts.sum() * 100

print(pd.DataFrame({'count': counts, 'percentage': percentages}))

            count  percentage
neutral      8638     21.5950
worry        8459     21.1475
happiness    5209     13.0225
sadness      5165     12.9125
love         3842      9.6050
surprise     2187      5.4675
fun          1776      4.4400
relief       1526      3.8150
hate         1323      3.3075
empty         827      2.0675
enthusiasm    759      1.8975
boredom       179      0.4475
anger         110      0.2750


## What are the max and min lengths of the tweets that we're dealing with?

In [4]:
lengths = df['content'].str.len()

# display the maximum and minimum lengths
print('Maximum length:', lengths.max())
print('Minimum length:', lengths.min())

Maximum length: 167
Minimum length: 1


***
***
## Check each sentiment to see what type of content there is to see if it's legit

In [5]:
print(sorted(df['sentiment'].unique()))

['anger', 'boredom', 'empty', 'enthusiasm', 'fun', 'happiness', 'hate', 'love', 'neutral', 'relief', 'sadness', 'surprise', 'worry']


In [None]:
# this cell prints all of the sentiments all at once,
# but keeping it commented out for now

# unique_sentiments = df['sentiment'].unique()

# for emotion in unique_sentiments:
#     content = df.loc[df['sentiment'] == emotion, 'content']
#     print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

#     random_nums = random.sample(range(len(content)), 5)
#     print(f"Number of tweets for '{emotion}': {len(content)}\nRandom indices:' {random_nums}", end='\n\n')
#     print(f"Random 5 elements of 'content' column for emotion '{emotion}'")
    
#     counter = 0
#     for tweet in content.iloc[random_nums]:
#         print(f"{counter}: {tweet}")
#         counter = counter + 1
    
#     print('\n')

***
### Let's explore 'anger'

In [12]:
emotion = 'anger'

content = df.loc[df['sentiment'] == emotion, 'content']

random_nums = random.sample(range(len(content)), 5)
print(f"Number of tweets for '{emotion}': {len(content)}\nRandom indices:' {random_nums}", end='\n\n')
print(f"Random 5 elements of 'content' column for emotion '{emotion}'")

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

Number of tweets for 'anger': 110
Random indices:' [59, 30, 37, 13, 78]

Random 5 elements of 'content' column for emotion 'anger'
0: sam and sean are teasing me saying they are gonna get wings without me
1: @aranarose Murphy's Law?  Sorry that your computer is not cooperating when you have lots of work. My kids are .. http://tinyurl.com/km235x
2: Very bad things.......I need to stop thinking!
3: @drakesizzle  If you don't want to come then don't come. JEEEEEZ.
4: @PENLDN just got in, gonna go upto bed in a sec, not drunk! I'm disgusted with myself  haha


#### Initial thoughts on 'anger':

The number of 'anger' tweets, 110, is the lowest count in the dataset and represents just 0.2750% of the dataset. This will lead to an imbalance issue.<br>
<br>
Some of these Tweets seem to be mislabeled (e.g. "omg! goooood ass nappy nap  jusss woke up bout 2 clean up a lil then get ready", "OIL IS CHANGED!  And I am filthy.    But it's an accomplished filthy.", "Just sittin here waitin for my coffee to be full grown on farm town before going to bed"). But most seem to be labeled correctly (e.g. "What did I do to you!  sheesh", "The &quot;Catch Me If You Can&quot; DVD that I rented from Blockbuster.com yesterday was cracked. Figured it out about 35 minutes into the movie.", "@drakesizzle  If you don't want to come then don't come. JEEEEEZ.").<br>
***

***
### Let's explore 'boredom'

In [16]:
emotion = 'boredom'

content = df.loc[df['sentiment'] == emotion, 'content']

random_nums = random.sample(range(len(content)), 5)
print(f"Number of tweets for '{emotion}': {len(content)}\nRandom indices:' {random_nums}", end='\n\n')
print(f"Random 5 elements of 'content' column for emotion '{emotion}'")

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

Number of tweets for 'boredom': 179
Random indices:' [86, 32, 22, 47, 36]

Random 5 elements of 'content' column for emotion 'boredom'
0: @random_nexus he has to have a new suitcase, but he is just so bloody indecicive, everytime he wants to buy something it takes HOURS
1: nomatter how much i sleep am still tired
2: why im not sleeping  !!
3: I want to go to my home!!! D: I don't like stay in the work all alone
4: I am sooooooo bored in textiles !


#### Initial thoughts on 'boredom':

This is the second lowest sentiment category with only 179 Tweets representing 0.4475% of the dataset. Like anger, this will be an imbalance issue.<br>
<br>
The interpretation of 'bored' seems to vary from the traditional sense (e.g. "I am sooooooo bored in textiles !", "is bored. my BFF doesn't want to hang out") to "this person must be bored because they have nothing better to do than write this Tweet" (e.g. "aw now where's that little asian girl who runs round pooping her pants in public? i miss laughing at her.", "my neighbours are far too loud in thier back garden, all I can hear is this loud woman that won't stop laughing")

### Let's explore 'empty'

In [None]:
emotion = 'empty'

content = df.loc[df['sentiment'] == emotion, 'content']
print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

random_nums = np.random.randint(0, len(content), size=5)

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

#### Initial thoughts on 'empty'

It seems that the 'empty' sentiment is very random. At times, there are two or more sentiments expressed ("back from grimsby  it sucks bein back but was amazin wknd anyway!!"). Other times they are just statements of fact or questions ("On the way to santa monica", "@spook68 morning.any plans for today?"). Still other times they are just words ("HELLOO"). Sometimes, there are Tweets that seem like they should be labeled with an emotion ("yay, joss is coming over on saturday" should probably be labeled 'happiness' or 'enthusiasm') and perhaps because there are possibly two different labels, the labeler chose to leave it empty.

### Let's explore 'enthusiasm'

In [None]:
emotion = 'enthusiasm'

content = df.loc[df['sentiment'] == emotion, 'content']
print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

random_nums = np.random.randint(0, len(content), size=5)

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

#### Initial thoughts on 'enthusiasm':

asdf

### Let's explore 'fun'

In [None]:
emotion = 'fun'

content = df.loc[df['sentiment'] == emotion, 'content']
print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

random_nums = np.random.randint(0, len(content), size=5)

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

#### Initial thoughts on 'fun':

asdf

### Let's explore 'happiness'

In [None]:
emotion = 'happiness'

content = df.loc[df['sentiment'] == emotion, 'content']
print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

random_nums = np.random.randint(0, len(content), size=5)

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

#### Initial thoughts on 'happiness':

asdf

### Let's explore 'hate'

In [None]:
emotion = 'hate'

content = df.loc[df['sentiment'] == emotion, 'content']
print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

random_nums = np.random.randint(0, len(content), size=5)

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

#### Initial thoughts on 'hate':

asdf

### Let's explore 'love'

In [None]:
emotion = 'love'

content = df.loc[df['sentiment'] == emotion, 'content']
print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

random_nums = np.random.randint(0, len(content), size=5)

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

#### Initial thoughts on 'love':

asdf

### Let's explore 'neutral'

In [None]:
emotion = 'neutral'

content = df.loc[df['sentiment'] == emotion, 'content']
print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

random_nums = np.random.randint(0, len(content), size=5)

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

#### Initial thoughts on 'neutral':

asdf

### Let's explore 'relief'

In [None]:
emotion = 'relief'

content = df.loc[df['sentiment'] == emotion, 'content']
print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

random_nums = np.random.randint(0, len(content), size=5)

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

#### Initial thoughts on 'relief':

asdf

### Let's explore 'sadness'

In [None]:
emotion = 'sadness'

content = df.loc[df['sentiment'] == emotion, 'content']
print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

random_nums = np.random.randint(0, len(content), size=5)

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

#### Initial thoughts on 'sadness':

asdf

### Let's explore 'surprise'

In [None]:
emotion = 'surprise'

content = df.loc[df['sentiment'] == emotion, 'content']
print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

random_nums = np.random.randint(0, len(content), size=5)

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

#### Initial thoughts on 'surprise':

asdf

### Let's explore 'worry'

In [None]:
emotion = 'worry'

content = df.loc[df['sentiment'] == emotion, 'content']
print(f"Random 5 elements of 'content' column for emotion '{emotion}':")

random_nums = np.random.randint(0, len(content), size=5)

counter = 0
for tweet in content.iloc[random_nums]:
    print(f"{counter}: {tweet}", end='\n')
    counter = counter + 1

#### Initial thoughts on 'worry':

asdf

### General thoughts about the contents of these tweets:

There are @mentions, URLs, and character entities (&nbsp;, &quot;, &amp;) that we may want to remove as part of data cleaning since they are superfluous in terms of indicating emotion.

### Clean the data by removing any @mentions from the content

In [None]:
# remove @mentions from the 'content' column
df['content'] = df['content'].str.replace(r'@\w+', '')

# display the resulting DataFrame
print(df.head())

### Clean the data by removing any whitespace from the front and end of each tweet

In [None]:
df['content'] = df['content'].str.strip()

# display the resulting DataFrame
print(df.head())

In [None]:
for tweet in df.head()['content']:]
    
    print(tweet, end='\n')

In [None]:
type(df.head())