***
# EDA for tweet_emotions.csv dataset
https://github.com/dair-ai/emotion_dataset
***

***
## 1 Load the CARER dataset
 <span style="color:red">!!!Make sure to comment out the correct source (Google drive or local)!!!</span>

In [1]:
import pandas as pd
import numpy as np
import random

# read from the google drive
# url = 'https://drive.google.com/file/d/1xv5hff4DntWz6gn4KMXkBQ6t5_Vs16MQ/view?usp=sharing'
# source = 'https://drive.google.com/uc?id=' + url.split('/')[-2]

# read from local
source = '../datasets/carer_data.pkl'

# Load the pickled DataFrame
df = pd.read_pickle(source)

print(df.head())

                                                     text emotions
27383   i feel awful about it too because it s my job ...  sadness
110083                              im alone i feel awful  sadness
140764  ive probably mentioned this before but i reall...      joy
100071           i was feeling a little low few days back  sadness
2837    i beleive that i am much more sensitive to oth...     love


***
## 2 Basic EDA

### Datatypes for each of the columns in our dataframe

In [2]:
print(df.dtypes)

text        object
emotions    object
dtype: object


### Rename the columns to match the other data to make it easier to reuse code

In [3]:
# Rename columns using rename() method
df.rename(columns={'text': 'content', 'emotions': 'sentiment'}, inplace=True)
print(df.dtypes)

content      object
sentiment    object
dtype: object


### Unique types and count of sentiments in the dataset

In [4]:
# get the value counts of each unique value
counts = df['sentiment'].value_counts()

# convert the counts to percentages
percentages = counts / counts.sum() * 100

cp_df = pd.DataFrame({'count': counts, 'percentage': percentages})

print(cp_df)

           count  percentage
joy       141067   33.844519
sadness   121187   29.074948
anger      57317   13.751383
fear       47712   11.446970
love       34554    8.290128
surprise   14972    3.592053


#### Initial thoughts on this dataset

This dataset, while a bit imbalanced, isn't as imbalanced as the Kaggle dataset that we had looked at initially.

### Max and min lengths of the text

In [5]:
lengths = df['content'].str.len()

# display the maximum and minimum lengths
print('Maximum length:', lengths.max())
print('Minimum length:', lengths.min())

Maximum length: 830
Minimum length: 2


***
## 3 Check each sentiment to see what type of content there is to see if it's legit

### All the sentiments

In [6]:
all_sentiments = sorted(df['sentiment'].unique())
print(all_sentiments)

['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']


### A function to explore the sentiments

In [7]:
# pass the emotion and number of random text to see
def explore(emotion=None, num_tweets=5):
    if emotion == None:
        return 'no emotion provided'
    elif emotion not in all_sentiments:
        return 'not one of the sentiments in the dataframe'
    else:
        content = df.loc[df['sentiment'] == emotion, 'content']

        random_nums = random.sample(range(len(content)), num_tweets)
        
        print(f"Number of tweets for '{emotion}': {len(content)}")
        print(f"Percentage of the dataset: {cp_df.loc[emotion].percentage}")
        print(f"Random indices: {random_nums}", end='\n\n')
        print(f"Random {num_tweets} Tweets for emotion '{emotion}':")

        counter = 1
        for tweet in content.iloc[random_nums]:
            print(f"{counter}: {tweet}", end='\n')
            counter = counter + 1

### Uncomment to see all sentiments at once

In [None]:
# this cell prints all of the sentiments all at once,
# but keeping it commented out for now

# for emotion in all_sentiments:
#     explore(emotion)
#     print('**********************************************************\n**********************************************************\n')

### Let's explore 'anger'

In [8]:
explore('anger')

Number of tweets for 'anger': 57317
Percentage of the dataset: 13.751382527728529
Random indices: [38055, 44937, 29126, 13834, 6668]

Random 5 Tweets for emotion 'anger':
1: i hear and see james clapper talk about the nsa leak i feel disgusted
2: i feel stressed and sad and a whole range of emotions that both my project due to the funding issues and my status in my site my own sense of security and the uncertainty of if i ll be able to stay there for the rest of my service are so up in the air right now
3: i have three little girls and want so desperately for them not to do the kind of stupid things i did or feel the selfish feelings i felt
4: i hate this feeling not feeling like yourself and dissatisfied in who you are and way too far away from the ones who can find you again
5: i feel like i am being selfish and that has never been a trait that i carry


#### Initial thoughts on 'anger':

The number of 'anger' tweets, 110, is the lowest count in the dataset and represents just 0.2750% of the dataset. This will lead to an imbalance issue.<br>
<br>
Some of these Tweets seem to be mislabeled (e.g. "omg! goooood ass nappy nap  jusss woke up bout 2 clean up a lil then get ready", "OIL IS CHANGED!  And I am filthy.    But it's an accomplished filthy.", "Just sittin here waitin for my coffee to be full grown on farm town before going to bed"). But most seem to be labeled correctly (e.g. "What did I do to you!  sheesh", "The &quot;Catch Me If You Can&quot; DVD that I rented from Blockbuster.com yesterday was cracked. Figured it out about 35 minutes into the movie.", "@drakesizzle  If you don't want to come then don't come. JEEEEEZ.").

### Let's explore 'fear'

In [9]:
explore('fear')

Number of tweets for 'fear': 47712
Percentage of the dataset: 11.446969715145308
Random indices: [37445, 6189, 33005, 1591, 18605]

Random 5 Tweets for emotion 'fear':
1: i wake on a monday feeling frantic world spinning before my feet hit the ground
2: im so torn between my wants and my needs that its making me feel a little anxious
3: i feel a bit apprehensive the captain has hours and hours of sailing experience and a few years back he sailed to hawaii by himself
4: i love monos mom and some others in the family but as a whole that family is sick and irritating and i just feel tortured when i am with them
5: i just got bad news at work or had a fight with a friend and am already feeling vulnerable theres no way im going to read reviews


#### Initial thoughts on 'boredom':

This is the second lowest sentiment category with only 179 Tweets representing 0.4475% of the dataset. Like anger, this will be an imbalance issue.<br>
<br>
The interpretation of 'bored' seems to vary from the traditional sense (e.g. "I am sooooooo bored in textiles !", "is bored. my BFF doesn't want to hang out") to "this person must be bored because they have nothing better to do than write this Tweet" (e.g. "aw now where's that little asian girl who runs round pooping her pants in public? i miss laughing at her.", "my neighbours are far too loud in thier back garden, all I can hear is this loud woman that won't stop laughing").

### Let's explore 'joy'

In [10]:
explore('joy')

Number of tweets for 'joy': 141067
Percentage of the dataset: 33.84451871240784
Random indices: [3106, 54160, 99782, 112861, 77798]

Random 5 Tweets for emotion 'joy':
1: i finally stop attempting to save face and tell the staff that i have to go to the hotel doctor i have already been once for the vomiting and did not feel the need to develop any sort of friendly relationship with the doctor
2: i mentioned this on my fb profile but i ve gotten to the point where if i see a pilot with a well worn leather pilot s bag that s seen some serious milage all scratched up and almost falling apart i feel reassured
3: i feel a bit more assured it s going in the right direction
4: i makes me feel more appreciative for the people around me
5: i have been feeling jeweltones and rich metallics


#### Initial thoughts on 'empty'

It seems that the 'empty' sentiment is very random. At times, there are two or more sentiments expressed (e.g. "back from grimsby  it sucks bein back but was amazin wknd anyway!!"). Other times they are just statements of fact or questions (e.g. "On the way to santa monica", "@spook68 morning.any plans for today?"). Still other times they are just words ("HELLOO"). Sometimes, there are Tweets that seem like they should be labeled with one of the other emotions (e.g. "yay, joss is coming over on saturday" should probably be labeled 'happiness' or 'enthusiasm') and perhaps because there are possibly two different labels in that last example, the labeler chose to leave it empty.<br>
<br>
There are 827 of the 'empty' Tweets representing 2.0675% of the dataset. We should consider removing these from the dataset since these seem to be the "unknown" type of Tweets.

### Let's explore 'love'

In [11]:
explore('love')

Number of tweets for 'love': 34554
Percentage of the dataset: 8.29012809224369
Random indices: [27129, 10567, 11524, 2481, 26676]

Random 5 Tweets for emotion 'love':
1: i must be going to sleep feeling longing for something or nothing at all for i feel even number when i wake up
2: i feel like again daniel franco didnt get the fullest shake he could have gotten though he was extremely nobel and gracious about it
3: i feel we are blessed to live in such a beautiful area and i wouldnt choose anywhere else to live in the world
4: i wouldnt fall that deep for you jaslyn but those words u said really make me feel loved
5: i am feeling generous and happy today


#### Initial thoughts on 'enthusiasm':

These set of Tweets also suffer from some mislabeling issues (e.g. "I'm bored, extremely bored. in the car. waiting for my dad. and dinner. chinese. yummm.", "wishes I could be the one going to our conference in the Bahamas next week"). Some Tweets seemed to be mislabeled as 'enthusiasm' due to certain keywords (e.g. "I made my parents add u guys on the family myspace...they were impressed by the song" -- 'impressed') or perhaps the number of exclamantion points (e.g. "im so new!! and i need ur help").

### Let's explore 'sadness'

In [None]:
explore('sadness')

#### Initial thoughts on 'fun':

This emotion is a bit hard to nail down in terms of what the labelers were going. And, as in other emotions, it seems that there are a lot of mislabeled Tweets.

### Let's explore 'surprise'

In [None]:
explore('surprise')

#### Initial thoughts on 'happiness':

Have not gotten to this.

### General thoughts about the contents of these tweets:

There are @mentions, URLs, and character entities ("\&nbsp;", "\&quot;", "\&amp;") that we may want to remove as part of data cleaning since they are superfluous in terms of indicating emotion. There are also hashtags that we may or may not consider removing. Some of these hashtags don't actually add to the sentiment, but some do so we may consider keeping them.<br>
<br>
Additionally, there are multiple emotions that have a proportionally small amount of records when compared to others ('anger' with 110 records and 'neutral' with 8638). We will have to deal with this somehow, perhaps by using stratified sampling techniquest (https://stackoverflow.com/questions/70849127/training-validation-and-test-sets-for-imbalanced-datasets-in-machine-learning) rather than a random train/test/validation split which may overrepresent the 'neutral' emotion.

***
## 4 Clean the data

### Clean the data by removing any @mentions from the content

In [None]:
# remove @mentions from the 'content' column
df['content'] = df['content'].str.replace(r'@\w+', '')

# display the resulting DataFrame
print(df.head())

### Clean the data by removing any whitespace from the front and end of each tweet

In [None]:
df['content'] = df['content'].str.strip()

# display the resulting DataFrame
print(df.head())