***
# EDA for tweet_emotions.csv dataset
https://github.com/dair-ai/emotion_dataset<br>
https://huggingface.co/datasets/dair-ai/emotion
***

***
## 1 Load the CARER dataset
 <span style="color:red">!!!Make sure to comment out the correct source (Google drive or local)!!!</span>

In [1]:
import pandas as pd
import numpy as np
import random

# read from the google drive
# url = 'https://drive.google.com/file/d/1e-w7djzvn1LUFutmDgnOh16mYjiVX34s/view?usp=sharing'
# source = 'https://drive.google.com/uc?id=' + url.split('/')[-2]

# read from local
source = '../datasets/carer_data.pkl'

# Load the pickled DataFrame
df = pd.read_pickle(source)

print(df.head())

                                                     text emotions
27383   i feel awful about it too because it s my job ...  sadness
110083                              im alone i feel awful  sadness
140764  ive probably mentioned this before but i reall...      joy
100071           i was feeling a little low few days back  sadness
2837    i beleive that i am much more sensitive to oth...     love


***
## 2 Basic EDA

### Datatypes for each of the columns in our dataframe

In [2]:
print(df.dtypes)

text        object
emotions    object
dtype: object


### Rename the columns to match the other data to make it easier to reuse code

In [3]:
# Rename columns using rename() method
df.rename(columns={'text': 'content', 'emotions': 'sentiment'}, inplace=True)
print(df.dtypes)

content      object
sentiment    object
dtype: object


### Unique types and count of sentiments in the dataset

In [4]:
# get the value counts of each unique value
counts = df['sentiment'].value_counts()

# convert the counts to percentages
percentages = counts / counts.sum() * 100

cp_df = pd.DataFrame({'count': counts, 'percentage': percentages})

print(cp_df)

           count  percentage
joy       141067   33.844519
sadness   121187   29.074948
anger      57317   13.751383
fear       47712   11.446970
love       34554    8.290128
surprise   14972    3.592053


### Max and min lengths of the text

In [5]:
lengths = df['content'].str.len()

# display the maximum and minimum lengths
print('Maximum length:', lengths.max())
print('Minimum length:', lengths.min())

Maximum length: 830
Minimum length: 2


#### Initial thoughts on this dataset

This dataset, while a bit imbalanced, isn't as imbalanced as the Kaggle dataset that we had looked at initially. Additionally, unlike the Tweet dataset, we have content that is up to 830 characters.

***
## 3 Check each sentiment to see what type of content there is to see if it's legit

### All the sentiments

In [6]:
all_sentiments = sorted(df['sentiment'].unique())
print(all_sentiments)

['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']


### A function to explore the sentiments

In [7]:
# pass the emotion and number of random text to see
def explore(emotion=None, num_tweets=5):
    if emotion == None:
        return 'no emotion provided'
    elif emotion not in all_sentiments:
        return 'not one of the sentiments in the dataframe'
    else:
        content = df.loc[df['sentiment'] == emotion, 'content']

        random_nums = random.sample(range(len(content)), num_tweets)
        
        print(f"Number of tweets for '{emotion}': {len(content)}")
        print(f"Percentage of the dataset: {cp_df.loc[emotion].percentage}")
        print(f"Random indices: {random_nums}", end='\n\n')
        print(f"Random {num_tweets} Tweets for emotion '{emotion}':")

        counter = 1
        for tweet in content.iloc[random_nums]:
            print(f"{counter}: {tweet}", end='\n')
            counter = counter + 1

### Uncomment to see all sentiments at once

In [None]:
# this cell prints all of the sentiments all at once,
# but keeping it commented out for now

# for emotion in all_sentiments:
#     explore(emotion)
#     print('**********************************************************\n**********************************************************\n')

### Let's explore 'anger'

In [15]:
explore('anger')

Number of tweets for 'anger': 57317
Percentage of the dataset: 13.751382527728529
Random indices: [22018, 21561, 1230, 44874, 48454]

Random 5 Tweets for emotion 'anger':
1: i know he isn t doing these things on purpose i still feel irritated
2: id break down my question like this if the bunch of us feels dissatisfied with the current state of uu theology would we read a new crop of publications specialized blogs online magazines new periodicals books
3: i am more aware of my self in the dynamic of the relationship and speak up if ever i feel disrespected or offended
4: i finished with that article not feeling particularly enraged just annoyed i saw there was a second featured article beneath it called a style font weight bold href http health
5: i feel despised and i dont deserve that


#### Initial thoughts on 'anger':

The number of 'anger' text, is 57317, is the third highest amount in this dataset and represents 13.75% of the dataset.<br>
<br>
Most of these texts feel very spot-on with the label 'anger' (e.g. "i left feeling really quite angry and frustrated", "i feel disgusted and frustrated", "i am now well aware of his intention and feeling insulted").<br>
<br>
There are some that may not fit exactly 'anger' (e.g. "i want to be honest and blunt and tell it like it is and not worry that i might hurt someones feelings or make them make mad me", "i feel gleefully rebellious", "i find the easiest way to calm down if im feeling agitated is by satisfying each of the five senses") but can reasonably assessed as such.<br>
<br>
In general, though, after looking through several random batches of these, amounting to several hundred short texts, it seems that the 'anger' label is fitting.

### Let's explore 'fear'

In [20]:
explore('fear')

Number of tweets for 'fear': 47712
Percentage of the dataset: 11.446969715145308
Random indices: [46142, 11482, 2207, 35121, 17262]

Random 5 Tweets for emotion 'fear':
1: i do not feel threatened at all by spain
2: i feel reluctant to go back to the vivos maybe i still prefer heel or midfoot strike over the forefoot one
3: i feel scared when you don t call
4: i feel pressured by christian culture
5: i now feel alarmed and uneasy


#### Initial thoughts on 'fear':

Like 'anger', these texts seem to be in line with their labeled emotion ('fear').

### Let's explore 'joy'

In [25]:
explore('joy')

Number of tweets for 'joy': 141067
Percentage of the dataset: 33.84451871240784
Random indices: [50833, 63199, 6679, 130321, 85511]

Random 5 Tweets for emotion 'joy':
1: i feel as if it would be more useful for your computer at home that you use for social network
2: i feel assured and do not have to worry about missing out any important work anymore
3: i hadn t planned my route i was slightly dehydrated and did not feel particularly energetic
4: i feel venus is gifting us is the perfect opportunity to reconcile the masculine and feminine within and also integrate our divine selves with our physical selves
5: i feel artistic at times i always lose steam on projects quickly


#### Initial thoughts on 'joy'

The 'joy' label seems to be a mixed bag. There are some that seem to fit the label (e.g. "i feel good about the way im going to look that day", "i never feel so bouncy and happy like this since i was kid") while others definitely do not fit (e.g. "i didnt really feel satisfied with the solution", "i feel no less pain feel no less resolved in my head about why").

### Let's explore 'love'

In [29]:
explore('love')

Number of tweets for 'love': 34554
Percentage of the dataset: 8.29012809224369
Random indices: [25723, 11862, 21503, 28868, 17146]

Random 5 Tweets for emotion 'love':
1: i feel quite passionate about the subject of religious tolerance
2: i feel every ounce of his love in his adoring mouth every bit of more in his strong arms that hold me securely to him
3: i feel it when i enjoy the humidity of a summer day or talk about loving to read genealogies or want to really see the cows at the fair rather than just walk by them to say i did
4: i will return shortly to the lyrics but it must be said that as a whole these songs musically capture a feeling of longing and loneliness for which i suspect there is not as direct a comparison in most rock music
5: i dont know about my siblings since for the past two years they arent around everytime i go back but i feel very sympathetic for him


#### Initial thoughts on 'love':

Same as 'joy'. Labeling is a mixed bag.

### Let's explore 'sadness'

In [40]:
explore('sadness')

Number of tweets for 'sadness': 121187
Percentage of the dataset: 29.07494799776396
Random indices: [88468, 76346, 70461, 113469, 87187]

Random 5 Tweets for emotion 'sadness':
1: i feel terrible about myself
2: i feel it in my bones broke some records here en route to a multi stay run at put one out every year but they just released a greatest hits set direct hits with only two new tracks including the shot in the night with nothing seasonal on it
3: i were left alone for large chunks of time in the delivery room feeling helpless clueless and upset
4: i have thought id feel so heartbroken again
5: i always check my posing list that i accumulate beforehand i really have to know what kind of poses photos i m going to produce and take it from there otherwise i feel so blank during a shoot


#### Initial thoughts on 'sadness':

More of these feel like texts along the lines of 'sadness' than do 'love' and 'joy'; however, there are some that also seem mislabeled, though not as many proportionally. This is among the largest cohort in the dataset.

### Let's explore 'surprise'

In [46]:
explore('surprise', 15)

Number of tweets for 'surprise': 14972
Percentage of the dataset: 3.5920529547106708
Random indices: [5586, 148, 3583, 10757, 3001, 1792, 6451, 2819, 5562, 1267, 3772, 4799, 2690, 9363, 10623]

Random 15 Tweets for emotion 'surprise':
1: i feel so weird right now so far away from everyone just in my own fucking world doing whatever i please
2: i haven t had short hair in a long time and am feeling curious
3: i too feel surprised to be headed down this path
4: i really liked the book though it had a lot of good things to say and i thought the story was one in which many people could find enjoyment once they get past feeling shocked about some of the issues that come up
5: im feeling amazed by the blessings in this life of mine
6: i feel that i should give thanks for my amazing family
7: i had a feeling i was going to be less than impressed once j davey hit the stage because it was going to sound so craptacular
8: i like it when i came out of the cinema feeling impressed
9: i continued t

#### Initial thoughts on 'surprise':

Comments same as 'sadness'.

### General thoughts about the contents of these tweets:

This data appears to have been pre-processed: no apostrophes (replaced with whitespace), no numerical digits, all lowercase, and punctuation removed.<br>
<br>
In general this data seems to be more or less on target with the labels. There are some, of course, that may seem out of place for the label in question, but generally speaking the labels within each cohort seem to be on track.

***
## 4 Clean the data

### Clean the data by removing any @mentions from the content

In [47]:
# remove @mentions from the 'content' column
df['content'] = df['content'].str.replace(r'@\w+', '')

# display the resulting DataFrame
print(df.head())

                                                  content sentiment
27383   i feel awful about it too because it s my job ...   sadness
110083                              im alone i feel awful   sadness
140764  ive probably mentioned this before but i reall...       joy
100071           i was feeling a little low few days back   sadness
2837    i beleive that i am much more sensitive to oth...      love


  df['content'] = df['content'].str.replace(r'@\w+', '')


### Clean the data by removing any whitespace from the front and end of each tweet

In [48]:
df['content'] = df['content'].str.strip()

# display the resulting DataFrame
print(df.head())

                                                  content sentiment
27383   i feel awful about it too because it s my job ...   sadness
110083                              im alone i feel awful   sadness
140764  ive probably mentioned this before but i reall...       joy
100071           i was feeling a little low few days back   sadness
2837    i beleive that i am much more sensitive to oth...      love
