# RNN/GRU Model: Text Classification & Sentiment Analysis

This project implements a neural network for classifying emotions in text. The dataset used for training and evaluation is a 20 000 line csv-file, each line containing a string along with one of six labels: 'anger', 'sadness', 'joy', 'love', 'fear' or 'surprise'.

A custom model has been created from the ground up with Keras. The data has been appropriately split into training, validation, and test #TODO sets. Activation functions and loss functions were selected to optimize performance.

**The project includes comprehensive prints of relevant metrics and visualizations of training curves. #TODO viz**

---

- *Data Preprocessing*: Removing unnecessary elements.
- *Model Architecture*:
   - *Activation Functions*: Softmax.
   - *Loss Function*: Categorical Crossentropy.
   - *Regularization Techniques*: Dropout in GRU-layer.
- *Evaluation*: n/a #TODO

- **Reflection and Future Improvements**:
   - The project reflects on the achieved performance and discusses potential enhancements.
   - Suggestions for further improvement are outlined, showcasing a critical evaluation of the project.

---

## Dataset Overview

The initial EDA provides insights into the dataset structure:

```plaintext
   text                                                    label
   count   20000                                           20000
   unique  19948                                           6
   top     i tend to stop breathing when i m feeling s...  joy
   freq    2                                               6761
```

The dataset comprises 20,000 entries, with 19,948 unique texts and six unique emotion labels. The most frequent emotion is 'joy' with 6,761 occurrences.

In [1]:
import pandas as pd
df = pd.read_csv('Data/emotions.csv')
df.describe()

Unnamed: 0,text,label
count,20000,20000
unique,19948,6
top,i tend to stop breathing when i m feeling stre...,joy
freq,2,6761


In [2]:
df.columns

Index(['text', 'label'], dtype='object')

### Class Distribution Analysis


In [3]:
df['label'].value_counts()

label
joy         6761
sadness     5797
anger       2709
fear        2373
love        1641
surprise     719
Name: count, dtype: int64

The class distribution exhibits varying degrees of skewness, with 'joy' and 'Sadness' being the most prevalent emotions in the dataset. Adjustments may be considered to ensure the model achieves a balanced and accurate representation across all emotional categories. #TODO

In [4]:
df.sample(10)

Unnamed: 0,text,label
3224,i feel like all she wants is his parents fortu...,sadness
4386,i just think about all the day i chatted with ...,love
16451,i miss everybody i am still feeling relieved b...,joy
14320,i wasnt exactly sure how i was going to feel a...,joy
13768,i feel now i am not giving all of me to christ...,love
9067,i feel like im not as stubborn,anger
7932,im left feeling convinced this is another rela...,joy
10233,i feel fucked,anger
5230,im feeling like a tortured teen i decided to p...,fear
2344,i did not enjoy the feeling of the naughty kid...,love


## Data Cleaning

First iteration cleaning by just running `df.sample()` a couple of times and picking out irrelevant words.

The following URL-related elements and common words are identified for removal:

```
url_elements = ['http', 'href', 'www', 'src']
common_words = ['i', 'id', 'im', 'ive', 've', 'is', 'to', 'am', 'feel', 'feeling', 'your']
```

The `remove_words()` function is then applied to create a new 'text_clean' column.


In [5]:
def get_freq(column, search_term, return_rows=False, verbose=False):
    n, rows = 0, []
    for row, text in column.items():
        if search_term in str(text):
            n += 1
            rows.append(row)

    if verbose:
        print(f"The word '{search_term}' occurs {n} times in the column '{column.name}'.")
        if n > 0:
            print(f"It occurs in the following rows: {rows}")

    if return_rows:
        return n, rows
    else:
        return n

search_term = "href"
freq, rows = get_freq(df['text'], search_term, return_rows=True, verbose=True)

The word 'href' occurs 207 times in the column 'text'.
It occurs in the following rows: [3, 19, 88, 91, 200, 267, 351, 425, 560, 623, 731, 824, 959, 1103, 1314, 1372, 1402, 1543, 1657, 1700, 1737, 1803, 1873, 2024, 2026, 2213, 2225, 2253, 2623, 2969, 3016, 3478, 3525, 3568, 3631, 3823, 3832, 3882, 4071, 4079, 4138, 4142, 4144, 4335, 4365, 4405, 4667, 4690, 4722, 5009, 5011, 5117, 5438, 5596, 5634, 5684, 5757, 5964, 6023, 6050, 6448, 6513, 6761, 6765, 6821, 6913, 7003, 7216, 7320, 7325, 7401, 7597, 7876, 7877, 8112, 8341, 8652, 8675, 8872, 8880, 8956, 9084, 9161, 9320, 9359, 9363, 9454, 9480, 9487, 9488, 9638, 9724, 10124, 10228, 10327, 10403, 10456, 10506, 10564, 10625, 10718, 10959, 11003, 11024, 11082, 11304, 11306, 11360, 11368, 11409, 11460, 11587, 11600, 11603, 11686, 11734, 11792, 11978, 12102, 12235, 12298, 12352, 12530, 12545, 12617, 12633, 12693, 12976, 13006, 13019, 13154, 13252, 13274, 13295, 13442, 13466, 13581, 13594, 13609, 13755, 13757, 13791, 13793, 13832, 13918, 13922,

In [6]:
url_elements = ['http', 'href', 'www', 'src']
common_words = ['i', 'id', 'im', 'ive', 've', 'is', 'to', 'am', 'feel', 'feeling', 'your']

def remove_words(text_string, stop_words):
    words = text_string.split()
    filtered = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered)

df['text_clean'] = df['text'].apply(lambda x: remove_words(x, common_words + url_elements))

In [7]:
df.iloc[3]

text          i feel petty a href http clairee
label                                    anger
text_clean                     petty a clairee
Name: 3, dtype: object

Me too, I also feel petty a href http clairee.

In [8]:
df['label'].unique()

array(['anger', 'sadness', 'joy', 'love', 'fear', 'surprise'],
      dtype=object)

## Word Frequency Analysis
An analysis of word frequency identifies words that occur only once (8,668 words). These words are considered for removal.

In [9]:
label_list = df['label'].unique().tolist()
for i in label_list:
    freq = get_freq(df['text'], i)
    print(f"{i} occurs {freq} times.")

anger occurs 155 times.
sadness occurs 35 times.
joy occurs 296 times.
love occurs 929 times.
fear occurs 152 times.
surprise occurs 102 times.


In [100]:
# Hol up, it's mini game time

pd.set_option('display.max_colwidth', None) 

def guess_the_label(df=df):
    sample = df.sample()
    print("What's that sentiment?")
    print((sample['text'].iloc[0]))
    a = input('Enter your guesa')
    if str(a) == str(sample['label'].iloc[0]):
        print("Correct!")
    else:
        print(f"Too bad! It was {sample['label'].iloc[0]}")

guess_the_label()

What's that sentiment?
i am also posting this because i am trying to work on the writing i want my students to feel passionate about
Correct!


In [11]:
corpus_texts = df['text_clean'].tolist()

In [12]:
from collections import Counter

def get_words(corpus_texts=corpus_texts, n=10, least = False):
    all_text = ' '.join(corpus_texts)
    words = all_text.split()
    word_counts = Counter(words)

    if n == None:
        return word_counts
        
    if least:
        return word_counts.most_common()[-n-1:-1]
    
    return word_counts.most_common(n)

get_words()

[('and', 11996),
 ('the', 10462),
 ('a', 7748),
 ('that', 6314),
 ('of', 6182),
 ('my', 5326),
 ('in', 4239),
 ('it', 3922),
 ('like', 3616),
 ('so', 3127)]

In [13]:
new_stop_words = [w for w, c in get_words()]
new_stop_words

['and', 'the', 'a', 'that', 'of', 'my', 'in', 'it', 'like', 'so']

In [14]:
least_common = get_words(n=20, least=True)

In [15]:
least_common

[('murmur', 1),
 ('chastised', 1),
 ('rubble', 1),
 ('everyfuckingthing', 1),
 ('passive', 1),
 ('dissected', 1),
 ('biomedical', 1),
 ('mombasa', 1),
 ('digg', 1),
 ('unconcern', 1),
 ('hovel', 1),
 ('riots', 1),
 ('pipsqueak', 1),
 ('condemn', 1),
 ('dipped', 1),
 ('everchanging', 1),
 ('sailormoon', 1),
 ('gateway', 1),
 ('lathi', 1),
 ('sonam', 1)]

These all seem pretty irrelevant. But we can probably remove even more.

In [16]:
Counter(get_words(corpus_texts, n=None).values()).most_common(10)

[(1, 8668),
 (2, 2457),
 (3, 1202),
 (4, 766),
 (5, 500),
 (6, 376),
 (7, 289),
 (8, 218),
 (9, 179),
 (10, 142)]

In [17]:
stop_words = [word for word, count in  get_words(df['text_clean'], n=None).items() if count == 1]

In [18]:
len(stop_words)

8668

In [19]:
df['text_clean'] = df['text_clean'].apply(lambda x:remove_words(x, stop_words+new_stop_words))

In [20]:
df.sample(10)

Unnamed: 0,text,label,text_clean
17810,i feel like i shouldnt bother people with these petty stupid little pathetic thoughts i feel like no one really would care to know what really goes on inside my head,anger,shouldnt bother people with these petty stupid little pathetic thoughts no one really would care know what really goes on inside head
7973,i felt like facebook was a catalyst for me to feel that way about myself and i started to see it as a bit of a hostile online community,anger,felt facebook was for me way about myself started see as bit hostile online community
1518,i feel sure the nervousness and fear will always lurk in my mind but i feel at ease in my heart hopeful about theo ad and eli being happy healthy and safe and living to be old people with fulfilled lives,joy,sure nervousness fear will always mind but at ease heart hopeful about ad being happy healthy safe living be old people with fulfilled lives
7809,i feel edmontonians are superior to the residents of any other major city but if you ask me what keeps me living here despite my obvious hatred for the climate of the year then my response is family and friends,joy,are superior residents any other major city but if you ask me what keeps me living here despite obvious hatred for climate year then response family friends
15872,i feel liked these days by both tom and myself,love,liked these days by both tom myself
2397,i ended the podcast feeling not depressed exactly but like i still didn t have a concrete answer for how to strike that balance that self help authors love to talk about,sadness,ended podcast not depressed exactly but still didn t have answer for how strike balance self help authors love talk about
18804,i am living with my dad and his wife in his new home and i feel very unwelcome here,sadness,living with dad his wife his new home very unwelcome here
3194,i want to come out about it but i feel so reluctant for some reason,fear,want come out about but reluctant for some reason
15574,im feeling rather inspired yet low i will enjoy my writing and even though i may be writing about morbid things i will find a way to make it interesting to read,joy,rather inspired yet low will enjoy writing even though may be writing about morbid things will find way make interesting read
15104,i would be feeling guilty of writing craps on my blog nothing useful nor beneficial to others,sadness,would be guilty writing on blog nothing useful nor beneficial others


In [21]:
df['label'].unique()

array(['anger', 'sadness', 'joy', 'love', 'fear', 'surprise'],
      dtype=object)

In [22]:
df['text_clean'].idxmax()

14941

In [23]:
max(df['text_clean'], key=len)

'more important than going fun ipad strategy games original boots from ugg wear ugg boots this winter low cost ugg boots uggs need get washed inside they are also you won t burdened with them speed up finances with payday loans payday loans monthly solution for you'

whAT, uh anyway

## Data Preparation for the Model

The next step involves preparing the data for the model. This includes label encoding and one-hot encoding for the target variable.

### One hot encoding labels

In [24]:
df['label'].unique()

array(['anger', 'sadness', 'joy', 'love', 'fear', 'surprise'],
      dtype=object)

In [91]:
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

integer_labels = label_encoder.fit_transform(df['label'])
one_hot_labels = to_categorical(integer_labels)

assert len(one_hot_labels) == 20000

### Tokenization

The text data is tokenized using the Keras Tokenizer:

In [26]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['text_clean'])

len(tokenizer.word_index) + 1

8404

In [27]:
tokenizer.document_count

20000

In [28]:
tokenizer.word_counts

OrderedDict([('pissed', 77),
             ('off', 277),
             ('over', 426),
             ('an', 700),
             ('old', 148),
             ('friend', 147),
             ('some', 750),
             ('friends', 253),
             ('found', 123),
             ('has', 500),
             ('made', 368),
             ('huge', 36),
             ('difference', 31),
             ('especially', 148),
             ('on', 1925),
             ('finger', 15),
             ('with', 2530),
             ('ring', 8),
             ('skin', 59),
             ('feels', 416),
             ('much', 618),
             ('less', 308),
             ('irritated', 79),
             ('also', 423),
             ('unfortunate', 55),
             ('nearly', 27),
             ('all', 1456),
             ('readers', 39),
             ('going', 606),
             ('meet', 53),
             ('man', 107),
             ('will', 887),
             ('be', 2193),
             ('african', 7),
             ('americans'

## Model Architecture

In [32]:
input_dim = len(tokenizer.word_index) + 1
input_dim

8404

In [33]:
input_length = max(len(sequence) for sequence in tokenizer.texts_to_sequences(df['text_clean']))
input_length

51

In [34]:
from keras.models import Sequential
from keras.layers import Embedding, GRU, Dense
from keras.optimizers import Adam

model = Sequential()
model.add(Embedding(input_dim=input_dim, output_dim=128, input_length=input_length))
model.add(GRU(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(6, activation='softmax'))
model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 51, 128)           1075712   
                                                                 
 gru_1 (GRU)                 (None, 64)                37248     
                                                                 
 dense_1 (Dense)             (None, 6)                 390       
                                                                 
Total params: 1113350 (4.25 MB)
Trainable params: 1113350 (4.25 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


The model consists of three layers:

- Embedding Layer: Maps the input sequences to dense vectors of fixed size (128).
- GRU (Gated Recurrent Unit) Layer: A type of recurrent layer with 64 units and dropout of 0.2 for regularization.
- Dense Layer: The output layer with 6 units (for each emotion category) and softmax activation.

The model is compiled using the Adam optimizer and categorical cross-entropy loss, with accuracy as the evaluation metric.

## Training the Model

To train the text classification model, the preprocessed data is converted into sequences and padded/truncated to a fixed length. The training process is then executed using the Keras `fit` method.

The model undergoes training for 10 epochs, with a batch size of 32. The training progress is displayed in terms of loss and accuracy.

In [92]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

sequences = tokenizer.texts_to_sequences(df['text_clean'])

X_train = pad_sequences(sequences, maxlen=input_length, padding='post', truncating='post')

In [94]:
X_train.shape

(20000, 51)

In [97]:
epochs = 10
batch_size = 32

model.fit(X_train, one_hot_labels, epochs=epochs, batch_size=batch_size, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x141d15fd0>

The training results provide insights into the model's convergence, with both training and validation accuracies and losses over each epoch.

In [98]:
import numpy as np

def predict_label(model, tokenizer, input_text, input_length):
    sequence = tokenizer.texts_to_sequences([input_text])
    padded_sequence = pad_sequences(sequence, maxlen=input_length, padding='post', truncating='post')
    predicted_probabilities = model.predict(padded_sequence)
    predicted_label = label_encoder.inverse_transform([np.argmax(predicted_probabilities)])

    return predicted_label[0]

a = ["I", "You", "He", "She", "They"]

for i in a:
    predicted_label = predict_label(model, tokenizer, i.lower() + "very happy", input_length)
    print(f"Predicted Label: {predicted_label}")


Predicted Label: joy
Predicted Label: joy
Predicted Label: joy
Predicted Label: joy
Predicted Label: joy


#TODO viz metrics and split test 

Rows of Interest

| Row  | Text | Label |
| ---  | ---- | ----- |
| 5578 | i now don t want to feel slutty | love
|12380 | i gotta feeling da bul taewuhbeoryeo burn it up i gotta feeling niga ulbujitneun nal ah neoneun wiheomhae gal ttaekkaji gatsseo get away woooo becuz i m cuz i m dangerous b | anger
| 16503 | i do not feel insecure or unsafe | fear
| 18330 | i more important than going fun ipad strategy games original boots from ugg wear ugg boots this winter low cost ugg boots uggs need to get washed inside they are also lightweight so you won t feel burdened with them speed up finances with payday loans payday loans the monthly solution for you | sadness