# Emotion detection in text

## 1. Exploring the dataset

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("emotion.csv")

In [4]:
df.head()

Unnamed: 0,Emotion,Text
0,neutral,Why ?
1,joy,Sage Act upgrade on my to do list for tommorow.
2,sadness,ON THE WAY TO MY HOMEGIRL BABY FUNERAL!!! MAN ...
3,joy,Such an eye ! The true hazel eye-and so brill...
4,joy,@Iluvmiasantos ugh babe.. hugggzzz for u .! b...


In [5]:
df['Emotion'].value_counts()

joy         11045
sadness      6722
fear         5410
anger        4297
surprise     4062
neutral      2254
disgust       856
shame         146
Name: Emotion, dtype: int64

## 2. Pre-processing 

In [53]:
# Justin
'''neutral
fear
angry
sad
happy
surprise
disgust'''

'neutral\nfear\nangry\nsad\nhappy\nsurprise\ndisgust'

In [6]:
!pip install neattext

Collecting neattext
  Downloading neattext-0.1.3-py3-none-any.whl (114 kB)
[K     |████████████████████████████████| 114 kB 4.5 MB/s eta 0:00:01
[?25hInstalling collected packages: neattext
Successfully installed neattext-0.1.3


In [7]:
import neattext.functions as nfx

To use neattext, we list all the methods and attributes used by neattext for data cleaning.

In [8]:
dir(nfx)

['BTC_ADDRESS_REGEX',
 'CURRENCY_REGEX',
 'CURRENCY_SYMB_REGEX',
 'Counter',
 'DATE_REGEX',
 'EMAIL_REGEX',
 'EMOJI_REGEX',
 'HASTAG_REGEX',
 'MASTERCard_REGEX',
 'MD5_SHA_REGEX',
 'MOST_COMMON_PUNCT_REGEX',
 'NUMBERS_REGEX',
 'PHONE_REGEX',
 'PoBOX_REGEX',
 'SPECIAL_CHARACTERS_REGEX',
 'STOPWORDS',
 'STOPWORDS_de',
 'STOPWORDS_en',
 'STOPWORDS_es',
 'STOPWORDS_fr',
 'STOPWORDS_ru',
 'STOPWORDS_yo',
 'STREET_ADDRESS_REGEX',
 'TextFrame',
 'URL_PATTERN',
 'USER_HANDLES_REGEX',
 'VISACard_REGEX',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__generate_text',
 '__loader__',
 '__name__',
 '__numbers_dict',
 '__package__',
 '__spec__',
 '_lex_richness_herdan',
 '_lex_richness_maas_ttr',
 'clean_text',
 'defaultdict',
 'digit2words',
 'extract_btc_address',
 'extract_currencies',
 'extract_currency_symbols',
 'extract_dates',
 'extract_emails',
 'extract_emojis',
 'extract_hashtags',
 'extract_html_tags',
 'extract_mastercard_addr',
 'extract_md5sha',
 'extract_numbers',
 'extr

In [9]:
# user handles
df['Clean_Text'] = df['Text'].apply(nfx.remove_userhandles)

We also use apply() to add remove_stopwords. We save the cleaned dataset into a new column named Clean_Text.

In [34]:
# stopwords
df['Clean_Text'] = df['Clean_Text'].apply(nfx.remove_stopwords)

In [11]:
df

Unnamed: 0,Emotion,Text,Clean_Text
0,neutral,Why ?,?
1,joy,Sage Act upgrade on my to do list for tommorow.,Sage Act upgrade list tommorow.
2,sadness,ON THE WAY TO MY HOMEGIRL BABY FUNERAL!!! MAN ...,WAY HOMEGIRL BABY FUNERAL!!! MAN HATE FUNERALS...
3,joy,Such an eye ! The true hazel eye-and so brill...,eye ! true hazel eye-and brilliant ! Regular f...
4,joy,@Iluvmiasantos ugh babe.. hugggzzz for u .! b...,ugh babe.. hugggzzz u .! babe naamazed nga ako...
...,...,...,...
34787,surprise,@MichelGW have you gift! Hope you like it! It'...,gift! Hope like it! hand wear ! It'll warm! Lol
34788,joy,The world didnt give it to me..so the world MO...,world didnt me..so world DEFINITELY cnt away!!!
34789,anger,A man robbed me today .,man robbed today .
34790,fear,"Youu call it JEALOUSY, I call it of #Losing YO...","Youu JEALOUSY, #Losing YOU..."


## 3. Importing ML packages

In [35]:
# Estimators
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Transformers
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

## 4. Model features and labels

**Features** are the attributes and variables extracted from the dataset. These extracted features are used as inputs to the model during training enabling model learning. Our features are present in the Clean_Text column.

**Labels** are the output or the target variable. Our label is the Emotion column, and this is what the model is predicting.

In [29]:
Xfeatures = df['Clean_Text']
ylabels = df['Emotion']

## 5. Data splitting and pipline

In [40]:
# split data
x_train,x_test,y_train,y_test = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=42)

The first stage is the CountVectorizer process. This stage converts the raw text dataset into a matrix of numbers that a machine can understand.

The second stage is the model training process using the LogisticRegression algorithm. In this stage, the model learns from the dataset. During training, it understands patterns, gains knowledge, and uses the knowledge to make predictions.

In [41]:
from sklearn.pipeline import Pipeline

In [42]:
# LogistiticRegression pipeline
pipe_lr = Pipeline(steps=[('cv',CountVectorizer()),('lr',LogisticRegression())])

## 7. Model fitting

After initializing the two stages, we need to fit these stages into our dataset. We will use the train set dataset, which is specified as x_train and y_train.

In [44]:
# train and fit data
pipe_lr.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Pipeline(steps=[('cv', CountVectorizer()), ('lr', LogisticRegression())])

In [45]:
pipe_lr

Pipeline(steps=[('cv', CountVectorizer()), ('lr', LogisticRegression())])

In [49]:
# check accuracy
pipe_lr.score(x_test,y_test)

0.6200421536692853

In [47]:
# make a prediction
sample1 = "This chocolate was very sweet it made me happy"

In [48]:
# actual prediction
pipe_lr.predict([sample1])

array(['joy'], dtype=object)

In [56]:
text_1 = "this is all wrong I shouldn't be up here I should be back in school on the other side of the ocean yet you all come to us young people for Hope how dare you you have stolen my dreams and my childhood with your empty words and yet I'm one of the lucky ones people are suffering people are dying entire ecosystems are collapsing we are in the beginning of a mass extinction and all you can talk about is the money and fairy tales of Eternal economic growth how dare you"


In [57]:
pipe_lr.predict([text_1])

array(['fear'], dtype=object)

In [58]:
text_2 = "thank you thank you thank you thank you to the academy for this all 6,000 members thank you to the other nominees all these performances were impeccable in my opinion I didn't see a false note anywhere I want to thank valet or director Jennifer garnered with daily there's a few things about three things to my account that I need each day one of them is something to look up to another is something to look forward to in another is someone to chase now first off I want to thank God because that's who I look up to he's great my life with opportunities that I know are not of my hand or any other human hand he is showing me that it's a scientific fact that gratitude reciprocates"


In [59]:
pipe_lr.predict([text_2])

array(['surprise'], dtype=object)

In [61]:
text_3 = "I used to think the whole purpose of life was pursuing happiness everyone said the path to happiness was success so I searched for that ideal job that perfect boyfriend that beautiful apartment but instead of ever feeling fulfilled I felt anxious and the drift and I wasn't alone my friends they struggled with this too"


In [62]:
pipe_lr.predict([text_3])

array(['joy'], dtype=object)