<a href="https://colab.research.google.com/github/themadan/p7.Emotion-detection/blob/master/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# References

* [15 basic step for NLP](https://github.com/themadan/12.NLP-ear-and-tongue-sound-and-text-/blob/master/15_natural_language_processing.ipynb)

* [Microsoft developer](https://devblogs.microsoft.com/cse/2015/11/29/emotion-detection-and-recognition-from-text-using-deep-learning/)
* [Medium](https://medium.com/the-research-nest/applied-machine-learning-part-3-3fd405842a18) <br> [Work](https://github.com/aditya-xq/Text-Emotion-Detection-Using-NLP)
* [Compete web application](https://github.com/maelfabien/Multimodal-Emotion-Recognition)

* [git](https://github.com/Harsh24893/EmotionRecognition)
* [Notebook](https://github.com/abishekarun/Text-Emotion-Classification/blob/master/emotion_classification.ipynb)



# Import Packages

In [0]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from scipy.stats import itemfreq
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer,HashingVectorizer
from sklearn.pipeline import Pipeline
import nltk
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix




In [297]:
from google.colab import drive
drive.mount('/drive')

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


In [0]:
columns=['emotion','content']
data = pd.read_csv('/drive/My Drive/Fusemachines Nepal/NLP/ISEAR.csv',names=columns)

In [299]:
data.head()

Unnamed: 0,emotion,content
0,joy,On days when I feel close to my partner and ot...
1,fear,Every time I imagine that someone I love or I ...
2,anger,When I had been obviously unjustly treated and...
3,sadness,When I think about the short time that we live...
4,disgust,At a gathering I found myself involuntarily si...


In [300]:
data.describe()

Unnamed: 0,emotion,content
count,7446,7446
unique,7,7379
top,joy,When my grandfather died.
freq,1082,8


* We have 7 emotion category.
* We have Total of 7446 data.
* Class joy has the highest number of data.

In [301]:
data.shape

(7446, 2)

In [302]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7446 entries, 0 to 7445
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   emotion  7446 non-null   object
 1   content  7446 non-null   object
dtypes: object(2)
memory usage: 174.5+ KB


## Number of data for each emotion.

In [303]:
data.emotion.value_counts()

joy        1082
sadness    1074
anger      1069
fear       1063
disgust    1059
shame      1059
guilt      1040
Name: emotion, dtype: int64

* We have 7 emotion types
* The data seems to be quite balanced

# **Clean Text**

In [304]:
data['content'][0]
print(data['content'][0])
print(data['content'].str.len())

On days when I feel close to my partner and other friends.   
When I feel at peace with myself and also experience a close  
contact with people whom I regard greatly.
0       167
1        92
2        88
3       139
4       144
       ... 
7441    181
7442     72
7443    875
7444    358
7445     90
Name: content, Length: 7446, dtype: int64


## **Remove all the new line characters**

In [305]:
data['content'] = data['content'].str.replace('\n', '')
print(data['content'][0])
print(data['content'].str.len())

On days when I feel close to my partner and other friends.   When I feel at peace with myself and also experience a close  contact with people whom I regard greatly.
0       165
1        91
2        87
3       137
4       142
       ... 
7441    179
7442     71
7443    862
7444    353
7445     89
Name: content, Length: 7446, dtype: int64


## **Replace full stop with blank**

In [306]:
data['content'] = data['content'].str.replace('.', '')
print(data['content'][0])
print(data['content'].str.len())

On days when I feel close to my partner and other friends   When I feel at peace with myself and also experience a close  contact with people whom I regard greatly
0       163
1        90
2        86
3       136
4       141
       ... 
7441    177
7442     70
7443    853
7444    349
7445     88
Name: content, Length: 7446, dtype: int64


## **Remove irrelevant characters other than alphanumeric and space**

In [307]:
data['content']=data['content'].str.replace('[^A-Za-z0-9\s]+', '')
print(data['content'][0])
print(data['content'].str.len())

On days when I feel close to my partner and other friends   When I feel at peace with myself and also experience a close  contact with people whom I regard greatly
0       163
1        89
2        86
3       136
4       141
       ... 
7441    177
7442     70
7443    852
7444    348
7445     88
Name: content, Length: 7446, dtype: int64


## **Remove links from the text**

In [308]:
data['content']=data['content'].str.replace('http\S+|www.\S+', '', case=False)
print(data['content'][0])
print(data['content'].str.len())

On days when I feel close to my partner and other friends   When I feel at peace with myself and also experience a close  contact with people whom I regard greatly
0       163
1        89
2        86
3       136
4       141
       ... 
7441    177
7442     70
7443    852
7444    348
7445     88
Name: content, Length: 7446, dtype: int64


## **Convert everything to lowercase**

In [309]:
data['content']=data['content'].str.lower()
print(data['content'][0])
print(data['content'].str.len())

on days when i feel close to my partner and other friends   when i feel at peace with myself and also experience a close  contact with people whom i regard greatly
0       163
1        89
2        86
3       136
4       141
       ... 
7441    177
7442     70
7443    852
7444    348
7445     88
Name: content, Length: 7446, dtype: int64


## **Removing Punctuation, Symbols**

In [310]:
data['content'] = data['content'].str.replace('[^\w\s]',' ')
print(data['content'][0])
print(data['content'].str.len())

on days when i feel close to my partner and other friends   when i feel at peace with myself and also experience a close  contact with people whom i regard greatly
0       163
1        89
2        86
3       136
4       141
       ... 
7441    177
7442     70
7443    852
7444    348
7445     88
Name: content, Length: 7446, dtype: int64


## **Removing Stop Words using NLTK**

In [311]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
data['content'] = data['content'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

print(data['content'][0])
print(data['content'].str.len())

days feel close partner friends feel peace also experience close contact people regard greatly
0        94
1        72
2        50
3        62
4       102
       ... 
7441    110
7442     29
7443    474
7444    226
7445     48
Name: content, Length: 7446, dtype: int64


## **Lemmatisation**

In [312]:
from textblob import Word

data['content'] = data['content'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
print(data['content'][0])
print(data['content'].str.len())

day feel close partner friend feel peace also experience close contact people regard greatly
0        92
1        72
2        50
3        61
4       101
       ... 
7441    109
7442     29
7443    468
7444    225
7445     47
Name: content, Length: 7446, dtype: int64


## **Correcting Letter Repetitions**

In [313]:
import re
def de_repeat(text):
    pattern = re.compile(r"(.)\1{2,}")
    return pattern.sub(r"\1\1", text)

data['content'] = data['content'].apply(lambda x: " ".join(de_repeat(x) for x in x.split()))

print(data['content'][0])
print(data['content'].str.len())

day feel close partner friend feel peace also experience close contact people regard greatly
0        92
1        72
2        50
3        61
4       101
       ... 
7441    109
7442     29
7443    468
7444    225
7445     47
Name: content, Length: 7446, dtype: int64


# **Assign Target Variable**

In [314]:
target=data.emotion
data = data.drop(['emotion'],axis=1)
print(target)
print(data)

0           joy
1          fear
2         anger
3       sadness
4       disgust
         ...   
7441      anger
7442    sadness
7443    disgust
7444      shame
7445      guilt
Name: emotion, Length: 7446, dtype: object
                                                content
0     day feel close partner friend feel peace also ...
1     every time imagine someone love could contact ...
2     obviously unjustly treated possibility elucida...
3     think short time live relate period life think...
4     gathering found involuntarily sitting next two...
...                                                 ...
7441  last week planned play tennis booked tennis co...
7442                      ill stay hospital period time
7443  day back waiting bus bus stop getting bus prep...
7444  day back tutorial class teacher randomly assig...
7445    quarrelled sister deliberately messed belonging

[7446 rows x 1 columns]


# **LabelEncoder for target**

In [316]:
le=LabelEncoder()
target=le.fit_transform(target)
print(target)

[4 2 0 ... 1 6 3]


# **Split Data into train & test**

In [323]:
X_train, X_test, y_train, y_test = train_test_split(data,target,stratify=target,test_size=0.4, random_state=42)
print(X_train)
print(y_train)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)



                                                content
504                went home train sicilia molested man
2282  met girl 22 year old married liked asked date ...
3476          sad first boyfriend finished relationship
6478                 heard friend started drinking beer
2250  possibility getting better professional life v...
...                                                 ...
4074  aunt phoned ask refused invitation dinner home...
6622  wet head bed one day sister discovered reporte...
2758  feeling unable preserve one idea ambition inno...
1465  angry several driver showed aggressive dangero...
2824  possibility act certain activity better done r...

[4467 rows x 1 columns]
[1 1 5 ... 5 0 3]
(4467, 1)
(4467,)
(2979, 1)
(2979,)


# **Tokenization**

## **1.Term Frequency - Inverse Document Frequency (TF-IDF)**

In [329]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train.content)
X_test_counts =count_vect.transform(X_test.content)
print('Shape of Term Frequency Matrix: ',X_train_counts.shape)

Shape of Term Frequency Matrix:  (4467, 6372)


In [331]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
lsvm = SGDClassifier(alpha=0.001, random_state=5, max_iter=15, tol=None)
lsvm.fit(X_train_counts,y_train)
y_pred = lsvm.predict(X_test_counts)
print('lsvm using count vectors accuracy %s' % accuracy_score(y_pred, y_val))

ValueError: ignored

In [237]:
# last 50 features
print(vect.get_feature_names()[-50:])

['wurm', 'xmas', 'xrays', 'yard', 'yastrebetz', 'yavanna', 'ye', 'year', 'yearold', 'yearrs', 'yearscourse', 'yeaterday', 'yell', 'yelled', 'yelling', 'yellow', 'yes', 'yesterday', 'yet', 'yield', 'yielding', 'york', 'young', 'younger', 'youngest', 'youngish', 'youngster', 'youngstters', 'yournals', 'youth', 'yr', 'yugoslavia', 'yukky', 'zalu', 'zambezi', 'zambia', 'zcbc', 'zealand', 'zealander', 'zeeland', 'zemba', 'zero', 'zesco', 'zhu', 'zigzagging', 'zip', 'zipper', 'zomba', 'zombie', 'zone']


In [238]:
# show vectorizer options
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [174]:
# use default options for CountVectorizer
vect = CountVectorizer()

# create document-term matrices
train_dtm = vect.fit_transform(X_train)
test_dtm = vect.transform(X_test)

# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(train_dtm, y_train)
y_pred_class = nb.predict(test_dtm)

# calculate accuracy
print (accuracy_score(y_test, y_pred_class))

0.5516778523489932


In [176]:
# CountVectorizer
vect = CountVectorizer()
pd.DataFrame(vect.fit_transform(X_train).toarray(), columns=vect.get_feature_names())

Unnamed: 0,00,10,100,1011,102,10t,10th,10year,10yrs,11,110,110kmh,1130,11months,11th,12,120,1200,1230,1283,12th,12yearold,13,13th,14,1400,15,150,1500,1516,16,16yearold,17,18,180,18th,19,1960,1966,1968,...,yearscourse,yeaterday,yell,yelled,yelling,yellow,yes,yesterday,yet,yield,yielding,york,young,younger,youngest,youngish,youngster,youngstters,yournals,youth,yr,yugoslavia,yukky,zalu,zambezi,zambia,zcbc,zealand,zealander,zeeland,zemba,zero,zesco,zhu,zigzagging,zip,zipper,zomba,zombie,zone
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6696,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6697,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6698,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6699,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [178]:
# CountVectorizer
vect = CountVectorizer()
pd.DataFrame(vect.fit_transform(X_train).toarray(), columns=vect.get_feature_names())

Unnamed: 0,00,10,100,1011,102,10t,10th,10year,10yrs,11,110,110kmh,1130,11months,11th,12,120,1200,1230,1283,12th,12yearold,13,13th,14,1400,15,150,1500,1516,16,16yearold,17,18,180,18th,19,1960,1966,1968,...,yearscourse,yeaterday,yell,yelled,yelling,yellow,yes,yesterday,yet,yield,yielding,york,young,younger,youngest,youngish,youngster,youngstters,yournals,youth,yr,yugoslavia,yukky,zalu,zambezi,zambia,zcbc,zealand,zealander,zeeland,zemba,zero,zesco,zhu,zigzagging,zip,zipper,zomba,zombie,zone
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6696,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6697,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6698,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6699,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Count Vectors

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(analyzer='word')
count_vect.fit(data['content'])
X_train_count =  count_vect.transform(X_train)
X_val_count =  count_vect.transform(X_val)

In [240]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
lsvm = SGDClassifier(alpha=0.001, random_state=5, max_iter=15, tol=None)
lsvm.fit(X_train_count, y_train)
y_pred = lsvm.predict(X_val_count)
print('lsvm using count vectors accuracy %s' % accuracy_score(y_pred, y_val))

lsvm using count vectors accuracy 0.5919463087248322
