<a href="https://colab.research.google.com/github/themadan/p7.Emotion-detection/blob/master/Surface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# References

* [15 basic step for NLP](https://github.com/themadan/12.NLP-ear-and-tongue-sound-and-text-/blob/master/15_natural_language_processing.ipynb)

* [Microsoft developer](https://devblogs.microsoft.com/cse/2015/11/29/emotion-detection-and-recognition-from-text-using-deep-learning/)
* [Medium](https://medium.com/the-research-nest/applied-machine-learning-part-3-3fd405842a18) <br> [Work](https://github.com/aditya-xq/Text-Emotion-Detection-Using-NLP)
* [Compete web application](https://github.com/maelfabien/Multimodal-Emotion-Recognition)

* [git](https://github.com/Harsh24893/EmotionRecognition)
* [Notebook](https://github.com/abishekarun/Text-Emotion-Classification/blob/master/emotion_classification.ipynb)



# Import Packages

In [0]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from scipy.stats import itemfreq
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer,HashingVectorizer
from sklearn.pipeline import Pipeline




In [15]:
from google.colab import drive
drive.mount('/drive')

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


In [0]:
columns=['sentiment','content']
data = pd.read_csv('/drive/My Drive/Fusemachines Nepal/NLP/ISEAR.csv',names=columns)

In [18]:
data.head()

Unnamed: 0,sentiment,content
0,joy,On days when I feel close to my partner and ot...
1,fear,Every time I imagine that someone I love or I ...
2,anger,When I had been obviously unjustly treated and...
3,sadness,When I think about the short time that we live...
4,disgust,At a gathering I found myself involuntarily si...


In [19]:
data.shape

(7446, 2)

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7446 entries, 0 to 7445
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  7446 non-null   object
 1   content    7446 non-null   object
dtypes: object(2)
memory usage: 174.5+ KB


In [22]:
data.sentiment.value_counts()

joy        1082
sadness    1074
anger      1069
fear       1063
disgust    1059
shame      1059
guilt      1040
Name: sentiment, dtype: int64

# Clean Text

Remove irrelevant characters other than alphanumeric and space

In [0]:
data['content']=data['content'].str.replace('[^A-Za-z0-9\s]+', '')

Remove links from the text

In [0]:
data['content']=data['content'].str.replace('http\S+|www.\S+', '', case=False)

Convert everything to lowercase

In [0]:
data['content']=data['content'].str.lower()

Assign Target Variable

In [0]:
target=data.sentiment
data = data.drop(['sentiment'],axis=1)

In [27]:
data

Unnamed: 0,content
0,on days when i feel close to my partner and ot...
1,every time i imagine that someone i love or i ...
2,when i had been obviously unjustly treated and...
3,when i think about the short time that we live...
4,at a gathering i found myself involuntarily si...
...,...
7441,last week i had planned to play tennis and had...
7442,when i was ill and had to stay at the hospital...
7443,a few days back i was waiting for the bus at t...
7444,a few days back i had a tutorial class and the...


In [28]:
target

0           joy
1          fear
2         anger
3       sadness
4       disgust
         ...   
7441      anger
7442    sadness
7443    disgust
7444      shame
7445      guilt
Name: sentiment, Length: 7446, dtype: object

LabelEncoder for target.

In [0]:
le=LabelEncoder()
target=le.fit_transform(target)

Split Data into train & test

In [0]:
X_train, X_test, y_train, y_test = train_test_split(data,target,stratify=target,test_size=0.4, random_state=42)

Check if the split divides the classes uniformly

In [33]:
itemfreq(y_train)

`itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
  This is separate from the ipykernel package so we can avoid doing imports until


array([[  0, 641],
       [  1, 635],
       [  2, 638],
       [  3, 624],
       [  4, 649],
       [  5, 644],
       [  6, 636]])

In [34]:
itemfreq(y_test)

`itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
  """Entry point for launching an IPython kernel.


array([[  0, 428],
       [  1, 424],
       [  2, 425],
       [  3, 416],
       [  4, 433],
       [  5, 430],
       [  6, 423]])

# Tokenization

Tokenization can be done in a variety of ways, namely Bag of words, tf-idf, Glove, word2vec ,fasttext etc. Lets see how they can be applied and how they affect the accuracy

Bag of Words

In [37]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train.content)
X_test_counts =count_vect.transform(X_test.content)
print('Shape of Term Frequency Matrix: ',X_train_counts.shape)

Shape of Term Frequency Matrix:  (4467, 7065)


Naive Bayes Model

In [39]:
from sklearn.naive_bayes import MultinomialNB


# Machine Learning
# Training Naive Bayes (NB) classifier on training data.
clf = MultinomialNB().fit(X_train_counts,y_train)
predicted = clf.predict(X_test_counts)
nb_clf_accuracy = np.mean(predicted == y_test) * 100
print(nb_clf_accuracy)

54.98489425981873


Same thing can be done using a Pipeline¶

Lets take a look at how it can be done.
First lets define a function for printing accuracy

In [0]:
def print_acc(model):
    predicted = model.predict(X_test.content)
    accuracy = np.mean(predicted == y_test) * 100
    print(accuracy)

In [42]:


nb_clf = Pipeline([('vect', CountVectorizer()), ('clf', MultinomialNB())])
nb_clf = nb_clf.fit(X_train.content,y_train)
print_acc(nb_clf)

54.98489425981873


TF IDF transformer

In [43]:
nb_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
nb_clf = nb_clf.fit(X_train.content,y_train)
print_acc(nb_clf)

55.72339711312521


Hash Vectorizer


Note: Naive Bayes requires input to be non negative. Therefore, the alternate sign should be set to false in Hashing Vectorizer to make it work with naive bayes algorithm

In [44]:
nb_clf = Pipeline([('vect', HashingVectorizer(n_features=2500,alternate_sign=False)), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
nb_clf = nb_clf.fit(X_train.content,y_train)
print_acc(nb_clf)

51.42665323934206


In [46]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test,predicted)

array([[200,  31,  32,  77,  20,  26,  42],
       [ 63, 206,  35,  39,  17,  26,  38],
       [ 23,  16, 287,  28,  32,  22,  17],
       [ 60,  13,  30, 220,  14,  37,  42],
       [ 36,   6,  13,  37, 284,  40,  17],
       [ 36,   8,  21,  46,  30, 267,  22],
       [ 69,  30,  25,  76,  27,  22, 174]])

Remove Stop Words

In [51]:
import nltk

from nltk.corpus import stopwords

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [52]:
stop_words = set(stopwords.words('english'))
nb_clf = Pipeline([('vect', CountVectorizer(stop_words=stop_words)), ('clf', MultinomialNB())])
nb_clf = nb_clf.fit(X_train.content,y_train)
print_acc(nb_clf)

55.085599194360526


In [53]:
nb_clf = Pipeline([('vect', CountVectorizer(stop_words=stop_words)), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
nb_clf = nb_clf.fit(X_train.content,y_train)
print_acc(nb_clf)

55.052030882846594


Lemmatization

In [55]:
nltk.download('wordnet')

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)])
X_train.loc[:,'content'] = X_train['content'].apply(lemmatize_text)
X_test.loc[:,'content'] = X_test['content'].apply(lemmatize_text)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value


In [56]:
nb_clf = Pipeline([('vect', CountVectorizer(stop_words=stop_words)), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
nb_clf = nb_clf.fit(X_train.content,y_train)
print_acc(nb_clf)

55.11916750587446
