In this notebook, we will see how to train emotion detection model using Machine Learning. We will using scikit-learn, pandas and numpy libs.

In [8]:
#import dependecies
import numpy as np
import pandas as pd

## Load the dataset

In [9]:
#read dataset
data = pd.read_csv("./data/text_emotion.csv")

In [10]:
data.shape

(40000, 4)

In [11]:
data.head()

Unnamed: 0,tweet_id,sentiment,author,content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...


In [12]:
#total classes
data.sentiment.unique()

array(['empty', 'sadness', 'enthusiasm', 'neutral', 'worry', 'surprise',
       'love', 'fun', 'hate', 'happiness', 'boredom', 'relief', 'anger'], dtype=object)

## Preprocess the dataset

In [13]:
#drop unnecessary columns
data = data.drop(data.columns[[0,2]], axis=1)

In [14]:
data.head()

Unnamed: 0,sentiment,content
0,empty,@tiffanylue i know i was listenin to bad habi...
1,sadness,Layin n bed with a headache ughhhh...waitin o...
2,sadness,Funeral ceremony...gloomy friday...
3,enthusiasm,wants to hang out with friends SOON!
4,neutral,@dannycastillo We want to trade with someone w...


In [15]:
dataset = data.as_matrix()

In [16]:
dataset.shape

(40000, 2)

In [17]:
features = dataset[:,1]

In [18]:
features[123]

'@poinktoinkdoink He died.  Wait, what about Magic Jack? I just read it.'

In [19]:
target = dataset[:,0]

In [20]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
target_processed = le.fit_transform(target)

In [21]:
le.classes_

array(['anger', 'boredom', 'empty', 'enthusiasm', 'fun', 'happiness',
       'hate', 'love', 'neutral', 'relief', 'sadness', 'surprise', 'worry'], dtype=object)

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_processed = tfidf.fit_transform(features)

In [23]:
X_processed

<40000x48212 sparse matrix of type '<class 'numpy.float64'>'
	with 475946 stored elements in Compressed Sparse Row format>

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_processed, target_processed, test_size=0.5, random_state=42)

In [None]:
y_train

array([ 3,  5, 10, ...,  4,  6,  7], dtype=int64)

## Train a model

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

In [None]:
#evaluate model
rf.score(X_test, y_test)

In [None]:
#predict
test_ex = "It is so irritating"
text_ex_processed = tfidf.transform(test_ex)
rf.predict(test_ex_processed)

Excercise : Use SVM to train a model and notice the difference in performance.

In [None]:
from sklearn.svm import SVC
svm = SVC(kernel="rbf", C=10)
svm.fit(X_train, y_train)