## Emotion analysis and classification of short comments using machine learning techniques
+ Code developed by: Douglas Maia dos Santos
+ Github acess: https://github.com/m-dougl/emotion-analysis

##### Importing libraries for proper code functioning
###### OBS: Code depends on functions stored in emotion_analysis.py
+ Scikit-Learn or Sklearn: Library that provides the models used for classifications, as well as methods and metrics to evaluate them
+ Natural Language Toolkit or NLTK: Library used to perform some text pre-processing steps, such as the removal of stopwords and text tokenization
+ Pandas: Library used to organize and manipulate dataset working with objects in the DataFrame format that allows you to visualize changes in an organized way
+ Matplotlib and Seaborn: Libraries for graphical visualizations of data
+ Emotion_analysis: Python file where all useful functions have been organized to make the task of sorting comments easier

In [None]:
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from collections import Counter
from unicodedata import normalize
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

import pandas as pd
import nltk
import numpy as np
import string
import emotion_analysis
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
nltk.download('rslp')
nltk.download('stopwords')
plt.style.use('seaborn')

##### Dataset import and emotion analysis
+ Using pandas library stored in function "open_dataset" to read dataset
+ Pie chart plot to check the predominance of emotions

In [None]:
df = emotion_analysis.open_dataset('dataset.xlsx', 'xlsx')
df.Emoção = df.Emoção.str.lower()

In [None]:
emotions = Counter(df.Emoção)
plt.figure(figsize=(9, 4))
plt.title(f'$Distribution $ $of$ $emotions$ $in$ {len(list(df.Emoção))} $comments$')
plt.pie(x=emotions.values(), labels=emotions.keys(),
        shadow=True, autopct='%1.1f%%')

##### Data pre-processing stage
+ 1 - Removal of special characters
+ 2 - Removal of accents contained in words
+ 3 - Text tokenization
+ 4 - Removel of stopwords contained in comments
+ 5 - Text untokenization
+ 6 - Reduction of the word to its root (Stemming)

In [None]:
df.Comentarios = df.Comentarios.apply(emotion_analysis.remove_characters)
df.Comentarios = df.Comentarios.apply(emotion_analysis.remove_accents)
df.Comentarios = df.Comentarios.apply(emotion_analysis.tokenize)
df.Comentarios = df.Comentarios.apply(emotion_analysis.remove_stopwords)
df.Comentarios = df.Comentarios.apply(emotion_analysis.untokenize)
df.Comentarios = df.Comentarios.apply(emotion_analysis.stemming)

##### Transformation of comments into a numerical matrix using TFIDF-Vectorizer or CountVectorizer methods
Check emotion_analysis file for more information about the parameters of this function

In [None]:
X, y = df.Comentarios, df.Emoção
X = emotion_analysis.vectorizer(X, 'tfidf')

##### Separation of the amout of training and test data
The variable X stores the numerical matrix and the variable y stores the emotions of the dataset. In this case, we choose the proportion of 80% data for training and 20% for testing our models, these values can be freely tested by user
+ If you choose to use the "simple_train" function, the models will be trained without parameter optimization of the GridSearchCV
+ If you choose to use the "cv_train" models will be trained taking cross-validation into account

In [None]:
train_size = .8
X_train, X_test, y_train, y_test = emotion_analysis.dataset_split(X, y, train_size)

In [None]:
labels =  ['ALEGRIA', 'TRISTEZA', 'SURPRESA']
emotion_analysis.emotion_plot(y_train, y_test, emotions)

In [None]:
n_fold = 10
NB  = emotion_analysis.cv_train(classifier_name='NB',  X=X_train, y= y_train, n_fold=n_fold)
SVM = emotion_analysis.cv_train(classifier_name='SVM', X=X_train, y= y_train, n_fold=n_fold)
KNN = emotion_analysis.cv_train(classifier_name='KNN', X=X_train, y= y_train, n_fold=n_fold)

pred_nb  = NB.predict(X_test)   # Naive Bayes prediction
pred_svm = SVM.predict(X_test)  # SVM prediction
pred_knn = KNN.predict(X_test)  # KNN prediction

In [None]:
ADA = emotion_analysis.cv_train(classifier_name='ADA', X=X_train, y= y_train, n_fold=n_fold)
XGB = emotion_analysis.cv_train(classifier_name='XGB', X=X_train, y= y_train, n_fold=n_fold)
CAT = emotion_analysis.cv_train(classifier_name='CAT', X=X_train, y= y_train, n_fold=n_fold)

pred_ada = ADA.predict(X_test)  # ADA prediction
pred_xgb = XGB.predict(X_test)  # XGB prediction
pred_cat = CAT.predict(X_test)  # CAT prediction

##### Evaluating the models
+ Confusion Matrix
+ Accuracy
+ Precision
+ Recall 
+ F1-Score
+ AUC Score

+ Confusion Matrix

In [None]:
list_predict = [pred_nb, pred_svm, pred_knn, pred_ada,pred_xgb, pred_cat]
models_names = ['NB', 'SVM', 'KNN', 'ADA','XGB', 'CAT']
'''OBS:
        0 = ALEGRIA
        1 = SURPRESA
        2 = TRISTEZA
'''
emotion_analysis.confusion_matrix_plot(list_predict = list_predict,
                                  models_names = models_names, 
                                  y_true       = y_test)

+ Accuracy, Precision, Recall and F1-Score

In [None]:
emotion_analysis.metrics_evaluation(models_names=models_names,
                   list_predict=list_predict,
                   y_true = y_test)