**IST664 - NLP Lab Week 9**

In previous lab, we have practiced different word embedding techniques to convert text data into vectors of numeric data. In this lab, we will learn to use convolutional neural network (CNN) for recognizing emotions in a tweet. The algorithm we learn uses multiple channels - the raw tweet text, the hash tags, emojis, and emoticons, and the features based on the emotion lexicons. This is an example that shows the potential to combine deep learning techniques and the language processing knowledge we gain from the foundations. For this lab, you will need the lexicon resources which are available in this week's folder on the course site.

Please note that the multi-channel CNN algorithm below is developed by Islam, Mercer, & Xiao (2019). If you plan to use it for your data analysis work, please cite the references to give the researchers credit. Thank you.

Islam, J., Mercer, R. E., & Xiao, L. (2019, June). Multi-channel convolutional neural network for twitter emotion and sentiment recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for [link text](https://)Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 1355-1365).

In [1]:
!pip install emoji
!pip install vaderSentiment

Collecting emoji
  Downloading emoji-2.9.0-py2.py3-none-any.whl (397 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m397.5/397.5 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.9.0
Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [2]:
# ********* Import Packages ********* #
import re, csv, emoji, operator
import pandas as pd
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
import numpy as np
from numpy import asarray, zeros, array
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import KFold
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
#from emoji.unicode_codes import UNICODE_EMOJI
from keras.preprocessing.text import Tokenizer
#from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import Input, Dense, Dropout, Embedding, GlobalMaxPooling1D
from keras.layers import Conv1D
#from keras.layers.merge import concatenate
from tensorflow.keras.layers import concatenate
from keras.callbacks import Callback

In [3]:
# ********* Load Data ********* #
# Important: Change load_data() function according to the dataset that you have. The following
# function processes the Twitter Emotion Corpus (TEC)
# Link of paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.383.3384&rep=rep1&type=pdf
# Link of dataset: http://saifmohammad.com/WebPages/SentimentEmotionLabeledData.html

basefile = "Jan9-2012-tweets-clean.txt"

# This downloads the zip file
!wget http://saifmohammad.com/WebDocs/Jan9-2012-tweets-clean.txt.zip
!unzip Jan9-2012-tweets-clean.txt.zip

# List of tweets
texts = []

# List of labels
labels = []

def load_data():
    with open(basefile, 'r', encoding="utf8") as f:
        for line in f:
            splitted = line.strip().split()
            labels.append(splitted[len(splitted)-1])
            texts.append(' '.join(splitted[1:len(splitted)-2]))
    print('Loaded %s  data' % len(labels))

print("Loading data...")
load_data()
print(Counter(labels))

# Example
print(texts[55])
print(labels[55])

--2024-01-11 16:50:54--  http://saifmohammad.com/WebDocs/Jan9-2012-tweets-clean.txt.zip
Resolving saifmohammad.com (saifmohammad.com)... 192.185.17.122
Connecting to saifmohammad.com (saifmohammad.com)|192.185.17.122|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1128895 (1.1M) [application/zip]
Saving to: ‘Jan9-2012-tweets-clean.txt.zip’


2024-01-11 16:50:54 (12.3 MB/s) - ‘Jan9-2012-tweets-clean.txt.zip’ saved [1128895/1128895]

Archive:  Jan9-2012-tweets-clean.txt.zip
  inflating: Jan9-2012-tweets-clean.txt  
   creating: __MACOSX/
  inflating: __MACOSX/._Jan9-2012-tweets-clean.txt  
Loading data...
Loaded 21051  data
Counter({'joy': 8240, 'surprise': 3849, 'sadness': 3830, 'fear': 2816, 'anger': 1555, 'disgust': 761})
literally haven't seen the sun in a week and it's finally coming out!
joy


In [4]:
# download lexicons
!wget https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/Ratings_Warriner_et_al.csv
!wget https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/NRC-emotion-lexicon-wordlevel-v0.92.txt
!wget https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/nrc_affect_intensity.txt
!wget https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/NRC-Hashtag-Emotion-Lexicon-v0.2.txt
!wget https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/BingLiu.txt
!wget https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/mpqa.txt
!wget https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/AFINN-en-165.txt
!wget https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/stopwords.txt
!wget https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/slangs.txt
!wget https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/negated_words.txt
!wget https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/emoticons.txt

--2024-01-11 16:50:55--  https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/Ratings_Warriner_et_al.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3736801 (3.6M) [text/plain]
Saving to: ‘Ratings_Warriner_et_al.csv’


2024-01-11 16:50:55 (40.8 MB/s) - ‘Ratings_Warriner_et_al.csv’ saved [3736801/3736801]

--2024-01-11 16:50:55--  https://raw.githubusercontent.com/jumayel06/Tension-Analysis/master/Notebooks/Emotion%20Recognition%20Notebook/lexicons/NRC-emotion-lexicon-wordlevel-v0.92.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133

In [5]:
# ********* Load Lexicons ********* #
bingliu_mpqa = {}
nrc_emotion = {}
nrc_affect_intensity = {}
nrc_hashtag_emotion = {}
afinn = {}
ratings = {}
stopwords = []
slangs = {}
negated = {}
emoticons = []

# Vader
analyzer = SentimentIntensityAnalyzer()

def load_lexicons():
    # Ratings by Warriner et al. (2013)
    with open('Ratings_Warriner_et_al.csv', 'r') as f:
        reader = csv.reader(f)
        rows = list(reader)
    print(rows)
    for i in range(1, len(rows)):
        # Normalize values
        valence = (float(rows[i][2]) - 1.0)/(9.0-1.0)
        arousal = (float(rows[i][5]) - 1.0)/(9.0-1.0)
        dominance = (float(rows[i][8]) - 1.0)/(9.0-1.0)
        ratings[rows[i][1]] = {"Valence": valence, "Arousal": arousal, "Dominance": dominance}


    # NRC Emotion Lexicon (2014)
    with open('NRC-emotion-lexicon-wordlevel-v0.92.txt', 'r') as f:
        f.readline()
        for line in f:
            splitted = line.strip().split('\t')
            if splitted[0] not in nrc_emotion:
                nrc_emotion[splitted[0]] = {'anger': float(splitted[1]),
                                                    'disgust': float(splitted[3]),
                                                    'fear': float(splitted[4]),
                                                    'joy': float(splitted[5]),
                                                    'sadness': float(splitted[8]),
                                                    'surprise': float(splitted[9])}

    # NRC Affect Intensity (2018)
    with open('nrc_affect_intensity.txt', 'r') as f:
        f.readline()
        for line in f:
            splitted = line.strip().split('\t')
            if splitted[0] not in nrc_affect_intensity:
                nrc_affect_intensity[splitted[0]] = {'anger': float(splitted[1]),
                                                    'disgust': float(splitted[3]),
                                                    'fear': float(splitted[4]),
                                                    'joy': float(splitted[5]),
                                                    'sadness': float(splitted[8]),
                                                    'surprise': float(splitted[9])}

    # NRC Hashtag Emotion Lexicon (2015)
    with open('NRC-Hashtag-Emotion-Lexicon-v0.2.txt', 'r') as f:
        f.readline()
        for line in f:
            splitted = line.strip().split('\t')
            splitted[0] = splitted[0].replace('#','')
            if splitted[0] not in nrc_hashtag_emotion:
                nrc_hashtag_emotion[splitted[0]] = {'anger': float(splitted[1]),
                                                    'disgust': float(splitted[3]),
                                                    'fear': float(splitted[4]),
                                                    'joy': float(splitted[5]),
                                                    'sadness': float(splitted[8]),
                                                    'surprise': float(splitted[9])}


    # BingLiu (2004) and MPQA (2005)
    with open('BingLiu.txt', 'r') as f:
        for line in f:
            splitted = line.strip().split('\t')
            if splitted[0] not in bingliu_mpqa:
                bingliu_mpqa[splitted[0]] = splitted[1]
    with open('mpqa.txt', 'r') as f:
        for line in f:
            splitted = line.strip().split('\t')
            if splitted[0] not in bingliu_mpqa:
                bingliu_mpqa[splitted[0]] = splitted[1]


    with open('AFINN-en-165.txt', 'r') as f:
        for line in f:
            splitted = line.strip().split('\t')
            if splitted[0] not in afinn:
                score = float(splitted[1])
                normalized_score = (score - (-5)) / (5-(-5))
                afinn[splitted[0]] = normalized_score


    with open('stopwords.txt', 'r') as f:
        for line in f:
            stopwords.append(line.strip())

    with open('slangs.txt', 'r') as f:
        for line in f:
            splitted = line.strip().split(',', 1)
            slangs[splitted[0]] = splitted[1]

    with open('negated_words.txt', 'r') as f:
        for line in f:
            splitted = line.strip().split(',', 1)
            negated[splitted[0]] = splitted[1]

    with open('emoticons.txt', 'r') as f:
        for line in f:
            emoticons.append(line.strip())
load_lexicons()

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [6]:
# Now grab the GloVe embeddings we will need: Takes about a minute to download.
# Then it take about another minute to fill the data structure.
# Note that the zip file with the embeddings is hosted in Dropbox. If this does
# not work, it could be downloaded from the GloVe website and uploaded to the
# file store for this notebook.

#!wget https://www.dropbox.com/s/ewfdwppopt3pild/glove.twitter.27B.100d.txt.zip?dl=1
!wget  https://www.dropbox.com/s/tq8t97grqd7vjxc/glove.twitter.27B.100d.txt.zip?dl=0
!unzip glove.twitter.27B.100d.txt.zip?dl=0
print("Loading word embeddings...")
embeddings_index = dict() # Initialize an empty dictionary
embedding_dir = 'glove.twitter.27B.100d.txt'

f = open(embedding_dir,encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

--2024-01-11 16:50:59--  https://www.dropbox.com/s/tq8t97grqd7vjxc/glove.twitter.27B.100d.txt.zip?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.80.18, 2620:100:601f:18::a27d:912
Connecting to www.dropbox.com (www.dropbox.com)|162.125.80.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/tq8t97grqd7vjxc/glove.twitter.27B.100d.txt.zip [following]
--2024-01-11 16:51:00--  https://www.dropbox.com/s/raw/tq8t97grqd7vjxc/glove.twitter.27B.100d.txt.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucdaa58deb2ffe6b7e879e9d7657.dl.dropboxusercontent.com/cd/0/inline/CLIk2TKN1giqVGeaDGm5-z1Kfs1rrc1ww3V4D5Qhpf2XFdx6OOsnyAoaYedzkve5GzZ7GWsjdNcGrc0HDFM1qUWnuC-J5rAnInyw7vkdyX8Sw9z1K-rnD1a28ADi6IScUirai-SdFyKtlADWj3ck-QFt/file# [following]
--2024-01-11 16:51:01--  https://ucdaa58deb2ffe6b7e879e9d7657.dl.dropboxusercontent.com/cd/0/inline/CLIk2TKN1giqVGeaDGm5-z1Kfs1rrc1ww3V4D5Qhp

In [7]:
# ********* Hyper-parameters configurations ********* #

# Fix your seed
seed = 66
np.random.seed(seed)

# List of emotions you are going to use in ascending order
emotion_categories = ['anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise']
num_categories = len(emotion_categories)

# Word and Hash-emo embedding dimension
dimension = 100

# Lexical feature dimension
feature_dimension = 29

filters = [128, 128, 128, 128]
# prevent overfitting
dropout_rates = [0.5, 0.5, 0.5, 0.5]
# imitate n-gram
kernel_sizes = [1, 2, 3, 1]
hidden = [200, 100, 10]

epochs = 5
batch_size = 64

embedding_dir = 'glove.twitter.27B.100d.txt'

In [8]:
# ********* Helper Functions ********* #
def char_is_emoji(character):
    #return character in emoji.UNICODE_EMOJI
    return emoji.is_emoji(character[0])  #check out the API for emoji here: https://carpedm20.github.io/emoji/docs/api.html#

def text_has_emoji(text):
    for character in text:
        if char_is_emoji(character):
            return True
    return False

def clean_tweets(texts):
    cleaned_tweets = []
    hash_emos = []

    for text in texts:
        hash_emo = []
        text = re.sub('(!){2,}', ' <!repeat> ', text)
        text = re.sub('(\?){2,}', ' <?repeat> ', text)

        # Tokenize using tweet tokenizer
        tokenizer = TweetTokenizer(strip_handles=False, reduce_len=True)
        tokens = tokenizer.tokenize(text.lower())
        lemmatizer = WordNetLemmatizer()


        # Emojis and emoticons
        if text_has_emoji(text):
            temp = []
            for word in tokens:
                if char_is_emoji(word):
                    hash_emo.append(emoji.demojize(word))
                elif word in emoticons:
                    hash_emo.append(word)
                else:
                    temp.append(word)
            tokens = temp

        # Hashtags
        temp = []
        for word in tokens:
            if '#' in word:
                word = word.replace('#','')
                hash_emo.append(word)
            else:
                temp.append(word)
        tokens = temp

        # Replace slangs and negated words
        temp = []
        for word in tokens:
            if word in slangs:
                temp += slangs[word].split()
            elif word in negated:
                temp += negated[word].split()
            else:
                temp.append(word)
        tokens = temp

        # Replace user names
        tokens = ['<user>'  if '@' in word else word for word in tokens]

        #Replace numbers
        tokens = ['<number>' if word.isdigit() else word for word in tokens]

        # Remove urls
        tokens = ['' if 'http' in word else word for word in tokens]

        # Lemmatize
        #tokens = [lemmatizer.lemmatize(word) for word in tokens]

        # Remove stop words
        tokens = [word for word in tokens if word not in stopwords]

        # Remove tokens having length 1
        tokens = [word for word in tokens if word != '' and len(word) > 1]

        cleaned_tweets.append(tokens)
        hash_emos.append(hash_emo)

    return cleaned_tweets, hash_emos

In [9]:
# This function returns a n-dimensional feature vector
def feature_generation(texts, hashtags):
    feature_vectors = []

    for i in range(len(texts)):
        feats = [0] * feature_dimension
        for word in texts[i]:
            # Warriner er al.
            if word in ratings:
                feats[0] += ratings[word]['Valence']
                feats[1] += ratings[word]['Arousal']
                feats[2] += ratings[word]['Dominance']

            # Vader Sentiment
            polarity_scores = analyzer.polarity_scores(word)
            feats[3] += polarity_scores['pos']
            feats[4] += polarity_scores['neg']
            feats[5] += polarity_scores['neu']

            # NRC Emotion
            if word in nrc_emotion:
                feats[6] += nrc_emotion[word]['anger']
                feats[7] += nrc_emotion[word]['disgust']
                feats[8] += nrc_emotion[word]['fear']
                feats[9] += nrc_emotion[word]['joy']
                feats[10] += nrc_emotion[word]['sadness']
                feats[11] += nrc_emotion[word]['surprise']

            # NRC Affect Intensity
            if word in nrc_affect_intensity:
                feats[12] += nrc_affect_intensity[word]['anger']
                feats[13] += nrc_affect_intensity[word]['disgust']
                feats[14] += nrc_affect_intensity[word]['fear']
                feats[15] += nrc_affect_intensity[word]['joy']
                feats[16] += nrc_affect_intensity[word]['sadness']
                feats[17] += nrc_affect_intensity[word]['surprise']

            # AFINN
            if word in afinn:
                feats[18] += float(afinn[word])

            # BingLiu and MPQA
            if word in bingliu_mpqa:
                if bingliu_mpqa[word] == 'positive':
                    feats[19] += 1
                else:
                    feats[20] += 1


        count = len(texts[i])
        if count == 0:
            count = 1
        newArray = np.array(feats)/count
        feats = list(newArray)

        # Presence of consecutive exclamation mark or question mark
        for word in texts[i]:
            if word == '<!REPEAT>':
                feats[21] = 1
            elif word == '<?REPEAT>':
                feats[22] = 1

        for word in hashtags[i]:
            #NRC Hashtag Emotion
            if word in nrc_hashtag_emotion:
                feats[23] += nrc_hashtag_emotion[word]['anger']
                feats[24] += nrc_hashtag_emotion[word]['disgust']
                feats[25] += nrc_hashtag_emotion[word]['fear']
                feats[26] += nrc_hashtag_emotion[word]['joy']
                feats[27] += nrc_hashtag_emotion[word]['sadness']
                feats[28] += nrc_hashtag_emotion[word]['surprise']

        feature_vectors.append(feats)
    return np.array(feature_vectors)

In [10]:
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines) #https://stackoverflow.com/questions/51956000/what-does-keras-tokenizer-method-exactly-do
    return tokenizer

In [11]:
def max_length(lines):
    return max([len(s) for s in lines])

In [12]:
def encode_text(tokenizer, lines, length):
    encoded = tokenizer.texts_to_sequences(lines) #https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
    padded = pad_sequences(encoded, maxlen=length, padding='post')
    return padded

In [13]:
tokenizertest = Tokenizer()
tokenizertest.fit_on_texts(["this is a great day"])
sequences = tokenizertest.texts_to_sequences(["this is a day", "this is a good day"])
print(sequences)

[[1, 2, 3, 5], [1, 2, 3, 5]]


In [14]:
print("Cleaning Data...")
cleaned_tweets, hash_emos = clean_tweets(texts)
print("Cleaning Completed!")

Cleaning Data...
Cleaning Completed!


In [15]:
print("Generating Features...")
features = feature_generation(cleaned_tweets, hash_emos)
print("Feature Generation Completed!")

Generating Features...
Feature Generation Completed!


In [16]:
print("Encoding Data...")
# For Tweet Matrix
tokenizer_tweets = create_tokenizer(cleaned_tweets)
max_tweet_length = max_length(cleaned_tweets)
vocab_size = len(tokenizer_tweets.word_index) + 1
print('Vocabulary size: %d' % vocab_size)
X = encode_text(tokenizer_tweets, cleaned_tweets, max_tweet_length)

Encoding Data...
Vocabulary size: 24671


In [17]:
# For Hash-Emo Matrix
tokenizer_hash_emo = create_tokenizer(hash_emos)
max_hash_emo_length = max_length(hash_emos)
vocab_size_hash_emo = len(tokenizer_hash_emo.word_index) + 1
print('Vocabulary size (Hash-Emos): %d' % vocab_size_hash_emo)
encoded_hash_emo = encode_text(tokenizer_hash_emo, hash_emos, max_hash_emo_length)

Vocabulary size (Hash-Emos): 3533


In [18]:
# Labels
lb = LabelBinarizer() #https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html
lb.fit(labels)
Y = lb.transform(labels)
print("Encoding Completed!")


Encoding Completed!


In [19]:
# Load embedding
print("Loading word embeddings...")
embeddings_index = dict()
f = open(embedding_dir, encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))


Loading word embeddings...
Loaded 1193514 word vectors.


In [20]:
# Generate embedding matrices
print("Generating embedding matrices...")
tweet_matrix = zeros((vocab_size, dimension))
for word, i in tokenizer_tweets.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        tweet_matrix[i] = np.array(list(embedding_vector))
    else:
        tweet_matrix[i] = np.array(list(np.random.uniform(low=-1, high=1, size=(100,))))
tweet_matrix[1]

Generating embedding matrices...


array([ 0.63006002,  0.65177   ,  0.25545001,  0.018593  ,  0.043094  ,
        0.047194  ,  0.23218   ,  0.11613   ,  0.17371   ,  0.40487   ,
        0.022524  , -0.076731  , -2.29110003,  0.094127  ,  0.43292999,
        0.041801  ,  0.063175  , -0.64486003, -0.43656999,  0.024114  ,
       -0.082989  ,  0.21686   , -0.13462   , -0.22336   ,  0.39436001,
       -2.1724    , -0.39544001,  0.16536   ,  0.39438   , -0.35181999,
       -0.14996   ,  0.10502   , -0.45936999,  0.27728999,  0.89240003,
       -0.042313  , -0.009345  ,  0.55017   ,  0.095521  ,  0.070504  ,
       -1.17809999,  0.013723  ,  0.17742001,  0.74141997,  0.17715999,
        0.038468  , -0.31683999,  0.08941   ,  0.20557   , -0.34327999,
       -0.64302999, -0.87800002, -0.16293   , -0.055925  ,  0.33897999,
        0.60663998, -0.27739999,  0.33625999,  0.21603   , -0.11051   ,
        0.0058673 , -0.64757001, -0.068222  , -0.77414   ,  0.13911   ,
       -0.15851   , -0.61884999, -0.10192   , -0.47      ,  0.19

In [21]:
hash_emo_matrix = zeros((vocab_size_hash_emo, dimension))

for word, i in tokenizer_hash_emo.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        hash_emo_matrix[i] = np.array(list(embedding_vector))
    else:
        hash_emo_matrix[i] = np.array(list(np.random.uniform(low=-1, high=1, size=(100,))))
hash_emo_matrix[1]

array([-4.60200012e-01, -1.83950007e-01,  1.37390003e-01,  2.15160009e-02,
       -1.22000001e-01,  2.93669999e-01,  3.08589995e-01,  3.32630007e-03,
       -8.08589995e-01,  4.45800006e-01, -1.38600007e-01, -8.68040025e-01,
       -4.18139982e+00, -1.53659999e-01, -4.63270009e-01,  4.71199989e-01,
       -2.27579996e-01, -2.09279999e-01,  3.24140012e-01,  3.86680007e-01,
        2.10630000e-01, -2.42760003e-01, -1.28959998e-01,  3.17400008e-01,
       -1.60490006e-01,  6.44840002e-01, -4.66430008e-01,  9.82540026e-02,
        5.34170009e-02,  1.77709997e-01,  1.75960004e-01, -1.85690001e-01,
       -7.76479989e-02,  9.86459970e-01,  8.71779956e-03, -1.73179999e-01,
        6.41369969e-02,  1.77169994e-01,  1.39709994e-01,  1.64010003e-01,
       -1.04130006e+00, -2.06540003e-01, -4.04619984e-02,  1.92770008e-02,
        2.62919992e-01, -3.42739999e-01, -3.44020009e-01,  5.51840007e-01,
       -1.03840005e+00,  1.40499997e+00, -1.71939999e-01,  2.04270005e-01,
       -2.39350006e-01,  

In [22]:
# ********* Model ********* #


class TestCallback(Callback):
    def __init__(self, test_data):
        self.test_data = test_data
        self.accs = []

    def on_epoch_end(self, epoch, logs={}):
        x, y = self.test_data
        loss, acc = self.model.evaluate(x, y, verbose=0)
        self.accs.append(acc)
        print('\nTesting loss: {}, acc: {}\n'.format(loss, acc))

def model(max_tweet_length, max_hash_emo_length, vocab_size, vocab_size_hash_emo, tweet_matrix, hash_emo_matrix, dimension, feature_dimension, num_categories, train_embedding):

    # Channel 1 - tweet text
    inputs1 = Input(shape=(max_tweet_length,)) #Input() is used to instantiate a Keras tensor. https://keras.io/api/layers/core_layers/input/
    embedding1 = Embedding(vocab_size, dimension, weights=[tweet_matrix], trainable=train_embedding)(inputs1) # Embedding() is used as the first layer https://keras.io/api/layers/core_layers/embedding/#embedding-class

    # unigram
    conv1 = Conv1D(filters=filters[0], kernel_size=kernel_sizes[0], activation='relu')(embedding1)
    drop1 = Dropout(dropout_rates[0])(conv1) #The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting.
    pool1 = GlobalMaxPooling1D()(drop1)

    # bigram
    conv2 = Conv1D(filters=filters[1], kernel_size=kernel_sizes[1], activation='relu')(embedding1)
    drop2 = Dropout(dropout_rates[1])(conv2)
    pool2 = GlobalMaxPooling1D()(drop2)

    #trigram
    conv3 = Conv1D(filters=filters[2], kernel_size=kernel_sizes[2], activation='relu')(embedding1)
    drop3 = Dropout(dropout_rates[2])(conv3)
    pool3 = GlobalMaxPooling1D()(drop3)

    # Channel 2 - hashtag, emoji, and emoticon
    inputs2 = Input(shape=(max_hash_emo_length,))
    embedding2 = Embedding(vocab_size_hash_emo, dimension, weights=[hash_emo_matrix], trainable=train_embedding)(inputs2)
    conv4 = Conv1D(filters=filters[3], kernel_size=kernel_sizes[3], activation='relu')(embedding2)
    drop4 = Dropout(dropout_rates[3])(conv4)
    pool4 = GlobalMaxPooling1D()(drop4)

    # Lexical features
    features = Input(shape=(feature_dimension,))

    #It takes as input a list of tensors and returns a single tensor that is the concatenation of all inputs. https://keras.io/api/layers/merging_layers/concatenate/
    merged = concatenate([pool1, pool2, pool3, pool4, features])

    dense1 = Dense(hidden[0], activation='relu')(merged) #https://keras.io/api/layers/core_layers/dense/
    dense2 = Dense(hidden[1], activation='relu')(dense1)
    dense3 = Dense(hidden[2], activation='relu')(dense2)
    outputs = Dense(num_categories, activation='softmax')(dense3)

    model = Model(inputs=[inputs1, inputs2, features], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    #print(model.summary())
    #plot_model(model, show_shapes=True, to_file='multichannel.png')
    model.summary()
    return model

In [23]:
kf = KFold(n_splits=10, shuffle=True, random_state=seed)
kf.get_n_splits(len(labels))

accuracies = []
counter = 1
#for train, test in kf:
for train, test in kf.split((labels)):
    print('Fold#', counter)
    counter += 1
    model_GloVe = model(max_tweet_length,
                       max_hash_emo_length,
                       vocab_size,
                       vocab_size_hash_emo,
                       tweet_matrix,
                       hash_emo_matrix,
                       dimension,
                       feature_dimension,
                       num_categories,
                       True)
    testObj = TestCallback(([X[test], encoded_hash_emo[test], features[test]], Y[test])) ### print perfromance for each epoch

    #earlystop = EarlyStopping(monitor='val_acc', min_delta=0.0001, patience=3, verbose=1, mode='auto')
    model_GloVe.fit([X[train], encoded_hash_emo[train], features[train]],
                    array(Y[train]),
                    epochs=epochs,
                    batch_size=batch_size,
                    callbacks=[testObj],
                    verbose = 1)
    scores = model_GloVe.evaluate([X[test], encoded_hash_emo[test], features[test]], Y[test], verbose=0)
    print("%s: %.2f%%" % (model_GloVe.metrics_names[1], scores[1]*100)) ### print last epoch's performance
    index, value = max(enumerate(testObj.accs), key=operator.itemgetter(1))
    accuracies.append(value)
    break

print(accuracies)
print(np.mean(accuracies))

Fold# 1
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 29)]                 0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, 13)]                 0         []                            
                                                                                                  
 embedding (Embedding)       (None, 29, 100)              2467100   ['input_1[0][0]']             
                                                                                                  
 embedding_1 (Embedding)     (None, 13, 100)              353300    ['input_2[0][0]']             
                                                                                      

In [24]:
###### to use this model to identify emotions ######
predictiontexts = []
!wget https://www.dropbox.com/s/jcq50uf9r4xqbzu/TestTweet.txt

with open('TestTweet.txt', 'r', encoding="utf8") as f:
    for line in f:
        splitted = line.strip().split()
        predictiontexts.append(' '.join(splitted[1:len(splitted)-2]))

cleaned_test_tweets, hash_test_emos = clean_tweets(predictiontexts)

test_features = feature_generation(cleaned_test_tweets, hash_test_emos)
test_X = encode_text(tokenizer_tweets, cleaned_test_tweets, max_tweet_length)
test_encoded_hash_emo = encode_text(tokenizer_hash_emo, hash_test_emos, max_hash_emo_length)
results = model_GloVe.predict([test_X, test_encoded_hash_emo, test_features])
predicted_label = [np.argmax(r) for r in results]
print(predicted_label)


--2024-01-11 16:55:40--  https://www.dropbox.com/s/jcq50uf9r4xqbzu/TestTweet.txt
Resolving www.dropbox.com (www.dropbox.com)... 162.125.80.18, 2620:100:601f:18::a27d:912
Connecting to www.dropbox.com (www.dropbox.com)|162.125.80.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/jcq50uf9r4xqbzu/TestTweet.txt [following]
--2024-01-11 16:55:41--  https://www.dropbox.com/s/raw/jcq50uf9r4xqbzu/TestTweet.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uca1f5c4c526c022870aa0b7ef9f.dl.dropboxusercontent.com/cd/0/inline/CLK_hwSWumgwvUCulQVhAROv2RHTlhcKOKci00KzL7E0dZNT-fZU8jv0zkR9WMB0A5rWtN8f3SOxm3euRKK-_tD7yBaZGKOze7zHiEAUy4f1mMNm6thWXVhYf7HKE-U5W-uMVd95TpbCUMC0noKi8Hue/file# [following]
--2024-01-11 16:55:41--  https://uca1f5c4c526c022870aa0b7ef9f.dl.dropboxusercontent.com/cd/0/inline/CLK_hwSWumgwvUCulQVhAROv2RHTlhcKOKci00KzL7E0dZNT-fZU8jv0zkR9WMB0A5rWtN8f3SOxm3euRKK-_tD7yBaZGKOze