# Emojify

This project implements "Emojify" with Pytorch.

- Input: Sentences 
- Output: Emoji (cast as numerical labels)🤔

For example:
Food is life 🍴



## Acknowledgement

Some ideas and the structure of the neural network come from [Coursera Deep Learning Specialization](https://www.coursera.org/specializations/deep-learning).

The dataset can be download from [here](https://drive.google.com/drive/folders/1vXgzjhALvH981cNYZwlQ1wZJ_NE_Xd44?usp=sharing).

## 1. Import Packages

In [1]:
import numpy as np
import pandas as pd
import os
import os.path as osp

# import pytorch packages
import torch
from torch.autograd import Variable
import torch.nn as nn
from torch.utils import data

from tqdm.auto import tqdm
import emoji

## 2. Prepare the dataset

Preview the dataset. The original datasets are in csv format. The first column '0' shows all the training data and the second column '1' shows all the labels. The ground truth labels can cast to the emojis.

In [2]:
data_root = 'data'
train_name = 'train_emoji.csv'
test_name = 'tess.csv'

# preview the dataset
dataset_preview = pd.read_csv(osp.join(data_root, train_name), header = None)
dataset_preview.head()

Unnamed: 0,0,1,2,3
0,never talk to me again,3,,
1,I am proud of your achievements,2,,
2,It is the worst day in my life,3,,
3,Miss you so much,0,,[0]
4,food is life,4,,


Cast the labels to real emojis.🤔 

In [3]:
def to_emoji(emoji_dict, label):
    """ Cast a numerical label to the emoji
    """
    
    emoji_new = emoji.emojize(emoji_dict[label],use_aliases=True)
    
    return emoji_new

emoji_dictionary = {0: "\u2764\uFE0F",    # :heart: prints a black instead of red heart depending on the font
                    1: ":baseball:",
                    2: ":smile:",
                    3: ":disappointed:",
                    4: ":fork_and_knife:"}


print('Check the labels and their corresponding emojis:\n')
for label in emoji_dictionary.keys():

    print('Label:{}, and its corresponding emoji:{}'.format(label, to_emoji(emoji_dictionary, label)))
    
print('\nPreview dataset:')
for i in range(5):
    print(dataset_preview[0][i], to_emoji(emoji_dictionary, dataset_preview[1][i]))

Check the labels and their corresponding emojis:

Label:0, and its corresponding emoji:❤️
Label:1, and its corresponding emoji:⚾
Label:2, and its corresponding emoji:😄
Label:3, and its corresponding emoji:😞
Label:4, and its corresponding emoji:🍴

Preview dataset:
never talk to me again 😞
I am proud of your achievements 😄
It is the worst day in my life 😞
Miss you so much ❤️
food is life 🍴


Build the dataset class.

In [4]:
class emoji_dataset(data.Dataset):
    
    def __init__(self, data_root, data_name):
        """
        """
        
        self.dataset = pd.read_csv(osp.join(data_root, data_name), header = None)
        self.length = len(self.dataset)
        self.data = self.dataset[0]
        self.labels = self.dataset[1]
        
    def __len__(self):
        return self.length
    
    def data(self,index):
        """return the data.
        """
        return self.data[index]
    
    def label(self,index):
        """return the labels of dataset.
        """
        return self.labels[index]
    
    def __getitem__(self, index):
        
        X = self.data[index]
        y = self.labels[index]
        
        return X, y
    


In [5]:
train_dataset = emoji_dataset(data_root, train_name)
test_dataset = emoji_dataset(data_root, test_name)

print(train_dataset[10][0], to_emoji(emoji_dictionary, train_dataset[10][1]))

print('Length of training examples:{} \nlength of test examples:{}'.format(len(train_dataset), len(test_dataset)))

she did not answer my text  😞
Length of training examples:132 
length of test examples:56


## 2. Create Dataloader

In [6]:
batch_size = 32

train_loader = torch.utils.data.DataLoader(dataset = train_dataset,
                                          shuffle = True,
                                           batch_size = batch_size,
                                          )

test_loader = torch.utils.data.DataLoader(dataset = test_dataset,
                                          shuffle = False,
                                           batch_size = 8,
                                          )

## 3. Preprocess the dataset

Several helper functions are needed to preprocess the dataset.

### 3.1 Read the GloVe
Read the global vectors for word representation file. Get the word embeddings and word index.

In [7]:


name = '/home/sh2439/pytorch_tutorials/Sequence Model/Week 2/Word Vector Representation/glove.6B.50d.txt'

# Read the GloVe text file and return the words.
def read_glove(name):
    """Given the path/name of the glove file, return the words(set) and word2vec_map(a python dict)
    """
    file = open(name, 'r')
    # Create set for words and a dictionary for words and their corresponding  
    words = set()
    word2vec_map = {}
    
    data = file.readlines()
    for line in data:
        # add word to the words set.
        word = line.split()[0]
        words.add(word)
        
        word2vec_map[word] = np.array(line.split()[1:], dtype = np.float64)
        
    i = 1
    word2index = {}
    index2word = {}
    for word in words:
        word2index[word] = i
        index2word[i] = word
        i = i+1
        
    return words, word2vec_map, word2index, index2word

words, word2vec_map, word2index, index2word = read_glove(name)
# Read the GloVe text file and return the words.
def read_glove(name):
    """Given the path/name of the glove file, return the words(set), word2vec_map(a python dict),
        word2index(a python dict), index2word(a python dict).
    """
    file = open(name, 'r')
    # Create set for words and a dictionary for words and their corresponding  
    words = set()
    word2vec_map = {}
    
    data = file.readlines()
    for line in data:
        # add word to the words set.
        word = line.split()[0]
        words.add(word)
        
        word2vec_map[word] = np.array(line.split()[1:], dtype = np.float64)
        
    i = 1
    word2index = {}
    index2word = {}
    for word in words:
        word2index[word] = i
        index2word[i] = word
        i = i+1
        
    return words, word2vec_map, word2index, index2word

In [8]:
words, word2vec_map, word2index, index2word = read_glove(name)


In [9]:
word = 'hello'
index = 121098
print('Index of word',word, '=', word2index[word])
print('Word of index',index, '=', index2word[index])
print('Embedding vector of word', word, '=', word2vec_map[word])

Index of word hello = 60101
Word of index 121098 = h2co3
Embedding vector of word hello = [-0.38497   0.80092   0.064106 -0.28355  -0.026759 -0.34532  -0.64253
 -0.11729  -0.33257   0.55243  -0.087813  0.9035    0.47102   0.56657
  0.6985   -0.35229  -0.86542   0.90573   0.03576  -0.071705 -0.12327
  0.54923   0.47005   0.35572   1.2611   -0.67581  -0.94983   0.68666
  0.3871   -1.3492    0.63512   0.46416  -0.48814   0.83827  -0.9246
 -0.33722   0.53741  -1.0616   -0.081403 -0.67111   0.30923  -0.3923
 -0.55002  -0.68827   0.58049  -0.11626   0.013139 -0.57654   0.048833
  0.67204 ]


### 3.2 Convert the sentences input to indices

Define a helper function to convert sentence input numerical inputs.  
If the length of the sentence is less than the maximum length, padd the sentence with 0.

In [10]:
def to_index(sentences, word2index, max_length):
    """ Given the word2index dict, maximum length, and inputs, return the numerical inputs
    """
    num = len(sentences)
    out = torch.zeros(num, max_length).long()
    
    for idx, sen in enumerate(sentences):
        
        sen = sen.lower().split()
        
        j = 0
        
        for word in sen:
            word_idx = word2index[word]
            out[idx, j] = word_idx
            j += 1
            
            if j >= max_length:
                break
            
    return out

In [11]:
out = to_index(train_dataset.data, word2index, max_length = 5)
print('The input tensor of 1st sentence: ', out[0])
print('The input tensor of 4th sentence: ', out[3])

The input tensor of 1st sentence:  tensor([228682, 340541,  34521,  58080,  11905])
The input tensor of 4th sentence:  tensor([143666, 194081, 268164, 196552,      0])


### 3.3 Create embedding weights

In [12]:
emb_weights = torch.zeros(len(word2vec_map)+1, 50)
for word, idx in word2index.items():
#     print(word)
    emb_weights[idx,:] = torch.tensor(word2vec_map[word])

print('Size of the weights:', emb_weights.size())

Size of the weights: torch.Size([400001, 50])


## 4. Build the model class

In [13]:
class Emoji_Net(nn.Module):
    """ The emoji net uses embedding layer, lstm layer and fully-connected layer.
    """
    def __init__(self,layer_num,input_dim, hidden_dim, output_dim, weights):
        super(Emoji_Net, self).__init__()
        self.input_dim = input_dim
        self.layer_num = layer_num
        self.output_dim = output_dim
        self.hidden_dim = hidden_dim
        
        # the embedding layer
        weights = weights.to(device)
        self.embedding = nn.Embedding.from_pretrained(weights)
        
        # the lstm layer
        self.lstm = nn.LSTM(input_size = self.input_dim, hidden_size = self.hidden_dim, 
                            num_layers = self.layer_num, batch_first = True, dropout = 0.8,bidirectional = True)
        self.dropout = nn.Dropout(0.6)
        
        # the output layer
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        
    def forward(self, x):
        
        
        # h0
        h0 = Variable(torch.zeros(2*self.layer_num, x.size(0), self.hidden_dim)).to(device)
        
        # c0
        c0 = Variable(torch.zeros(2*self.layer_num, x.size(0), self.hidden_dim)).to(device)
        
        # embedding
        x = self.embedding(x)
        # lstm
        x, (hn ,cn) = self.lstm(x, (h0, c0))
        x = self.dropout(x)
        
        # output layer
        x = self.fc(x[:, -1, :])
        
        return x
        

Instantiate the model class.

In [14]:
device = torch.device('cuda:0' if torch.cuda.is_available else 'cpu')


layer_num = 2
input_dim = 50
hidden_dim = 128
output_dim = 5


emoji_net = Emoji_Net(layer_num, input_dim,hidden_dim, output_dim, emb_weights)

emoji_net.to(device)

Emoji_Net(
  (embedding): Embedding(400001, 50)
  (lstm): LSTM(50, 128, num_layers=2, batch_first=True, dropout=0.8, bidirectional=True)
  (dropout): Dropout(p=0.6)
  (fc): Linear(in_features=256, out_features=5, bias=True)
)

## 4. Loss and Optimizer

In [15]:
criterion = nn.CrossEntropyLoss()

learning_rate = 0.001
optimizer = torch.optim.Adam(emoji_net.parameters(), lr = learning_rate)

## 5. Train the model

Define save model function

In [16]:
def save_best(is_best, best_accuracy, model, epoch, path):
    filename = path + 'best_model.pth'
    
    if is_best:
        if not osp.exists(path):
            os.makedirs(path)
        torch.save({'epoch':epoch,
                   'model_state_dict':model.state_dict(),
                    'best_accuracy':best_accuracy
                   }, filename)
        
        print(best_accuracy)

Start training...

In [17]:
num_epochs = 200

is_best = False
best_accuracy = 0

for epoch in tqdm(range(int(num_epochs))):
    emoji_net.train()
    for batch_idx, (inputs, labels) in enumerate(train_loader):
        
        
        # clear grads
        optimizer.zero_grad()
        
        inputs = to_index(inputs,word2index,max_length = 15)
        inputs = Variable(inputs).to(device)
        
        labels = Variable(labels).to(device)
        
        # forward pass
        outputs = emoji_net(inputs)
        
        # get loss
        loss = criterion(outputs, labels)
        # backward
        loss.backward()
        
        optimizer.step()
        
        if batch_idx % 5 == 0:
            print('epoch: {}, iters: {}, loss: {}'.format(epoch, 
            batch_idx + epoch*np.ceil(len(train_dataset)/32), loss.item()))
       

    correct = 0
    total = 0
    
    with torch.no_grad():
        emoji_net.eval()
        for batch_idx, (inputs, labels) in enumerate(test_loader):

            inputs = to_index(inputs,word2index, max_length = 15)
            inputs = Variable(inputs).to(device)
            labels = Variable(labels)

            outputs = emoji_net(inputs)
            _,preds = torch.max(outputs.data, dim=1)


            total += labels.size(0)
            correct += float((preds.cpu() == labels).sum())

        accuracy = 100* correct/total
        print( 'Epoch: {}, Test Accuracy:{}'.format(epoch, accuracy))
        
        if accuracy > best_accuracy:
            is_best = True
            best_accuracy = accuracy
            save_best(is_best, best_accuracy, emoji_net, epoch, 'models/')
        else:
            is_best = False
       

HBox(children=(IntProgress(value=0, max=200), HTML(value='')))

epoch: 0, iters: 0.0, loss: 1.5978573560714722
Epoch: 0, Test Accuracy:26.785714285714285
26.785714285714285
epoch: 1, iters: 5.0, loss: 1.593022346496582
Epoch: 1, Test Accuracy:26.785714285714285
epoch: 2, iters: 10.0, loss: 1.5806373357772827
Epoch: 2, Test Accuracy:26.785714285714285
epoch: 3, iters: 15.0, loss: 1.5557117462158203
Epoch: 3, Test Accuracy:26.785714285714285
epoch: 4, iters: 20.0, loss: 1.6087793111801147
Epoch: 4, Test Accuracy:26.785714285714285
epoch: 5, iters: 25.0, loss: 1.5102040767669678
Epoch: 5, Test Accuracy:26.785714285714285
epoch: 6, iters: 30.0, loss: 1.4841002225875854
Epoch: 6, Test Accuracy:26.785714285714285
epoch: 7, iters: 35.0, loss: 1.619018793106079
Epoch: 7, Test Accuracy:33.92857142857143
33.92857142857143
epoch: 8, iters: 40.0, loss: 1.5192112922668457
Epoch: 8, Test Accuracy:33.92857142857143
epoch: 9, iters: 45.0, loss: 1.5225038528442383
Epoch: 9, Test Accuracy:33.92857142857143
epoch: 10, iters: 50.0, loss: 1.5658466815948486
Epoch: 10, 

Epoch: 89, Test Accuracy:60.714285714285715
epoch: 90, iters: 450.0, loss: 0.39030885696411133
Epoch: 90, Test Accuracy:60.714285714285715
epoch: 91, iters: 455.0, loss: 0.5294881463050842
Epoch: 91, Test Accuracy:66.07142857142857
epoch: 92, iters: 460.0, loss: 0.5831853747367859
Epoch: 92, Test Accuracy:64.28571428571429
epoch: 93, iters: 465.0, loss: 0.5130822658538818
Epoch: 93, Test Accuracy:62.5
epoch: 94, iters: 470.0, loss: 0.21867796778678894
Epoch: 94, Test Accuracy:58.92857142857143
epoch: 95, iters: 475.0, loss: 0.30327343940734863
Epoch: 95, Test Accuracy:57.142857142857146
epoch: 96, iters: 480.0, loss: 0.30814459919929504
Epoch: 96, Test Accuracy:60.714285714285715
epoch: 97, iters: 485.0, loss: 0.42968326807022095
Epoch: 97, Test Accuracy:62.5
epoch: 98, iters: 490.0, loss: 0.3651081323623657
Epoch: 98, Test Accuracy:64.28571428571429
epoch: 99, iters: 495.0, loss: 0.3736293315887451
Epoch: 99, Test Accuracy:64.28571428571429
epoch: 100, iters: 500.0, loss: 0.4266050457

Epoch: 179, Test Accuracy:91.07142857142857
epoch: 180, iters: 900.0, loss: 0.020338818430900574
Epoch: 180, Test Accuracy:91.07142857142857
epoch: 181, iters: 905.0, loss: 0.01514829695224762
Epoch: 181, Test Accuracy:89.28571428571429
epoch: 182, iters: 910.0, loss: 0.009589284658432007
Epoch: 182, Test Accuracy:85.71428571428571
epoch: 183, iters: 915.0, loss: 0.00956319272518158
Epoch: 183, Test Accuracy:83.92857142857143
epoch: 184, iters: 920.0, loss: 0.1524752974510193
Epoch: 184, Test Accuracy:83.92857142857143
epoch: 185, iters: 925.0, loss: 0.014115557074546814
Epoch: 185, Test Accuracy:82.14285714285714
epoch: 186, iters: 930.0, loss: 0.07756875455379486
Epoch: 186, Test Accuracy:82.14285714285714
epoch: 187, iters: 935.0, loss: 0.06741046160459518
Epoch: 187, Test Accuracy:69.64285714285714
epoch: 188, iters: 940.0, loss: 1.0113507509231567
Epoch: 188, Test Accuracy:66.07142857142857
epoch: 189, iters: 945.0, loss: 0.933855414390564
Epoch: 189, Test Accuracy:64.285714285714

## 6. Test the result

### 6.1 Load the best model

In [18]:


saved = torch.load('models/best_model.pth')

best_model = Emoji_Net(layer_num, input_dim,hidden_dim, output_dim, emb_weights)

best_model.to(device)
best_model.load_state_dict(saved['model_state_dict'])


### 6.2 Test the accuracy

In [19]:
total = 0
correct = 0

with torch.no_grad():
    best_model.eval()
    for batch_idx, (inputs, labels) in enumerate(test_loader):

        inputs = to_index(inputs,word2index, max_length = 15)
        inputs = Variable(inputs).to(device)
        labels = Variable(labels)

        outputs = best_model(inputs)
        _,preds = torch.max(outputs.data, dim=1)


        total += labels.size(0)
        correct += float((preds.cpu() == labels).sum())

    accuracy = 100* correct/total
    print( 'Test Accuracy:{}'.format(accuracy))


Test Accuracy:91.07142857142857


In [20]:
print(saved['best_accuracy'])

91.07142857142857
