# Hong Kongese Language Identifier
This notebook contains modifications to make it run with the Hong Kongese language identification dataset. The only difference is that we do not load the English vectors because they will be useless on Hong Kongese.
This notebook uses a dataset with 500 each of Hong Kongese and Standard Chinese articles.

# 5 - Multi-class Sentiment Analysis

In all of the previous notebooks we have performed sentiment analysis on a dataset with only two classes, positive or negative. When we have only two classes our output can be a single scalar, bound between 0 and 1, that indicates what class an example belongs to. When we have more than 2 examples, our output must be a $C$ dimensional vector, where $C$ is the number of classes.

In this notebook, we'll be performing classification on a dataset with 6 classes. Note that this dataset isn't actually a sentiment analysis dataset, it's a dataset of questions and the task is to classify what category the question belongs to. However, everything covered in this notebook applies to any dataset with examples that contain an input sequence belonging to one of $C$ classes.

Below, we setup the fields, and load the dataset. 

The first difference is that we do not need to set the `dtype` in the `LABEL` field. When doing a mutli-class problem, PyTorch expects the labels to be numericalized `LongTensor`s. 

The second different is that we use `TREC` instead of `IMDB` to load the `TREC` dataset. The `fine_grained` argument allows us to use the fine-grained labels (of which there are 50 classes) or not (in which case they'll be 6 classes). You can change this how you please.

In [1]:
import torch
from torchtext import data
from torchtext import datasets
import random

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

DATASET="500"

Custom tokenizer to simply split at character level.

In [2]:
def tokenizer(text): # create a tokenizer function
    return list(map(str, text.replace(" ", "")))

Load dataset from data/language directory.

In [3]:
TEXT = data.Field(tokenize=tokenizer)
LABEL = data.LabelField()

fields = {'language': ('label', LABEL), 'text': ('text', TEXT)}
train_data, valid_data, test_data = data.TabularDataset.splits(
                                        path = 'data/language/' + DATASET,
                                        train = 'train.json',
                                        validation = 'valid.json',
                                        test = 'test.json',
                                        format = 'json',
                                        fields = fields
)

Let's look at one of the examples in the training set.

In [4]:
vars(train_data[-1])

{'label': 'en',
 'text': ['(',
  'T',
  'h',
  'i',
  's',
  'a',
  'r',
  't',
  'i',
  'c',
  'l',
  'e',
  'c',
  'o',
  'n',
  't',
  'a',
  'i',
  'n',
  's',
  's',
  'p',
  'o',
  'i',
  'l',
  'e',
  'r',
  's',
  't',
  'o',
  't',
  'h',
  'e',
  'm',
  'o',
  'v',
  'i',
  'e',
  'L',
  'o',
  'g',
  'a',
  'n',
  '.',
  ')',
  '\xa0',
  '\xa0',
  'L',
  'o',
  'g',
  'a',
  'n',
  ',',
  't',
  'h',
  'e',
  'm',
  'o',
  's',
  't',
  'r',
  'e',
  'c',
  'e',
  'n',
  't',
  'X',
  '-',
  'M',
  'e',
  'n',
  'l',
  'i',
  'v',
  'e',
  'a',
  'c',
  't',
  'i',
  'o',
  'n',
  'f',
  'i',
  'l',
  'm',
  's',
  'e',
  'r',
  'i',
  'e',
  's',
  '–',
  'a',
  'n',
  'd',
  's',
  'u',
  'p',
  'p',
  'o',
  's',
  'e',
  'd',
  'l',
  'y',
  't',
  'h',
  'e',
  'l',
  'a',
  's',
  't',
  't',
  'i',
  'm',
  'e',
  'H',
  'u',
  'g',
  'h',
  'J',
  'a',
  'c',
  'k',
  'm',
  'a',
  'n',
  'w',
  'o',
  'u',
  'l',
  'd',
  'b',
  'e',
  's',
  't',
  'a',
  'r',
  'r

Next, we'll build the vocabulary. As this dataset is small (only ~3800 training examples) it also has a very small vocabulary (~7500 unique tokens), this means we do not need to set a `max_size` on the vocabulary as before.

In [5]:
TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

Next, we can check the labels.

The 6 labels (for the non-fine-grained case) correspond to the 6 types of questions in the dataset:
- `HUM` for questions about humans
- `ENTY` for questions about entities
- `DESC` for questions asking you for a description 
- `NUM` for questions where the answer is numerical
- `LOC` for questions where the answer is a location
- `ABBR` for questions asking about abbreviations

In [6]:
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x116f12268>, {'hky': 0, 'zh': 1, 'en': 2})


As always, we set up the iterators.

In [7]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size=BATCH_SIZE, 
    device=device,
    sort_key=lambda x: len(x.text), # the BucketIterator needs to be told what function it should use to group the data.
    sort_within_batch=False)

We'll be using the CNN model from the previous notebook, however any of the models covered in these tutorials will work on this dataset. The only difference is now the `output_dim` will be $C$ instead of $1$.

In [8]:
import torch.nn as nn
import torch.nn.functional as F

class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.convs = nn.ModuleList([
                                    nn.Conv2d(in_channels = 1, out_channels = n_filters, 
                                              kernel_size = (fs, embedding_dim)) 
                                    for fs in filter_sizes
                                    ])
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        
        #text = [sent len, batch size]
        
        text = text.permute(1, 0)
                
        #text = [batch size, sent len]
        
        embedded = self.embedding(text)
                
        #embedded = [batch size, sent len, emb dim]
        
        embedded = embedded.unsqueeze(1)
        
        #embedded = [batch size, 1, sent len, emb dim]
        
        conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
            
        #conv_n = [batch size, n_filters, sent len - filter_sizes[n]]
        
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
        
        #pooled_n = [batch size, n_filters]
        
        cat = self.dropout(torch.cat(pooled, dim=1))

        #cat = [batch size, n_filters * len(filter_sizes)]
            
        return self.fc(cat)

We define our model, making sure to set `OUTPUT_DIM` to $C$. We can get $C$ easily by using the size of the `LABEL` vocab, much like we used the length of the `TEXT` vocab to get the size of the vocabulary of the input.

In [9]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [2,3,4]
OUTPUT_DIM = len(LABEL.vocab)
DROPOUT = 0.5

model = CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT)

Another different to the previous notebooks is our loss function (aka criterion). Before we used `BCEWithLogitsLoss`, however now we use `CrossEntropyLoss`. Without going into too much detail, `CrossEntropyLoss` performs a *softmax* function over our model outputs and the loss is given by the *cross entropy* between that and the label.

Generally:
- `CrossEntropyLoss` is used when our examples exclusively belong to one of $C$ classes
- `BCEWithLogitsLoss` is used when our examples exclusively belong to only 2 classes (0 and 1) and is also used in the case where our examples belong to between 0 and $C$ classes (aka multilabel classification).

In [10]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.CrossEntropyLoss()

model = model.to(device)
criterion = criterion.to(device)

Before, we had a function that calculated accuracy in the binary label case, where we said if the value was over 0.5 then we would assume it is positive. In the case where we have more than 2 classes, our model outputs a $C$ dimensional vector, where the value of each element is the beleief that the example belongs to that class. 

For example, in our labels we have: 'HUM' = 0, 'ENTY' = 1, 'DESC' = 2, 'NUM' = 3, 'LOC' = 4 and 'ABBR' = 5. If the output of our model was something like: **[5.1, 0.3, 0.1, 2.1, 0.2, 0.6]** this means that the model strongly believes the example belongs to class 0, a question about a human, and slightly believes the example belongs to class 3, a numerical question.

We calculate the accuracy by performing an `argmax` to get the index of the maximum value in the prediction for each element in the batch, and then counting how many times this equals the actual label. We then average this across the batch.

In [11]:
def categorical_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    max_preds = preds.argmax(dim=1, keepdim=True) # get the index of the max probability
    correct = max_preds.squeeze(1).eq(y)
    return correct.sum()/torch.FloatTensor([y.shape[0]])

The training loop is similar to before, without the need to `squeeze` the model predictions as `CrossEntropyLoss` expects the input to be **[batch size, n classes]** and the label to be **[batch size]**.

The label needs to be a `LongTensor`, which it is by default as we did not set the `dtype` to a `FloatTensor` as before.

In [12]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text)
        
        loss = criterion(predictions, batch.label)
        
        acc = categorical_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

The evaluation loop is, again, similar to before.

In [13]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text)
            
            loss = criterion(predictions, batch.label)
            
            acc = categorical_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Next, we train our model.

In [14]:
%%time

N_EPOCHS = 30

lowest_valid_loss = 100
for epoch in range(N_EPOCHS):

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    # save the model with the lowest validation loss for use later
    saved = False
    if valid_loss < lowest_valid_loss:
        lowest_valid_loss = valid_loss
        with open("./models/language-identifier-" + DATASET + "-best.pt", 'wb') as fb:
            saved = True
            torch.save(model, fb)
    
    print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}% | Saved: {saved}')
    
with open("./models/language-identifier-" + DATASET + "-final.pt", 'wb') as ff:
    torch.save(model, ff)    

  "type " + obj.__name__ + ". It won't be checked "


| Epoch: 01 | Train Loss: 0.849 | Train Acc: 60.14% | Val. Loss: 0.269 | Val. Acc: 97.27% | Saved: True
| Epoch: 02 | Train Loss: 0.385 | Train Acc: 84.93% | Val. Loss: 0.136 | Val. Acc: 94.92% | Saved: True
| Epoch: 03 | Train Loss: 0.239 | Train Acc: 90.88% | Val. Loss: 0.085 | Val. Acc: 99.22% | Saved: True
| Epoch: 04 | Train Loss: 0.155 | Train Acc: 94.22% | Val. Loss: 0.074 | Val. Acc: 98.83% | Saved: True
| Epoch: 05 | Train Loss: 0.125 | Train Acc: 94.99% | Val. Loss: 0.066 | Val. Acc: 98.83% | Saved: True
| Epoch: 06 | Train Loss: 0.099 | Train Acc: 97.02% | Val. Loss: 0.063 | Val. Acc: 98.83% | Saved: True
| Epoch: 07 | Train Loss: 0.079 | Train Acc: 96.93% | Val. Loss: 0.062 | Val. Acc: 98.83% | Saved: True
| Epoch: 08 | Train Loss: 0.093 | Train Acc: 96.42% | Val. Loss: 0.062 | Val. Acc: 98.83% | Saved: False
| Epoch: 09 | Train Loss: 0.083 | Train Acc: 97.15% | Val. Loss: 0.065 | Val. Acc: 98.44% | Saved: False
| Epoch: 10 | Train Loss: 0.075 | Train Acc: 97.06% | Val. Los

For the non-fine-grained case, we should get an accuracy of around 90%. For the fine-grained case, we should get around 70%.

In [15]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}% |')

| Test Loss: 0.219 | Test Acc: 95.37% |


Deep learning tend to overfit the training data if it ran for too many epochs. We'll compare with the best model we've found.

In [16]:
with open("./models/language-identifier-" + DATASET + "-best.pt", 'rb') as fbl:
    best_model = torch.load(fbl)

In [17]:
best_model_test_loss, best_model_test_acc = evaluate(best_model, test_iterator, criterion)

print(f'| Test Loss: {best_model_test_loss:.3f} | Test Acc: {best_model_test_acc*100:.2f}% |')

| Test Loss: 0.223 | Test Acc: 94.59% |


Choose the model with the lowest loss.

In [18]:
if test_loss > best_model_test_loss:
    print("Will use best_model.")
    selected_model = best_model
else:
    print("Will use final model.")
    selected_model = model

Will use final model.


Similar to how we made a function to predict sentiment for any given sentences, we can now make a function that will predict the class of question given.

The only difference here is that instead of using a sigmoid function to squash the input between 0 and 1, we use the `argmax` to get the highest predicted class index. We then use this index with the label vocab to get the human readable label.

In [19]:
def predict_sentiment(sentence, trained_model, min_len=4):
    tokenized = tokenizer(sentence)
    if len(tokenized) < min_len:
        tokenized += ['<pad>'] * (min_len - len(tokenized))
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    preds = trained_model(tensor)
    print(preds)
    max_preds = preds.argmax(dim=1)
    return max_preds.item()

Now, let's try it out on a few different questions...

In [20]:
pred_class = predict_sentiment("特朗普上周四（7日）曾表示，在3月1日達成貿易協議的最後期限前，他不會與中國國家主席習近平會晤。", selected_model)
print(f'Predicted class is: {pred_class} = {LABEL.vocab.itos[pred_class]}')

tensor([[-3.0592,  6.1842, -3.9069]], grad_fn=<AddmmBackward>)
Predicted class is: 1 = zh


In [21]:
pred_class = predict_sentiment("喺未有互聯網之前，你老母叫你做人唔好太高眼角，正正常常嘅男人嫁出去就算。", selected_model)
print(f'Predicted class is: {pred_class} = {LABEL.vocab.itos[pred_class]}')

tensor([[ 5.2779, -1.1500, -3.9987]], grad_fn=<AddmmBackward>)
Predicted class is: 0 = hky


In [22]:
pred_class = predict_sentiment("I need to get some food.", selected_model)
print(f'Predicted class is: {pred_class} = {LABEL.vocab.itos[pred_class]}')

tensor([[-0.5634, -0.9845,  2.0486]], grad_fn=<AddmmBackward>)
Predicted class is: 2 = en


In [23]:
def range_predictions(prelist, trained_model, min_len=4):
    min_len = 4
    predict_hky = {}
    predict_zh = {}
    predict_en = {}
    for token in prelist:
        tokenized = tokenizer(token)
        tokenized_len = len(tokenized)
        if tokenized_len < min_len:
            tokenized += ['<pad>'] * (min_len - tokenized_len)
        indexed = [TEXT.vocab.stoi[t] for t in tokenized]
        tensor = torch.LongTensor(indexed).to(device)
        tensor = tensor.unsqueeze(1)
        preds = trained_model(tensor)
        max_preds = preds.argmax(dim=1)
        if LABEL.vocab.itos[max_preds.item()] == 'hky':
            predict_hky[token] = preds.data[0][max_preds.item()].item()
        elif LABEL.vocab.itos[max_preds.item()] == 'zh':
            predict_zh[token] = preds.data[0][max_preds.item()].item()
        else:
            predict_en[token] = preds.data[0][max_preds.item()].item()
    return predict_hky, predict_zh, predict_en

In [24]:
predict_hky, predict_zh, predict_en = range_predictions(TEXT.vocab.itos, selected_model, min_len=4)

In [25]:
sorted_by_value = sorted(predict_hky.items(), key=lambda kv: kv[1])
sorted_by_value.reverse()
for i in range(5):
    if i < len(sorted_by_value):
        print(sorted_by_value[i])
    else:
        break

('嘅', 2.3741226196289062)
('唔', 2.373934030532837)
('咗', 2.059317111968994)
('？', 1.960614562034607)
('係', 1.9603140354156494)


In [26]:
sorted_by_value = sorted(predict_zh.items(), key=lambda kv: kv[1])
sorted_by_value.reverse()
for i in range(5):
    if i < len(sorted_by_value):
        print(sorted_by_value[i])
    else:
        break

('的', 2.023282766342163)
('他', 1.1343632936477661)
('坡', 1.0784488916397095)
('被', 1.0438194274902344)
('恭', 1.0360703468322754)


In [27]:
sorted_by_value = sorted(predict_en.items(), key=lambda kv: kv[1])
sorted_by_value.reverse()
for i in range(5):
    if i < len(sorted_by_value):
        print(sorted_by_value[i])
    else:
        break

('<pad>', 1.1476426124572754)


Check what kind of articles are incorrect.

In [28]:
with torch.no_grad():
    for batch in test_iterator:
        predictions = selected_model(batch.text)
        max_preds = predictions.argmax(dim=1)
        wrong_preds = (max_preds.eq(batch.label) == 0).nonzero()
        for wrong_pred in wrong_preds:
            wrong_idx = wrong_pred.item()
            incorrect_prediction = max_preds[wrong_idx].item()
            correct_label = batch.label[wrong_idx].item()
            print("Predicted \"" + LABEL.vocab.itos[incorrect_prediction] + "\" but should be \"" + LABEL.vocab.itos[correct_label] + "\".")
            text_i = batch.text[:,wrong_idx].tolist()
            text_striped_idx = [x for x in text_i if x != TEXT.vocab.stoi['<pad>']]
            full_text = list(map(lambda x: TEXT.vocab.itos[x], text_striped_idx))
            print("Article: ")
            print("".join(full_text))
            print()

Predicted "hky" but should be "zh".
Article: 
已宣布參選下屆特首的前政務司司長林鄭月娥今早於商台節目《在晴朗的一天出發》中，被問到就基本法廿三條立法和政改重啟的立場，林鄭指就廿三條立法是憲制責任，而大家越來越覺得國家安全重要，恐怖活動又無日無之於其他地方發生，認為要考慮環境因素和條件是否可以讓下一屆政府推行這「兩件重要的事」。林鄭指，任何施政也要看環境因素，不能勉強，要權衡輕重，「喺咩時候就係做咩政策」。她舉例指，今屆政府是直到2013年底才啟動政改，而非一上任就做政改工作，是因為這屆政府首先答應推行扶貧安老助養、改善民生的政策，讓社會氣氛好一點才做政改。她指，以後的政府也要用這種思維，「喺咩時候可以做咩嘢」。而現時廿三條立法和政改重啟兩個議題也富爭議性，壁壘分明，所以下屆特首要「諗幾時做」，但「兩個（議題）都要做」。面對現屆立法會　推行有困難廣告她指，廿三條立法是憲制責任，現在大家越來越覺得國家安全很重要，「呢啲恐怖活動真係無日無之喺其他地方發生緊」，而政改是回應很多市民追求民主發展和一人一票選特首，「所以你問我，就係要睇吓個環境因素、條件，是否可以讓我們喺下一屆推行呢兩件重要的事」。被問到她對大勢評估，下一屆政府著手推行這兩個議題是否比較困難，林鄭直言是很困難，因為政改和廿三條也要由現屆立法會處理，所以現時立法會的組成能否給予政府信心，能獲得足夠支持推行該政策，是一個無可迴避的問題，她指這也是要考慮的因素。廣告手到拿來膽更大？　林鄭：選舉令人謙卑但被問到大方向是否「休養生息」，林鄭指「休養生息」有很多解讀，就如「咩都唔做」，但她認為「香港點可咩都唔做」，因為在經濟面對很多競爭，所以經濟上就不可休養生息。聽眾陳先生問林鄭，若「手到拿來」會否「膽更大」？林鄭回應指，「有一個說法，選舉係令人謙卑，咁我而家都進入呢個狀態」，又指若她有做得不好地方，歡迎市民提出，她也會自我反省，「如果我喺為官呢30多年係有啲地方做得唔好，咁我會趁這機會認識同埋可以改善」。至於會否追究現任特首梁振英UGL事件，林鄭則指會按法治和現行制度處理。對於若由她帶領的政府下，社會撕裂會細一點，林鄭指她有這期望，「如果唔係我都唔會拋個身出嚟，去做一件相當艱鉅的事」，又指特首不只是一份工，要全情投入、步步為營。

Predicted "hky" but sh