### Problem Description:
As a media monitoring company, we try to make sense of the media that is created every day. One area in particular is written articles, where we want to automatically determine the type of the article. This knowledge helps us to filter articles and present to our customers only what they really care about.
You are given a corpus of text articles, included in this repo.

Your job is to create a service that **classifies** the *articles* into related groups by detecting patterns inside the dataset. When your service is given a new article, your model(s) should return the *type(s)* that this article belongs to.

### Plan of attack:
* Split the lines from text file to get articles and their labels
* Preprocess articles: remove stop words, punctuations, etc
* As it is mentioned in the problem description, associated article labels are not reliable, thus we need to use technique to define true article labels. For this challenge I use LDA since it is a well-known model for topic modeling. I have used LDA to cluster articles into topics based on their bag of words.
* Add true labels/topics to the dataset
* Build a transformer based classifier to detect long range dependencies/patterns within articles, the issue that I faced while trying to use RNN models such as LSTM and bi-LSTM for this challenge. I decided to use BERT model to tackle this issue.
* You can also try the model as a service in [here](http://drstrange.cse.unsw.edu.au:5002).

* Results and observations
  * The transformer model achieved 83% accuracy in classifying articles by getting trained on only 50k out of 260k available samples. I use this amount of training data just to speed up the training step. However, if we want to use this model in production, it is required to train it on the whole dataset. Furthermore, we can also generate more training data by leveraging text augmentation techniques (e.g. pivot language paraphrasing, synonym replacement).
  * The main issue with transformer models is that they are very slow (due to huge number of parameters) at inference time. One solution to speed up the inference is to use [distillation](https://medium.com/pytorch/bert-distillation-with-catalyst-c6f30c985854) technique. The main idea is to use a model (with less number of parameters) that can achieve almost same performance (with slight decrease) compare to the larger model. Another solution is to use pooled representation of sentences ([CLS] special character) from third or second to the last layer in BERT architecture, and pass these representations to a simpler model (e.g. Random Forest). You can find an example of such solution in [here](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/). This solution works well if the problem is classification (similar to what we have in this challenge).

### Install packages

In [2]:
!pip install gensim
!pip install wordcloud
!pip install transformers
!pip install torch
!pip install pandas
!pip install matplotlib
!pip install scikit-learn
!pip install numpy
!pip install tqdm



### Import libraries

In [69]:
import os
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora, models
from gensim.test.utils import datapath
# import nltk
# from nltk.stem import WordNetLemmatizer, SnowballStemmer
# nltk.download('wordnet')
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pickle
import re
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler, RandomSampler
from transformers import BertModel, BertTokenizer, BertConfig, BertForSequenceClassification, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from tqdm import tqdm, trange
from sklearn.metrics import classification_report, confusion_matrix, multilabel_confusion_matrix, f1_score, accuracy_score
import time
import datetime
import pandas as pd
import numpy as np

# os.environ["CUDA_VISIBLE_DEVICES"]="2"
PATH = "."

### Prepare articles to train our transformer model

We have the option: 


1.   Preprocess raw articles and assign appropriate topic labels to each
2.   Download *ready-to-use* articles from [here](https://www.dropbox.com/s/qbkm39innbud1c4/articles.csv?dl=1)

#### 1. Preprocess raw articles

Read [text file](https://bitbucket.org/isentia/coding-challenge-ml/src/master/) contains articles and their labels. We only use train.txt file for now.

In [95]:
filename = "train.txt"
lines = []
with open(os.path.join(PATH, filename), "r") as file:
  for line in file.readlines():
    lines.append(line)

Extract article texts and corresponding labels

In [None]:
articles = []
labels = []
num_topics = 0 # number of unique labels assigned to articles (can be found in the text file)
for line in lines:
  articles.append(line.split("__label__")[0])
  labels.append(line.split("__label__")[1].replace("\n", ""))
num_topics = len(set(labels))

Preprocess articles: 
*   remove stop words
*   filter out words with less than 3 characters

In [None]:
# we skip the stemming step for now
# stemmer = SnowballStemmer("english")
# def lemmatize_stemming(article):
#     return stemmer.stem(WordNetLemmatizer().lemmatize(article, pos='v'))

def preprocess(article):
    result = []
    article = re.sub('[,\.!?]', '', article)
    for word in gensim.utils.simple_preprocess(article):
        if word not in gensim.parsing.preprocessing.STOPWORDS and len(word) > 3: # remove stop words and words with less than 3 characters
            result.append(word)
    return result

# preprocess articles
processed_articles = []
for article in articles:
  processed_articles.append(preprocess(article))

Create a dictionary of most frequent words, we will use it later when categorise articles into topics (topic modelling). Inspired by [this](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24), we filter out words that appear in:
* less than 15 articles (absolute number) or
* more than 0.5 articles (fraction of total corpus size, not absolute number).
* after the above two steps, keep only the first 100k most frequent words.

In [None]:
dictionary = gensim.corpora.Dictionary(processed_articles)
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

Create BoW to replace each word appears in an article with its index from the dictionary (convert corpus to BoW format).

In [None]:
bow_articles = [dictionary.doc2bow(bow_article) for bow_article in processed_articles]

Compute TF-IDF for each BoW formatted article.

In [None]:
tfidf_model = models.TfidfModel(bow_articles)
tfidf_converted_articles = tfidf_model[bow_articles]

##### Use LDA to form topics and categorise articles based on their bag-of-words. As it is mentioned in the problem description, initial labels assigned to articles are not reliable, thus, we use LDA model to classify each article into an appropriate topic and consider that topic as its true label.

We have two options: 
-   Create LDA model
-   Download *ready-to-use* model from [here](https://www.dropbox.com/s/t3td12qefngddo0/isentia-lda-model.zip?dl=1)

###### Create an LDA model using pre-processed articles. 

Using formatted articles, we create an LDA model to form topics each is represented by set of words.

In [None]:
lda_model_tfidf = gensim.models.LdaMulticore(tfidf_converted_articles, num_topics=num_topics, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Save the LDA model and BoW dictionary

In [None]:
# save lda model
# lda_model_tfidf.save(os.path.join(PATH, "lda-tfidf-model"))

# save bow dictionary
dictionary.save(os.path.join(PATH, "bow-dictionary"))

###### Download already created LDA model

In [2]:
# load the pretrained lda model
!wget -O isentia_lda_model.zip -P PATH https://www.dropbox.com/s/t3td12qefngddo0/isentia-lda-model.zip?dl=1
!unzip isentia_lda_model.zip
fname = datapath(os.path.join(PATH, "lda-tfidf-model"))
lda_model_tfidf = models.LdaModel.load(fname, mmap='r')

# load bow dictionary
!wget -O bow-dictionary -P PATH https://www.dropbox.com/s/t3td12qefngddo0/isentia-lda-model.zip?dl=1
fname = datapath(os.path.join(PATH, "bow-dictionary"))
dictionary = gensim.corpora.Dictionary.load(fname, mmap='r')

--2020-09-02 18:30:34--  https://www.dropbox.com/s/t3td12qefngddo0/isentia-lda-model.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.83.1, 2620:100:6033:1::a27d:5301
Connecting to www.dropbox.com (www.dropbox.com)|162.125.83.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/t3td12qefngddo0/isentia-lda-model.zip [following]
--2020-09-02 18:30:34--  https://www.dropbox.com/s/dl/t3td12qefngddo0/isentia-lda-model.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc977dfc88c5bb13a402f54ee4f4.dl.dropboxusercontent.com/cd/0/get/A-m_m14t2fzz4XUhY1M0zpkKqMHCzI5e0FqDV9Yy-peBn9_LJL2uMgb1mGfddZB4eEyzbIxd6IaUXUCuYLhd0ObPuJavtnhjWutmcXV3cQIyyhiQqNw3RJ3Sj6svFr5vQwA/file?dl=1# [following]
--2020-09-02 18:30:34--  https://uc977dfc88c5bb13a402f54ee4f4.dl.dropboxusercontent.com/cd/0/get/A-m_m14t2fzz4XUhY1M0zpkKqMHCzI5e0FqDV9Yy-peBn9_LJL2uMgb1mGfddZB4eEyzbIxd6IaUXUCuYL

##### Assign topics to articles using LDA model

In [None]:
article_lda_topics = []
for article in tfidf_converted_articles:
  topic_num, similarity_score = sorted(lda_model_tfidf[article], key=lambda item: item[1], reverse=True)[0]
  article_lda_topics.append(topic_num)

##### Create a dataframe contains pre-processed articles, their initial labels (not reliable) and lda clustered topics.

In [None]:
# convert lists to dataframe
df = pd.DataFrame([" ".join(article) for article in processed_articles], columns=["article"])
df["initial_label"] = labels
df["lda_topic"] = article_lda_topics

# remove rows with missing article values (if any)
df.dropna(axis=0, inplace=True)

# drop very short articles with less than 50 words
short_articles = df.loc[df.article.str.split().str.len() < 50]
df.drop(short_articles.index, axis=0, inplace=True)

df.head()

#### 2. Download already pre-processed articles in CSV format

In [70]:
# download processed articles
!wget -O articles.csv -P PATH https://www.dropbox.com/s/qbkm39innbud1c4/articles.csv?dl=1

# load articles
df = pd.read_csv(os.path.join(".", "articles.csv"))

# remove rows with missing article values (if any)
df.dropna(axis=0, inplace=True)

# drop very short articles with less than 50 words
short_articles = df.loc[df.article.str.split().str.len() < 50]
df.drop(short_articles.index, axis=0, inplace=True)

df.head()

Unnamed: 0,article,initial_label,lda_topic
0,investment accelerate divergent efforts global...,Energy&Resources&Utilities,7
1,interpol hunting fugitive queensland nickel di...,Legal&Defence,3
2,things victoria driving grunt work today host ...,Entertainment,9
3,thanks small mercies shorten believes economic...,Food&Beverage,2
4,heart australia future kids members guests iso...,"Information,Technology&Telecommunications",7


#### Prepare articles to train our transformer:


1. Tokenization and Padding
3. Split



In [74]:
articles = df.article.values
article_labels = np.array(list(df.lda_topic.values))
num_labels = df.lda_topic.nunique()

Tokenize articles using *BERTTokenizer*.

While BERT accepts sentences with maximum 512 words length, around *19k* articles have long texts (*more than 512 words*). In order to fix this limitation, we need to split long texts into small chunks. For now,
we pad all sentences into the first 256 to speed up training.

In [101]:
# load tokenizer
model_option = "bert-base-uncased" # we use base model
tokenizer = BertTokenizer.from_pretrained(model_option)

# set the maximum length for sequences, I set the length to 256 to speed up training
MAX_LEN = 256 # maximum length of text that BERT accepts is 512, however we need to fix this limitation by splitting long texts into smaller chunks
# MAX_LEN = df.article.str.split().str.len().mean()
# long_articles = df.loc[df.article.str.split().str.len() > 512]

# use encode plus to perform tokenization, padding and masking
encodings = tokenizer.batch_encode_plus(articles[:50000], max_length=MAX_LEN, padding="max_length", truncation=True)

input_ids = encodings["input_ids"]
token_type_ids = encodings["token_type_ids"]
attention_masks = encodings["attention_mask"]

##### Split articles into train, validation and test

In [103]:
# split samples into train, validation, test
train_inputs, test_inputs, train_labels, test_labels, train_attention_masks, test_attention_masks = train_test_split(input_ids, article_labels[:50000], attention_masks, test_size=0.1, random_state=42, shuffle=True)
train_inputs, validation_inputs, train_labels, validation_labels, train_attention_masks, validation_attention_masks = train_test_split(train_inputs, train_labels, train_attention_masks, test_size=0.1, random_state=42, shuffle=True)

Convert all lists to torch tensors

In [104]:
# Convert all to tensors
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels, dtype=torch.long)
train_attention_masks = torch.tensor(train_attention_masks)
# train_token_types = torch.tensor(train_token_types)

validation_inputs = torch.tensor(validation_inputs)
validation_labels = torch.tensor(validation_labels, dtype=torch.long)
validation_attention_masks = torch.tensor(validation_attention_masks)
# validation_token_types = torch.tensor(validation_token_types)

test_inputs = torch.tensor(test_inputs)
test_labels = torch.tensor(test_labels, dtype=torch.long)
test_attention_masks = torch.tensor(test_attention_masks)
# test_token_types = torch.tensor(test_token_types)

Create Dataloader to save memory, no need to load all the data into memory at once (during training/validation)

In [105]:
# Create Dataloader to save memory
batch_size = 64
train_data = TensorDataset(train_inputs, train_labels, train_attention_masks)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_labels, validation_attention_masks)
validation_sampler = RandomSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

test_data = TensorDataset(test_inputs, test_labels, test_attention_masks)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

### Build our transformer model

Create **BertArticleClassifier** class inherit from BERTModel with a classifier as last layer on top.

In [107]:
class BertArticleClassifier(nn.Module):
  def __init__(self, config, num_labels, model_option):
    super(BertArticleClassifier, self).__init__()
    self.num_labels = num_labels
    self.bert = BertModel.from_pretrained(model_option)
    self.dropout = nn.Dropout(0.2)
    self.classifier = nn.Linear(in_features=config.hidden_size, out_features=num_labels)
    nn.init.xavier_uniform_(self.classifier.weight)
  
  def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None):
    sequence_output, pooled_output = self.bert(input_ids, attention_mask=attention_mask)
    output = self.dropout(pooled_output)
    logits = self.classifier(output)
    return (logits, pooled_output)

Create our Article Classifer transformer model and load it into GPU

In [108]:
model_option = "bert-base-uncased"
model = BertArticleClassifier(BertConfig(model_option), num_labels, model_option)
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

# load the model into gpu (if any)
model.to(device)

BertArticleClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_aff

Set fine-tuning parameters

In [109]:
# set fine-tuning parameters
epochs = 5
param_optimizer = list(model.named_parameters())
no_decay = ["bias", "gamma", "beta"]

optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.0},
    ]

# we use adamw optimizer with learning rate 3e-5 for now
optimizer = optim.AdamW(optimizer_grouped_parameters, lr=3e-5)

# add warmup schedule
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

Train and Validate our model

In [110]:
def train_model(model, optimizer, scheduler, num_epochs):
  loss_values = []
  valid_loss_values = []
  loss_func = nn.CrossEntropyLoss()
  for _ in trange(num_epochs, desc="Epochs==>"):
    # start time for each training epoch.
    t0 = time.time()
    
    # reset the losses for this epoch.
    train_loss = 0.0
    valid_loss = 0.0
    
    # put the model into training mode.
    model.train()
    for step, batch in enumerate(train_dataloader):
      
      # print the progress for every 100 batches
      if step % 100 ==0 and not step ==0:
        elapsed_rounded = int(round((time.time() - t0)))
        elapsed = str(datetime.timedelta(seconds=elapsed_rounded))
        print("")
        print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
      
      # load the batch into gpu (otherwise into cpu)
      batch = [t.to(device) for t in batch]
      
      # unpack the batch
      b_input_ids, b_labels, b_attention_masks = batch
      
      # clear any gradients accumulated from previous turn
      optimizer.zero_grad()

      logits, _ = model(b_input_ids, attention_mask = b_attention_masks)
      # logits, = model(b_input_ids, attention_mask = b_attention_masks)
      loss = loss_func(logits.view(-1, num_labels), b_labels.view(-1))
      train_loss += loss.item()
      
      # calculate the gradients
      loss.backward()
      
      # just to make sure we won't have gradient exploding issue, we clip it to 1
      torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

      # update weights according to the gradients
      optimizer.step()

      # update learning rate
      scheduler.step()
    
    avg_loss = train_loss/len(train_dataloader)
    loss_values.append(avg_loss) # keep the average loss for this round
    print("")
    print("  Average training loss: {0:.2f}".format(avg_loss))
    elapsed_rounded = int(round((time.time() - t0)))
    elapsed = str(datetime.timedelta(seconds=elapsed_rounded))
    print("  Training epoch took: {:}".format(elapsed))

    # validate our model
    pred_labels = []
    true_labels = []
    t0 = time.time()

    # put the model in evaluation mode
    model.eval()
    for step, batch in enumerate(validation_dataloader):
      batch = [t.to(device) for t in batch]
      b_input_ids, b_labels, b_attention_masks = batch
      with torch.no_grad():
        logits, _ = model(b_input_ids, attention_mask = b_attention_masks)
        # logits,  = model(b_input_ids, attention_mask = b_attention_masks)

      loss = loss_func(logits.view(-1, num_labels), b_labels.view(-1))
      valid_loss += loss.item()
      
      # move the logits into cpu as we don't need to keep them in gpu anymore
      logits = logits.detach().cpu().numpy()
      b_labels = b_labels.to("cpu").numpy()
      pred_labels.extend(np.argmax(logits, axis=1).flatten()) # keep predicted labels
      true_labels.extend(b_labels.flatten()) # keep true labels

    avg_loss = valid_loss/len(validation_dataloader)
    valid_loss_values.append(avg_loss) # keep the average loss for this round
    print("")
    print("  Average validation loss: {0:.2f}".format(avg_loss))
    print("  Validation accuracy: {0:.2f}".format(accuracy_score(true_labels, pred_labels)))

    elapsed_rounded = int(round((time.time() - t0)))
    elapsed = str(datetime.timedelta(seconds=elapsed_rounded))
    print("  Validation epoch took: {:}".format(elapsed))
  return loss_values, valid_loss_values


loss_values, valid_loss_values = train_model(model, optimizer, scheduler, epochs)
print("")
print("Training complete!")








Epochs==>:   0%|          | 0/5 [00:00<?, ?it/s][A[A[A[A[A[A


  Batch   100  of    633.    Elapsed: 0:01:35.

  Batch   200  of    633.    Elapsed: 0:03:13.

  Batch   300  of    633.    Elapsed: 0:05:01.

  Batch   400  of    633.    Elapsed: 0:06:55.

  Batch   500  of    633.    Elapsed: 0:08:51.

  Batch   600  of    633.    Elapsed: 0:10:50.

  Average training loss: 0.89
  Training epoch took: 0:11:31








Epochs==>:  20%|██        | 1/5 [11:58<47:55, 718.96s/it][A[A[A[A[A[A


  Average validation loss: 0.64
  Validation accuracy: 0.80
  Validation epoch took: 0:00:28

  Batch   100  of    633.    Elapsed: 0:02:01.

  Batch   200  of    633.    Elapsed: 0:04:02.

  Batch   300  of    633.    Elapsed: 0:06:03.

  Batch   400  of    633.    Elapsed: 0:08:06.

  Batch   500  of    633.    Elapsed: 0:10:09.

  Batch   600  of    633.    Elapsed: 0:12:10.

  Average training loss: 0.53
  Training epoch took: 0:12:51








Epochs==>:  40%|████      | 2/5 [25:17<37:08, 742.96s/it][A[A[A[A[A[A


  Average validation loss: 0.59
  Validation accuracy: 0.80
  Validation epoch took: 0:00:28

  Batch   100  of    633.    Elapsed: 0:02:03.

  Batch   200  of    633.    Elapsed: 0:04:06.

  Batch   300  of    633.    Elapsed: 0:06:08.

  Batch   400  of    633.    Elapsed: 0:08:11.

  Batch   500  of    633.    Elapsed: 0:10:12.

  Batch   600  of    633.    Elapsed: 0:12:15.

  Average training loss: 0.36
  Training epoch took: 0:12:54








Epochs==>:  60%|██████    | 3/5 [38:39<25:21, 760.61s/it][A[A[A[A[A[A


  Average validation loss: 0.58
  Validation accuracy: 0.82
  Validation epoch took: 0:00:28

  Batch   100  of    633.    Elapsed: 0:02:02.

  Batch   200  of    633.    Elapsed: 0:04:06.

  Batch   300  of    633.    Elapsed: 0:06:09.

  Batch   400  of    633.    Elapsed: 0:08:09.

  Batch   500  of    633.    Elapsed: 0:10:12.

  Batch   600  of    633.    Elapsed: 0:12:13.

  Average training loss: 0.24
  Training epoch took: 0:12:54








Epochs==>:  80%|████████  | 4/5 [52:00<12:52, 772.75s/it][A[A[A[A[A[A


  Average validation loss: 0.60
  Validation accuracy: 0.82
  Validation epoch took: 0:00:27

  Batch   100  of    633.    Elapsed: 0:02:02.

  Batch   200  of    633.    Elapsed: 0:04:04.

  Batch   300  of    633.    Elapsed: 0:06:06.

  Batch   400  of    633.    Elapsed: 0:08:07.

  Batch   500  of    633.    Elapsed: 0:10:09.

  Batch   600  of    633.    Elapsed: 0:12:10.

  Average training loss: 0.15
  Training epoch took: 0:12:50








Epochs==>: 100%|██████████| 5/5 [1:05:18<00:00, 780.35s/it][A[A[A[A[A[A


  Average validation loss: 0.64
  Validation accuracy: 0.82
  Validation epoch took: 0:00:28

Training complete!


Plot Training/Validation loss

Test our model using the test dataset

In [112]:
print('{:,} Test samples...'.format(test_inputs.shape[0]))

# put the model in evaluation mode
model.eval()
test_pred_labels = []
test_true_labels = []

for batch in test_dataloader:
  # load the batch into gpu
  batch = tuple(t.to(device) for t in batch)
  
  # unpack the inputs from our test dataloader
  b_input_ids, b_labels, b_attention_masks = batch

  with torch.no_grad():
      logits, _ = model(b_input_ids, attention_mask = b_attention_masks)

  # move logits and intents to CPU
  logits = logits.detach().cpu().numpy()
  b_labels = b_labels.to("cpu").numpy()  
  
  test_pred_labels.extend(np.argmax(logits, axis=1).flatten())
  test_true_labels.extend(b_labels.flatten())

print("Test accuracy: {0:.2f}".format(accuracy_score(test_true_labels, test_pred_labels)))

5,000 Test samples...
Test accuracy: 0.83


Model achieved accuracy **83%** for now, considering that we only trained it using 50k samples.

### Try our model with new unseen articles (inference time): BERT Classifier vs LDA

In [92]:
# take unseen article from test.txt file
unseen_article = 'SEPHORA OPENS AT HIGHPOINT SHOPPING CENTRE ON 2 NOVEMBER Global beauty giant SEPHORA has today announced its next store location, continuing its extensive retail expansion in Australia, opening at Highpoint Shopping Centre on November 2nd 2017. This will be SEPHORA’s 13th Australian store and third in Victoria, (Melbourne Central and Chadstone). SEPHORA has been in high demand with Victorian beauty lovers since the opening of its first Victorian store, Melbourne Central in 2015. SEPHORA enthusiasts will be able to shop over 100 cosmetic brands in the new Highpoint store including exclusive lines from Marc Jacobs Beauty, Givenchy, Tarte, Anastasia Beverly Hills and the new Fenty Beauty by Rihanna As SEPHORA’s retail expansion continues, the product offering in-store and online continues to grow, with the recent launch of its new Wellness Category. Aimed to promote beauty from the inside out, the all-new Wellness Category features leading health and wellness brands including; KORA Organics, The Beauty Chef & WellCo, which will also play a part of the Highpoint store offering. SEPHORA Highpoint Shopping Centre, 120-200 Rosamund Road, Maribyrnong 3032'

Ask from LDA

In [93]:
# load the pretrained lda model
!wget -O isentia_lda_model.zip -P PATH https://www.dropbox.com/s/t3td12qefngddo0/isentia-lda-model.zip?dl=1
!unzip isentia_lda_model.zip
fname = datapath(os.path.join(PATH, "lda-tfidf-model"))
lda_model_tfidf = models.LdaModel.load(fname, mmap='r')

# load bow dictionary
!wget -O bow-dictionary -P PATH https://www.dropbox.com/s/9a5t87a6mgw7zno/bow-dictionary?dl=1
fname = datapath(os.path.join(PATH, "bow-dictionary"))
dictionary = gensim.corpora.Dictionary.load(fname, mmap='r')

def preprocess(article):
    result = []
    article = re.sub('[,\.!?]', '', article)
    for word in gensim.utils.simple_preprocess(article):
        if word not in gensim.parsing.preprocessing.STOPWORDS and len(word) > 3: # remove stop words and words with less than 3 characters
            result.append(word)
    return result

article_bow_vector = dictionary.doc2bow(preprocess(unseen_article))
for index, score in sorted(lda_model_tfidf[article_bow_vector], key=lambda item: item[1], reverse=True)[0:1]:
    print("Topic {}".format(index))

Topic 17


Ask from our BERT Classifier

In [113]:
# convert articles to sequences
MAX_LEN = 512

# load bert tokenizer, in case we skipped the "building bert model" section
model_option = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_option)

# tokenize, pad and mask the article
encodings = tokenizer.batch_encode_plus([unseen_article], max_length = MAX_LEN, padding=True, pad_to_max_length=True, truncation=True)
input_ids = encodings["input_ids"]
attention_masks = encodings["attention_mask"]

# convert to tensors
input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)

# put the model in evaluation mode
model.eval()

# move ids and masks into gpu
input_ids = input_ids.to(device)
attention_masks = attention_masks.to(device)

# since it's evaluation, we don't need to calculate gradients
with torch.no_grad():
    logits, _ = model(input_ids, attention_mask = attention_masks)

# move logit back to CPU
logits = logits.detach().cpu().numpy()

print("Topic {}".format(np.argmax(logits, axis=1).flatten()[0]))


Topic 17
