In [1]:
import os
import pickle
import yaml
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import pandas as pd

import torch
import torch.nn as nn

In [2]:
device = torch.device('cuda:0')

In [3]:
def load_pkl(filepath):
    with Path(filepath).resolve().open('rb') as f:
        obj = pickle.load(f)
    return obj

## Data loading and exploration

The first step to do is to load the data.

In [4]:
#download the data
!wget -nc https://navee-ai-models.s3.eu-west-1.amazonaws.com/multimodal_learning/2022_04_19_chanel_80cls/chanel_80cls_dataset.tar
!tar -xf chanel_80cls_dataset.tar --one-top-level
dataset_path = "chanel_80cls_dataset/dataset.pkl"

File ‘chanel_80cls_dataset.tar’ already there; not retrieving.



A sample data, called listing, is composed of all the information we gathered for a single post of a product. See the DataFrame below. For this notebook, we are particularly interested in the following fields:

- `split`: This dataset is already split to train and test sets in a way to respect its unbalanced nature. So this field determins wether the data belongs to the train or test set.

- `cls`: The label.

- `id`: The if for the listings.

- `name`: The name or title of the product set by its seller. This is the first text field we're interested in for the NLP task.

-  `description`: Additionally to the `name`, some sellers go on and describe the product more in details. Thus this field also intrests us for the NLP task.

In [23]:
#load the data
dataset = load_pkl(dataset_path)
df_listings = pd.DataFrame.from_dict(dataset['listings'], orient='index')

df_listings.head()

Unnamed: 0,listing_id,local_id,imname
0,9795,0,i000000_l009795_o000_jewellery---bracelets---c...
1,9795,1,i000001_l009795_o001_jewellery---bracelets---c...
2,9795,2,i000002_l009795_o002_jewellery---bracelets---c...
3,9795,3,i000003_l009795_o003_jewellery---bracelets---c...
4,6649,0,i000004_l006649_o000_jewellery---bracelets---c...


In [6]:
#Let's consider the only field we need
df_text = df_listings[['split','cls','id','name','description']].copy()

df_text.head()

Unnamed: 0,split,cls,id,name,description
9795,train,jewellery---bracelets---coco_crush,9795,silver-metal-matelasse-chanel-bracelet-15891167,Chanel Mattalasse Chain Cuff Bracelet \r\n\r\n...
6649,train,jewellery---bracelets---coco_crush,6649,silver-white-gold-coco-crush-chanel-bracelet-1...,CHANEL COCO CRUSH line bracelet \n A collectio...
9762,train,jewellery---bracelets---coco_crush,9762,white-gold-matelasse-chanel-bracelet-9613741,Chanel quilted bracelet in white gold and diam...
9775,train,jewellery---bracelets---coco_crush,9775,gold-yellow-gold-matelasse-chanel-bracelet-106...,"DIAMOND BRACELET, 'MATELASSÉ, CHANEL \n \n Qui..."
6659,train,jewellery---bracelets---coco_crush,6659,gold-yellow-gold-coco-crush-chanel-bracelet-15...,Chanel Coco Crush bracelet in size XS. Yellow ...


In [7]:
#Make dict to make numerical labels easier
labels = df_text['cls'].unique()
labels_to_num = dict(zip(labels, range(len(labels))))
num_to_labels = dict(zip(range(len(labels)), labels))

#Apply on df
df_text.loc[:,'cls'] = df_text['cls'].apply(lambda x: labels_to_num[x])

Feel free to further develop this part to explore and understand better the data.

## Preprocessing

Processing a text requires converting it to some numerical vector of features that we can feed to the model. To do so, the text is first tokenized. Tokenization is a way of separating a piece of text into smaller units called tokens. Tokens can be either words, characters, or subwords. The simplest thing would be to seperate words on spaces: "I work at Navee." becomes ["I", "work", "at", "Navee", "."]. The preprocessing can continue to omit punctuations, lowercase, etc.

In [8]:
#First let's replace the "-" in the names to spaces so the tokenizer doesn't take them into account
df_text.loc[:,'name'] = [name.replace("-", " ") for name in df_text['name']]

BERT uses WordPiece Model to perform tokeniztion. This approach adresses the problem of Out Of Vocavulary (OOV) words i.e. rare words that aren't found in the constructed vocubulary and thus that doen't have representations (feature vector). To do so, the words are divided into a limited set of common subword units (“wordpieces”) for both input and output. You might also notice some special tokens such as `[CLS]`: first token of a sequence (input to the transformer).

In [9]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [10]:
from transformers import AutoTokenizer

#Load the tokenizer used for bert0base-case model - You can choose other models
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [11]:
#Let's test the tokenizer on an example
text_example = df_text.iloc[1]['name']
print("Example text: ", text_example)

#Let's apply the tokenizer - Note that the parameteres are easily understood from the documentation
input_ids, attention_mask = tokenizer(text_example, #text to tokenize
          padding='max_length',  #The output of the tokenizer is a vector of fixed length equal to max_length. Sentences with less tokens are padded with 0
          max_length=64,  
          truncation=True, #Sentences with more tokens are truncated
            return_tensors='pt', #The returned tensors are pytorch tensors
         return_token_type_ids=False, 
         verbose=False).values()

#Each token has an id that represents it
#The input_ids is tensor containing the ids of the tokens for the given example
print("Input Ids: ", input_ids)

#The attention mask tells whether the entry is an actual token "1" or a padding "0"
print("Attention mak: ", attention_mask)

#Let's look at the sub-words i.e. the tokens
tokens_example = tokenizer.convert_ids_to_tokens(input_ids[0])
print("Tokens example: ", tokens_example)

Example text:  silver white gold coco crush chanel bracelet 15993451
Input Ids:  tensor([[  101,  3165,  2317,  2751, 25033, 10188,  9212,  2884, 19688, 18914,
          2683, 22022, 22203,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]])
Attention mak:  tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Tokens example:  ['[CLS]', 'silver', 'white', 'gold', 'coco', 'crush', 'chan', '##el', 'bracelet', '159', '##9', '##34', '##51', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PA

## Dataset
Now that we have the tokenizer ready, let's build the dataset.

In [12]:
#This is one of many ways to build the dataset. Feel free to change it as it might suit you.

class Dataset(torch.utils.data.Dataset):
    
    def __init__(self, df_text, tokenizer, split ='train', with_description = False):
        """
        - df_text: dataframe containing the texts and labels
        - tokenizer: the tokenizer to apply on text strings
        - split: 'train' or 'test'
        - with description: if True adds description to text
        """

        self.df_text = df_text.loc[df_text['split'] == split] #select only data of split
        
        self.tokenizer = tokenizer

        self.with_description = with_description
        
    def __len__(self):
        return len(self.df_text)
    
    def __getitem__(self, ind):

      text = self.df_text.iloc[ind]['name']

      if self.with_description:
        text += " " + self.df_text.iloc[ind]['name'] #add description if True
        
      input_ids, attention_mask = self.tokenizer(text).values()
   
      return {'input_ids': input_ids.squeeze(), 'attention_mask': attention_mask,  'label': self.df_text.iloc[ind]['cls']}

## BERT
Transformers are networks based on attention mechanisms. In a nutshell, attention makes the transformer learn relations between words (tokens). Thus, the transformer will learn how much information a word *A* holds about another word *B* in a sentence.Bsic transformers include an encoder and a decoder. And the encoder is fed the entire sentence of words at once which is in contrast to RNN.

BERT’s key technical innovation is applying the bidirectional training of Transformer. That is, the model receives a sentence with missing words. It tries to infer these missing words from their contexct. 

In [13]:
from transformers import BertModel

class BerClassifier(nn.Module):

  def __init__(self, nb_labels, pretrained_model = 'bert-base-cased'):
    super(BerClassifier, self).__init__()

    #Load Bert Model
    self.bert = BertModel.from_pretrained(pretrained_model) 

    #Add MLP for classification
    self.linear = nn.Linear(768, nb_labels)

  def forward(self, input_ids, attention_mask):

    #Extract the pooled output
    _, pooled_output = self.bert(input_ids = input_ids, attention_mask = attention_mask, return_dict = False)

    scores = self.linear(pooled_output)
    y = nn.functional.softmax(scores, dim=1)

    return y

## Finetuning

In [21]:
import time
try:
    from google.colab import drive
    drive.mount('/gdrive', force_remount = True) #Link our drive
    MODELPATH = '/gdrive/My Drive/Projects/BERT/bertmodel_weights.pt'
except ImportError:
    COLAB = True
    MODELPATH = './bertmodel_weights.pt'
    print('Not running on colab, setting local path')

Not running on colab, setting local path


In [17]:
def train_step(model, train_dataloader, criterion, optimizer):
  train_loss = 0
  train_acc = 0

  for i, data in enumerate(train_dataloader):

    model.zero_grad()

    #Load tensors to GPU
    input_ids = data['input_ids'].to(device)
    attention_mask = data['attention_mask'].to(device)
    label_true = data['label'].to(device)

    #Forward pass
    label_pred = model(input_ids, attention_mask)

    #Compute batch loss
    batch_loss = criterion(label_pred, label_true)

    #Backpropagation
    batch_loss.backward()
    optimizer.step()

    #COmpute metrics
    train_loss += batch_loss.item()
    train_acc += (label_pred.argmax(axis=1) == label_true).sum()

  train_loss = train_loss/len(train_dataloader)
  train_acc = train_acc/len(train_dataloader.dataset)
  
  print(f"loss: {train_loss :.3f} - acc: {train_acc*100 :.2f}")


In [18]:
#Set hyperparameters
BATCH_SIZE = 32
LEARNING_RATE = 1e-6
EPOCHS = 30

#Tokenizer
tokenizer_ = lambda text: tokenizer(text, 
          padding='max_length',  
          max_length=64,  
          truncation=True, 
          return_tensors='pt', 
          return_token_type_ids=False,
          verbose=False)

#Create datasets
train_dataset = Dataset(df_text, tokenizer_, 'train', with_description = True)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size = BATCH_SIZE)

#Intantiate model
bertmodel = BerClassifier(len(labels), pretrained_model = 'bert-base-uncased')
bertmodel = bertmodel.to(device)

#Load model state from previous execution
# bertmodel.load_state_dict(torch.load('/gdrive/My Drive/Projects/BERT/bertmodel_weights.pt'))

criterion = nn.CrossEntropyLoss().cuda()

optimizer = torch.optim.Adam(bertmodel.parameters(), lr = LEARNING_RATE)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Now that we have everytithing preped, we can easily finetune BERT.

Since training takes some time and Colab might stop the execution at any moment, we can save the weights of the model to our drive at every interval of time during the training.

In [22]:
#Start counting
start = time.time()

for epoch in range(EPOCHS):

  #Perform a training step
  print(f"Epoch: {epoch}")
  
  train_step(bertmodel, train_dataloader, criterion, optimizer)

  #Compute elapsed time and save model state if 1 minute passed
  elapsed = time.time() - start

  #Check wether the interval has passed. Here I have chosen the duration to be 1 minute.
  if elapsed/60 > 1:
    #Restart the timer
    start  = time.time()  
    #Save the model weights in Drive
    torch.save(bertmodel.state_dict(), MODELPATH) 

Epoch: 0
loss: 4.332 - acc: 13.68
Epoch: 1
loss: 4.323 - acc: 16.71
Epoch: 2
loss: 4.306 - acc: 17.51
Epoch: 3
loss: 4.291 - acc: 20.97
Epoch: 4
loss: 4.280 - acc: 22.06
Epoch: 5
loss: 4.272 - acc: 25.15
Epoch: 6
loss: 4.264 - acc: 25.53
Epoch: 7
loss: 4.255 - acc: 31.61
Epoch: 8
loss: 4.248 - acc: 34.40
Epoch: 9
loss: 4.238 - acc: 34.75
Epoch: 10
loss: 4.229 - acc: 36.56
Epoch: 11
loss: 4.219 - acc: 41.29
Epoch: 12
loss: 4.208 - acc: 43.75
Epoch: 13
loss: 4.198 - acc: 45.39
Epoch: 14
loss: 4.185 - acc: 45.81
Epoch: 15
loss: 4.171 - acc: 50.32
Epoch: 16
loss: 4.159 - acc: 51.18
Epoch: 17
loss: 4.144 - acc: 51.34
Epoch: 18
loss: 4.130 - acc: 50.38
Epoch: 19
loss: 4.120 - acc: 53.81
Epoch: 20
loss: 4.108 - acc: 53.46
Epoch: 21
loss: 4.099 - acc: 54.78
Epoch: 22
loss: 4.085 - acc: 51.95
Epoch: 23
loss: 4.071 - acc: 53.88
Epoch: 24
loss: 4.060 - acc: 57.10
Epoch: 25
loss: 4.052 - acc: 59.22
Epoch: 26
loss: 4.042 - acc: 58.85
Epoch: 27
loss: 4.035 - acc: 61.07
Epoch: 28
loss: 4.027 - acc: 6