## General

This colab notebook generates semantic embeddings on the product titles using Google Language-agnostic BERT Sentence Embedding ([LaBSE](https://tfhub.dev/google/LaBSE/1)).

LaBSE is a multilingual sentence semantic encoder that takes in sentences as input and generates a feature embedding of size 768.

In [None]:
# Confirm GPU is running
!nvidia-smi

Fri Apr 23 12:40:59 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import nltk
from nltk.corpus import stopwords

In [None]:
!pip install bert-for-tf2
import bert

Collecting bert-for-tf2
[?25l  Downloading https://files.pythonhosted.org/packages/a5/a1/acb891630749c56901e770a34d6bac8a509a367dd74a05daf7306952e910/bert-for-tf2-0.14.9.tar.gz (41kB)
[K     |████████                        | 10kB 16.2MB/s eta 0:00:01[K     |████████████████                | 20kB 22.6MB/s eta 0:00:01[K     |███████████████████████▉        | 30kB 25.9MB/s eta 0:00:01[K     |███████████████████████████████▉| 40kB 28.1MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 6.0MB/s 
[?25hCollecting py-params>=0.9.6
  Downloading https://files.pythonhosted.org/packages/aa/e0/4f663d8abf83c8084b75b995bd2ab3a9512ebc5b97206fde38cef906ab07/py-params-0.10.2.tar.gz
Collecting params-flow>=0.8.0
  Downloading https://files.pythonhosted.org/packages/a9/95/ff49f5ebd501f142a6f0aaf42bcfd1c192dc54909d1d9eb84ab031d46056/params-flow-0.8.2.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... 

In [None]:
# Load train data
train = pd.DataFrame(np.load('/content/drive/MyDrive/General Assembly - Data Science Immersive/shopee-product-matching/datasets/train.npy', allow_pickle=True),
                     columns=['posting_id', 'image', 'image_phash', 'title', 'label_group', 'matches', 'image_duplicates'])

## Generate Tokens

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Potential stop words:
shopee_words = [# Sales words:
                'free', 'gift', 'give', 'get', 'ready', 'stock', 'stocks', 'stok',
                'ori', 'original', 'official', 'new', 'latest',
                'import', 'low', 'price', 'cheap', 'vip', 'discount', 'warranty',
                'promo', 'promotion', 'buy', 'buyer', 'shop', 'shopper', 'shopping',
                'bigsale', 'sale', 'sell', 'seller', 'resell', 'reseller',
                'all', 'any', 'full', 'include', 'includes', 'inclusive', 'tax',
    
                # Units
                'pieces', 'piece', 'pcs', 'pc', 'box', 'boxes', 'pack', 'packs', 'packet', 'packets', 'paket', 'package',
                'set', 'sets', 'size', 'roll', 'rolls', 'sachet', 'sachets'
                
                # Dimensions
                'ml', 'l', 'litre', 'liter', 'g', 'gr', 'gram', 'kg', 'kilo', 'kilogram',
                'mm', 'cm', 'm', 'meter', 'metre', 'yard', 'inch', 'x',
    
                # Miscellaneous alphabets
                'c', 'xe', 'f', 'b', 'v', 'xa',
                
                # Location words:
                'shopee', 'indonesia', 'indonesian', 'indo', 'id', 'jakarta', 'local', 'lokal',
    
                # English descriptors:
                'fashion', 'colour', 'color', 'design',
                'plus', 'pro', 'mini', 'premium', 'pro', 'super', 'extra', 'big', 'small',
                
                # Indonesian descriptors:
                'bpom', 'muat', 'cod', 'murah', 'isi', 'warna', 'pajak', 'garansi', 'beli', 'gratis',
                'terbaru', 'harga', 'resmi',
                
]

# Add NLTK English and Indonesian stop words
stop_words = stopwords.words('english') + \
             stopwords.words('indonesian') + \
             shopee_words

stop_words = list(set(stop_words))

In [None]:
# Create function for generating tokens from titles
def process_tokens(title, stop_words, tokenizer):
    words = tokenizer.tokenize(title.lower())
    return ' '.join([word for word in words if word not in stop_words])

In [None]:
# Create same token sets 2 and 3 as in notebook 04_text_embeddings
tokenizer_1 = nltk.tokenize.RegexpTokenizer('[a-zA-Z0-9]+')
tokens_2 = train['title'].map(lambda x: process_tokens(x, stop_words, tokenizer_1)).to_numpy()
tokens_3 = train['title'].map(lambda x: process_tokens(x, [], tokenizer_1)).to_numpy()

## Load LaBSE Model

This section follows the Google LaBSE API to generate sentence embeddings from input strings.

In [None]:
def get_model(model_url, max_seq_length):
    labse_layer = hub.KerasLayer(model_url, trainable=True)

    # Define input.
    input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                             name="input_word_ids")
    input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                         name="input_mask")
    segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                          name="segment_ids")

    # LaBSE layer.
    pooled_output,  _ = labse_layer([input_word_ids, input_mask, segment_ids])

    # The embedding is l2 normalized.
    pooled_output = tf.keras.layers.Lambda(
          lambda x: tf.nn.l2_normalize(x, axis=1))(pooled_output)

    # Define model.
    return tf.keras.Model(
            inputs=[input_word_ids, input_mask, segment_ids],
            outputs=pooled_output), labse_layer

In [None]:
# Set max sequence length
max_seq_length = 64

In [None]:
labse_model, labse_layer = get_model(
    model_url="https://tfhub.dev/google/LaBSE/1", max_seq_length=max_seq_length)

In [None]:
# Examine BERT tokenizer
vocab_file = labse_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = labse_layer.resolved_object.do_lower_case.numpy()
tokenizer = bert.bert_tokenization.FullTokenizer(vocab_file, do_lower_case)

In [None]:
# Check some tokenized titles
tokens_2[:10]

array(['paper bag victoria secret',
       'double tape 3m vhb 12 4 5 double foam tape',
       'maling tts canned pork luncheon meat 397',
       'daster batik lengan pendek motif acak campur leher kancing dpt001 00 batik karakter alhadi',
       'nescafe xc3 x89clair latte 220ml',
       'celana wanita bb 45 84 harem wanita', 'jubah anak 1 12 thn',
       'kulot plisket salur candy plisket wish kulot kulot pelangi hieka kulot',
       'logu tempelan kulkas magnet angka tempelan angka magnet',
       'sepatu pantofel kulit keren kerja kantor laki pria cowok dinas formal pesta kickers'],
      dtype=object)

In [None]:
# Check how our tokenized titles get re-tokenized by BERT
for input_string in tokens_2[:10]:
  print(tokenizer.tokenize(input_string))

['paper', 'bag', 'victoria', 'secret']
['double', 'tape', '3m', 'v', '##hb', '12', '4', '5', 'double', 'foam', 'tape']
['maling', 'tts', 'canne', '##d', 'pork', 'lunch', '##eon', 'meat', '397']
['das', '##ter', 'batik', 'lengan', 'pendek', 'motif', 'acak', 'campur', 'leher', 'kan', '##cing', 'dpt', '##001', '00', 'batik', 'karakter', 'al', '##hadi']
['nes', '##cafe', 'x', '##c', '##3', 'x', '##89', '##clair', 'latte', '220', '##ml']
['celana', 'wanita', 'bb', '45', '84', 'harem', 'wanita']
['jubah', 'anak', '1', '12', 'thn']
['kulo', '##t', 'plis', '##ket', 'salu', '##r', 'candy', 'plis', '##ket', 'wish', 'kulo', '##t', 'kulo', '##t', 'pelan', '##gi', 'hie', '##ka', 'kulo', '##t']
['logu', 'tempel', '##an', 'kulkas', 'magnet', 'angka', 'tempel', '##an', 'angka', 'magnet']
['sepatu', 'panto', '##fel', 'kulit', 'keren', 'kerja', 'kantor', 'laki', 'pria', 'cowok', 'dinas', 'formal', 'pesta', 'kick', '##ers']


In [None]:
def create_input(input_strings, tokenizer, max_seq_length):
    
    input_ids_all, input_mask_all, segment_ids_all = [], [], []
    for input_string in input_strings:
        
        # Tokenize input.
        input_tokens = ["[CLS]"] + tokenizer.tokenize(input_string) + ["[SEP]"]
        input_ids = tokenizer.convert_tokens_to_ids(input_tokens)
        sequence_length = min(len(input_ids), max_seq_length)

        # Padding or truncation.
        if len(input_ids) >= max_seq_length:
            input_ids = input_ids[:max_seq_length]
        else:
            input_ids = input_ids + [0] * (max_seq_length - len(input_ids))

        input_mask = [1] * sequence_length + [0] * (max_seq_length - sequence_length)

        input_ids_all.append(input_ids)
        input_mask_all.append(input_mask)
        segment_ids_all.append([0] * max_seq_length)

    return np.array(input_ids_all), np.array(input_mask_all), np.array(segment_ids_all)

In [None]:
def encode(input_text):
    input_ids, input_mask, segment_ids = create_input(input_text, tokenizer, max_seq_length)
    return labse_model([input_ids, input_mask, segment_ids])

## Generate Embeddings

In [None]:
# As the dataset is large, we will run the embedding in chunks
chunk_size = 3000
chunks = np.arange(np.ceil(len(train) / chunk_size))

In [None]:
# Generate text embeddings from LaBSE model in chunks for tokens set 2
# Initialize embeddings list
embeddings = []

# Iterate through chunks
for i in chunks:
    # Start and end index
    start = int(i * chunk_size)
    end = int((i + 1) * chunk_size)

    # Get tokens
    tokens = tokens_2[start:end]

    # Generate embeddings
    text_embeddings = encode(tokens)

    # Append to embeddings list
    embeddings.append(text_embeddings)

    # Print status
    print(f'Chunk {i} completed')

text_labse_embeddings_2 = np.concatenate(embeddings)

# Delete temporary variables to free memory
del embeddings
del tokens
del text_embeddings

Chunk 0.0 completed
Chunk 1.0 completed
Chunk 2.0 completed
Chunk 3.0 completed
Chunk 4.0 completed
Chunk 5.0 completed
Chunk 6.0 completed
Chunk 7.0 completed
Chunk 8.0 completed
Chunk 9.0 completed
Chunk 10.0 completed
Chunk 11.0 completed


In [None]:
text_labse_embeddings_2.shape

(34250, 768)

In [None]:
# Save embeddings as npy file
np.save('/content/drive/MyDrive/General Assembly - Data Science Immersive/shopee-product-matching/datasets/text_labse_embeddings_2.npy', text_labse_embeddings_2)

In [None]:
# Generate text embeddings from LaBSE model in chunks for tokens set 3
# Initialize embeddings list
embeddings = []

# Iterate through chunks
for i in chunks:
    # Start and end index
    start = int(i * chunk_size)
    end = int((i + 1) * chunk_size)

    # Get tokens
    tokens = tokens_3[start:end]

    # Generate embeddings
    text_embeddings = encode(tokens)

    # Append to embeddings list
    embeddings.append(text_embeddings)

    # Print status
    print(f'Chunk {i} completed')

text_labse_embeddings_3 = np.concatenate(embeddings)

# Delete temporary variables to free memory
del embeddings
del tokens
del text_embeddings

Chunk 0.0 completed
Chunk 1.0 completed
Chunk 2.0 completed
Chunk 3.0 completed
Chunk 4.0 completed
Chunk 5.0 completed
Chunk 6.0 completed
Chunk 7.0 completed
Chunk 8.0 completed
Chunk 9.0 completed
Chunk 10.0 completed
Chunk 11.0 completed


In [None]:
text_labse_embeddings_3.shape

(34250, 768)

In [None]:
# Save embeddings as npy file
np.save('/content/drive/MyDrive/General Assembly - Data Science Immersive/shopee-product-matching/datasets/text_labse_embeddings_3.npy', text_labse_embeddings_3)