<img src="https://i.imgur.com/TodFykz.png">
<center><h1>-Model Training & Submission-</h1></center>

# 1. Introduction
üü¢ **Goal:** Building a model that can identify which images contain the same product/s.

üü† **To consider**:
* This competition is a little different, as it doesn't use Supervised ML Techniques, but **Unsupervised** ML Techniques.
* The goal is to group similar products together: although we have a "target variable" (named `label_group`) in the `train` dataset, there can be multiple other types of groups in the `test` dataset (completely unseen during training). Hence, we can't use the `label_group` as our target (`y`) feature.

<div class="alert alert-block alert-success">
<b>Inspiration:</b> HUGE thanks to Chris Deotte for creating a <a href="https://www.kaggle.com/cdeotte/part-2-rapids-tfidfvectorizer-cv-0-700"> trendsetter notebook with a baseline </a>, so we can all get started and to zzy990106 for his <a href="https://www.kaggle.com/zzy990106/b0-bert-cv0-9"> PyTorch version </a> on Chris's work.
<p>This notebook has the purpose of going deeper with the explanations regarding the code and process and an attempt of improving the baseline score as we go along. üòä</p>
</div>

### üìö Libraries + W&B

> You can find my W&B Dashboard on this competition [here](https://wandb.ai/andrada/shopee-kaggle?workspace=user-andrada).

In [None]:
# Libraries CPU
import wandb     ### comment when Internet OFF
import cv2
import os
import gc
import random
import tqdm
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from mpl_toolkits import mplot3d
import sys
sys.path = ['../input/efficientnet-pytorch/EfficientNet-PyTorch/EfficientNet-PyTorch-master'
           ] + sys.path

# Libaries GPU
import cudf
import cupy
import cuml
from cuml.feature_extraction.text import TfidfVectorizer
from cuml.neighbors import NearestNeighbors

# Pytorch & Deep Learning
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from albumentations import Compose, Resize, Normalize, HorizontalFlip, VerticalFlip,\
                            Rotate, CenterCrop


from efficientnet_pytorch import EfficientNet
from transformers import AutoTokenizer
from torchvision.models import resnet34, resnet50

# Environment check
os.environ["WANDB_SILENT"] = "true"      ### comment when Internet OFF

# Secrets ü§´
### comment when Internet OFF
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("wandb")

# Color scheme
my_colors = ["#EDAC54", "#F4C5B7", "#DD7555", "#B95F18", "#475A20"]

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device available now:', device)

# Base paths
train_base = "../input/shopee-product-matching/train_images/"
test_base = "../input/shopee-product-matching/test_images/"

In [None]:
def set_seed(seed = 1234):
    '''Sets the seed of the entire notebook.
    This is for REPRODUCIBILITY.'''
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    
set_seed()

In [None]:
! wandb login $secret_value_0     ### comment when Internet OFF

# 2. Load the Data

Let's read the data, by always taking into account the state of the notebook (whether is in **submission** or **commiting** process).
* For `submission`, we'll read in `test.csv` data
* For `commiting`, we'll read in `train.csv`, so we can plot a CV score as well

In [None]:
# ---- Set COMPUTE_CV value ----
COMPUTE_CV = True

# Switch to False if test.csv has more than 3 values
### check out Chris's notebook for more info on this
test = pd.read_csv('../input/shopee-product-matching/test.csv')

if len(test)>3: 
    COMPUTE_CV = False

In [None]:
if COMPUTE_CV == True:
    # === CPU data ===
    # Read in data
    data = pd.read_csv("../input/shopee-product-matching/train.csv")    
    # Set a "filepath" column
    data["filepath"] = train_base + data["image"]
    # Map on for each product all `posting_id` that are labeled as the same
    ### this way we create a "target" column (ONLY FOR TRAIN)
    group_dicts = data.groupby('label_group')["posting_id"].unique().to_dict()
    data['target'] = data["label_group"].map(group_dicts)
    
    # === GPU data ===
    data_gpu = cudf.read_csv("../input/shopee-product-matching/train.csv")    
    data_gpu["filepath"] = train_base + data_gpu["image"]

else:
    # === CPU data ===
    data = pd.read_csv("../input/shopee-product-matching/test.csv")
    data["filepath"] = test_base + data["image"]
    # No Target Here
    
    # === GPU data ===
    data_gpu = cudf.read_csv("../input/shopee-product-matching/test.csv")    
    data_gpu["filepath"] = test_base + data_gpu["image"]

> When this notebook is commited, the `data` variable will have 34,000 rows. However, when we'll commit it, the `data` will access the 70,000 hidden rows in the `test.csv`. This means that **the amount of observations pushed through the pipeline will double**. To avoid any *memory errors*, you would want to also experiment by pushing ~ 70,000 rows as well, to **make sure your code isn't crushing** somewhere along the way.

In [None]:
# # === OPTIONAL ===
# # Increase 2.05 times the amount of data
# data = pd.concat([data, data, data.loc[:2000]], axis=0)
# data_gpu = cudf.concat([data_gpu, data_gpu, data_gpu.loc[:2000]], axis=0)

In [None]:
# Let's look at it
data.head(2)

In [None]:
# Save data to W&B Artifacts
### comment when Internet OFF
run = wandb.init(project='shopee-kaggle', name='original_data')
artifact = wandb.Artifact(name='original', 
                          type='dataset')

artifact.add_file("../input/shopee-preprocessed-data/train.parquet")
artifact.add_file("../input/shopee-preprocessed-data/test.parquet")

wandb.log_artifact(artifact)
wandb.finish()

# 3. Competition Metric

Let's now understand the competition metric. I usually like to have this down, as it is a very important part of the prediction process.

*üìå Again, the methodology is highly inspired from [
[PART 2] - RAPIDS TfidfVectorizer - [CV 0.700]](https://www.kaggle.com/cdeotte/part-2-rapids-tfidfvectorizer-cv-0-700) üìå*

<img src="https://i.imgur.com/h3oWxLT.png" width=800>

In [None]:
def F1_score(target_column, pred_column):
    '''Returns the F1_score for each row in the data.
    Remember: The final score is the mean F1 score.
    target_column: the name of the column that contains the target
    pred_column: the name of the column that contains the prediction
    '''
    
    def get_f1(row):
        # Find the common values in target and prediction arrays.
        intersection = len( np.intersect1d(row[target_column], row[pred_column]) )
        # Computes the score by following the formula
        f1_score = 2 * intersection / (len(row[target_column]) + len(row[pred_column]))
        
        return f1_score
    
    return get_f1

So, without doing anything we have a **CV score** of **0.553**.

In [None]:
run = wandb.init(project='shopee-kaggle', name='metric_baseline')

data_baseline = data.copy()

# Create artificial prediction column
### based on image_phash - all images with the same image_phash are the same
group_baseline = data_baseline.groupby("image_phash")["posting_id"].unique().to_dict()
data_baseline['preds'] = data_baseline["image_phash"].map(group_baseline)

# Get F1 score for each row
data_baseline['F1'] = data_baseline.apply(F1_score(target_column="target", pred_column="preds"), axis=1)
print('CV score for baseline = {:.3f}'.format(data_baseline["F1"].mean()))
wandb.log({"Baseline CV Score" : data_baseline["F1"].mean()})

wandb.finish()

# 4. PyTorch Dataset

We'll create a Dataset class called `ShopeeDataset` that will:
1. Receive the metadata
2. Read in the `image` and `title`
3. Perform image augmentation and tokenization
4. Return the necessary information to feed into the model afterwards


### The Bert Tokenizer ([data from Abhishek Thakur](https://www.kaggle.com/abhishek/bert-base-uncased/code?datasetId=431504&sortBy=voteCount)):
* Pretrained tokenizer that splits sentences into tokens (source from `transformers` library - [click here for more info](https://huggingface.co/transformers/preprocessing.html))
* The output is as follows:
    * `input_ids`: indices corresponding to each token in the sentence
    * `attention_mask`: indicates to the model which tokens should be attended to, and which should not ([documentation on attention_mask here](https://huggingface.co/transformers/glossary.html#attention-mask))
<img src="https://i.imgur.com/3uY3YFi.png" width=500>

In [None]:
class ShopeeDataset(Dataset):
    
    def __init__(self, csv, train):
        self.csv = csv.reset_index()
        self.train = train
        
        # Instantiate one of the tokenizer classes of the library from BERT
        self.tokenizer = AutoTokenizer.from_pretrained('../input/bert-base-uncased')
        # Image Augmentation
        self.transform = Compose([VerticalFlip(p=0.5),
                                  HorizontalFlip(p=0.5),
                                  Resize(256, 256),
                                  Normalize(),
                                 ])
        
    def __len__(self):
        return len(self.csv)
    
    
    def __getitem__(self, index):
        '''Read in image & title as PyTorch Dataset.
        Return the transformed image and text ids and mask.'''
            
        # Read in image and text data
        image = cv2.imread(self.csv["filepath"][index])
        text = self.csv["title"][index]
        
        # Transform image & transpose channels [color, height, width]
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image_transf = self.transform(image=image)["image"].astype(np.float32)
        image_transf = torch.tensor(image_transf.transpose(2, 0, 1))
        
        # Tokenize the text using BERT
        text_token = self.tokenizer(text, padding="max_length",
                                    truncation=True, max_length=16,
                                    return_tensors="pt")
        input_ids = text_token["input_ids"][0]
        attention_mask = text_token["attention_mask"][0]
        
        # Return dataset info
        ### if "test", we won't have label_group available
        if self.train == True:
            label_group = torch.tensor(self.csv["label_group"][index])
            return image_transf, input_ids, attention_mask, label_group
        
        else:
            return image_transf, input_ids, attention_mask

Now we can create the `dataset` and the `dataloader`. Remember, if:
* **COMPUTE_CV == True**: `dataset_data` variable will contain `train.csv` data
* **COMPUTE_CV == False**: `dataset_data` variable will contain `test.csv` data

In [None]:
# Compute dataloader for test data
dataset_data = ShopeeDataset(csv=data, train=False)
data_loader = DataLoader(dataset_data, batch_size=16,
                         num_workers=4)

print("Dataset Len: {:,}".format(len(dataset_data)), "\n" +
      "Image Shape [0]: {}".format(dataset_data[0][0].shape), "\n" +
      "input_ids [0]: {}".format(dataset_data[0][1]), "\n" +
      "attention_mask [0]: {}".format(dataset_data[0][2]))

# 5. Grouping using Image Embeddings

Now we can safely extract the embeddings from our images using EffNet. You can find more on PyToch EfficientNet [here](https://github.com/lukemelas/EfficientNet-PyTorch).

The Embeddings are actually the abstract representation of the images:
* `input`: an image of [3, 256, 256] (3 channels, of size 256x256)
* `output`: an array of 1000 items which is the abstract representation of the input structure (see image below)
<img src="https://i.imgur.com/PjLEVaE.png" width=550>

## I. Retrieving the embeddings

> **üìå Note**: Because we do not have Internet access for this notebook, we need to import the EffNet model from a dataset. Nikita Kozodoi has kindly already created this for us [here](https://www.kaggle.com/kozodoi/efficientnet-pytorch). 
<img src="https://miro.medium.com/max/910/1*CjpipU_oChc899f_Esjpyg.png" width=400>

In [None]:
# Extract Efficientnet and put model on GPU
model_effnet = EfficientNet.from_name("efficientnet-b2").cuda()
model_effnet.load_state_dict(torch.load("../input/efficientnet-pytorch/efficientnet-b2-27687264.pth"))

# model_resnet = resnet50(pretrained = False).cuda()
# model_resnet.load_state_dict(torch.load('../input/pretrained-pytorch-models/resnet50-19c8e357.pth'))

> **üìå Note**: The cell below takes ~ 6 mins to run. Hence, I have saved the `image_embeddings` numpy array [here](https://www.kaggle.com/andradaolteanu/shopee-preprocessed-data).

> What we are doing is appending to EACH batch of images (`[16, 1000]`) the `ids` extracted from BERT (`[16, 16]`) and the `masks` (`[16, 16]`) => `[16, 1032]`

In [None]:
# Extract embeddings of the image (the EffnetB0 representation)
embeddings = []

# We aren't training, only extracting the representation
with torch.no_grad():
    for image, ids, mask in tqdm.tqdm(data_loader):
        # Don't forget to append the image to .cuda() as well
        image = image.cuda()
        ids = ids.detach().numpy()
        mask = mask.detach().numpy()
        
        img_embeddings = model_effnet(image)
        img_embeddings = img_embeddings.detach().cpu().numpy()
        # Add information from ids and mask as well
        img_embeddings = np.hstack((img_embeddings, ids, mask))
        embeddings.append(img_embeddings)
        

# Concatenate all embeddings
all_image_embeddings = np.concatenate(embeddings)
print("image_embeddings shape: {:,}/{:,}".format(all_image_embeddings.shape[0], all_image_embeddings.shape[1]))

# Save it to a binary file in NumPy .npy format.
# np.save("image_embeddings", all_image_embeddings)

In [None]:
# Read in image_embeddings
# all_image_embeddings = np.load("../input/shopee-preprocessed-data/image_embeddings.npy")

# Save image_embeddings to W&B
### comment when Internet OFF
run = wandb.init(project='shopee-kaggle', name='image_embeddings')
artifact = wandb.Artifact(name='image_embeddings', 
                          type='dataset')

artifact.add_file("../input/shopee-preprocessed-data/image_embeddings.npy")

wandb.log_artifact(artifact)
wandb.log({"Length of Image embeddings" : all_image_embeddings.shape[1],
           "Width of Image embeddings" : all_image_embeddings.shape[0]})
wandb.finish()

In [None]:
# Clean memory
del model_effnet
_ = gc.collect()

## II. Creating the predictions

The competition says that "group sizes are capped at 50, so there is no benefit to predict more than 50 matches." Hence, we'll create clusters of a maximum size of 50.

In [None]:
run = wandb.init(project='shopee-kaggle', name='image_predictions')    ### comment when Internet OFF

In [None]:
# Create the model instance
if len(data) > 3:
    knn_model = NearestNeighbors(n_neighbors=50)
    wandb.log({"n_neighbors" : 50})     ### comment when Internet OFF
else:
    knn_model = NearestNeighbors(n_neighbors=2)
    wandb.log({"n_neighbors" : 2})      ### comment when Internet OFF
    
# Train the model
knn_model.fit(all_image_embeddings)

In [None]:
# Creating the splits, to prevent memory errors
### more info on this in Chris's notebook
predictions = []
CHUNK = 1024 * 4  ### 4096

SPLITS = len(all_image_embeddings) // CHUNK
if len(all_image_embeddings) % CHUNK != 0: SPLITS += 1
print("Total Splits:", SPLITS)


# Making the prediction
print("Finding Similar Images ...")

for no in range(SPLITS):
    
    a = no * CHUNK
    b = (no+1) * CHUNK
    b = min(b, len(all_image_embeddings))
    print("CHUNK:", a, "-", b)
    
    distances, indices = knn_model.kneighbors(all_image_embeddings[a:b,])
    
    for k in range(b-a):
        index = np.where(distances[k, ] < 6.0)[0]
        split = indices[k, index]
        pred = data.iloc[split]["posting_id"].values
        
        predictions.append(pred)

        
# Clean environment
del knn_model, distances, indices
_ = gc.collect()

In [None]:
# Add predictions to dataframe
data['img_pred'] = predictions
data.head(3)

In [None]:
### comment when Internet OFF
wandb.finish()

### Bonus: 3D Plotting on Image Embeddings Clusters

> We'll use PCA to downsize the data from 1000 features to only 3.

In [None]:
# Create dataframe
img_embeddings_df = pd.DataFrame(all_image_embeddings)

# Separating out the features
X = img_embeddings_df.values
# Standardizing the features
X = StandardScaler().fit_transform(X)

# Separating out the target
y = data["label_group"]


# PCA
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(X)
# pca.explained_variance_ratio_.sum()

principalDf = pd.DataFrame(data = principalComponents,
                           columns = ['pc_1', 'pc_2', 'pc_3'])
finalDf = pd.concat([principalDf, y], axis = 1)

In [None]:
# Plot
fig = plt.figure(figsize=(20, 15))
ax = plt.axes(projection='3d')

ax.scatter3D(finalDf['pc_1'], finalDf['pc_2'], finalDf['pc_3'], c=finalDf['label_group'], cmap='BrBG')
ax.set_title('Image Embeddings: 3D Cluster', size=20);

In [None]:
del img_embeddings_df, X, pca, principalDf, finalDf, all_image_embeddings
_ = gc.collect()

# 6. Grouping using Text Embeddings

As we also have the `title` of the image available, it would be a shame not to use this data for predicting as well. In this part we'll create a TfIdf Vectorizer to extract these embeddings.

## I. Retrieving the embeddings

> A `TfIdf` Process looks like the example below:
<img src="https://i.imgur.com/W2tVXDY.png" width=700>

In [None]:
# Extract the Tf-Idf Matrix
# TODO: Extract more features & add preprocessing from notebook I
tf_idf = TfidfVectorizer(stop_words='english', binary=True, max_features=25000)
text_embeddings = tf_idf.fit_transform(data_gpu["title"]).toarray()

print("Text Embeddings Matrix format: {:,}/{:,}".format(text_embeddings.shape[0], text_embeddings.shape[1]))

In [None]:
# Save image_embeddings to W&B
### comment when Internet OFF
run = wandb.init(project='shopee-kaggle', name='text_embeddings')
artifact = wandb.Artifact(name='text_embeddings', 
                          type='dataset')

artifact.add_file("../input/shopee-preprocessed-data/text_embeddings.npy")

wandb.log_artifact(artifact)
wandb.log({"Length of Text embeddings" : text_embeddings.shape[1],
           "Width of Text embeddings" : text_embeddings.shape[0]})
wandb.finish()

## II. Creating the predictions

In [None]:
def find_matches_cupy(X, posting_ids, threshold):
    # TODO: to be developed
    # https://www.kaggle.com/c/shopee-product-matching/discussion/230486
    X = cp.array(X)
    N = X.shape[1]
    matches = []

    for i in tqdm(range(N)):
        v = X[:, i].reshape(-1, 1)
        thresholded_bool = cp.linalg.norm(v - X, axis=0) < threshold
        thresholded_ix = cp.argwhere(thresholded_bool).squeeze(-1)
        thresholded_ix = thresholded_ix.get()
        match = " ".join(posting_ids[thresholded_ix])
        matches.append(match)

    return matches

In [None]:
# Creating the splits, to prevent memory errors
### more info on this in Chris's notebook
predictions = []
CHUNK = 1024 * 4  ### 4096

SPLITS = len(text_embeddings) // CHUNK
if len(text_embeddings) % CHUNK != 0: SPLITS += 1
print("Total Splits:", SPLITS)


# Making the prediction
print("Finding Similar Titles ...")

for no in range(SPLITS):
    
    a = no * CHUNK
    b = (no+1) * CHUNK
    b = min(b, len(text_embeddings))
    print("CHUNK:", a, "-", b)
    
    # Cosine similarity distance
    cts = cupy.matmul(text_embeddings, text_embeddings[a:b].T).T
    
    for k in range(b-a):
        index = cupy.where(cts[k,] > 0.7)[0]
        index = cupy.asnumpy(index)
        pred = data.iloc[index]["posting_id"].values
        
        predictions.append(pred)

        
# Clean environment
del tf_idf, text_embeddings
_ = gc.collect()

In [None]:
# Add predictions to dataframe
data['title_pred'] = predictions
data.head(3)

### Bonus: 3D Plotting on Text Embeddings Clusters

> We'll use PCA to downsize the data from 1000 features to only 3.

In [None]:
from cuml.experimental.preprocessing import StandardScaler as StandardScaler_gpu
from cuml.decomposition import PCA as PCA_gpu

In [None]:
# # Create dataframe
# text_embeddings_df = cudf.DataFrame(text_embeddings)

# # Separating out the features
# X = text_embeddings_df.values
# # Standardizing the features
# X = StandardScaler_gpu().fit_transform(X)

# # Separating out the target
# y = data["label_group"]


# # PCA
# pca = PCA_gpu(n_components=3)
# principalComponents = pca.fit_transform(X)

# principalDf = cudf.DataFrame(data = principalComponents,
#                              columns = ['pc_1', 'pc_2', 'pc_3'])
# finalDf = cudf.concat([principalDf, y], axis = 1)

# 7. Final predictions

Now that we have predictions linked to both image and title embeddings, we can combine them and create the final predictions that we'll also submit to the leaderboard.

In [None]:
# All images that have the same phash are identical, so we'll add these too
duplicate_dict = data.groupby('image_phash').posting_id.agg('unique').to_dict()
data['duplic_pred'] = data["image_phash"].map(duplicate_dict)

In [None]:
def combine_predictions(row, cv=True):
    '''Combine all predictions together.'''
    
    # Concatenate all predictions
    all_preds = np.concatenate([row["img_pred"],row["title_pred"], row["duplic_pred"]])
    all_preds = np.unique(all_preds)
    
    # Return combined unique preds
    if cv == True:
        return all_preds
    else:
        return ' '.join(all_preds)

> **CV Score: 0.67** with a submission score in Leaderboard of **0.66**.
<img src="https://i.imgur.com/QLtVqqq.png" width=600>

In [None]:
if COMPUTE_CV == True:
    
    data["all_preds"] = data.apply(lambda x: combine_predictions(x, cv=True), axis=1)
    data["f1"] = data.apply(F1_score(target_column="target", pred_column="all_preds"), axis=1)
    print("CV Score: {:.3}".format(data["f1"].mean()))
    

data["matches"] = data.apply(lambda x: combine_predictions(x, cv=False), axis=1)

In [None]:
# Plot F1 Score on product
plt.figure(figsize = (20, 6))

plot = sns.kdeplot(x = data["f1"])
plt.title("F1 score Distribution", fontsize=20)
plt.xlabel("F1", fontsize=15)
plt.ylabel("");

In [None]:
# --- Make a custom plot to save into W&B ---
### comment when Internet OFF
run = wandb.init(project='shopee-kaggle', name='f1_final_scores')

# Prepare data
custom_data = [[s] for s in data["f1"]]

# Create Table & .log() the plot
table = wandb.Table(data=custom_data, columns=["f1"])
wandb.log({'f1_hist': wandb.plot.histogram(table, "f1",
                                           title="F1 score Distribution")})

wandb.finish()

> This is how the distribution shows in the W&B dashboard:
<img src="https://i.imgur.com/FnL9Br0.png" width=500>

## üì© Submission

> **üìå Note**: Don't forget to disable the Internet access before submitting.

<div class="alert alert-block alert-warning">
<b>Note:</b> This notebooks uses internet to connect to the W&B Dashboard. To submit it, you'll have to set the Internet Off and to comment the lines of code that save information into the W&B Project.</p>
</div>

In [None]:
data[['posting_id','matches']].to_csv('submission.csv',index=False)
print("Submission Ready :)")

<img src="https://i.imgur.com/cUQXtS7.png">

# Specs on how I trained ‚å®Ô∏èüé®¬∂
### (on my local machine)
* Z8 G4 Workstation üñ•
* 2 CPUs & 96GB Memory üíæ
* NVIDIA Quadro RTX 8000 üéÆ
* RAPIDS version 0.17 üèÉüèæ‚Äç‚ôÄÔ∏è