# Workflow

- Image
    - Extract **image embeddings of each item** with EffNetB0
    - Compare them using RAPIDS cuML KNN
- Text
    - RAPIDS cuML TfidfVectorizer to extract **text embeddings of each item's title**
    - Compare the embeddings using RAPIDS cuML KNN/ Cosine Similarity 
    
In this notebook, we use embedding to find a baseline model. To **further improve your score** click here to explore more: [Eff-B4 + TFIDF w/ CV for threshold_searching
](https://www.kaggle.com/chienhsianghung/eff-b4-tfidf-w-cv-for-threshold-searching)

# Setting

## Colab

In [None]:
COLAB = False
default_dir = None

if COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    default_dir = '/content'

    # Install RAPIDS
    !git clone https://github.com/rapidsai/rapidsai-csp-utils.git
    !bash rapidsai-csp-utils/colab/rapids-colab.sh

    import sys, os

    dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
    # ValueError: '/usr/local/lib/python3.6/dist-packages' is not in list
    sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
    sys.path
    exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

    # intall miniconda
    !wget -c https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
    !chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
    !bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local

    # install RAPIDS packages
    !conda install -q -y --prefix /usr/local -c conda-forge \
    -c rapidsai-nightly/label/cuda10.0 -c nvidia/label/cuda10.0 \
    cudf cuml

    # set environment vars
    import shutil
    sys.path.append('/usr/local/lib/python3.6/site-packages/')
    os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
    os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

    # copy .so files to current working dir
    # FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/libcudf.so'
    for fn in ['libcudf.so', 'librmm.so']:
        shutil.copy('/usr/local/lib/'+fn, os.getcwd())

    # ModuleNotFoundError: No module named 'conda'
    !conda install -c nvidia nvstrings==0.1.0
    !conda install -c rapidsai -c numba -c conda-forge -c defaults cudf=0.4.0

    # intall miniconda
    !wget -c https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
    !chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
    !bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local

    !conda install -c pytorch faiss-gpu cuda92

    # install RAPIDS packages
    !conda install -q -y --prefix /usr/local -c conda-foFileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/libcudf.so'rge \
    -c rapidsai-nightly/label/cuda10.0 -c nvidia/label/cuda10.0 \
    cudf cuml
    # FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/libcudf.so'
    # https://github.com/rapidsai/cuml/issues/25
    # Running conda install -c pytorch faiss-gpu cuda92 prior to the cuML install resolves the issue.

    # set environment vars
    import sys, os, shutil
    sys.path.append('/usr/local/lib/python3.6/site-packages/')
    os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
    os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

    # copy .so files to current working dir
    # FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/libcudf.so'
    for fn in ['libcudf.so', 'librmm.so']:
        shutil.copy('/usr/local/lib/'+fn, os.getcwd())

else: default_dir = '../input/shopee-product-matching'

I've downloaded the data to MyDrive. If you want to use Kaggle API, please check [here](https://www.kaggle.com/questions-and-answers/135301). It's a more safe way to prevent file missing problem while unzipping data to VM I think.

In [None]:
if COLAB:
    !cp /content/drive/MyDrive/ML/Shopee\ -\ Price\ Match\ Guarantee/shopee-product-matching.zip /content
    !unzip \*.zip  && rm *.zip

    # To check wether there is any missing file in train_images
    # https://askubuntu.com/questions/370697/how-to-count-number-of-files-in-a-directory-but-not-recursively
    %cd train_images/
    !ls -F |grep -v / | wc -l
    %cd ..

## Libraries

In [None]:
import tensorflow as tf
print('tf version:', tf.__version__)

import pandas as pd
import numpy as np
import cv2, matplotlib.pyplot as plt

from os.path import join

from tensorflow.keras.applications import EfficientNetB0, EfficientNetB3, EfficientNetB5, EfficientNetB6
import gc

## RAM Restriction
Restrict Tensorflow to 1 GP of GPU RAM so that we have GPU RAM for RAPIDS cuML KNN, and be able to submit final result. See ["Submission CSV Not Found" - struggling to submit](https://www.kaggle.com/c/shopee-product-matching/discussion/229672).

In [None]:
LIMIT = 1
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024*LIMIT)])
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), 'Physical GPUs', len(logical_gpus), 'Logical GPUs')
    except RuntimeError as e:
        print(e)
    print('Restrict TensorFlow to max %iGB GPU RAM'%LIMIT)
    print('so RAPIDS can use %iGB GPU RAM'%(16-LIMIT))
else:
    print('Non Accelerator detected')

# Training Data Load In

We have image, title (X) and column `label_group` (Y) indicates the ground trugh of which items are similar.

## COMPUTE_CV

*This committed notebook computes CV score but when we submit this notebook it does not compute CV. Instead it will load the 70,000 row test.csv file and compute matches in the test dataset. Because the variable `COMPUTE_CV = True` when we commit this notebook. But when we submit this notebook to Kaggle then the length of test.csv will be longer than 3 and the if-statement below will change to `COMPUTE_CV=False`.*

In [None]:
COMPUTE_CV = True

test = pd.read_csv(join(default_dir, 'test.csv'))
if len(test)>3: COMPUTE_CV = False
else: print('this submission notebook will compute CV score but commit notebook will not')

In [None]:
train = pd.read_csv(join(default_dir, 'train.csv'))
tmp = train.groupby('label_group').posting_id.agg('unique').to_dict()
train['target'] = train.label_group.map(tmp)
print('train shape is', train.shape)
train.head()

## Images Check Randomly

In [None]:
BASE = join(default_dir, 'train_images/')

def displayDF(train, random=False, COLS=6, ROWS=4, path=BASE):
    for k in range(ROWS):
        plt.figure(figsize=(20,5))
        for j in range(COLS):
            if random: row = np.random.randint(0,len(train))
            else: row = COLS*k + j
            name = train.iloc[row,1]
            title = train.iloc[row,3]
            title_with_return = ""
            for i,ch in enumerate(title):
                title_with_return += ch
                if (i!=0)&(i%20==0): title_with_return += '\n'
            img = cv2.imread(path+name)
            
            # color fixing
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            
            plt.subplot(1,COLS,j+1)
            plt.title(title_with_return)
            plt.axis('off')
            plt.imshow(img)
        plt.show()
        
displayDF(train,random=True)

## Duplicated Items

In [None]:
groups = train.label_group.value_counts()
plt.figure(figsize=(20, 5))
plt.plot(np.arange(len(groups)), groups.values)
plt.ylabel('Duplicate Count', size=14)
plt.xlabel('Index of Unique Item', size=14)
plt.title('Duplicate Count vs. Unique Item Count', size=16)
plt.show()

plt.figure(figsize=(20,5))
plt.bar(groups.index.values[:50].astype('str'),groups.values[:50])
plt.xticks(rotation = 45)
plt.ylabel('Duplicate Count',size=14)
plt.xlabel('Label Group',size=14)
plt.title('Top 50 Duplicated Items',size=16)
plt.show()

In [None]:
for k in range(2):
    print('#'*40)
    print('### TOP %i DUPLICATED ITEM:'%(k+1),groups.index[k])
    print('#'*40)
    top = train.loc[train.label_group==groups.index[k]]
    displayDF(top, random=False, ROWS=2, COLS=4)

# Baseline CV Score

A baseline is to predict all items with the same `image_phash` as being duplicate.

In [None]:
tmp = train.groupby('image_phash').posting_id.agg('unique').to_dict()
train['oof'] = train.image_phash.map(tmp)
train.head()

In [None]:
def getMetric(col):
    def f1score(row):
        n = len( np.intersect1d(row.target, row[col]) )
        return 2*n / (len(row.target) + len(row[col]))
    return f1score

In [None]:
train['f1'] = train.apply(getMetric('oof'), axis=1)
print('CV score for baseline =', train.f1.mean())

del train

## Compute RAPIDS Model CV and Infer Submission

Note how the variable `COMPUTE_CV` is only `True` when we **commit** this notebook. Right now you are reading a **commit** notebook, so we see test replaced with train and computed CV score. When we **submit** this notebook, the variable `COMPUTE_CV` will be `False` and the **submit** notebook will **not** compute CV. Instead it will load the real test dataset with 70,000 rows and find duplicates in the real test dataset.

In [None]:
import cudf

if COMPUTE_CV:
    test = pd.read_csv(join(default_dir, 'train.csv'))
    test_gf = cudf.DataFrame(test)
    print('Using train as test to compute CV (since commit notebook). Shape is', test_gf.shape)
else:
    test = pd.read_csv(join(default_dir, 'test.csv'))
    test_gf = cudf.DataFrame(test)
    print('Test shape is', test_gf.shape)
test_gf.head()

# Training

## Use Image Embeddings

To prevent memory errors, we will compute image embeddings in chunks. And we will find similar images with RAPIDS cuML KNN in chunks.

In [None]:
class DataGenerator(tf.keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, df, img_size=256, batch_size=32, path=''): 
        self.df = df
        self.img_size = img_size
        self.batch_size = batch_size
        self.path = path
        self.indexes = np.arange( len(self.df) )
        
    def __len__(self):
        'Denotes the number of batches per epoch'
        ct = len(self.df) // self.batch_size
        ct += int(( (len(self.df)) % self.batch_size)!=0)
        return ct

    def __getitem__(self, index):
        'Generate one batch of data'
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
        X = self.__data_generation(indexes)
        return X
            
    def __data_generation(self, indexes):
        'Generates data containing batch_size samples' 
        X = np.zeros((len(indexes),self.img_size,self.img_size,3),dtype='float32')
        df = self.df.iloc[indexes]
        for i,(index,row) in enumerate(df.iterrows()):
            img = cv2.imread(self.path+row.image)
            X[i,] = cv2.resize(img,(self.img_size,self.img_size)) #/128.0 - 1.0
        return X

In [None]:
BASE = join(default_dir, 'test_images/')
if COMPUTE_CV: BASE = join(default_dir, 'train_images/')

### Keras implementation of EfficientNet

Because training EfficientNet on ImageNet takes a tremendous amount of resources and several techniques that are not a part of the model architecture itself. Hence the Keras implementation by default loads pre-trained weights obtained via training with **AutoAugment**.

For B0 to B7 base models, the input shapes are different. [Here](https://keras.io/examples/vision/image_classification_efficientnet_fine_tuning/) is a list of input shape expected for each model.


#### ResourceExhaustedError

ResourceExhaustedError:  OOM when allocating tensor with shape[8,192,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc<br>
	 [[node efficientnetb6/block2a_expand_bn/FusedBatchNormV3 (defined at <ipython-input-16-5b6f43f107b0>:42) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.<br>
 [Op:__inference_predict_function_16157]<br>

Function call stack:<br>
predict_function

* [OOM when allocating tensor with shape #16768](https://github.com/tensorflow/tensorflow/issues/16768)
* [How to add report_tensor_allocations_upon_oom to RunOptions in Keras](https://stackoverflow.com/questions/49665757/how-to-add-report-tensor-allocations-upon-oom-to-runoptions-in-keras)

In [None]:
MODEL = EfficientNetB0
SAVE_IMGEMBEDDING = False

if not COMPUTE_CV or SAVE_IMGEMBEDDING:
    
    if MODEL == EfficientNetB0:
        WGT = '../input/effnetb0/efficientnetb0_notop.h5'
        model = EfficientNetB0(weights=WGT, include_top=False, pooling='avg', input_shape=None)
    elif MODEL == EfficientNetB3:
        WGT = '../input/tfkerasefficientnetimagenetnotop/efficientnetb3_notop.h5'
        model = EfficientNetB3(weights=WGT, include_top=False, pooling='avg', input_shape=None)
    elif MODEL == EfficientNetB5:
        WGT = '../input/tfkerasefficientnetimagenetnotop/efficientnetb5_notop.h5'
        model = EfficientNetB5(weights=WGT, include_top=False, pooling='avg', input_shape=None)
    elif MODEL == EfficientNetB6:
        WGT = '../input/tfkerasefficientnetimagenetnotop/efficientnetb6_notop.h5'
        model = EfficientNetB6(weights=WGT, include_top=False, pooling='avg', input_shape=None)

    embeds = []
    CHUNK = 1024 * 4

    print('Computing image embeddings...')
    CTS = len(test) // CHUNK
    if len(test) % CHUNK != 0: CTS += 1
    for i, j in enumerate(range(CTS)):

        a = j * CHUNK
        b = (j+1) * CHUNK
        b = min(b, len(test))
        print('chunk', a, 'to', b)
        
        if MODEL == EfficientNetB6:
            test_gen = DataGenerator(test.iloc[a:b], img_size=512, batch_size=6, path=BASE)
        else:
            test_gen = DataGenerator(test.iloc[a:b], batch_size=32, path=BASE)
            
        image_embeddings = model.predict(test_gen, verbose=1, use_multiprocessing=True, workers=4)
        embeds.append(image_embeddings)

        #if i>=1: break

    del model
    _ = gc.collect()
    image_embeddings = np.concatenate(embeds)

    # Saving a NumPy Array to CSV File
    if SAVE_IMGEMBEDDING: np.savetxt('image_embeddings_EfficientNetB6.csv', image_embeddings, delimiter=',')

else:
    print('Loading image embeddings...')
    if EfficientNetB0:
        image_embeddings = np.loadtxt('../input/shopee-price-match-guarantee-embeddings/image_embeddings.csv',
                                 delimiter=',')
    else: raise ValueError('Please select the correspondent model and embeddings in "../input/shopee-price-match-guarantee-embeddings".')

print('image embeddings shape',image_embeddings.shape)

Please Note! As stated in competition's evaluation page:<br>
*Group sizes were capped at 50, so there is no benefit to predict more than 50 matches.*

[I predicted 100 matches](https://www.kaggle.com/muhammad4hmed/you-need-more-tensors-in-neighbourhood) in hope of finding the actual neighborsand it did improve the score on LB a very little (3rd digits maybe).

In [None]:
from cuml.neighbors import NearestNeighbors

KNN = 100
if len(test) == 3: KNN = 2
model = NearestNeighbors(n_neighbors=KNN)
model.fit(image_embeddings)

In [None]:
preds = []
CHUNK = 1024*4

print('Finding similar images...')
CTS = len(image_embeddings) // CHUNK
if len(image_embeddings) % CHUNK != 0: CTS += 1
for j in range(CTS):
    
    a = j * CHUNK
    b = (j+1) * CHUNK
    b = min(b, len(image_embeddings))
    print('chunk', a, 'to', b)
    distances, indices = model.kneighbors(image_embeddings[a:b, ])
    
    for k in range(b-a):
        IDX = np.where(distances[k, ] < 6.0)[0]
        IDS = indices[k, IDX]
        o = test.iloc[IDS].posting_id.values
        preds.append(o)
        
del model, distances, indices, image_embeddings # embeds
_ = gc.collect()

test['preds2'] = preds
test.head()

## Use Text Embeddings

In [None]:
from cuml.feature_extraction.text import TfidfVectorizer

print('Computing text embeddings...')
model = TfidfVectorizer(stop_words=None, binary=True, max_features=25_000)
text_embeddings = model.fit_transform(test_gf.title).toarray()

# Saving a NumPy Array to CSV File
# np.savetxt('text_embeddings.csv', text_embeddings, delimiter=',')
# Warning: It's not np type

print('text embeddings shape',text_embeddings.shape)

In [None]:
# To prevent memory errors, we will find similar titles in chunks. 
# To faciliate this, we will use cosine similarity between text embeddings instead of KNN.
COSINE_SIMILARITY = True
Text_KNN_Follow_Up = False

if not COSINE_SIMILARITY:
    KNN = 100
    if len(test) == 3: KNN = 2
    model = NearestNeighbors(n_neighbors = KNN)
    model.fit(text_embeddings)

Here is an amazing post-searching method.
- [@kirderf](https://www.kaggle.com/kirderf)
 - [Features and post processing that might help](https://www.kaggle.com/c/shopee-product-matching/discussion/233626)

In [None]:
import cupy

preds = []
CHUNK = 1024*4

print('Finding similar titles...')

CTS = len(test) // CHUNK
if len(test) % CHUNK != 0: CTS += 1

for j in range(CTS):
    
    a = j * CHUNK
    b = (j+1) * CHUNK
    b = min(b, len(test))
    print('chunk', a, 'to', b)
    
    if COSINE_SIMILARITY:
        # COSINE SIMILARITY DISTANCE
        cts = cupy.matmul(text_embeddings, text_embeddings[a:b].T).T

        for k in range(b-a):
            IDX = cupy.where(cts[k, ] > 0.7)[0]
            o = test.iloc[cupy.asnumpy(IDX)].posting_id.values
            for ii in np.arange(0.7, 0.5, -0.02):
                if ii > 0.5 and o.shape[0] <= 1:
                    IDX = cupy.where(cts[k, ] > ii)[0]
                    o = test.iloc[cupy.asnumpy(IDX)].posting_id.values
            preds.append(o)
    
    else:
        # KNN
        distances, indices = model.kneighbors(text_embeddings[a:b,])
        
        for k in range(b-a):
            IDX = cupy.where(indices[k, ] < 6.0)[0]
            o = test.iloc[cupy.asnumpy(IDX)].posting_id.values
            preds.append(o)
            
            # IDX = np.where(distances[k, ] < 6.0)[0]
            # IDS = indices[k, IDX]
            # o = test.iloc[IDS].posting_id.values
            # preds.append(o)
            
            # TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
            # https://stackoverflow.com/questions/65008297/attempting-numpy-conversion-when-not-needed-in-cupy
            
if not Text_KNN_Follow_Up: del model, text_embeddings
else: del model
_ = gc.collect()

test['preds'] = preds
test.head()

## Text KNN Follow Up

A thought from [here](https://www.kaggle.com/muhammad4hmed/b3-tfidf-knn-boom-p/comments).

In [None]:
if Text_KNN_Follow_Up:
    KNN = 100
    if len(test) == 3: KNN = 2
    model = NearestNeighbors(n_neighbors = KNN)
    model.fit(text_embeddings)

    preds = []
    CHUNK = 1024*4

    print('Finding similar titles using KNN...')

    CTS = len(test) // CHUNK
    if len(test) % CHUNK != 0: CTS += 1

    for j in range(CTS):

        a = j * CHUNK
        b = (j+1) * CHUNK
        b = min(b, len(test))
        print('chunk', a, 'to', b)

        distances, indices = model.kneighbors(text_embeddings[a:b,])

        for k in range(b-a):
            IDX = cupy.where(indices[k, ] < 2.0)[0]
            o = test.iloc[cupy.asnumpy(IDX)].posting_id.values
            preds.append(o)

            # IDX = np.where(distances[k, ] < 6.0)[0]
            # IDS = indices[k, IDX]
            # o = test.iloc[IDS].posting_id.values
            # preds.append(o)

            # TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
            # https://stackoverflow.com/questions/65008297/attempting-numpy-conversion-when-not-needed-in-cupy

    del distances, indices, model, text_embeddings
    _ = gc.collect()

    test['preds_txt2'] = preds
    test.head()

## Use Phash Feature

We will predict all items with the same phash as duplicates:<br>
CV Score = 0.7191099726819047 <br>
CV Score = 0.7190606455051343 (w/o Phash)

In [None]:
tmp = test.groupby('image_phash').posting_id.agg('unique').to_dict()
test['preds3'] = test.image_phash.map(tmp)
test.head()

# Compute CV Score

In [None]:
def combine_for_sub(row):
    x = np.concatenate([row.preds, row.preds2, row.preds3])
    return ' '.join( np.unique(x) )

def combine_for_cv(row):
    x = np.concatenate([row.preds, row.preds2, row.preds3])
    return np.unique(x)

def combine_for_cv_txt2(row):
    x = np.concatenate([row.preds, row.preds2, row.preds3, row.preds_txt2])
    return np.unique(x)

In [None]:
if COMPUTE_CV:        
    tmp = test.groupby('label_group').posting_id.agg('unique').to_dict()
    test['target'] = test.label_group.map(tmp)
    
    if Text_KNN_Follow_Up: 
        test['oof'] = test.apply(combine_for_cv_txt2, axis=1)
    else: 
        test['oof'] = test.apply(combine_for_cv, axis=1)
        
    test['f1'] = test.apply(getMetric('oof'),axis=1)
    print('CV Score =', test.f1.mean())
    print(f'COSINE_SIMILARITY = {COSINE_SIMILARITY}, Text_KNN_Follow_Up = {Text_KNN_Follow_Up}')

test['matches'] = test.apply(combine_for_sub,axis=1)

## Baseline CV Score After Embedding

CV Score = 0.6393196101848189 (preds2)<br>
CV Score = 0.6137154152579091 (preds)<br>
CV Score = 0.5530933399167943 (preds3)

# Write Submission CSV

In this notebook, the submission file below looks funny containing train information. But when we submit this notebook, the size of `test.csv` dataframe will be longer than 3 rows and the variable `COMPUTE_CV` will subsequently set to `False`. Then our submission notebook will compute the correct matches using the real test dataset and our submission csv for LB will be ok.

In [None]:
test[['posting_id','matches']].to_csv('submission.csv',index=False)
sub = pd.read_csv('submission.csv')
sub.head()

# Appendix

## Ignore Ground Truth (RAPIDS Only)

We will now ignore the ground truth and try to find similar items in train data using only the title's text. First we will extract text embeddings using **RAPIDS cuML's TfidfVectorizer**. This will turn every title into a one-hot-encoding of the words present. We will then compare one-hot-encodings with **RAPIDS cuML KNN** to find title's that are similar.

In [None]:
import cuml, cupy
print('RAPIDS', cuml.__version__)

In [None]:
train = pd.read_csv(join(default_dir, 'train.csv'))

# LOAD TRAIN UNTO THE GPU WITH CUDF
train_gf = cudf.read_csv(join(default_dir, 'train.csv'))
print('train_gf shape is', train_gf.shape, '\ntrain shape is', train.shape)
train_gf.head()

## Similar Titles (TfidfVectorizer)

**TfidfVectorizer** returns a cupy sparse matrix. Afterward we convert to a cupy dense matrix and feed that into **RAPIDS cuML KNN**.

In [None]:
model = TfidfVectorizer(stop_words='english', binary=True)
text_embeddings = model.fit_transform(train_gf.title).toarray()
print('text embeddings shape is',text_embeddings.shape)

## Similar Titles (KNN)

In [None]:
KNN = 50
model = NearestNeighbors(n_neighbors=KNN)
model.fit(text_embeddings)
distances, indices = model.kneighbors(text_embeddings)

In [None]:
for k in range(5):
    plt.figure(figsize=(20,3))
    plt.plot(np.arange(50),cupy.asnumpy(distances[k,]),'o-')
    plt.title('Text Distance From Train Row %i to Other Train Rows'%k,size=16)
    plt.ylabel('Distance to Train Row %i'%k,size=14)
    plt.xlabel('Index Sorted by Distance to Train Row %i'%k,size=14)
    plt.show()
    
    print( train_gf.loc[cupy.asnumpy(indices[k,:10]),['title','label_group']] )

## Similar Images (EfficientNetB0)

Again, we will now ignore the ground truth and try to find similar items in train data using only the item's image. First we will extract image embeddings using **EffNetB0**. We will then compare image embeddings with **RAPIDS cuML KNN** to find images that are similar.

In [None]:
# model = EfficientNetB0(weights='../input/effnetb0/efficientnetb0_notop.h5', include_top=False, pooling='avg', input_shape=None)

# train_gen = DataGenerator(train, batch_size=32, path=BASE)
# image_embeddings = model.predict(train_gen, verbose=1)

image_embeddings = np.loadtxt('../input/shopee-price-match-guarantee-embeddings/image_embeddings.csv',
                             delimiter=',')
print('image embeddings shape is',image_embeddings.shape)

## Similar Images (KNN)

In [None]:
KNN = 50
model = NearestNeighbors(n_neighbors=KNN)
model.fit(image_embeddings)
distances, indices = model.kneighbors(image_embeddings)

In [None]:
for k in range(180,190):
    plt.figure(figsize=(20,3))
    plt.plot(np.arange(50),cupy.asnumpy(distances[k,]),'o-')
    plt.title('Image Distance From Train Row %i to Other Train Rows'%k,size=16)
    plt.ylabel('Distance to Train Row %i'%k,size=14)
    plt.xlabel('Index Sorted by Distance to Train Row %i'%k,size=14)
    plt.show()
    
    cluster = train.loc[cupy.asnumpy(indices[k,:8])] 
    displayDF(cluster, random=False, ROWS=2, COLS=4)

# Next: Hyperparameters
What is the importance of hyperparameter tuning?<br>
Hyperparameters are crucial as they control the overall behaviour of a machine learning model. The ultimate goal is to find an optimal combination of hyperparameters that gives better results. To further improve your score, click to see here: [Eff-B4 + TFIDF w/ CV for threshold_searching](https://www.kaggle.com/chienhsianghung/eff-b4-tfidf-w-cv-for-threshold-searching)

Further discussion: [The Best Hyperparameters](https://www.kaggle.com/c/tabular-playground-series-apr-2021/discussion/231152)

# References

* [RAPIDS cuML TfidfVectorizer and KNN](https://www.kaggle.com/cdeotte/rapids-cuml-tfidfvectorizer-and-knn)
* [[PART 2] - RAPIDS TfidfVectorizer - [CV 0.700]](https://www.kaggle.com/cdeotte/part-2-rapids-tfidfvectorizer-cv-0-700/data#Use-Image-Embeddings)
* [Multiple ways to download output file generated in KAGGLE Kernel !](https://www.kaggle.com/getting-started/168312)
* [How to Save a NumPy Array to File for Machine Learning](https://machinelearningmastery.com/how-to-save-a-numpy-array-to-file-for-machine-learning/)