<a href="https://colab.research.google.com/github/rahiakela/kaggle-competition-projects/blob/master/shopee-product-matching-competition/1_shopee_eda_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Shopee: EDA+RAPIDS preprocessing

<center><h1>Introduction 📝</h1></center>

> 🎯Goal: To build a model that predicts which items are the same products
> 
> As a shopaholic🛍️ , I admit getting the best deals for products is a very rewarding experience. Scanning through multiple shopping websites to get the perfect deal and keeping an eye on upcoming sales is one manual way to go about.
> 
> We often find retail companies offering recommendations in which they promote their products in such a way that customers tend to get swayed and pick a similar product that is priced lower. Product matching 📋📋 is one of these strategies wherein a company to offers products at rates that are competitive to the same product sold by another retailer. 
> 
> These matches can be performed automatically with the help of machine learning and that is the goal of this competition. We have been provided with data of **Shopee**, which is the leading e-commerce platform in Southeast Asia and Taiwan. 

<div>
    <img src="https://i.imgur.com/mqPVRT5.png">
</div>



## <center><h1>Diving into the Data 🤿 </h1></center>

> **train/test.csv** - Each row contains the data for a single posting. 
> 
> ℹ️Multiple postings might have the exact same image ID, but with different titles or vice versa.
> 
> - posting_id : the ID code for the posting
> - image : the image id/md5sum
> - image_phash : a perceptual hash of the image
> - title : the product description for the posting
> - label_group : ID code for all postings that map to the same product. Not provided for the test set
> - matches - **Space delimited** list of all posting IDs that match a particular posting. 
> 
> 📌Posts always self-match. 
> 
> 📌**Group sizes were capped at 50**, so we need not predict more than 50 matches for a posting.

## <h1><center>Evaluation metric: <b>F1-score 🧪</b> </center></h1>

> The evaluation metric for this competition is F1-Score or F-Score.
> 
> <img src="https://www.gstatic.com/education/formulas2/355397047/en/f1_score.svg">
> 
>  It finds the balance between precision and recall.
>  <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/d37e557b5bfc8de22afa8aad1c187a357ac81bdb">
>  <img src="https://miro.medium.com/max/560/1*AEV3TE67ahMn3NVpU0ov4g.png" height=10>
>  
>  where-
>  - TP = True Positive
>  - FP = False Positive
>  - TN = True Negative
>  - FN = False Negative

## <center><h1>Setup 📚</h1></center>

In [None]:
%%shell

pip install colorama

In [None]:
%tensorflow_version 2.x     # magic command instructing to use TensorFlow version 2+

import os
import numpy as np 
import pandas as pd 
import cv2
import matplotlib.pyplot as plt
# import cuml, cudf, cupy
import nltk
import tensorflow as tf

from nltk.corpus import stopwords
# from cuml.feature_extraction.text import CountVectorizer
# from cuml.neighbors import NearestNeighbors
from colorama import Fore, Back, Style
from wordcloud import WordCloud,STOPWORDS
from tensorflow.keras.applications import ResNet101

y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA

Let's download dataset from [Kaggle](https://www.kaggle.com/c/shopee-product-matching/data)

In [2]:
from google.colab import files
files.upload() # upload kaggle.json file

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"rahiakela","key":"484f91b2ebc194b0bff8ab8777c1ebff"}'}

In [None]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download dataset from kaggle
kaggle competitions download -c shopee-product-matching

unzip -qq train.csv.zip
mkdir shopee-product-matching
mv *.jpg *.csv shopee-product-matching

In [10]:
train_df = pd.read_csv("shopee-product-matching/train.csv")
test_df = pd.read_csv("shopee-product-matching/test.csv")

train_df.head()

Unnamed: 0,posting_id,image,image_phash,title,label_group
0,train_129225211,0000a68812bc7e98c42888dfb1c07da0.jpg,94974f937d4c2433,Paper Bag Victoria Secret,249114794
1,train_3386243561,00039780dfc94d01db8676fe789ecd05.jpg,af3f9460c2838f0f,"Double Tape 3M VHB 12 mm x 4,5 m ORIGINAL / DO...",2937985045
2,train_2288590299,000a190fdd715a2a36faed16e2c65df7.jpg,b94cb00ed3e50f78,Maling TTS Canned Pork Luncheon Meat 397 gr,2395904891
3,train_2406599165,00117e4fc239b1b641ff08340b429633.jpg,8514fc58eafea283,Daster Batik Lengan pendek - Motif Acak / Camp...,4093212188
4,train_3369186413,00136d1cf4edede0203f32f05f660588.jpg,a6f319f924ad708c,Nescafe \xc3\x89clair Latte 220ml,3648931069


In [8]:
test_df.head()

Unnamed: 0,posting_id,image,image_phash,title
0,test_2255846744,0006c8e5462ae52167402bac1c2e916e.jpg,ecc292392dc7687a,Edufuntoys - CHARACTER PHONE ada lampu dan mus...
1,test_3588702337,0007585c4d0f932859339129f709bfdc.jpg,e9968f60d2699e2c,(Beli 1 Free Spatula) Masker Komedo | Blackhea...
2,test_4015706929,0008377d3662e83ef44e1881af38b879.jpg,ba81c17e3581cabe,READY Lemonilo Mie instant sehat kuah dan goreng


## <center><h1>Getting image paths from the directory 🛣️</h1></center>