# [Shopee NDSC 2019](https://careers.shopee.sg/ndsc/) Submission
### by [@siowyisheng](https://github.com/siowyisheng) and [@kronosfere](https://github.com/kronosfere)

***
### Background

Shopee is an online marketplace where hundreds of thousands of new products are uploaded every day. Putting products into correct subcategories is a important task to help shoppers find the products they want.

### Task

Create machine learning models for mobile phone items 📱, beauty items 💄 and fashion items 👗, which output the right subcategory, given the title of the product and the image url.

### Our Plan

Neither of us were formally trained in or worked in data science, and neither of us had done anything with image classification before, so that seemed challenging. On the other hand, the text titles seemed to hold plenty of information by themselves. 

We decided to try to create two models, one using the titles and one using the images. Each model would output their predicted category together with their confidence (in probability). We would then do a weighted average of the results, and output the category with the final highest confidence.
***

# The Text Model 
#### (by [@siowyisheng](https://github.com/siowyisheng))

First, let's do the standard pandas 🐼🐼 and numpy 🔢🍕 imports with the standard aliases.

In [3]:
import pandas as pd
import numpy as np

Next, load the data! 📃➡️💻

In [4]:
df = pd.read_csv('train.csv')

We check out how much data we're dealing with. 📃📃📃 666615 is the number of rows, 4 is the number of columns.

In [14]:
df.shape

(666615, 4)

We check out what it looks like. 🌸

In [5]:
df.head()

Unnamed: 0,itemid,title,Category,image_path
0,307504,nyx sex bomb pallete natural palette,0,beauty_image/6b2e9cbb279ac95703348368aa65da09.jpg
1,461203,etude house precious mineral any cushion pearl...,1,beauty_image/20450222d857c9571ba8fa23bdedc8c9.jpg
2,3592295,milani rose powder blush,2,beauty_image/6a5962bed605a3dd6604ca3a4278a4f9.jpg
3,4460167,etude house baby sweet sugar powder,3,beauty_image/56987ae186e8a8e71fcc5a261ca485da.jpg
4,5853995,bedak revlon color stay aqua mineral make up,3,beauty_image/9c6968066ebab57588c2f757a240d8b9.jpg


In [11]:
df.tail()

Unnamed: 0,itemid,title,Category,image_path
666610,1510771637,beli 2 gratis 1 xiaomi mi mix black 6 64 rom g...,34,mobile_image/70e0d8ddd69692b0f134498efbddf4e1.jpg
666611,1515822742,android i phone x real 4g 16gb free wireless c...,35,mobile_image/d58393fe029ba62160d2a5d1fa6638a1.jpg
666612,1516747666,xiaomi mia1 ram 4gb 64gb black,34,mobile_image/bfacb3c9af2f6a597008e57fb2d34609.jpg
666613,1517270941,khusus hari ini samsung j2 prime,32,mobile_image/42d74ab8212a24720d42e84c649ab488.jpg
666614,1518889125,oppo a83 2 gb new garansi resmi 1 tahun,41,mobile_image/0b10f33a67ccb4ee3e1240d44c2ee0ef.jpg


We notice that the `image_path` contains the main category of the item, which wasn't mentioned in the briefing 👓. Woe to those who fail to separate the items by this main category 💣.

We can separate them into different [dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) 📗📙📘. An alternative would be to create a new `main_category` column in the same dataframe.

In [16]:
df_mobile = df.loc[df['image_path'].str.startswith('mobile_image')]
df_beauty = df.loc[df['image_path'].str.startswith('beauty_image')]
df_fashion = df.loc[df['image_path'].str.startswith('fashion_image')]

So then I read through
https://dataweave.com/blog/implementing-a-machine-learning-based-ecommerce-product-classification-system-f846d894148b
which does exactly what we want, but only gives leads instead of a step by step.

So then I read through
https://machinelearningmastery.com/gentle-introduction-bag-words-model/
and tried to follow it.

## This is the size of the vocabulary. (Step 2)

In [73]:
len(words_mobile)

26023

## Managing vocabulary (still based on the article)

```
a bag-of-bigrams representation is much more powerful than bag-of-words, and in many cases proves very hard to beat.

— Page 75, Neural Network Methods in Natural Language Processing, 2017.
```

So let's build bigrams!

## Hashing trick
https://machinelearningmastery.com/gentle-introduction-bag-words-model/

also describes reducing the size of the vector using 'feature hashing'(aka the 'hashing trick')

https://en.wikipedia.org/wiki/Feature_hashing

In [None]:
import time  # to time how long training takes
from sklearn.feature_extraction.text import CountVectorizer  # the way to turn text into features
from sklearn import svm  # the classifier we're using
from joblib import dump, load  # used to save/load the model to/from disk

## CountVectorizer

Chanced across this

https://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage

while reading about the hashing trick documentation in sklearn. Seems like what we need. Let's try.

In [6]:
vectorizer = CountVectorizer(ngram_range=(1, 2))
corpus = df_mobile.title
X = vectorizer.fit_transform(corpus)

In [8]:
y = df_mobile.Category
clf = svm.SVC(gamma='scale')
clf.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

# ALRIGHT FOLKS WE HAVE A MODEL

In [11]:
df_mobile.head()

Unnamed: 0,itemid,title,Category,image_path
506285,2346660,apple iphone 4s back glass spare part original...,31,mobile_image/a9c8f0fdd6587deed197634066cf7eee.jpg
506286,2816338,iphone 4s 64gb white,31,mobile_image/3b9a11608551b11b9330268e0d055e01.jpg
506287,2847602,samsung sm b310e piton dual sim,32,mobile_image/1d719e936841a83c165da620f927de68.jpg
506288,3116949,samsung caramel gt e1272 dual sim 32 mb putih,32,mobile_image/1d35a74d90df6cf4a02e6a5df9e9ff29.jpg
506289,3794648,garskin sony experia z z1 z2 ultra,33,mobile_image/5556577b09539a9c0db0d00e0f171e2d.jpg


In [16]:
test_X = vectorizer.transform(['iphone 4s 64gb white'])

In [20]:
clf.predict(test_X)

array([31])

In [22]:
clf.predict_proba(test_X)

AttributeError: predict_proba is not available when  probability=False

## Need to set probability to True

In [23]:
from sklearn.model_selection import train_test_split

In [28]:
mobile_train, mobile_test = train_test_split(df_mobile)

In [30]:
mobile_train.shape

(120247, 4)

In [31]:
mobile_test.shape

(40083, 4)

In [51]:
vectorizer_mobile = CountVectorizer(ngram_range=(1, 2))
corpus = mobile_train.title
X = vectorizer_mobile.fit_transform(corpus)

In [33]:
y = mobile_train.Category
clf_mobile = svm.SVC(gamma='scale', probability=True)
clf_mobile.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [37]:
dump(clf_mobile, 'clf_mobile.joblib')

['clf_mobile.joblib']

In [40]:
dump(clf, 'clf_mobile_no_proba.joblib')

['clf_mobile_no_proba.joblib']

In [41]:
from sklearn.metrics import accuracy_score

In [57]:
X = vectorizer_mobile.transform(mobile_test.title)

In [59]:
predictions = clf_mobile.predict(X)

In [60]:
accuracy_score(mobile_test.Category, predictions)

0.8160317341516353

In [42]:
clf_mobile.predict(mobile_test.title)

ValueError: could not convert string to float: 'samsung j2 pro 2018'

In [106]:
df['title_new'] = df.title + 'testtest'

# FOR TESTING ACCURACY ONLY

In [61]:
df = pd.read_csv('train.csv')
df_mobile = df.tail(160330)
df_beauty = df.head(286583)
df_fashion = df.iloc[286583:666615 - 160330]
mobile_train, mobile_test = train_test_split(df_mobile)
vectorizer_mobile = CountVectorizer(ngram_range=(1, 2))
mobile_corpus = mobile_train.title
X = vectorizer_mobile.fit_transform(mobile_corpus)
y = mobile_train.Category
clf_mobile = svm.SVC(gamma='scale', probability=True)
start = time.time()
clf_mobile.fit(X, y)
end = time.time()
print(end - start)
dump(clf_mobile, 'clf_mobile.joblib')

24131.771150112152


['clf_mobile.joblib']

In [None]:
df = pd.read_csv('train.csv')
df_mobile = df.tail(160330)
df_beauty = df.head(286583)
df_fashion = df.iloc[286583:666615 - 160330]
beauty_train, beauty_test = train_test_split(df_beauty)
vectorizer_beauty = CountVectorizer(ngram_range=(1, 2))
beauty_corpus = beauty_train.title
X = vectorizer.fit_transform(beauty_corpus)
y = beauty_train.Category
clf_beauty = svm.SVC(gamma='scale', probability=True)
start = time.time()
clf_beauty.fit(X, y)
end = time.time()
print(end - start)
dump(clf_beauty, 'clf_beauty.joblib')

In [None]:
df = pd.read_csv('train.csv')
df_mobile = df.tail(160330)
df_beauty = df.head(286583)
df_fashion = df.iloc[286583:666615 - 160330]
fashion_train, fashion_test = train_test_split(df_fashion)
vectorizer_fashion = CountVectorizer(ngram_range=(1, 2))
fashion_corpus = fashion_train.title
X = vectorizer.fit_transform(fashion_corpus)
y = fashion_train.Category
clf_fashion = svm.SVC(gamma='scale', probability=True)
start = time.time()
clf_fashion.fit(X, y)
end = time.time()
print(end - start)
dump(clf_fashion, 'clf_fashion.joblib')

# FOR ACTUAL FINAL TRAINING

In [133]:
df = pd.read_csv('train.csv')
df_mobile = df.tail(160330)
df_beauty = df.head(286583)
df_fashion = df.iloc[286583:666615 - 160330]
vectorizer_mobile = CountVectorizer(ngram_range=(1, 2))
mobile_corpus = df_mobile.title
X = vectorizer_mobile.fit_transform(mobile_corpus)
y = df_mobile.Category
clf_mobile = svm.SVC(gamma='scale', probability=True)
start = time.time()
clf_mobile.fit(X, y)
end = time.time()
print(end - start)
dump(clf_mobile, 'clf_mobile_final.joblib')

46662.84632396698


['clf_mobile_final.joblib']

In [137]:
df = pd.read_csv('train.csv')
df_mobile = df.tail(160330)
df_beauty = df.head(286583)
df_fashion = df.iloc[286583:666615 - 160330]
vectorizer_beauty = CountVectorizer(ngram_range=(1, 2))
beauty_corpus = df_beauty.title
X = vectorizer_beauty.fit_transform(beauty_corpus)
y = df_beauty.Category
clf_beauty = svm.SVC(gamma='scale', probability=True)
start = time.time()
clf_beauty.fit(X, y)
end = time.time()
print(end - start)
dump(clf_beauty, 'clf_beauty_final.joblib')

117469.02375912666


['clf_beauty_final.joblib']

In [None]:
df = pd.read_csv('train.csv')
df_mobile = df.tail(160330)
df_beauty = df.head(286583)
df_fashion = df.iloc[286583:666615 - 160330]
vectorizer_fashion = CountVectorizer(ngram_range=(1, 2))
fashion_corpus = df_fashion.title
X = vectorizer_fashion.fit_transform(fashion_corpus)
y = df_fashion.Category
clf_fashion = svm.SVC(gamma='scale', probability=True)
start = time.time()
clf_fashion.fit(X, y)
end = time.time()
print(end - start)
dump(clf_fashion, 'clf_fashion_final.joblib')

# GENERATING FINAL PREDICTIONS

In [None]:
clf_mobile = load('clf_mobile.joblib')
clf_beauty = load('clf_beauty.joblib')
clf_fashion = load('clf_fashion.joblib')

In [3]:
df_test = pd.read_csv('test.csv')
df_test.image_path.str.startswith('beauty_image').value_counts()

False    95857
True     76545
Name: image_path, dtype: int64

In [4]:
df_test.image_path.str.startswith('mobile_image').value_counts()

False    131985
True      40417
Name: image_path, dtype: int64

In [10]:
df_test_mobile = df_test.tail(40417)
df_test_beauty = df_test.head(76545)
df_test_fashion = df_test.iloc[76545:172402 - 40417]

In [11]:
vectorizer1 = CountVectorizer(ngram_range=(1, 2))
vectorizer_mobile = vectorizer1.fit(df_mobile.title)

In [12]:
vectorizer2 = CountVectorizer(ngram_range=(1, 2))
vectorizer_beauty = vectorizer2.fit(df_beauty.title)

In [13]:
vectorizer3 = CountVectorizer(ngram_range=(1, 2))
vectorizer_fashion = vectorizer3.fit(df_fashion.title)

In [14]:
X_mobile = vectorizer_mobile.transform(df_test_mobile.title)
X_beauty = vectorizer_beauty.transform(df_test_beauty.title)
X_fashion = vectorizer_fashion.transform(df_test_fashion.title)

In [20]:
X_mobile

<40417x177261 sparse matrix of type '<class 'numpy.int64'>'
	with 583652 stored elements in Compressed Sparse Row format>

In [21]:
X_beauty

<76545x245259 sparse matrix of type '<class 'numpy.int64'>'
	with 1043676 stored elements in Compressed Sparse Row format>

In [22]:
X_fashion

<55440x277523 sparse matrix of type '<class 'numpy.int64'>'
	with 1102176 stored elements in Compressed Sparse Row format>

In [23]:
clf_mobile.shape_fit_

(120247, 150390)

In [46]:
clf_beauty.shape_fit_

(286583, 245259)

In [25]:
clf_fashion.shape_fit_

(164776, 231924)

In [None]:
predictions_mobile_title = clf_mobile.predict_proba(X_mobile)

In [None]:
predictions_beauty_title = clf_beauty.predict_proba(X_beauty)

In [None]:
predictions_fashion_title = clf_fashion.predict_proba(X_fashion)

In [18]:
predictions_mobile_image = np.load('mobile_image_pred.npy')
predictions_beauty_image = np.load('beauty_image_pred.npy')
predictions_fashion_image = np.load('fashion_image_pred.npy')

In [19]:
combined_predictions_mobile = predictions_mobile_title * 0.7 + predictions_mobile_image * 0.3
combined_predictions_beauty = predictions_beauty_title * 0.7 + predictions_beauty_image * 0.3
combined_predictions_fashion = predictions_fashion_title * 0.7 + predictions_fashion_image * 0.3

In [None]:
final_predictions_proba = combined_predictions_mobile + combined_predictions_beauty + combined_predictions_fashion

In [None]:
final_predictions = [ls.index(max(ls)) + 1 for ls in predictions]

## final submission has two columns, `itemid` and `category`

In [None]:
df_test = pd.read_csv('test')
df_final = df_test.drop(['title', 'image_path'], axis=1)
df_final['Category'] = np.Series(final_predictions)

In [45]:
df_final.to_csv('submission.csv', index=False)