Combining results from [LinearSVC](https://www.kaggle.com/szelee/aoeul-solution-step-1-linearsvc) + [Deep Learning Model](https://www.kaggle.com/szelee/aoeul-solution-step-2-deep-learning-model) gives quite a decent score boost. 

- (0.45500 + 0.46533) -> 0.46814 (Public LB)
- (0.45353 + 0.46393) -> 0.46673 (Private LB)

My actual final submissions gain a few extra points because of further fine tuning on the hyperparameters of the LinearSVC and DL models + some extra feature engineering on test data (matching unknown words to the  closest word in train data, finding exact matched train title's label in train data as prediction).

In [1]:
from pathlib import Path
import json
import re
import sys
import warnings
import pickle
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook

In [2]:
dl_df = pd.read_csv('Deep_Learning_predictions.csv')
svc_df = pd.read_csv('LinearSVC_predictions.csv')

In [3]:
# deep learning model prediction with 2 predictions for each test data
dl_df.head(10)

Unnamed: 0,id,tagging
0,370855998_Benefits,6 4
1,370855998_Brand,208 255
2,370855998_Colour_group,22 9
3,370855998_Product_texture,8 6
4,370855998_Skin_type,7 6
5,637234604_Benefits,6 3
6,637234604_Brand,282 360
7,637234604_Colour_group,24 9
8,637234604_Product_texture,8 6
9,637234604_Skin_type,1 0


In [4]:
# LinearSVC model prediction with 1 prediction for each test data
svc_df.head(10)

Unnamed: 0,id,tagging
0,370855998_Benefits,6
1,637234604_Benefits,6
2,690282890_Benefits,1
3,930913462_Benefits,6
4,1039280071_Benefits,6
5,1327710392_Benefits,6
6,1328802799_Benefits,1
7,1330468145_Benefits,3
8,1677309730_Benefits,1
9,1683142205_Benefits,6


In [5]:
# sanity check
len(dl_df), len(svc_df)

(977987, 977987)

In [6]:
# merging both dataframe
new_df = pd.merge(dl_df, svc_df, on='id')
new_df.head()

Unnamed: 0,id,tagging_x,tagging_y
0,370855998_Benefits,6 4,6
1,370855998_Brand,208 255,208
2,370855998_Colour_group,22 9,22
3,370855998_Product_texture,8 6,8
4,370855998_Skin_type,7 6,1


In [7]:
# Iterating through the results. 
# If both DL's first prediction == LinearSVC prediction, use DL's original prediction
# If not match, LinearSVC become first prediction, and DL's first become second.

new_tag = []
i = 0
for _, row in tqdm_notebook(new_df.iterrows()):
    x1, x2 = row.tagging_x.split(' ') 
    x1 = int(x1)
    x2 = int(x2)
    y = row.tagging_y
    
    if x1 == y:
        new_tag.append(row.tagging_x)
        i = i + 1
    else:
        new_tag.append(str(y)+ ' '+ str(x1))

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [8]:
# percentage of matches between DL and LinearSVC
i/len(new_df)*100 

85.23323929663687

In [9]:
# save to file
new_df['tagging']= np.array(new_tag)
new_df = new_df[['id', 'tagging']]
new_df.to_csv(f'Ensembled_SVC_DL_predictions.csv', index=None)