After some exploratory analysis of the train, test and validation data, I find some problems with the data distribution which might play an important role in our model building.

# UPDATE: More interesting findings from exploiting leaderboard score.
The previous version (the code below) finds the different distribution of labels in train and test dataset - labels have drastic difference in frequency in train data, but same frequency in validation data . Thus, it was recommended to combine train and test because it was assumed the test dataset should have similar distribution as train dataset, ie. similar to train set where label 42 is the most frequent, the majority of test set should also be 42. 
However, if you try submitting two files - one with all label 42, one with all label 1 - you will get similar score - 0.99322 and 0.99348 for each. After simple calculation - (1-0.99322) * 128 = 0.86784, (1-0.99322) * 128 = 0.83456 - it seems that the test set has similar distribution of label as the validation data which has same frequency of each label. 
All in all, if you want to validate, a better one might be to simulate the distribution of test set - have same distribution of each label.
The rest of this kernel shows how I found the distribution of labels and a simple EDA of the data.

In [None]:
raw_data_path = "../input"
import time
script_start_time = time.time()

import pandas as pd
import numpy as np
import gc
import json

pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 50)
import warnings
warnings.filterwarnings('ignore')

## 1. Load data (convert json to csv)

In [None]:
print('%0.2f min: Start loading data'%((time.time() - script_start_time)/60))

train={}
test={}
validation={}
with open('%s/train.json'%(raw_data_path)) as json_data:
    train= json.load(json_data)
with open('%s/test.json'%(raw_data_path)) as json_data:
    test= json.load(json_data)
with open('%s/validation.json'%(raw_data_path)) as json_data:
    validation = json.load(json_data)

print('Train No. of images: %d'%(len(train['images'])))
print('Test No. of images: %d'%(len(test['images'])))
print('Validation No. of images: %d'%(len(validation['images'])))

# JSON TO PANDAS DATAFRAME
# train data
train_img_url=train['images']
train_img_url=pd.DataFrame(train_img_url)
train_img_url['url'] = train_img_url['url'].apply(lambda r: r[0])
train_ann=train['annotations']
train_ann=pd.DataFrame(train_ann)
train_img_url.head()
train=pd.merge(train_img_url, train_ann, on='image_id', how='inner')

# test data
test=pd.DataFrame(test['images'])
test['url'] = test['url'].apply(lambda r: r[0])


# Validation Data
val_img_url=validation['images']
val_img_url=pd.DataFrame(val_img_url)
val_img_url['url'] = val_img_url['url'].apply(lambda r: r[0])
val_ann=validation['annotations']
val_ann=pd.DataFrame(val_ann)
validation=pd.merge(val_img_url, val_ann, on='image_id', how='inner')

print('%0.2f min: Finish loading data'%((time.time() - script_start_time)/60))

## 2. Check data (size, NA, duplicates...)

The self defined function I used is quite useful for the first step of data analysis. We get many information from them.
- There are 128 labels.
- There are no duplicated data in each data set.
- However, there are 7 duplicated url in all dataset (which will be investigated later).

In [None]:
# Findings: There are duplicated url
datas = {'train': train, 'test': test, 'validation': validation}
total_url = []
dataset_url = {}
for data_name, data in datas.items():
    print('%s shape: %s'%(data_name, str(data.shape)))
    print('Unique:')
    print(data.nunique()) # Unique values
    print('NA:')
    print(data.isnull().sum()) # No missing values
    print(data.describe())
    total_url = total_url + data['url'].tolist()
    dataset_url[data_name] = data['url'].tolist()
    print('-'*50)

print('Total images: %d'%(len(total_url)))
print('Total unique images: %d'%(len(set(total_url))))
print('Duplicated url: %d'%(len(total_url) - len(set(total_url))))

In [None]:
# #Save as csv -----------------------------------------------------------------
# for data_name, data in datas.items():
#     data.to_csv('%s/%s.csv'%(processed_data_dir, data_name), index = False)
# print('%0.2f min: Finish saving raw data as csv'%((time.time() - script_start_time)/60))

## 3. Visualization for further exploration

In [None]:
# 3. Exploratory Data Analysis =================================================
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import plotly.plotly as py
import cufflinks as cf
cf.set_config_file(offline=True, world_readable=True, theme='ggplot')
plt.rcParams["figure.figsize"] = 12,8
sns.set(rc={'figure.figsize':(12,8)})
plt.style.use('fivethirtyeight')

pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 50)
import warnings
warnings.filterwarnings('ignore')

### 3.1 Lets find the duplicated url first

In [None]:
# 3.1 Try to find the duplicated url -----------------------------------------------
# Findings: Labels are different
from itertools import product
combinations = list(product(*[datas.keys(), datas.keys()]))
for comb in combinations:
    print('%s inter %s: %d | %d'%(comb[0], comb[1], len(set(dataset_url[comb[0]])), len(set(dataset_url[comb[0]]).intersection(set(dataset_url[comb[1]])))))

As shows by the intersection, the duplicated url are in validation and train. However, they have different labels despite the same url.

In [None]:
# Confirm the duplicated url
duplicated = train[['url']].merge(validation[['url']], how = 'inner')
duplicated = duplicated.merge(validation[['url', 'label_id']], on = 'url',how = 'left').rename(columns = {'label_id': 'label_id_val'})
duplicated = duplicated.merge(train[['url', 'label_id']], on = 'url',how = 'left').rename(columns = {'label_id': 'label_id_train'})
print(duplicated)
duplicated_url = duplicated['url']

Lets display the image for duplicated url

In [None]:
# Display images with duplicated url
from IPython.display import Image
from IPython.core.display import HTML

def display_image(urls):
    img_style = "width: 180px; margin: 0px; float: left; border: 1px solid black;"
    images_list = ''.join([f"<img style='{img_style}' src='{u}' />" for _, u in urls.iteritems()])
    display(HTML(images_list))
display_image(duplicated_url)

Well, I guess the lebels represents the table or laptop.

### 3.2 Frequency of labels

This section let us realize the different frequency of labels in train and validation:
All labels appears for 50 times in validation
Most labels appears for 1000-2000 times in train. Some appear for 4000 times.

In [None]:
sns.distplot(train['label_id'])
sns.distplot(validation['label_id'])

In [None]:
train_Label_count = train['label_id'].value_counts().reset_index().rename(columns = {'index': 'label_id', 'label_id': 'label_id_count_train'})
validation_Label_count = validation['label_id'].value_counts().reset_index().rename(columns = {'index': 'label_id', 'label_id': 'label_id_count_val'})
label_count = train_Label_count.merge(validation_Label_count, on = 'label_id', how = 'right').fillna(0)
label_count['label_id_freq_train'] = label_count['label_id_count_train'] / train.shape[0]
label_count['label_id_freq_val'] = label_count['label_id_count_val'] / validation.shape[0]
print(label_count.describe())

In [None]:
sns.distplot(label_count['label_id_freq_train'])

In [None]:
sns.distplot(label_count['label_id_freq_val'])

### 3.3 Distribution of labels
The plots of label_id against index shows label_id are grouped together in train, while randomly in validation

In [None]:
plt.plot(train['label_id'], '.')

In [None]:
plt.plot(validation['label_id'], '.')

## 3.4 Images of the most frequent label 

Label 20 is the most frequent. I guess it is 'bottle'.

In [None]:
print(label_count.sort_values(['label_id_count_train']).tail())
url_label20 = train[train['label_id']==20]['url'][:10]
display_image(url_label20)

Label 83 is the least frequent. I guess it is 'table''

In [None]:
print(label_count.sort_values(['label_id_count_train']).head())
url_label83 = train[train['label_id']==83]['url'][:10]
display_image(url_label83)

In [None]:
print('%0.2f min: Finish running scipt'%((time.time() - script_start_time)/60))

# Note for model building: 
- The validation set is not a desirable validation data.
- There is a need to combine the data and get our own validation data.
- There are 7 duplicated url in train and validation where some have different lab

## 4. Construct Validation data after combining

Do not forget about the duplicated url and label when combine train and validation

In [None]:
all = pd.concat([train, validation])
print(all[all[['url', 'label_id']].duplicated()])

Double Check duplicates

In [None]:
all = all.drop_duplicates(all[['url', 'label_id']])
print(all[all[['url', 'label_id']].duplicated()])

For a more representative validation data, let's split the combined data based on the label distribution with stratification.

In [None]:
X = all[['url']]
y = all[['label_id']]
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify = y, test_size = 0.3)

Plot the distribution to see out work

In [None]:
sns.distplot(y_train['label_id'])
sns.distplot(y_val['label_id'])

Save the newly generated train and validation data

In [1]:
# train = pd.concat([X_train, y_train], axis = 1)
# validation = pd.concat([X_val, y_val], axis = 1)
# datas = {'train': train, 'validation': validation}
# for data_name, data in datas.items():
#     data[['url', 'label_id']].to_csv('%s/%s.csv'%(processed_data_dir, data_name), index = False)