# Overview
Input: training data with image urls and labels

Goal: generate the prediction of labels of each test set image

# Key steps of my strategy
1.	Rank the google labels of training images by frequency. Filter out labels that are not indicative or not having strong correlation with wish.com labels. 
2.	For each google label, extract all training images containing that label and corresponding wish.com labels. Aggregate those wish.com labels and rank them according to frequency. Filter out labels with frequency less than a certain threshold. My hypothesis is that the frequent wish.com labels are correlated with google label, so we can use google labels as an indicator of wish.com labels. 
3.	Detect the google labels of each test image, and assign corresponding wish.com to each test image.
4.	Cluster training images, and extract frequent labels of each cluster. My hypothesis is that for images in each cluster, they should have some labels in common.
5.	Use google labels to detect which cluster each test image belongs to. Add common labels of that cluster to the test image.
6.	Remove duplicate wish.com labels of each test image.


# Full strategy with details:
1.	Use Google Vision API to detect the labels of 10,000 images in training set. Aggregate all labels together and rank them by frequency. Filter out top 10 most frequent labels as they are not indicative in this circumstance. Also, filter out labels with frequency less than a threshold. I will name the table generated in this step as “ranking table”.

2.	For each label in “ranking table”, identify training images containing that label, and aggregate corresponding wish.com labels together. Then aggregate those wish.com labels and rank them according to frequency. Filter out wish.com labels with frequency lower than a threshold. My hypothesis is that the frequent wish.com labels are correlated with google label, so we can use google labels as an indicator of wish.com labels. After this step, a table indicating the matching relationship will be generated, and I will name this table as “matching table”.

3.	Use Google Vision API to detect the labels of each image in test set. For each Google label of the image, append the matching wish.com labels from the matching table. After this step, a table named “prediction table” will be generated. The column “predicted labels” is super long and there are many duplicate labels, I will remove duplicate labels later.

4.	Use k-means clustering method to assign 10,000 training set images into 60 clusters. For each cluster, select training set images belonging to that cluster and aggregate all corresponding wish.com labels. Rank those labels by frequency and select top 5 most frequent labels as representative labels of that cluster. After this step, a table indicating the representative labels of each cluster will be generated, and I will name this table as “clusters table”.

5.	For each test set image, identify which cluster it belongs to. For each test set image, if predicted labels contain three or more representative labels of a cluster, then I would say the image belongs to that cluster, and I will assign other representative labels to that label. 

6.	Remove duplicate labels of each test set image, as I found that duplicate labels will affect the f-score of prediction.

# Flow chart
![flow chart](https://pbs.twimg.com/media/DdCgJNZWkAA8eIw.jpg)

# About Google Vision API

Google Vision API will analyze the image and returns the labels of that image, please look at my Github repo for example: https://github.com/DisenWang/Google_vision_api_example

It took a long time to analyze all images using Vision API, to save your time, I will upload the result as a dataset.

In [None]:
# Pandas for managing datasets
import pandas as pd

# Matplotlib for additional customization
from matplotlib import pyplot as plt

import numpy as np
# Seaborn for plotting and styling
import seaborn as sns


import datetime 
from collections import Counter
import re
from plotly.offline import init_notebook_mode, iplot
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from plotly import tools
import seaborn as sns
from PIL import Image

# Step 1. Rank Google labels
Use Google Vision API to detect the labels of 10,000 images in training set. Aggregate all labels together and rank them by frequency. Filter out top 10 most frequent labels as they are not indicative in this circumstance. 

In [None]:
import os
print(os.listdir("../input"))
df = pd.read_csv('../input/train-labels/train_labels.csv', index_col=0)

In [None]:
# remove [ and ] from label lists
df['labels'] = df['labels'].str[1:]
df['labels'] = df['labels'].str[:-1]
df['labels'] = df['labels'] + ','

Here I count and extract top 200 most frequent labels

In [None]:
df_count = Counter(" ".join(df["labels"]).split(',')).most_common(200)
all_label_count = pd.DataFrame(df_count)
all_label_count.columns = ['label','count']
all_label_count['percentage'] = all_label_count['count']/len(df.index)

In [None]:
print (all_label_count.head())

Here I read in the training set, test set and validation set

In [None]:

import time
script_start_time = time.time()

import pandas as pd
import numpy as np
import json
import gc

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns


import plotly.plotly as py
import cufflinks as cf
cf.set_config_file(offline=True, world_readable=True, theme='ggplot')
plt.rcParams["figure.figsize"] = 12,8
sns.set(rc={'figure.figsize':(20,12)})
plt.style.use('fivethirtyeight')

pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 50)
import warnings
warnings.filterwarnings('ignore')

# Data path
data_path = '.'

# 1. Load data =================================================================
print('%0.2f min: Start loading data'%((time.time() - script_start_time)/60))
train={}
test={}
validation={}
with open('../input/imaterialist-challenge-fashion-2018/train.json',encoding='utf-8') as json_data:
    train= json.load(json_data)
with open('../input/imaterialist-challenge-fashion-2018/test.json',encoding='utf-8') as json_data:
    test= json.load(json_data)
with open('../input/imaterialist-challenge-fashion-2018/validation.json',encoding='utf-8') as json_data:
    validation = json.load(json_data)

print('Train No. of images: %d'%(len(train['images'])))
print('Test No. of images: %d'%(len(test['images'])))
print('Validation No. of images: %d'%(len(validation['images'])))

# JSON TO PANDAS DATAFRAME
# train data
train_img_url=train['images']
train_img_url=pd.DataFrame(train_img_url)
train_ann=train['annotations']
train_ann=pd.DataFrame(train_ann)
train=pd.merge(train_img_url, train_ann, on='imageId', how='inner')

# test data
test=pd.DataFrame(test['images'])

# Validation Data
val_img_url=validation['images']
val_img_url=pd.DataFrame(val_img_url)
val_ann=validation['annotations']
val_ann=pd.DataFrame(val_ann)
validation=pd.merge(val_img_url, val_ann, on='imageId', how='inner')

del (train_img_url, train_ann, val_img_url, val_ann)
gc.collect()

print('%0.2f min: Finish loading data'%((time.time() - script_start_time)/60))
print('='*50)




The index of training data start with 0, so I added 1 to all index for the merge operation later

In [None]:
train.index += 1

# Step 2. Find Correlated Labels
Here I defined a function named "match_labels", it is used to find the matched wish.com labels of each google label.

In [None]:
def match_labels(google_label):
    t=df[df['labels'].str.contains(google_label)]
    safe2 = pd.merge(t, train, left_index = True, right_index = True)
    l = []
    for index, row in safe2.iterrows():
        for i in row['labelId']:
            l.append(i)
    df_count2 = Counter("".join(str(l)).split(','))
    for key, cnts in list(df_count2.items()):   # list is important here
        if cnts < 0.05*len(l):
            del df_count2[key]

    #print (df_count2)
    tem = []
    for i in df_count2:
        tem.append(re.findall(r'\d+',i))
    #for i in tem:
     #   print (i[0])

    #for i in tem:
     #   for n in top_labels:
      #          if n == i[0]:
       #             tem.remove(i)

    final = []
    for i in tem:
        final.append(i[0])
    #print (final)
    return (final)

I dropped out the top 10 most frequent google labels as they are too common and not indicative enough in this case

In [None]:
all_label_count2 = all_label_count.iloc[10:]

In [None]:
print (all_label_count2.head())

In [None]:
all_label_count2['matched']=''

Here for each google label, I'm trying to find out the wish.com labels that has correlation with it.

In [None]:
for index, row in all_label_count2.iterrows():
    all_label_count2.set_value(index, 'matched', match_labels(row['label']))

In [None]:
print (all_label_count2.head())

In [None]:
# remove [ and ] from label lists
all_label_count2['matched'] = all_label_count2['matched'].astype(str)
all_label_count2['matched'] = all_label_count2['matched'].str[1:]
all_label_count2['matched'] = all_label_count2['matched'].str[:-1]
all_label_count2['matched'] = all_label_count2['matched'] + ','

In [None]:
print (all_label_count2.head())

In [None]:
all_label_count2['matched'] = all_label_count2['matched'].astype(str)
all_label_count2['matched'] = all_label_count2['matched'].map(lambda x: ''.join([i for i in x if i.isdigit() or i.isspace()]))

In [None]:
all_label_count2['matched'] = all_label_count2['matched'] + ' '

In [None]:
print (all_label_count2['matched'].head())

# Step 3. Detect test images
In the "labeled_test" dataset are the labels detected by Google api for each image in test set

In [None]:
test = pd.read_csv('../input/labeled-test/labeled_test.csv', index_col=0,encoding = "ISO-8859-1")

In [None]:
test['labels'] = test['labels'].str[1:]
test['labels'] = test['labels'].str[:-1]

In [None]:
print (test.head())

In [None]:
test['prediction'] = ''

This step will take some time since there has nested loop. For each google label of each image, I add the corresponding wish.com labels extracted above.

In [None]:
for index, trow in test.iterrows():
    for index, arow in all_label_count2.iterrows():
        if arow['label'] in trow['labels']:
            trow['prediction']=trow['prediction']+arow['matched']

# Step 4. Clustering

Here I'm trying to use k-means to cluster all wish.com labels. For k-means clustering, number of clusters is an input parameter. I used the Elbow Method (https://en.wikipedia.org/wiki/Elbow_method_(clustering)) and decided to set number of clusters as 60.

In [None]:
df = train.head(10000).drop(columns=['url'])
# print (df['labelId'])
df['labelId'] = df['labelId'].astype(str)

# Note that the result of this block takes a while to show
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer( max_features=200000,
                                  stop_words='english',
                                 use_idf=True)

%time tfidf_matrix = tfidf_vectorizer.fit_transform(df['labelId']) #fit the vectorizer to synopses


print(tfidf_matrix.shape)
terms = tfidf_vectorizer.get_feature_names()
len(terms)

In [None]:
from sklearn.cluster import KMeans

num_clusters = 60

km = KMeans(n_clusters=num_clusters)

%time km.fit(tfidf_matrix)

clusters = km.labels_.tolist()



In [None]:
df = train.head(10000).drop(columns=['url'])

df_cluster = pd.DataFrame(clusters)
df_cluster.columns = ["cluster"]
print (df_cluster.head())

In [None]:
df = pd.merge(df, df_cluster, left_index = True, right_index = True)

df.index += 1

train_labels = pd.read_csv('../input/train-labels/train_labels.csv', index_col=0)

print (train_labels.head())

In [None]:
df = pd.merge(df, train_labels, left_index = True, right_index = True)

In [None]:
df['labels'] = df['labels'].str[1:]
df['labels'] = df['labels'].str[:-1]
df['labels'] = df['labels'] + ','
print (df.head())

In [None]:
x = df.groupby('cluster')['labels'].apply(lambda x: x.sum())

x.columns = ["labels", "frequent_labels"]

x = x.to_frame()

x.columns = ["labels"]
x['frequent_labels']=""
x['labels'] = x['labels'].astype(str).replace("''", "")
print (x.head())

In [None]:
from collections import Counter
for index, row in x.iterrows():
    df_count = Counter("".join(row['labels']).split(',')).most_common(5)
    l = []
    for i in df_count:
        l.append(i[0])
    row['frequent_labels']=l

In [None]:
y = df.groupby('cluster')['labelId'].apply(lambda x: x.sum())

In [None]:
y = y.to_frame()
y.columns = ["wish_labels"]
y['frequent_wish_labels']=""

print (y.head())

In [None]:
for index, row in y.iterrows():
    df_count = Counter(",".join(row['wish_labels']).split(',')).most_common(10)
    l = []
    for i in df_count:
        l.append(i[0])
    row['frequent_wish_labels']=l

In [None]:
cluster_train = pd.concat([x,y],axis=1)
cluster_train= cluster_train[['frequent_labels','frequent_wish_labels']]

In [None]:
cluster_add = cluster_train

In [None]:
cluster_add['frequent_wish_labels'] = cluster_add['frequent_wish_labels'].astype(str)
cluster_add['frequent_wish_labels'] = cluster_add['frequent_wish_labels'].map(lambda x: ''.join([i for i in x if i.isdigit() or i.isspace()]))

In [None]:
test['cluster'] = ""

# Step 5. Add cluster labels
Here I add the cluster labels to previous predictions. How do I determine which cluster each image belongs to? I used the frequent labels of each cluster. For each image, if it contains at least 5 frequent labels in a cluster, then I would determine that image belongs to the cluster and add corresponding labels to that images. Please note that I didn't require each image only belong to one cluster, so it is possible that a image belongs to multiple clusters

In [None]:
for index, row in test.iterrows():
    x = 0
    for cdex,crow in cluster_train.iterrows():
        for n in crow['frequent_wish_labels']:
            n = str(n)
            
            if n in str(row['prediction']):
                x +=1
    if x >= 5:
        test.set_value(index, 'cluster', str(row['prediction']) +" " + str(cluster_add.at[cdex,'frequent_wish_labels']))

# Step 6. Remove duplicate labels
In previous steps I added labels from matching and clustering together, and there are some duplicate labels. In Kaggle's evaluation system there seems to have penalty for duplicate labels, so I will remove them and make sure each unique label only appears once for each image.

In [None]:
for index, row in test.iterrows():
    row['cluster'] = row['cluster'].split(" ")
    test.set_value(index, 'cluster', set(row['cluster']))

In [None]:
test['cluster'] = test['cluster'].astype(str)
test['cluster'] = test['cluster'].str[1:]
test['cluster'] = test['cluster'].str[:-1]
test['cluster'] = test['cluster'] + ','
test['cluster'] = test['cluster'].map(lambda x: ''.join([i for i in x if i.isdigit() or i.isspace()]))

In [None]:
print (test['cluster'].head())