### Good Classes Finder
Agenda
1. Find good classes with no error photos
2. Sample a few good classes and migrate their data to separate folder

Issues:
1. [07/01/2019] Some image can't be opened e.g /home/r8user2/Documents/HY/dress_data/datasets/imgtrain/民族风连衣裙/Co6q4VrxcFSAaNmBAAJGVl-KaNU099.jpg
> Causes: Some images are truncated https://stackoverflow.com/questions/12984426/python-pil-ioerror-image-file-truncated-with-big-images

2. [07/01/2019] Some images has only 2 dimension in nparray e.g /home/r8user2/Documents/HY/dress_data/datasets/imgtrain/民族风连衣裙/Co6q4VryEu6AEjzWAAFI4Sjx41c689.jpg
> Cause: 

References:
1. multiprocessing for multi loops  https://stackoverflow.com/questions/40450661/parallelize-these-nested-for-loops-in-python
2. multiprocess for several arguments
https://stackoverflow.com/questions/5442910/python-multiprocessing-pool-map-for-multiple-arguments

In [1]:
from collections import defaultdict
import os
import matplotlib.pyplot as plt
from PIL import Image
from PIL import ImageFile
%matplotlib inline
ImageFile.LOAD_TRUNCATED_IMAGES = True

In [2]:
SOURCE_PATH = '/home/r8user2/Documents/HY/dress_data/datasets/'
PARTITION_MAP = {1: 'imgtrain', 2: 'imgval', 3: 'imgtest'}

In [3]:
CLASS_MAP = {}
for id_, class_type in enumerate(os.listdir(SOURCE_PATH + 'imgtrain')):
    CLASS_MAP[id_] = class_type

#### Distribution of Partition of All Classes

In [4]:
dist_dict = defaultdict(dict)
for partition_idx, folder in PARTITION_MAP.items():
    for class_idx, class_ in CLASS_MAP.items():
        target_path = SOURCE_PATH + folder +'/' + class_
        cnt = len([item for item in os.listdir(target_path) if 'jpg' in item])
        dist_dict[partition_idx][class_idx] = cnt
        print('[%s/ %s]: %d' % (folder, class_, cnt))

[imgtrain/ 蝴蝶结长袖连衣裙]: 2978
[imgtrain/ 包臀长袖连衣裙]: 2400
[imgtrain/ 假两件连衣裙]: 2403
[imgtrain/ 修身包臀连衣裙]: 2400
[imgtrain/ 女童背心裙]: 2398
[imgtrain/ 刺绣连衣裙]: 2395
[imgtrain/ 修身打底连衣裙]: 2399
[imgtrain/ 礼服长裙]: 2400
[imgtrain/ 包臀鱼尾裙]: 2402
[imgtrain/ 提花连衣裙]: 2359
[imgtrain/ 民族风旗袍]: 2408
[imgtrain/ 字母连衣裙]: 2401
[imgtrain/ 大码宽松连衣裙]: 2403
[imgtrain/ 桑蚕丝裙子]: 2400
[imgtrain/ 高腰连衣裙]: 2401
[imgtrain/ 圆点连衣裙]: 2400
[imgtrain/ 公主蓬蓬裙]: 2400
[imgtrain/ 假两件裙子]: 2399
[imgtrain/ 蕾丝连衣裙]: 2400
[imgtrain/ 勾花连衣裙]: 2400
[imgtrain/ 大码短袖连衣裙]: 2399
[imgtrain/ 松紧腰连衣裙]: 2396
[imgtrain/ 半身裙套装]: 2400
[imgtrain/ 连衣裙两件套]: 2401
[imgtrain/ 复古旗袍]: 2392
[imgtrain/ 网纱长裙]: 2402
[imgtrain/ 潮流连衣裙]: 2400
[imgtrain/ 包臀针织连衣裙]: 2397
[imgtrain/ 连帽连衣裙]: 2400
[imgtrain/ 品牌连衣裙]: 2400
[imgtrain/ 无袖长裙]: 2400
[imgtrain/ 长袖针织衫]: 2404
[imgtrain/ 牛仔连衣裙]: 2402
[imgtrain/ 两件套装裙]: 2401
[imgtrain/ 修身打底裙]: 2392
[imgtrain/ 女童礼服]: 2398
[imgtrain/ 女童长袖连衣裙]: 2400
[imgtrain/ 套头连衣裙]: 2406
[imgtrain/ 简约连衣裙]: 2398
[imgtrain/ 蕾丝打底连衣裙]: 2401
[imgtrain/ 立领连衣裙]: 1046

#### Distribution of Errors for each Class

In [5]:
# Input class index, partition set and photo index, output an image display
# [partition_idx] train: 1, val: 2, test: 3
def jpg_query(class_idx, partition_idx, file_idx, vis = True, print_path = False):
    # Set up target path
    class_type = CLASS_MAP[class_idx]
    partition_type = PARTITION_MAP[partition_idx]
    sub_target_path = SOURCE_PATH + partition_type + '/' + class_type
    target_filename = os.listdir(sub_target_path)[file_idx]
    target_path = sub_target_path + '/' + target_filename
    if print_path:
        print('Path: %s' % target_path)
    
    img = Image.open(target_path)
    if vis: 
    # Display image
        plt.imshow(img);
        print(np.asarray(img, dtype=np.uint8).shape)
    # np.unit8 = 2**8
    return np.asarray(img, dtype=np.uint8).shape

In [6]:
def is_error(class_idx, partition_idx, file_idx):
    dim = jpg_query(class_idx, partition_idx, file_idx, vis = False, print_path = False)
    if len(dim) == 3:
        is_error_ = 0
    else:
        is_error_ = 1
    return (partition_idx, class_idx, file_idx), is_error_

In [7]:
# Create input list for multiprocess
input_list = []
for partition_idx, partition_type in PARTITION_MAP.items():
    for class_idx, class_type in CLASS_MAP.items():
        jpg_num = dist_dict[partition_idx][class_idx]
        for i in range(jpg_num):
            tmp_tuple = (class_idx, partition_idx, i)
            input_list.append(tmp_tuple)
input_list = tuple(input_list)

In [8]:
# Multi-processing
import time
import multiprocess as mp

start_time = time.time()
p = mp.Pool(processes = 60)

with p as pool:
    results = p.starmap(is_error, input_list)

p.close()
p.join()
print("---Multiprocess Complete: %d mins ---" % ((time.time() - start_time)/60))

---Multiprocess Complete: 6 mins ---


In [33]:
import pandas as pd
error_df = pd.DataFrame(results, columns=['index', 'error'])
error_df[['partition_idx', 'class_idx', 'file_idx']] = error_df['index'].apply(pd.Series)

In [45]:
error_df.groupby('class_idx')['error'].agg('sum').sort_values().head(10)

class_idx
56     0
173    0
44     0
47     0
171    0
49     0
51     0
52     0
55     0
139    0
Name: error, dtype: int64

#### Select A Few Classes with No Errors

In [46]:
GOOD_CLASS_MAP = {}
GOOD_CLASS_MAP[56] = CLASS_MAP[56]
GOOD_CLASS_MAP[173] = CLASS_MAP[173]
GOOD_CLASS_MAP[44] = CLASS_MAP[44]
GOOD_CLASS_MAP[47] = CLASS_MAP[47]
GOOD_CLASS_MAP[171] = CLASS_MAP[171]

In [47]:
GOOD_CLASS_MAPASS_MAPASS_MAP

{56: '印花裙子', 173: '大码裙子', 44: '短款旗袍', 47: '绷带连衣裙', 171: '荷叶袖连衣裙'}

##### Selected Class for Sample Run
1. '印花裙子'
2. '大码裙子'
3. '短款旗袍'
4. '绷带连衣裙'
5. '荷叶袖连衣裙'

#### Distribution of Partition for Selected Classes

In [48]:
# Find the distribution of selected classes
dist_dict = defaultdict(dict)
for partition_idx, folder in PARTITION_MAP.items():
    for class_idx, class_ in GOOD_CLASS_MAP.items():
        target_path = SOURCE_PATH + folder +'/' + class_
        cnt = len([item for item in os.listdir(target_path) if 'jpg' in item])
        dist_dict[partition_idx][class_idx] = cnt
        print('[%s/ %s]: %d' % (folder, class_, cnt))

[imgtrain/ 印花裙子]: 2375
[imgtrain/ 大码裙子]: 2388
[imgtrain/ 短款旗袍]: 2396
[imgtrain/ 绷带连衣裙]: 2398
[imgtrain/ 荷叶袖连衣裙]: 2400
[imgval/ 印花裙子]: 298
[imgval/ 大码裙子]: 299
[imgval/ 短款旗袍]: 300
[imgval/ 绷带连衣裙]: 301
[imgval/ 荷叶袖连衣裙]: 300
[imgtest/ 印花裙子]: 296
[imgtest/ 大码裙子]: 298
[imgtest/ 短款旗袍]: 299
[imgtest/ 绷带连衣裙]: 299
[imgtest/ 荷叶袖连衣裙]: 300


#### Migrate Data of Selected Classes

In [49]:
os.getcwd()

'/home/r8user2/Documents/HY/dress_data/alex_workplace/dressdata_project'

In [59]:
path = 'git_workplace/selected_gd_data10'
os.mkdir(path)

In [60]:
for partition_idx, partition_type in PARTITION_MAP.items():
    os.mkdir(path + '/' + partition_type)
    for class_idx, class_type in GOOD_CLASS_MAP.items():
        os.mkdir(path + '/' + partition_type + '/' + class_type)

In [54]:
import shutil
def copytree(src, dst, symlinks=False, ignore=None):
    for item in os.listdir(src):
        s = os.path.join(src, item)
        d = os.path.join(dst, item)
        if os.path.isdir(s):
            shutil.copytree(s, d, symlinks, ignore)
        else:
            shutil.copy2(s, d)

In [62]:
for partition_idx, partition_type in PARTITION_MAP.items():
    for class_idx, class_type in GOOD_CLASS_MAP.items():
        from_path = SOURCE_PATH + partition_type + '/' + class_type
        to_path = path + '/' + partition_type + '/' + class_type
        copytree(from_path, to_path)
        print(to_path)

git_workplace/data/imgtrain/印花裙子
git_workplace/data/imgtrain/大码裙子
git_workplace/data/imgtrain/短款旗袍
git_workplace/data/imgtrain/绷带连衣裙
git_workplace/data/imgtrain/荷叶袖连衣裙
git_workplace/data/imgval/印花裙子
git_workplace/data/imgval/大码裙子
git_workplace/data/imgval/短款旗袍
git_workplace/data/imgval/绷带连衣裙
git_workplace/data/imgval/荷叶袖连衣裙
git_workplace/data/imgtest/印花裙子
git_workplace/data/imgtest/大码裙子
git_workplace/data/imgtest/短款旗袍
git_workplace/data/imgtest/绷带连衣裙
git_workplace/data/imgtest/荷叶袖连衣裙


In [63]:
os.getcwd()

'/home/r8user2/Documents/HY/dress_data/alex_workplace/dressdata_project'