# データセットの作成
目的：IMDB-WIKIデータセットから男性の顔写真を抽出し、クラウドソージングの対象とする画像データセットを作成する。

補足：

   - WIKIデータは破損が多いそうなので利用しない。
   - IMDB-WIKIデータセットは"crop"バージョン（トリミング済み）を用いる。

In [1]:
from pathlib import Path

In [2]:
import numpy as np
import pandas as pd

## imdb-wikiのメタデータをロードし欲しい情報を加工する(MATLAB形式)

In [3]:
import scipy.io

In [4]:
imdb_mat = scipy.io.loadmat(Path("../imdb-wiki/imdb_crop/imdb.mat"))

使いずらいので、データ構造を作り直す

必要なデータは、ID、性別、名前、ファイルパス。IDは振られていないので、こちらで振ってしまう。
face_scoreも便利そうなので取っておく。

In [5]:
paths = []

In [6]:
for path in imdb_mat["imdb"][0][0][2][0]:
    paths.append(path[0])

In [7]:
paths[0:3], len(paths)

(['01/nm0000001_rm124825600_1899-5-10_1968.jpg',
  '01/nm0000001_rm3343756032_1899-5-10_1970.jpg',
  '01/nm0000001_rm577153792_1899-5-10_1968.jpg'],
 460723)

In [8]:
FEMALE = 0
MALE = 1

In [9]:
genders = imdb_mat["imdb"][0][0][3][0]

In [10]:
np.nansum(genders) , len(genders)

(263214.0, 460723)

In [32]:
imdb_mat["imdb"][0][0][4][0]

array([array(['Fred Astaire'], dtype='<U12'),
       array(['Fred Astaire'], dtype='<U12'),
       array(['Fred Astaire'], dtype='<U12'), ...,
       array(['Jane Levy'], dtype='<U9'),
       array(['Jane Levy'], dtype='<U9'),
       array(['Jane Levy'], dtype='<U9')], dtype=object)

In [31]:
names = []

In [33]:
for d in imdb_mat["imdb"][0][0][4][0]:
    names.append(d[0])

In [11]:
scores = imdb_mat["imdb"][0][0][6][0]

In [12]:
ids = range(len(paths))

In [35]:
df = pd.DataFrame([ids, names, paths, genders, scores])

In [36]:
df_imdb = pd.DataFrame(df.T.values, columns=["id", "names", "path","gender", "face_score"])

In [37]:
df_imdb

Unnamed: 0,id,names,path,gender,face_score
0,0,Fred Astaire,01/nm0000001_rm124825600_1899-5-10_1968.jpg,1.0,1.459693
1,1,Fred Astaire,01/nm0000001_rm3343756032_1899-5-10_1970.jpg,1.0,2.543198
2,2,Fred Astaire,01/nm0000001_rm577153792_1899-5-10_1968.jpg,1.0,3.455579
3,3,Fred Astaire,01/nm0000001_rm946909184_1899-5-10_1968.jpg,1.0,1.872117
4,4,Fred Astaire,01/nm0000001_rm980463616_1899-5-10_1968.jpg,1.0,1.158766
...,...,...,...,...,...
460718,460718,Jane Levy,08/nm3994408_rm761245696_1989-12-29_2011.jpg,0.0,3.845884
460719,460719,Jane Levy,08/nm3994408_rm784182528_1989-12-29_2011.jpg,0.0,-inf
460720,460720,Jane Levy,08/nm3994408_rm926592512_1989-12-29_2011.jpg,0.0,-inf
460721,460721,Jane Levy,08/nm3994408_rm943369728_1989-12-29_2011.jpg,0.0,4.450725


In [38]:
df_imdb.to_csv("imdb_path_gender_score.csv")

## 男性の写真のみにする

In [39]:
df_male = df_imdb[df_imdb["gender"]==MALE]

In [40]:
df_male

Unnamed: 0,id,names,path,gender,face_score
0,0,Fred Astaire,01/nm0000001_rm124825600_1899-5-10_1968.jpg,1.0,1.459693
1,1,Fred Astaire,01/nm0000001_rm3343756032_1899-5-10_1970.jpg,1.0,2.543198
2,2,Fred Astaire,01/nm0000001_rm577153792_1899-5-10_1968.jpg,1.0,3.455579
3,3,Fred Astaire,01/nm0000001_rm946909184_1899-5-10_1968.jpg,1.0,1.872117
4,4,Fred Astaire,01/nm0000001_rm980463616_1899-5-10_1968.jpg,1.0,1.158766
...,...,...,...,...,...
460619,460619,Ben Rappaport,19/nm2999419_rm570262784_1986-3-23_2011.jpg,1.0,5.086889
460620,460620,Ben Rappaport,19/nm2999419_rm573158144_1986-3-23_2011.jpg,1.0,5.03476
460621,460621,Ben Rappaport,19/nm2999419_rm971736576_1986-3-23_2010.jpg,1.0,4.539457
460622,460622,Ben Rappaport,19/nm2999419_rm988513792_1986-3-23_2010.jpg,1.0,3.965925


## face_scoreでソート
>face_score: detector score (the higher the better). Inf implies that no face was found in the image and the face_location then just returns the entire image

In [41]:
df_male_sorted = df_male.sort_values("face_score", ascending=False)

In [42]:
df_male_sorted

Unnamed: 0,id,names,path,gender,face_score
243534,243534,Joe Thomas,04/nm3022504_rm2354875136_1983-10-28_2008.jpg,1.0,7.342362
274825,274825,Bobby Cannavale,72/nm0134072_rm1979095808_1970-5-3_2007.jpg,1.0,7.169729
405301,405301,Kevin G. Schmidt,56/nm0773056_rm211794688_1988-8-16_2010.jpg,1.0,7.0074
325918,325918,Jordan Gavaris,98/nm2849998_rm211794688_1989-9-25_2010.jpg,1.0,7.0074
99127,99127,James Duval,66/nm0001166_rm4107245568_1972-9-10_2004.jpg,1.0,7.00737
...,...,...,...,...,...
278871,278871,Larry Drake,52/nm0236952_rm805541888_1950-2-21_2001.jpg,1.0,-inf
278938,278938,Tom Dreesen,78/nm0237378_rm3405551616_1939-9-11_2000.jpg,1.0,-inf
278941,278941,Mort Drescher,12/nm0237512_rm324705024_1929-10-29_2005.jpg,1.0,-inf
278942,278942,Mort Drescher,12/nm0237512_rm341482240_1929-10-29_2005.jpg,1.0,-inf


多めに見積もってもクラウドソージングでラベリングできるのは2000枚である。

In [55]:
df_2000 = df_male_sorted[0:3000]

In [56]:
df_2000

Unnamed: 0,id,names,path,gender,face_score
243534,243534,Joe Thomas,04/nm3022504_rm2354875136_1983-10-28_2008.jpg,1.0,7.342362
274825,274825,Bobby Cannavale,72/nm0134072_rm1979095808_1970-5-3_2007.jpg,1.0,7.169729
405301,405301,Kevin G. Schmidt,56/nm0773056_rm211794688_1988-8-16_2010.jpg,1.0,7.0074
325918,325918,Jordan Gavaris,98/nm2849998_rm211794688_1989-9-25_2010.jpg,1.0,7.0074
99127,99127,James Duval,66/nm0001166_rm4107245568_1972-9-10_2004.jpg,1.0,7.00737
...,...,...,...,...,...
67436,67436,Liam Neeson,53/nm0000553_rm1231534592_1952-6-7_2011.jpg,1.0,5.629399
320671,320671,Clive Standen,40/nm1641140_rm830259968_1981-7-22_2013.jpg,1.0,5.629373
124470,124470,Chris Sarandon,97/nm0001697_rm1037272320_1942-7-24_1988.jpg,1.0,5.629364
129727,129727,John Turturro,06/nm0001806_rm1701550848_1957-2-28_2005.jpg,1.0,5.629328


In [57]:
df_2000.to_csv("imdb-2000males-sorted-score.csv")

## 同一人物の削除

In [62]:
df_2000_uniqe = df_2000.drop_duplicates(subset="names")

In [63]:
df_2000_uniqe

Unnamed: 0,id,names,path,gender,face_score
243534,243534,Joe Thomas,04/nm3022504_rm2354875136_1983-10-28_2008.jpg,1.0,7.342362
274825,274825,Bobby Cannavale,72/nm0134072_rm1979095808_1970-5-3_2007.jpg,1.0,7.169729
405301,405301,Kevin G. Schmidt,56/nm0773056_rm211794688_1988-8-16_2010.jpg,1.0,7.0074
325918,325918,Jordan Gavaris,98/nm2849998_rm211794688_1989-9-25_2010.jpg,1.0,7.0074
99127,99127,James Duval,66/nm0001166_rm4107245568_1972-9-10_2004.jpg,1.0,7.00737
...,...,...,...,...,...
239135,239135,Jamie Waylett,88/nm0915488_rm726253312_1989-7-21_2007.jpg,1.0,5.629968
198287,198287,Matthew Lewis,35/nm0507535_rm726253312_1989-6-27_2007.jpg,1.0,5.629968
274627,274627,Martin Campbell,09/nm0132709_rm2023984896_1943-10-24_2010.jpg,1.0,5.629927
320671,320671,Clive Standen,40/nm1641140_rm830259968_1981-7-22_2013.jpg,1.0,5.629373


In [65]:
df_2000_uniqe.to_csv("imdb-2000males-sorted-score-uniqe.csv")

## 画像をピックしてくる

In [24]:
import shutil

In [168]:
!mkdir ..\images

In [66]:
imdbPath = Path("../imdb-wiki/imdb_crop/")
for ID, path in zip(df_2000_uniqe["id"], df_2000_uniqe["path"]):
    fromPath = imdbPath / Path(path)
    shutil.copy(fromPath, Path(f"../images/{ID}.jpg"))

## クリーニング
1394枚の内
 - 複数人が写っている
 - 女性である（ラベルのミス
 - 不鮮明
 - 同じ人っぽい
 - カラー写真でない
 
 を除外。結果1076枚となった。
 images/cleaned_20220519に保存。