# Exploring duplicates manually

This script is designed for when you know there are near-duplicates which can be identified by filename (e.g. with our wedding where all the files were distributed as high-res, watermarked, and black-and-white.

I used this just to identify if there were any files that I did NOT have as high-res.

In [1]:
import pandas as pd

In [36]:
import pathlib
import glob
from pprint import pprint
import re

In [87]:
fpath_basedir = pathlib.Path('./2018/2018-07-21 - Wedding')
fpattern = 'Muil [0-9]*.[jJ][pP][gG]'
re_canonical = re.compile('Muil [0-9]{4}_[0-9]+')

In [87]:
fpaths_match = [f for f in fpath_basedir.rglob(fpattern,)]
print(f'got {len(fpaths_match):n} matching filenames')
df = pd.DataFrame({'fpath': fpaths_match})

df['fname'] = df.fpath.apply(lambda f: f.name)
df['fstem'] = df.fpath.apply(lambda f: f.stem)

class Dummy():
    def group(arg0):
        return '<unmatched>'
dummy = Dummy()

df['canonical_stem'] = df.fname.apply(lambda f: (re_canonical.match(f) or dummy).group())
df

got 1415 matching filenames


Unnamed: 0,fpath,fname,fstem,canonical_stem
0,2018/2018-07-21 - Wedding/Muil 0718_001 - High...,Muil 0718_001 - High Res Files.jpg,Muil 0718_001 - High Res Files,Muil 0718_001
1,2018/2018-07-21 - Wedding/Muil 0718_002 - High...,Muil 0718_002 - High Res Files.jpg,Muil 0718_002 - High Res Files,Muil 0718_002
2,2018/2018-07-21 - Wedding/Muil 0718_003 - High...,Muil 0718_003 - High Res Files.jpg,Muil 0718_003 - High Res Files,Muil 0718_003
3,2018/2018-07-21 - Wedding/Muil 0718_004 - High...,Muil 0718_004 - High Res Files.jpg,Muil 0718_004 - High Res Files,Muil 0718_004
4,2018/2018-07-21 - Wedding/Muil 0718_005 - High...,Muil 0718_005 - High Res Files.jpg,Muil 0718_005 - High Res Files,Muil 0718_005
...,...,...,...,...
1410,2018/2018-07-21 - Wedding/Watermarked for Face...,Muil 0718_379-2.jpg,Muil 0718_379-2,Muil 0718_379
1411,2018/2018-07-21 - Wedding/Watermarked for Face...,Muil 0718_124.jpg,Muil 0718_124,Muil 0718_124
1412,2018/2018-07-21 - Wedding/Watermarked for Face...,Muil 0718_285-2.jpg,Muil 0718_285-2,Muil 0718_285
1413,2018/2018-07-21 - Wedding/Watermarked for Face...,Muil 0718_042-2.jpg,Muil 0718_042-2,Muil 0718_042


In [88]:
szs = df.groupby('canonical_stem').size().rename('num_fpaths')
szs.value_counts()

3    469
2      2
4      1
Name: num_fpaths, dtype: int64

In [90]:
dfa = pd.merge(df, szs.reset_index(), on='canonical_stem')
dfa[dfa.num_fpaths!=3].set_index(['canonical_stem', 'fstem']).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,fpath,fname,num_fpaths
canonical_stem,fstem,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Muil 0718_085,Muil 0718_085,2018/2018-07-21 - Wedding/Watermarked for Face...,Muil 0718_085.jpg,2
Muil 0718_085,Muil 0718_085-2,2018/2018-07-21 - Wedding/Watermarked for Face...,Muil 0718_085-2.jpg,2
Muil 0718_338,Muil 0718_338,2018/2018-07-21 - Wedding/Watermarked for Face...,Muil 0718_338.jpg,2
Muil 0718_338,Muil 0718_338-2,2018/2018-07-21 - Wedding/Watermarked for Face...,Muil 0718_338-2.jpg,2
Muil 0718_355,Muil 0718_355,2018/2018-07-21 - Wedding/Watermarked for Face...,Muil 0718_355.jpg,4
Muil 0718_355,Muil 0718_355 - High Res Files,2018/2018-07-21 - Wedding/Muil 0718_355 - High...,Muil 0718_355 - High Res Files.jpg,4
Muil 0718_355,Muil 0718_355-2,2018/2018-07-21 - Wedding/Watermarked for Face...,Muil 0718_355-2.jpg,4
Muil 0718_355,Muil 0718_355-2 - High Res Files,2018/2018-07-21 - Wedding/Muil 0718_355-2 - Hi...,Muil 0718_355-2 - High Res Files.jpg,4


Huh! So two files (`085` and `338`) do indeed not exist in high-res format. Very glad I checked this.