## Find Duplicate JPEG files from a Directory
The code below examines photos (JPEGS) in a given directory
and finds duplicates. The duplicates are moved to a dups directory where they may be 
examined to see if they really are dups.

Given a collection of duplicate photos, they are arranged so that names that start with 
letters are sorted before names that begin with numbers. The file(s) sent to the dups directory
will be the ones after the first of the sorted files.

The approach to find duplicates is the following:

- Create a dictionary for images:
  - Determine the size of the image in pixels: (X, Y).
  - Create a hash function for an image:
    - Average the RGB values for each pixek.
    - Construct a small grid of points.
    - Average the averaged pixel values over the grid.
  - A given file is appended to a list of filenames based on
    the key: (X, Y, image_hash_value).
    That is, the dictionary has the form: (X, Y, image_hash_value) -> list_of_files
- Once all of the files have been placed into the dictionary,
  find keys in the dictionary that values with list-length > 1.
  Copy all of the files, except the first to the dups directory.

  


### Setup

In [None]:
import os
import glob
import filecmp
import shutil
from datetime import datetime, timedelta
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

In [None]:
## Source is directory where the pics are
## Duplicate directory is where the duplicates will be moved.
SOURCE_DIR = <Set-this>
DUP_DIR    = <Set-this>

### Functions

### Color Signature
Problem: Find a way to find duplicate images, where the images can be of differing resolution and X/Y aspect ratio.

Proposed Solution:
- Extract the 3-D array of RGB color values.
- Pick certain directions (2-D normal vectors) and compute the average of the R, G, B array values
  in that direction.
- Use this set of tuples as a hash for a dictionary.
  Append to a list all picture file names that have the same hash.
- Find all key/values where the values are list of length > 1. 

In [None]:
## Color
def array_sig_col(A, normals):
  M, N, D = A.shape

  x_idx = np.array(list(range(M)))
  y_idx = np.array(list(range(N)))
  z_idx = np.array(list(range(D)))

  x_idx = x_idx.reshape(M, 1, 1)
  y_idx = y_idx.reshape(1, N, 1)
  z_idx = z_idx.reshape(1, 1, D)

  M2 = M / 2.0
  N2 = N / 2.0

  big_adj   = np.log(np.sqrt(6000000 / (1066 * 1600))) / 2.0
  small_adj = np.sqrt((1280 * 853) / (1024 * 1600)) / 4.0
  
  adj_no = np.sqrt((M * N) / (1066 * 1600)) 
  adj = 0.0
  if adj_no > 0.85:
    adj = - 0.825
  elif adj_no < 0.85:
    adj = 0.30
  
  sigs = []
  s = np.zeros(D).astype(int)
  for i in range(len(normals)):
    for k in range(D):
      B = normals[i][0] * (x_idx - M2) + normals[i][1] * (y_idx - N2)
      X_idx, Y_idx, Z_idx = np.where(B > 0)
      avg = np.round(np.average(A[X_idx, Y_idx, k]) + adj, 1)
      s[k] = int(avg / 17.5)
    sigs.append((s[0], s[1], s[2]))
  
  return(tuple(sigs))


In [None]:
## Color
def pic_sig_col(jpg_file, normals, source=SOURCE_DIR):
  """
  """
  os.chdir(source)
  
  ## Extract image from jpg file.
  img = Image.open(jpg_file)
  pary = np.asarray(img)

  return(array_sig_col(pary, normals))

In [None]:
angles = np.array([0.1, 0.55, 1.0, 1.4, 1.9, 2.5, 4.25, 4.8, 5.3, 5.8])
xs = np.cos(angles)
ys = np.sin(angles)
plt.scatter(xs, ys)

In [None]:
def find_dup_jpegs_col(jpeg_dir, angels, verbose=True):
  """
  """
  normals = [(np.cos(angle), np.sin(angle)) for angle in angles]
  
  ## Go to the picture directory.
  os.chdir(jpeg_dir)

  ## Get all of the JPEG files.
  print(f"Gather the JPEG files...")
  jpgFiles = glob.glob('*.jpg')

  ## Number of files to process
  n = len(jpgFiles)
  
  ## Process the files and fill the hash, HSH.
  print(f"Processing the {n} files and filling the hash...")
  HSH = {}
  progress = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
  if verbose:
    print(f"time = {datetime.now()}")
  for i in range(n):
    x = int(100 * i / n)
    if x in progress:
      progress.pop(0)
      if verbose:
        print(f"%{x} of files have been processed (time = {datetime.now()})...")
      else:
        print(f"%{x} of files have been processed...")
    hsh = pic_sig_col(jpgFiles[i], normals, source=jpeg_dir)
    if hsh in HSH:
      HSH[hsh].append(jpgFiles[i])
    else:
      HSH[hsh] = [jpgFiles[i]]
  print("Finished processing files.")
  return(HSH)

In [None]:
angles = np.array([0.1, 0.55, 1.0, 1.4, 1.9, 2.5, 4.25, 4.8, 5.3, 5.8])
t1 = datetime.now()
print(f"t1 = {t1}")
HSH = find_dup_jpegs_col(SOURCE_DIR, angles)
t2 = datetime.now()
print(f"t2 = {t2}")
print(f"Time to process = {t2 - t1}")

In [None]:
## Rearrange the files in the value array so that the more intuitive names come first.
## In this way we treat the later files as duplicates.
lst = [(key, len(value)) for key, value in HSH.items()]
lst.sort(reverse=True, key=lambda x: x[1])

## Potential duplicates are entries in the dictionary whose array as length > 1.
pot_dups =  [(key, value) for key, value in HSH.items() if len(value) > 1]

In [None]:
len(pot_dups)

In [None]:
pot_dups