# **Searching for duplicate files**

## **Author: [Dr. Rahul Remanan](https://www.linkedin.com/in/rahulremanan/)**
## **CEO, [Moad Computer](http://www.moad.computer/)**

An example implementation of duplicate file detection using Python. This could be used as the backbone for a de-duplicated file system.

# Import libraries

In [None]:
import os, hashlib
from glob import glob
from tqdm.notebook import tqdm

# Compute file hashes
The file hashes are computed for a specified chunk size using either SHA256 or Blake cryptographic functions using the [hashlib python library](https://docs.python.org/3/library/hashlib.html).

In [None]:
class FileHash():
  def __init__(self, 
               chunk_size:int=4096, 
               crypto:str='blake2b')->None:
    self.chunk_size = chunk_size
    self.crypto = crypto
  def file_hash(self, fname:str)->str:
    _hash_fn = getattr(hashlib, self.crypto)()
    with open(fname, 'rb') as f:
      for _chunk in iter(lambda: f.read(self.chunk_size), b''):
        _hash_fn.update(_chunk)
    return _hash_fn.hexdigest()

# Detect file duplicates
Creates a dictionary output with the cryptographic hash as the key and a list of files that share that specific cryptographic hash as the value.

In [None]:
class FileDedup(FileHash):
  def __init__(self,
               crypto:str='blake2b', 
               chunk_size:int=2048):
    super().__init__()
    self.crypto = crypto
    self.chunk_size = chunk_size
  def __call__(self,
               file_list:list)->dict:
    file_compare = {}
    for f in tqdm(file_list):
      try:
        file_compare[self.file_hash(f)].append(f)
      except KeyError:    
        file_compare[self.file_hash(f)] = [f]
    return file_compare

In [None]:
dedup_dict = FileDedup(crypto='blake2b', chunk_size=4096)(
               glob(
                    '../input/uw-madison-gi-tract-image-segmentation/**/*.png', 
                    recursive=True
                   )
                 )

# Testing for duplicates in original dataset
Finding the duplicate files can be performed by simply iterating over all the keys in the file comparison dictionary, looking for values with a list size of more than 1.

In [None]:
def find_duplicates(dedup_dict):
  num_dup = 0  
  for i, k in tqdm(enumerate(dedup_dict)):
    if len(dedup_dict[k])>1:
      print('\n', dedup_dict[k], '\n ')
      num_dup += 1
  print(f'Number of files with duplicates: {num_dup}')

In [None]:
find_duplicates(dedup_dict)

# Create some duplicate files

In [None]:
!cp -r ../input/uw-madison-gi-tract-image-segmentation/train/case101/ ./

# Testing duplicate detection on the synthetic file list

In [None]:
file_list = glob(
              '../input/uw-madison-gi-tract-image-segmentation/**/*.png', 
              recursive=True
              )
print(len(file_list))
file_list.extend(glob('./case101/**/*.png', recursive=True))
print(len(file_list))
dedup_dict = FileDedup(crypto='blake2b', chunk_size=4096)(file_list)

In [None]:
find_duplicates(dedup_dict)