Skip to content

rhinot/gmailDedupe

Repository files navigation

Gmail Deduplication Tool

License: CC BY-NC 4.0 Python 3.8+ Code style: simple

Simple IMAP-based tool to find duplicate emails between two Gmail accounts and label them.

Note: Works with Advanced Protection accounts! Uses app-specific passwords and IMAP.

Features

✅ Works with Advanced Protection accounts
✅ Fast - only fetches email headers (Message-IDs)
✅ Safe - dry-run mode by default
✅ Simple - single Python file, no external dependencies
✅ Direct labeling via IMAP

How It Works

  1. Connects to Account A via IMAP
  2. Fetches all Message-IDs (unique email identifiers) - headers only, not full emails
  3. Connects to Account B via IMAP
  4. Searches for emails with matching Message-IDs
  5. Applies "duplicate" label to matches in Account B

Setup

1. Generate App Passwords

For both Gmail accounts:

  1. Go to https://myaccount.google.com/security
  2. Enable 2-Step Verification (if not already enabled)
  3. Scroll to "App passwords" section
  4. Generate password for "Mail" app
  5. Copy the 16-character password (ignore spaces)

Note: App passwords work even with Advanced Protection enabled!

2. Configure Credentials

cd gmailDedupe
cp .env.example .env

Edit .env and add your credentials:

ACCOUNT_A_EMAIL=accounta@gmail.com
ACCOUNT_A_PASSWORD=abcdefghijklmnop

ACCOUNT_B_EMAIL=accountb@gmail.com
ACCOUNT_B_PASSWORD=qrstuvwxyzabcdef

Important: Never commit .env to git (it's in .gitignore)

3. Enable IMAP (if needed)

In Gmail settings for both accounts:

  • Settings → Forwarding and POP/IMAP
  • Enable IMAP access

Usage

Dry Run (Test Mode)

First, run in dry-run mode to see what would be labeled:

python dedupe.py

Output:

============================================================
Gmail Deduplication via IMAP
============================================================

Mode: 🔍 DRY RUN
Max emails: All
Label: 'duplicate'

Step 1: Connect to Account A
  Connecting to accounta@gmail.com...
  ✅ Connected successfully

Step 2: Fetch Message-IDs from Account A
  Found 2,847 emails
  Fetching Message-IDs (headers only)...
  ✅ Collected 2,847 unique Message-IDs

Step 3: Connect to Account B
  Connecting to accountb@gmail.com...
  ✅ Connected successfully

Step 4: Find and label duplicates in Account B
  Searching for 2,847 Message-IDs in Account B...
  ✅ Search complete: 234 duplicates found
  📄 Report saved to: duplicates_report.txt

============================================================
Summary
============================================================
Account A emails: 2,847
Duplicates found in Account B: 234

⚠️  DRY RUN MODE - No labels were applied

📄 Detailed report saved to: duplicates_report.txt

To apply labels:
  1. Review duplicates_report.txt (if generated)
  2. Edit dedupe.py
  3. Change: DRY_RUN = False
  4. Run: python dedupe.py

Dry Run Report File

The dry run generates duplicates_report.txt with details of each duplicate:

Duplicate #1
  Message-ID: <abc123@mail.example.com>
  Subject: Your receipt from Example Store
  From: noreply@example.com
  Date: Mon, 15 Jan 2024 10:23:45 -0800
  UID: 12345

Duplicate #2
  Message-ID: <xyz789@newsletter.example.com>
  Subject: Weekly Newsletter - Jan 2024
  From: newsletter@example.com
  Date: Tue, 16 Jan 2024 08:00:00 -0800
  UID: 12346
...

Live Run (Apply Labels)

After verifying the dry run results:

  1. Edit dedupe.py
  2. Change DRY_RUN = True to DRY_RUN = False
  3. Run again:
python dedupe.py

Configuration Options

Edit these variables at the top of dedupe.py:

DRY_RUN = True          # Set to False to apply labels
MAX_EMAILS = None       # Limit emails processed (None = all)
DUPLICATE_LABEL = 'duplicate'  # Label name to apply

Advanced Usage

Process Only Recent Emails

Modify the get_message_ids() function to add a date filter:

# In get_message_ids() function, change search criteria:
_, message_numbers = imap.search(None, 'SINCE', '01-Jan-2024')

Limit Processing

Set MAX_EMAILS to process only a subset:

MAX_EMAILS = 1000  # Process first 1000 emails

Different Label Name

DUPLICATE_LABEL = 'my-custom-label'

Troubleshooting

"Authentication failed"

  • Make sure you're using app-specific passwords, not your regular Gmail password
  • Verify 2-Step Verification is enabled
  • Check that IMAP is enabled in Gmail settings

"Connection timeout"

  • Check your internet connection
  • Try again (Gmail IMAP can be temporarily unavailable)

"No duplicates found" (but you expect some)

  • Message-IDs must match exactly
  • If emails were forwarded, they may have new Message-IDs
  • Check that you're searching the right accounts

Slow performance

  • Processing is linear (one Message-ID at a time for Account B)
  • For 10k+ emails, expect 30-60 minutes
  • Consider using MAX_EMAILS to process in batches

Technical Details

  • Uses Python's built-in imaplib and email libraries
  • Fetches only headers (RFC822.HEADER), not full email bodies
  • Searches using IMAP's HEADER Message-ID command
  • Labels via Gmail's IMAP extension (X-GM-LABELS)
  • No external dependencies required

Limitations

  • Only compares by Message-ID (standard email unique identifier)
  • No fuzzy matching or content comparison
  • Linear search (not optimized for 100k+ emails)
  • Account B must support IMAP labeling (Gmail-specific)

Security

  • Credentials stored locally in .env (not committed to git)
  • App passwords have limited scope (mail only)
  • IMAP uses SSL/TLS encryption
  • Read-only access to Account A
  • Only adds labels to Account B (no deletion)

License

This project is licensed under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0).

You are free to:

  • Share and adapt the code for personal or non-commercial use
  • Give appropriate credit

You may NOT:

  • Use this code for commercial purposes

See LICENSE file for full terms.

About

Simple IMAP-based tool to deduplicate emails between Gmail accounts. Works with Advanced Protection.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages