# 📬 Gmail Email Extraction Toolkit (Colab Version)

This notebook lets you extract and clean email addresses from your **Gmail Takeout (.mbox)** data — right inside Google Colab.

✅ Works on large files (even 10GB+)

✅ Automatically removes duplicates and system-generated emails

✅ Exports a clean list + CSV domain summary

## Step 1 — Mount Google Drive
This allows Colab to access your Gmail Takeout files stored in Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Step 2 — Install Dependencies
This installs required Python libraries (only needs to be done once per runtime).

In [None]:
!pip install tqdm pandas openpyxl

## Step 3 — Clone Toolkit Repository
You can use your own GitHub repo or the toolkit files manually uploaded.

In [None]:
!git clone https://github.com/yourusername/gmail-email-extraction-toolkit.git

## Step 4 — Set File Paths
Change the path below to match your `.mbox` file location in Drive.

In [None]:
INPUT_PATH = '/content/drive/MyDrive/your_folder/All mail Including Spam and Trash.mbox'
OUTPUT_PATH = '/content/drive/MyDrive/output/emails_extracted.txt'

## Step 5 — Run Email Extraction
This will process your MBOX and store results in both a text and SQLite file.

In [None]:
!python gmail-email-extraction-toolkit/scripts/extract_emails.py --input "$INPUT_PATH" --output "$OUTPUT_PATH"

## Step 6 — Convert to CSV
This converts the extracted `.txt` file into a `.csv` file with domain breakdown.

In [None]:
!python gmail-email-extraction-toolkit/scripts/make_csv.py

## ✅ Done!
You’ll now find two files in your Drive:
- `emails_extracted.txt` — full clean email list
- `emails_extracted.csv` — same list, with domain separation for sorting

---
### 🔒 Privacy Note
All processing happens locally in your Google Colab session. No data leaves your Google account. Use this only for mailboxes you own or have permission to analyze.