# PST File Inspector for Databricks

This notebook allows you to inspect PST files to see:
1. Message counts and statistics
2. Attachment information (names, sizes)
3. Email metadata (subject, sender, dates)

**Key Features:**
- 🔍 Preview PST files without extracting data
- 📊 Get detailed statistics about messages and attachments
- ⚡ Fast inspection - only reads metadata
- 📋 Generate reports of PST file contents


In [None]:
# Install required library for PST parsing
%pip install pypff-python --quiet


In [None]:
import os
from datetime import datetime
import pypff


In [None]:
# Configuration
PST_FILE_PATH = "/Volumes/catalog/schema/pst_files/sample.pst"  # Update with your PST file path
MAX_MESSAGES_TO_SHOW = 20  # Maximum number of messages with attachments to display


## 1. PST File Inspection Function


In [None]:
def inspect_pst_file(pst_file_path, max_messages_to_show=20, show_all_messages=False):
    """
    Inspect a PST file and display comprehensive information.
    Does NOT extract anything - just reads and displays metadata.
    
    Args:
        pst_file_path: Path to PST file to inspect
        max_messages_to_show: Maximum number of messages with attachments to display
        show_all_messages: If True, shows all messages (not just those with attachments)
        
    Returns:
        Dictionary with inspection results and statistics
    """
    print("=" * 80)
    print(f"INSPECTING PST FILE")
    print("=" * 80)
    print(f"File: {pst_file_path}")
    
    try:
        # Check if file exists
        if not os.path.exists(pst_file_path):
            print(f"\n❌ File not found: {pst_file_path}")
            return None
        
        file_size = os.path.getsize(pst_file_path)
        print(f"Size: {file_size / (1024**2):.2f} MB")
        print("=" * 80)
        
        # Open PST file
        pst = pypff.file()
        pst.open(pst_file_path)
        root = pst.get_root_folder()
        
        if not root:
            print("\n❌ No root folder found in PST file")
            pst.close()
            return None
        
        # Statistics
        total_messages = 0
        messages_with_attachments = 0
        total_attachments = 0
        total_folders = 0
        message_details = []
        attachment_details = []
        
        # Recursive function to traverse folders
        def inspect_folder(folder, folder_path=""):
            nonlocal total_messages, messages_with_attachments, total_attachments, total_folders
            
            folder_name = folder.name if folder.name else "Unknown"
            current_path = f"{folder_path}/{folder_name}" if folder_path else folder_name
            total_folders += 1
            
            # Process messages in current folder
            if folder.number_of_sub_messages > 0:
                for message in folder.sub_messages:
                    total_messages += 1
                    
                    # Get message details
                    subject = message.subject if message.subject else "(No Subject)"
                    sender = message.sender_name if message.sender_name else "(Unknown Sender)"
                    sender_email = message.sender_email_address if message.sender_email_address else ""
                    
                    delivery_time = None
                    try:
                        if message.delivery_time:
                            delivery_time = datetime.fromtimestamp(message.delivery_time)
                    except:
                        pass
                    
                    has_attachments = message.number_of_attachments > 0
                    
                    # Get attachment details if present
                    attachments = []
                    if has_attachments:
                        messages_with_attachments += 1
                        
                        for idx, attachment in enumerate(message.attachments):
                            att_name = attachment.name if attachment.name else f"attachment_{idx}"
                            att_size = attachment.size if hasattr(attachment, 'size') else 0
                            attachments.append({
                                "name": att_name,
                                "size": att_size,
                                "size_kb": att_size / 1024
                            })
                            total_attachments += 1
                        
                        attachment_details.append({
                            "folder": current_path,
                            "subject": subject,
                            "sender": sender,
                            "sender_email": sender_email,
                            "delivery_time": delivery_time,
                            "attachment_count": len(attachments),
                            "attachments": attachments
                        })
                    
                    # Store message details if showing all messages or if it has attachments
                    if show_all_messages or has_attachments:
                        message_details.append({
                            "folder": current_path,
                            "subject": subject,
                            "sender": sender,
                            "sender_email": sender_email,
                            "delivery_time": delivery_time,
                            "has_attachments": has_attachments,
                            "attachment_count": len(attachments) if has_attachments else 0,
                            "attachments": attachments if has_attachments else []
                        })
            
            # Recursively process subfolders
            if folder.number_of_sub_folders > 0:
                for sub_folder in folder.sub_folders:
                    inspect_folder(sub_folder, current_path)
        
        # Inspect all folders
        print("\nScanning folders and messages...")
        inspect_folder(root)
        
        pst.close()
        
        # Display results
        print("\n" + "=" * 80)
        print("FILE STATISTICS")
        print("=" * 80)
        print(f"📁 Total folders: {total_folders:,}")
        print(f"📧 Total messages: {total_messages:,}")
        print(f"📎 Messages with attachments: {messages_with_attachments:,} ({messages_with_attachments/max(total_messages, 1)*100:.1f}%)")
        print(f"📂 Total attachments: {total_attachments:,}")
        if messages_with_attachments > 0:
            print(f"📊 Average attachments per message (with attachments): {total_attachments/messages_with_attachments:.1f}")
        
        # Display message details
        if message_details:
            display_list = attachment_details if not show_all_messages else message_details
            list_title = "MESSAGES WITH ATTACHMENTS" if not show_all_messages else "ALL MESSAGES"
            
            print("\n" + "=" * 80)
            print(f"{list_title} (showing first {min(max_messages_to_show, len(display_list))})")
            print("=" * 80)
            
            for idx, msg in enumerate(display_list[:max_messages_to_show], 1):
                print(f"\n📧 Message {idx}:")
                print(f"   Folder: {msg['folder']}")
                print(f"   Subject: {msg['subject'][:70]}")
                print(f"   Sender: {msg['sender']}")
                if msg.get('sender_email'):
                    print(f"   Email: {msg['sender_email']}")
                if msg.get('delivery_time'):
                    print(f"   Date: {msg['delivery_time']}")
                
                if msg.get('has_attachments') or msg.get('attachment_count', 0) > 0:
                    print(f"   Attachments ({msg['attachment_count']}):")
                    for att in msg['attachments']:
                        print(f"      📎 {att['name']} ({att['size_kb']:.2f} KB)")
                else:
                    print(f"   Attachments: None")
            
            if len(display_list) > max_messages_to_show:
                print(f"\n   ... and {len(display_list) - max_messages_to_show} more messages")
        
        if not attachment_details and not show_all_messages:
            print("\n⚠️  No attachments found in this PST file")
        
        print("\n" + "=" * 80)
        print("✅ Inspection complete!")
        print("=" * 80)
        
        return {
            "file_path": pst_file_path,
            "file_size_mb": file_size / (1024**2),
            "total_folders": total_folders,
            "total_messages": total_messages,
            "messages_with_attachments": messages_with_attachments,
            "total_attachments": total_attachments,
            "message_details": message_details,
            "attachment_details": attachment_details
        }
        
    except Exception as e:
        print(f"\n❌ Error inspecting PST file: {str(e)}")
        import traceback
        traceback.print_exc()
        return None


## 2. Inspect a Single PST File


In [None]:
# Inspect the configured PST file
# This shows messages with attachments by default

results = inspect_pst_file(
    pst_file_path=PST_FILE_PATH,
    max_messages_to_show=MAX_MESSAGES_TO_SHOW,
    show_all_messages=False  # Set to True to see ALL messages, not just those with attachments
)

# Results are also returned as a dictionary for further processing
if results:
    print(f"\n📊 Summary:")
    print(f"   Total messages: {results['total_messages']:,}")
    print(f"   Messages with attachments: {results['messages_with_attachments']:,}")
    print(f"   Total attachments: {results['total_attachments']:,}")


## 3. Inspect Multiple PST Files


In [None]:
def find_pst_files(root_path):
    """
    Recursively search for PST files in the given path.
    
    Args:
        root_path: Root directory to search
        
    Returns:
        List of tuples: (file_path, file_size_bytes)
    """
    pst_files = []
    
    print(f"Searching for PST files in: {root_path}")
    
    for root, dirs, files in os.walk(root_path):
        for file in files:
            if file.lower().endswith('.pst'):
                file_path = os.path.join(root, file)
                try:
                    file_size = os.path.getsize(file_path)
                    pst_files.append((file_path, file_size))
                    print(f"  Found: {file_path} ({file_size / (1024**2):.2f} MB)")
                except Exception as e:
                    print(f"  Error accessing {file_path}: {str(e)}")
    
    print(f"\nTotal PST files found: {len(pst_files)}")
    return pst_files


def inspect_multiple_pst_files(pst_directory, max_files=10):
    """
    Inspect multiple PST files and generate a summary report.
    
    Args:
        pst_directory: Directory containing PST files
        max_files: Maximum number of files to inspect
        
    Returns:
        List of inspection results
    """
    print("=" * 80)
    print("INSPECTING MULTIPLE PST FILES")
    print("=" * 80)
    
    # Find PST files
    pst_files = find_pst_files(pst_directory)
    
    if not pst_files:
        print("\n❌ No PST files found")
        return []
    
    # Limit to max_files
    files_to_inspect = pst_files[:max_files]
    if len(pst_files) > max_files:
        print(f"\n⚠️  Limiting inspection to first {max_files} files (found {len(pst_files)} total)")
    
    # Inspect each file
    all_results = []
    for idx, (file_path, file_size) in enumerate(files_to_inspect, 1):
        print(f"\n{'='*80}")
        print(f"[{idx}/{len(files_to_inspect)}] Inspecting: {file_path}")
        print(f"{'='*80}")
        
        result = inspect_pst_file(file_path, max_messages_to_show=5, show_all_messages=False)
        if result:
            all_results.append(result)
    
    # Generate summary report
    if all_results:
        print("\n" + "=" * 80)
        print("OVERALL SUMMARY REPORT")
        print("=" * 80)
        
        total_files = len(all_results)
        total_messages = sum(r['total_messages'] for r in all_results)
        total_with_attachments = sum(r['messages_with_attachments'] for r in all_results)
        total_attachments = sum(r['total_attachments'] for r in all_results)
        total_size_mb = sum(r['file_size_mb'] for r in all_results)
        
        print(f"📁 Files inspected: {total_files}")
        print(f"💾 Total size: {total_size_mb:.2f} MB")
        print(f"📧 Total messages: {total_messages:,}")
        print(f"📎 Messages with attachments: {total_with_attachments:,} ({total_with_attachments/max(total_messages, 1)*100:.1f}%)")
        print(f"📂 Total attachments: {total_attachments:,}")
        
        print(f"\n{'='*80}")
        print("FILE BREAKDOWN")
        print(f"{'='*80}")
        for r in all_results:
            print(f"\n📄 {os.path.basename(r['file_path'])}")
            print(f"   Size: {r['file_size_mb']:.2f} MB")
            print(f"   Messages: {r['total_messages']:,} | With attachments: {r['messages_with_attachments']:,} | Attachments: {r['total_attachments']:,}")
        
        print("\n" + "=" * 80)
    
    return all_results


In [None]:
# Example: Inspect multiple PST files in a directory
# Uncomment and update the path to inspect multiple files

# PST_DIRECTORY = "/Volumes/catalog/schema/pst_files"
# 
# batch_results = inspect_multiple_pst_files(
#     pst_directory=PST_DIRECTORY,
#     max_files=10  # Limit inspection to first 10 files
# )


## 4. Export Inspection Results to DataFrame (Optional)


In [None]:
# Convert inspection results to a Spark DataFrame for analysis
# Uncomment to use

# if results:
#     # Create summary data
#     summary_data = [{
#         "file_path": results['file_path'],
#         "file_name": os.path.basename(results['file_path']),
#         "file_size_mb": results['file_size_mb'],
#         "total_folders": results['total_folders'],
#         "total_messages": results['total_messages'],
#         "messages_with_attachments": results['messages_with_attachments'],
#         "total_attachments": results['total_attachments'],
#         "inspection_timestamp": datetime.now()
#     }]
#     
#     df_summary = spark.createDataFrame(summary_data)
#     display(df_summary)
#     
#     # Create detailed attachment data
#     if results['attachment_details']:
#         attachment_rows = []
#         for msg in results['attachment_details']:
#             for att in msg['attachments']:
#                 attachment_rows.append({
#                     "file_path": results['file_path'],
#                     "folder": msg['folder'],
#                     "subject": msg['subject'],
#                     "sender": msg['sender'],
#                     "delivery_time": msg['delivery_time'],
#                     "attachment_name": att['name'],
#                     "attachment_size_kb": att['size_kb']
#                 })
#         
#         df_attachments = spark.createDataFrame(attachment_rows)
#         display(df_attachments)
#         
#         print(f"\n✅ Created DataFrames:")
#         print(f"   - Summary: {len(summary_data)} row(s)")
#         print(f"   - Attachments: {len(attachment_rows)} row(s)")


## 5. Usage Tips


### How to Use This Notebook

**1. Inspect a Single PST File:**
```python
PST_FILE_PATH = "/path/to/file.pst"
results = inspect_pst_file(PST_FILE_PATH, max_messages_to_show=20)
```

**2. Show All Messages (not just those with attachments):**
```python
results = inspect_pst_file(PST_FILE_PATH, show_all_messages=True)
```

**3. Inspect Multiple Files:**
```python
results = inspect_multiple_pst_files("/path/to/pst/directory", max_files=10)
```

**4. Export to DataFrame for Analysis:**
- Uncomment Cell 12 to convert results to Spark DataFrames
- Use for further analysis, filtering, or saving to Delta tables

### Common Use Cases

- **Pre-Extraction Validation**: Check if PST files contain attachments before running extraction
- **Inventory Management**: Generate reports of PST file contents
- **Testing**: Verify specific files have expected content
- **Troubleshooting**: Identify which files have issues or no attachments
- **Documentation**: Create an index of all PST files and their contents

### Tips

- 💡 Start with `max_messages_to_show=5` for quick previews
- 💡 Use `show_all_messages=True` to see messages without attachments
- 💡 Inspection is fast and doesn't modify or extract anything
- 💡 Results are returned as dictionaries for programmatic access
