# Shopify Subdomain Takeover Scanner - Dataset-Based

**Simple workflow:**
1. Load CSV dataset
2. Filter by HTTP status (403, 404, 409, 500, 503)
3. Sort by Est Monthly Page Views (highest traffic first)
4. Export top N targets
5. Deep scan with validation tools

**Total time:** 15-20 minutes for 100 domains

**Dataset:** `/kaggle/input/all-leads-merged/results.csv`

## Cell 1: Clone Project from GitHub

In [None]:
%%bash
WORKDIR="/kaggle/working"
PROJECT_DIR="$WORKDIR/subdomain-playground"

echo "=========================================="
echo "Cloning Project from GitHub"
echo "=========================================="

mkdir -p "$WORKDIR"
cd "$WORKDIR"

if [ -d "$PROJECT_DIR" ]; then
    echo "Removing existing copy..."
    rm -rf "$PROJECT_DIR"
fi

git clone --depth 1 https://github.com/sayihhamza/subdomain-playground.git subdomain-playground

if [ ! -d "$PROJECT_DIR" ]; then
    echo "‚úó Clone failed!"
    exit 1
fi

cd "$PROJECT_DIR"

echo ""
echo "‚úì Project cloned successfully!"
echo ""
echo "Key files:"
ls -lh scan.py hybrid_scan.py 2>/dev/null || echo "Files available"

## Cell 2: Install Python Dependencies

In [None]:
%%bash
cd /kaggle/working/subdomain-playground

echo "Installing Python requirements..."
python3 -m pip install --quiet -r requirements.txt

echo ""
echo "‚úì Dependencies installed!"

## Cell 3: Install Go 1.24

In [None]:
%%bash
cd /tmp

if command -v sudo >/dev/null 2>&1; then
    SUDO="sudo"
else
    SUDO=""
fi

echo "=========================================="
echo "Installing Go 1.24.1"
echo "=========================================="

$SUDO rm -rf /usr/local/go 2>/dev/null || true

echo "Downloading Go 1.24.1..."
wget -q https://go.dev/dl/go1.24.1.linux-amd64.tar.gz -O /tmp/go.tar.gz

$SUDO tar -C /usr/local -xzf /tmp/go.tar.gz
rm -f /tmp/go.tar.gz

echo ""
echo "‚úì Go 1.24.1 installed!"
/usr/local/go/bin/go version

## Cell 4: Build Security Tools

**Building 5 tools (5-8 minutes):**
- subfinder: Passive enumeration
- findomain: Additional coverage (+5-10%)
- dnsx: DNS + CNAME validation
- httpx: HTTP validation
- subzy: Takeover detection

In [None]:
%%bash
export PATH=/usr/local/go/bin:$PATH
cd /kaggle/working/subdomain-playground

echo "=========================================="
echo "Building Security Tools (5-8 minutes)"
echo "=========================================="
echo ""

mkdir -p bin

build_tool() {
    local name=$1
    local repo=$2
    local index=$3
    local max_attempts=3
    
    echo "[$index/5] Building $name..."
    
    for attempt in $(seq 1 $max_attempts); do
        if [ $attempt -gt 1 ]; then
            echo "  Retry $attempt/$max_attempts..."
            sleep 2
        fi
        
        if GOBIN=$(pwd)/bin /usr/local/go/bin/go install -v ${repo}@latest 2>&1; then
            if [ -f "bin/$name" ]; then
                echo "  ‚úì $name built successfully"
                return 0
            fi
        fi
    done
    
    echo "  ‚úó Failed to build $name"
    return 1
}

build_tool "subfinder" "github.com/projectdiscovery/subfinder/v2/cmd/subfinder" "1"
echo ""

build_tool "findomain" "github.com/Findomain/Findomain" "2" || echo "  ‚ö†Ô∏è Findomain failed (optional)"
echo ""

build_tool "dnsx" "github.com/projectdiscovery/dnsx/cmd/dnsx" "3"
echo ""

build_tool "httpx" "github.com/projectdiscovery/httpx/cmd/httpx" "4"
echo ""

build_tool "subzy" "github.com/PentestPad/subzy" "5"
echo ""

echo "=========================================="
echo "Verification"
echo "=========================================="

REQUIRED="subfinder dnsx httpx subzy"
TOOLS_OK=true

for tool in $REQUIRED; do
    if [ -f "bin/$tool" ]; then
        echo "  ‚úì bin/$tool"
    else
        echo "  ‚úó bin/$tool MISSING"
        TOOLS_OK=false
    fi
done

[ -f "bin/findomain" ] && echo "  ‚úì bin/findomain (bonus!)" || echo "  ‚ö†Ô∏è bin/findomain (optional)"

if [ "$TOOLS_OK" = false ]; then
    echo ""
    echo "‚ö†Ô∏è Required tools failed - re-run this cell"
    exit 1
fi

echo ""
echo "‚úì All required tools built!"

## Cell 5: Set Environment Variables

In [None]:
import os

os.environ['SUBFINDER_PATH'] = '/kaggle/working/subdomain-playground/bin/subfinder'
os.environ['FINDOMAIN_PATH'] = '/kaggle/working/subdomain-playground/bin/findomain'
os.environ['DNSX_PATH'] = '/kaggle/working/subdomain-playground/bin/dnsx'
os.environ['HTTPX_PATH'] = '/kaggle/working/subdomain-playground/bin/httpx'
os.environ['SUBZY_PATH'] = '/kaggle/working/subdomain-playground/bin/subzy'

os.chdir('/kaggle/working/subdomain-playground')

print("‚úì Environment configured")
print(f"‚úì Working directory: {os.getcwd()}")

## Cell 6: Load and Filter Dataset

**This cell:**
1. Loads the CSV dataset
2. Filters for Shopify stores
3. Filters by HTTP status (403, 404, 409, 500, 503)
4. Sorts by Est Monthly Page Views (descending)
5. Shows you the top targets before scanning

**You can adjust:**
- `LIMIT`: How many domains to scan (default: 100)
- `STATUS_CODES`: Which HTTP codes to filter (default: [403, 404, 409, 500, 503])

In [None]:
import pandas as pd
import os

# ========================================
# CONFIGURATION - ADJUST THESE
# ========================================
LIMIT = 100  # How many top domains to scan
STATUS_CODES = [403, 404, 409, 500, 503]  # HTTP status codes to target
SORT_BY = 'Est Monthly Page Views'  # Column to sort by

print("=" * 80)
print("LOADING DATASET")
print("=" * 80)
print("")

# Load CSV
csv_path = '/kaggle/input/all-leads-merged/results.csv'
print(f"Loading: {csv_path}")
df = pd.read_csv(csv_path, low_memory=False)
print(f"‚úì Loaded {len(df):,} total rows")
print("")

# Filter for Shopify
print("Filtering for Shopify stores...")
df_shopify = df[df['Is_Shopify'].astype(str).str.strip().str.lower() == 'yes']
print(f"‚úì Shopify stores: {len(df_shopify):,}")

# Filter out *.myshopify.com
df_custom = df_shopify[~df_shopify['Subdomain'].astype(str).str.contains('myshopify.com', na=False)]
print(f"‚úì Custom domains (not *.myshopify.com): {len(df_custom):,}")
print("")

# Filter by HTTP status
print(f"Filtering by HTTP status: {STATUS_CODES}")
df_filtered = df_custom[df_custom['HTTP_Status'].isin(STATUS_CODES)]
print(f"‚úì Matches: {len(df_filtered):,}")
print("")

# Status breakdown
print("Status code breakdown:")
for status in STATUS_CODES:
    count = len(df_filtered[df_filtered['HTTP_Status'] == status])
    print(f"  {status}: {count:,}")
print("")

# Sort by page views
print(f"Sorting by: {SORT_BY} (highest first)")

if SORT_BY in df_filtered.columns:
    # Convert to numeric
    df_filtered['PageViews_Numeric'] = pd.to_numeric(
        df_filtered[SORT_BY].astype(str).str.replace(r'[^\d.]', '', regex=True),
        errors='coerce'
    )
    df_filtered['PageViews_Numeric'].fillna(0, inplace=True)
    
    # Sort and take top N
    df_sorted = df_filtered.nlargest(LIMIT, 'PageViews_Numeric')
    print(f"‚úì Sorted {len(df_filtered):,} domains")
else:
    print(f"‚ö†Ô∏è Column '{SORT_BY}' not found - using unsorted data")
    df_sorted = df_filtered.head(LIMIT)

print("")
print("=" * 80)
print(f"TOP {LIMIT} TARGETS (Highest Traffic)")
print("=" * 80)
print("")

# Display top 20
display_cols = ['Subdomain', 'HTTP_Status', SORT_BY, 'CNAME_Record']
display_cols = [col for col in display_cols if col in df_sorted.columns]

print("Top 20:")
print(df_sorted[display_cols].head(20).to_string(index=False))
print("")
print(f"... and {len(df_sorted) - 20} more" if len(df_sorted) > 20 else "")
print("")

# Export to file
targets_file = '/kaggle/working/scan_targets.txt'
df_sorted['Subdomain'].to_csv(targets_file, index=False, header=False)
print(f"‚úì Exported {len(df_sorted)} targets to: {targets_file}")
print("")

# Also save CSV for reference
csv_file = '/kaggle/working/filtered_targets.csv'
df_sorted[display_cols].to_csv(csv_file, index=False)
print(f"‚úì Saved filtered data to: {csv_file}")
print("")
print("=" * 80)
print("Next: Run Cell 7 to scan these targets")
print("=" * 80)

## Cell 7: Deep Scanner Validation (10-15 minutes)

**Scans the filtered targets with:**
- DNS validation (CNAME chains)
- HTTP validation (body analysis)
- Takeover detection (subzy)
- CNAME blacklist filtering (46 patterns)
- Cloudflare verification detection

In [None]:
import subprocess
import sys
import time

print("=" * 80)
print("DEEP SCANNER VALIDATION")
print("=" * 80)
print("")
print("Scanning: /kaggle/working/scan_targets.txt")
print("Mode: quick (skips enumeration - already subdomains)")
print("Workers: 2 (optimized for Kaggle)")
print("")
print("Features:")
print("  ‚úì Automatic subdomain detection")
print("  ‚úì CNAME blacklist (46 patterns)")
print("  ‚úì Cloudflare verification detection")
print("  ‚úì Custom DNS resolvers")
print("  ‚úì Tools: subfinder + findomain + dnsx + httpx + subzy")
print("")

scan_start = time.time()

scan_cmd = [
    sys.executable, '-u', 'scan.py',
    '-l', '/kaggle/working/scan_targets.txt',
    '--mode', 'quick',
    '--provider', 'Shopify',
    '--require-cname',
    '--filter-status', '3*,4*,5*',
    '--workers', '2',
    '--json',
    '--output', '/kaggle/working/takeover_results.json'
]

scan_proc = subprocess.Popen(
    scan_cmd,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    universal_newlines=True,
    bufsize=1
)

for line in iter(scan_proc.stdout.readline, ''):
    if line:
        print(line, end='', flush=True)

scan_proc.wait()
scan_time = time.time() - scan_start

print("")
print("=" * 80)
print(f"‚úì Scan Complete in {scan_time/60:.1f} minutes")
print("=" * 80)
print("")
print("‚úì Results: /kaggle/working/takeover_results.json")
print("")
print("Next: Run Cell 8 to view results")

## Cell 8: View Results

In [None]:
import json
import pandas as pd
from pathlib import Path

results_file = Path('/kaggle/working/takeover_results.json')

if not results_file.exists():
    print("‚ùå Results file not found!")
    print("Make sure Cell 7 completed successfully")
else:
    with open(results_file) as f:
        results = json.load(f)
    
    print("=" * 80)
    print("SCAN RESULTS")
    print("=" * 80)
    print("")
    print(f"Total analyzed: {len(results)}")
    
    definite = [r for r in results if 'DEFINITE' in r.get('evidence', '')]
    high_prob = [r for r in results if 'HIGH PROBABILITY' in r.get('evidence', '')]
    false_pos = [r for r in results if 'FALSE POSITIVE' in r.get('evidence', '')]
    
    print("")
    print("Breakdown:")
    print(f"  üî¥ DEFINITE TAKEOVER: {len(definite)}")
    print(f"  ‚ö†Ô∏è  HIGH PROBABILITY: {len(high_prob)}")
    print(f"  ‚ùå FALSE POSITIVE: {len(false_pos)}")
    
    print("")
    print("=" * 80)
    print("TOP 10 VULNERABILITIES")
    print("=" * 80)
    
    if definite:
        df = pd.DataFrame(definite)
        df_sorted = df.sort_values('confidence', ascending=False) if 'confidence' in df.columns else df
        
        for i, row in df_sorted.head(10).iterrows():
            print("")
            print(f"{i+1}. {row['subdomain']}")
            print(f"   Status: {row.get('status', 'N/A')}")
            print(f"   CNAME: {row.get('cname', 'N/A')}")
            print(f"   Evidence: {row.get('evidence', 'N/A')}")
            print(f"   Confidence: {row.get('confidence', 0)}")
    else:
        print("")
        print("No definite takeovers found.")
        print("Check HIGH PROBABILITY results.")
    
    print("")
    print("=" * 80)
    print("Next: Run Cell 9 to export CSV")

## Cell 9: Export Results to CSV

In [None]:
import json
import pandas as pd
from pathlib import Path

results_file = Path('/kaggle/working/takeover_results.json')

if results_file.exists():
    with open(results_file) as f:
        results = json.load(f)
    
    df = pd.DataFrame(results)
    
    columns = ['subdomain', 'status', 'cname', 'evidence', 'confidence', 'message']
    df_export = df[[col for col in columns if col in df.columns]]
    
    if 'confidence' in df_export.columns:
        df_export = df_export.sort_values('confidence', ascending=False)
    
    output_csv = Path('/kaggle/working/shopify_takeovers.csv')
    df_export.to_csv(output_csv, index=False)
    
    print("=" * 80)
    print("CSV EXPORT")
    print("=" * 80)
    print("")
    print(f"‚úì Exported {len(df_export)} results to: {output_csv}")
    print("")
    print("Preview (top 20):")
    print(df_export.head(20).to_string(index=False))
    
    if len(df_export) > 20:
        print("")
        print(f"... and {len(df_export) - 20} more rows")
    
    df_high = df_export[df_export['evidence'].str.contains('DEFINITE|HIGH', na=False)]
    if len(df_high) > 0:
        high_risk_csv = Path('/kaggle/working/shopify_high_risk.csv')
        df_high.to_csv(high_risk_csv, index=False)
        print("")
        print(f"‚úì High-risk only ({len(df_high)} results): {high_risk_csv}")
    
    print("")
    print("=" * 80)
    print("‚úÖ COMPLETE!")
    print("=" * 80)
    print("")
    print("Download files from Kaggle Output section")
else:
    print("‚ùå Results file not found!")

## Cell 10: Performance Summary

In [None]:
print("=" * 80)
print("PERFORMANCE SUMMARY")
print("=" * 80)
print("")
print("Workflow:")
print("  1. Load 1.7M row CSV dataset")
print("  2. Filter by Shopify + HTTP status")
print("  3. Sort by Est Monthly Page Views")
print("  4. Scan top N targets (default: 100)")
print("")
print("Time:")
print("  Setup (Cells 1-5): 6-10 minutes")
print("  Filter (Cell 6): 2-5 seconds")
print("  Scan (Cell 7): 10-15 minutes")
print("  Total: 15-25 minutes")
print("")
print("Tools:")
print("  ‚úì subfinder (passive enumeration)")
print("  ‚úì findomain (bonus coverage)")
print("  ‚úì dnsx (DNS + CNAME)")
print("  ‚úì httpx (HTTP validation)")
print("  ‚úì subzy (takeover detection)")
print("")
print("Features:")
print("  ‚úì Dataset-based (no Google Sheets)")
print("  ‚úì Filter by HTTP status")
print("  ‚úì Sort by page views")
print("  ‚úì CNAME blacklist (46 patterns)")
print("  ‚úì Cloudflare verification detection")
print("  ‚úì Custom DNS resolvers")
print("")
print("=" * 80)