# Shopify Subdomain Takeover Scanner

**Total time:** ~6 hours for 10,000 domains

**What this does:**
- Scans domains for Shopify CNAME records
- Detects HTTP 403/404 status (potential takeover indicators)
- Real-time progress display
- Exports results to CSV

**Instructions:** Run cells 1-12 in order

## Cell 1: Clone Project from GitHub

Clones the project repository to Kaggle workspace.

In [None]:
%%bash
set -euo pipefail

WORKDIR=${KAGGLE_WORKING_DIR:-/kaggle/working}
if [ ! -d "$WORKDIR" ]; then
    WORKDIR="$(pwd)"
fi
PROJECT_DIR="$WORKDIR/subdomain-playground"

echo "=========================================="
echo "Cloning Project from GitHub"
echo "=========================================="

echo "Working directory: $WORKDIR"
mkdir -p "$WORKDIR"
cd "$WORKDIR"

if [ -d "$PROJECT_DIR" ]; then
    echo "Removing existing copy at $PROJECT_DIR"
    rm -rf "$PROJECT_DIR"
fi

git clone --depth 1 https://github.com/sayihhamza/subdomain-playground.git "$PROJECT_DIR"

cd "$PROJECT_DIR"

echo ""
echo "✓ Project cloned successfully!"
echo ""
echo "Project structure:"
ls -lh | head -20


## Cell 2: Configure Environment for Kaggle

In [None]:
%%bash
set -euo pipefail

WORKDIR=${KAGGLE_WORKING_DIR:-/kaggle/working}
if [ ! -d "$WORKDIR" ]; then
    WORKDIR="$(pwd)"
fi
PROJECT_DIR="$WORKDIR/subdomain-playground"

cd "$PROJECT_DIR"

echo "=========================================="
echo "Verifying Project Files & Python Deps"
echo "=========================================="

echo ""
echo "Essential files:"
ls -lh scan.py requirements.txt 2>/dev/null

echo ""
echo "Directories:"
ls -d */ 2>/dev/null

echo ""
echo "CSV files:"
if [ -d "data/domain_sources/myleadfox" ]; then
    CSV_COUNT=$(ls data/domain_sources/myleadfox/*.csv 2>/dev/null | wc -l)
    echo "✓ Found $CSV_COUNT CSV files in data/domain_sources/myleadfox/"
    ls -lh data/domain_sources/myleadfox/*.csv 2>/dev/null | head -5
else
    echo "✗ CSV directory not found"
fi

echo ""
echo "Installing Python requirements (quiet)..."
python3 -m pip install --quiet -r requirements.txt

echo ""
echo "✓ Project structure verified!"


## Cell 3: Install Go 1.24

Kaggle has Go 1.18, but we need Go 1.24+ to compile the latest security tools.

In [None]:
%%bash
set -euo pipefail

WORKDIR=${KAGGLE_WORKING_DIR:-/kaggle/working}
if [ ! -d "$WORKDIR" ]; then
    WORKDIR="$(pwd)"
fi
cd "$WORKDIR"

if command -v sudo >/dev/null 2>&1; then
    SUDO="sudo"
else
    SUDO=""
fi

echo "=========================================="
echo "Installing Go 1.24.1"
echo "=========================================="

echo "Current Go version:"
go version 2>/dev/null || echo "Go not found"

echo ""
echo "Installing Go 1.24.1..."

# Remove old Go installations
$SUDO rm -rf /usr/lib/go* 2>/dev/null || true
$SUDO rm -rf /usr/local/go 2>/dev/null || true

# Download Go 1.24.1 - with retry
echo "Downloading Go 1.24.1 for Linux AMD64..."
for i in {1..3}; do
    wget -q https://go.dev/dl/go1.24.1.linux-amd64.tar.gz -O /tmp/go.tar.gz && break || sleep 5
done

# Verify download
if [ ! -f /tmp/go.tar.gz ]; then
    echo "✗ Failed to download Go"
    exit 1
fi

# Install Go
echo "Installing to /usr/local/go..."
$SUDO tar -C /usr/local -xzf /tmp/go.tar.gz

# Cleanup
rm -f /tmp/go.tar.gz

# Verify installation
if [ ! -f /usr/local/go/bin/go ]; then
    echo "✗ Go installation failed"
    exit 1
fi

echo ""
echo "✓ Go 1.24.1 installed successfully!"
echo ""
echo "New Go version:"
/usr/local/go/bin/go version


## Cell 4: Build Security Tools from Source

Compiles subfinder, httpx, dnsx, and subzy from source. This takes 3-4 minutes.

These tools are required for:
- **subfinder**: Passive subdomain enumeration
- **httpx**: HTTP probing and status checking
- **dnsx**: DNS resolution and CNAME chain tracking
- **subzy**: Subdomain takeover detection

In [None]:
%%bash
export PATH=/usr/local/go/bin:$PATH
cd /kaggle/working/subdomain-playground

echo "=========================================="
echo "Building Security Tools"
echo "=========================================="
echo "This takes 3-4 minutes..."
echo "⏳ Retrying on network errors (Kaggle proxy issues)"
echo ""

# Verify Go is available
if ! command -v /usr/local/go/bin/go &> /dev/null; then
    echo "✗ Go not found! Re-run Cell 3"
    exit 1
fi

# Create bin directory
mkdir -p bin

# Function to build with retries
build_tool() {
    local name=$1
    local repo=$2
    local max_attempts=3
    
    echo "[$3/4] Building $name..."
    
    for attempt in $(seq 1 $max_attempts); do
        if [ $attempt -gt 1 ]; then
            echo "  Retry $attempt/$max_attempts..."
            sleep 2
        fi
        
        if GOBIN=$(pwd)/bin /usr/local/go/bin/go install -v ${repo}@latest 2>&1; then
            if [ -f "bin/$name" ]; then
                echo "  ✓ $name built successfully"
                return 0
            fi
        fi
    done
    
    echo "  ✗ Failed to build $name after $max_attempts attempts"
    return 1
}

# Build each tool with retries
build_tool "subfinder" "github.com/projectdiscovery/subfinder/v2/cmd/subfinder" "1"
SUBFINDER_OK=$?

echo ""
build_tool "httpx" "github.com/projectdiscovery/httpx/cmd/httpx" "2"
HTTPX_OK=$?

echo ""
build_tool "dnsx" "github.com/projectdiscovery/dnsx/cmd/dnsx" "3"
DNSX_OK=$?

echo ""
build_tool "subzy" "github.com/PentestPad/subzy" "4"
SUBZY_OK=$?

echo ""
echo "=========================================="
echo "Verification"
echo "=========================================="

# Check results
TOOLS_OK=true
for tool in subfinder httpx dnsx subzy; do
    if [ -f "bin/$tool" ]; then
        echo "✓ bin/$tool exists"
    else
        echo "✗ bin/$tool not found!"
        TOOLS_OK=false
    fi
done

if [ "$TOOLS_OK" = false ]; then
    echo ""
    echo "=========================================="
    echo "⚠️  Some tools failed to build due to network errors"
    echo "=========================================="
    echo ""
    echo "This is a Kaggle network issue (proxy.golang.org timeouts)."
    echo ""
    echo "Solutions:"
    echo "  1. Wait 1-2 minutes and re-run this cell"
    echo "  2. If it keeps failing, restart the Kaggle session"
    echo "  3. Try running at a different time (less network congestion)"
    exit 1
fi

echo ""
echo "Tool versions:"
./bin/subfinder -version 2>&1 | head -1 || echo "subfinder: installed"
./bin/httpx -version 2>&1 | head -1 || echo "httpx: installed"
./bin/dnsx -version 2>&1 | head -1 || echo "dnsx: installed"
./bin/subzy --help 2>&1 | head -1 || echo "subzy: installed"

echo ""
echo "Binary details:"
file bin/subfinder | cut -d: -f2
file bin/httpx | cut -d: -f2
file bin/dnsx | cut -d: -f2
file bin/subzy | cut -d: -f2

echo ""
echo "Tool sizes:"
ls -lh bin/ | grep -E "(subfinder|httpx|dnsx|subzy)" | awk '{print $9 ": " $5}'

echo ""
echo "✓ All tools built successfully!"

## Cell 5: Verify Environment

Verifies that .env file exists with correct tool paths (now included in repo).

In [None]:
%%bash
cd /kaggle/working/subdomain-playground

echo "Verifying .env file..."
echo ""

if [ -f ".env" ]; then
    echo "✓ .env file exists"
    echo ""
    echo "Contents:"
    cat .env
else
    echo "✗ .env file not found - creating it now..."
    cat > .env << 'EOF'
SUBFINDER_PATH=/kaggle/working/subdomain-playground/bin/subfinder
DNSX_PATH=/kaggle/working/subdomain-playground/bin/dnsx
HTTPX_PATH=/kaggle/working/subdomain-playground/bin/httpx
SUBZY_PATH=/kaggle/working/subdomain-playground/bin/subzy
EOF
    echo "✓ .env file created"
fi

echo ""
echo "Verifying tool paths:"
for tool in subfinder dnsx httpx subzy; do
    if [ -f "bin/$tool" ]; then
        echo "✓ bin/$tool exists"
    else
        echo "✗ bin/$tool NOT FOUND"
    fi
done

## Cell 6: Extract Domains from CSV Files

Extracts unique domains from CSV files in `data/domain_sources/myleadfox/`

In [None]:
%%bash
set -euo pipefail

WORKDIR=${KAGGLE_WORKING_DIR:-/kaggle/working}
if [ ! -d "$WORKDIR" ]; then
    WORKDIR="$(pwd)"
fi
PROJECT_DIR="$WORKDIR/subdomain-playground"
cd "$PROJECT_DIR"

echo "=========================================="
echo "Extracting Domains from CSV Files"
echo "=========================================="

if [ -d "data/domain_sources/myleadfox" ]; then
    CSV_COUNT=$(ls data/domain_sources/myleadfox/*.csv 2>/dev/null | wc -l)
    echo "Found $CSV_COUNT CSV files"
    echo ""

    # Extract unique domains from all CSV files
    echo "Extracting domains..."
    cat data/domain_sources/myleadfox/*.csv |       tail -n +2 |       cut -d',' -f1 |       sed 's/"//g' |       grep -E '^[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' |       sort -u > data/all_sources.txt

    DOMAIN_COUNT=$(wc -l < data/all_sources.txt | tr -d ' ')
    echo "✓ Extracted $DOMAIN_COUNT unique domains"
    echo ""
    echo "Saved to: data/all_sources.txt"
    echo ""
    echo "First 10 domains:"
    head -10 data/all_sources.txt
else
    echo "✗ CSV directory not found: data/domain_sources/myleadfox/"
    echo "Please add your CSV files to this directory"
    exit 1
fi


## Cell 7: Quick Test (5 domains)

**⚠️ IMPORTANT: Watch for real-time output!**

You should see:
- Domains streaming with DNS/HTTP info
- Live progress updates
- CNAME chains and provider detection

If Cell 7 works correctly, you can proceed to Cell 8 for the full scan.

In [None]:
import subprocess
import sys
import os

# Change to project directory
os.chdir('/kaggle/working/subdomain-playground')

# Add bin to PATH
os.environ['PATH'] = f"/kaggle/working/subdomain-playground/bin:{os.environ['PATH']}"

# Set tool paths
os.environ['SUBFINDER_PATH'] = '/kaggle/working/subdomain-playground/bin/subfinder'
os.environ['DNSX_PATH'] = '/kaggle/working/subdomain-playground/bin/dnsx'
os.environ['HTTPX_PATH'] = '/kaggle/working/subdomain-playground/bin/httpx'
os.environ['SUBZY_PATH'] = '/kaggle/working/subdomain-playground/bin/subzy'

print("=" * 80)
print("Quick Test - 5 Domains")
print("=" * 80)
print()

# Verify tools
print("Verifying tools are in PATH:")
for tool in ['subfinder', 'dnsx', 'httpx', 'subzy']:
    tool_path = f"/kaggle/working/subdomain-playground/bin/{tool}"
    if os.path.exists(tool_path):
        print(f"  ✓ {tool} found")
    else:
        print(f"  ✗ {tool} NOT FOUND")
print()

# Test dnsx directly
print("Testing dnsx directly on google.com:")
dnsx_test = subprocess.run(
    ['./bin/dnsx', '-a', '-cname', '-resp', '-json', '-silent'],
    input='google.com\n',
    capture_output=True,
    text=True,
    cwd='/kaggle/working/subdomain-playground'
)
print(dnsx_test.stdout[:200] if dnsx_test.stdout else "No output")
print()

# Create test file with domains 10-14
print("Creating test file with 5 diverse domains...")
with open('data/all_sources.txt', 'r') as f:
    all_domains = f.readlines()
    test_domains = all_domains[9:14]  # Lines 10-14 (0-indexed)

with open('data/test_5.txt', 'w') as f:
    f.writelines(test_domains)

print("Testing with:")
with open('data/test_5.txt', 'r') as f:
    print(f.read())
print()

print("=" * 80)
print("STARTING TEST SCAN (NO FILTERS)")
print("=" * 80)
print()
print("NOTE: Running without filters to see all discovered subdomains...")
print()

# Run scan with real-time output
process = subprocess.Popen(
    [sys.executable, '-u', 'scan.py', 
     '-l', 'data/test_5.txt',
     '--workers', '2',
     '--mode', 'quick'],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    universal_newlines=True,
    bufsize=1
)

# Stream output line by line in real-time
for line in process.stdout:
    print(line, end='', flush=True)

process.wait()

print()
print("=" * 80)
print("✓ Test Complete")
print("=" * 80)
print()
print("Did you see ANY subdomains above? If yes, the scanner works!")
print("If still 0 subdomains, then enumeration itself is failing.")

## Cell 8: FULL SCAN - ALL DOMAINS (Full Mode with Shopify CNAME Filter)

⚠️ **WARNING: Full mode takes 10-15 hours for ~10,000 domains!**

**Modes available:**
- **quick**: Passive enumeration only (5-6 hours) - subfinder only
- **full**: Active + passive enumeration (10-15 hours) - subfinder + DNS bruteforce + alterx

**What you'll see:**
- Real-time progress streaming
- Live DNS/HTTP information  
- Progress updates every 10 domains
- ETA (estimated time to completion)

**This cell uses FULL MODE with:**
- `--mode full`: Active + passive enumeration for maximum subdomain discovery
- `--require-cname`: Only show domains WITH CNAME records
- Filter applied: Must contain "shopify" in CNAME

Kaggle sessions timeout after 12 hours, so full mode may timeout. Use quick mode if needed.

In [None]:
import subprocess
import sys
import os

# Change to project directory
os.chdir('/kaggle/working/subdomain-playground')

# Add bin to PATH
os.environ['PATH'] = f"/kaggle/working/subdomain-playground/bin:{os.environ['PATH']}"

# Set tool paths
os.environ['SUBFINDER_PATH'] = '/kaggle/working/subdomain-playground/bin/subfinder'
os.environ['DNSX_PATH'] = '/kaggle/working/subdomain-playground/bin/dnsx'
os.environ['HTTPX_PATH'] = '/kaggle/working/subdomain-playground/bin/httpx'
os.environ['SUBZY_PATH'] = '/kaggle/working/subdomain-playground/bin/subzy'

print("=" * 80)
print("STARTING FULL SCAN - FULL MODE")
print("=" * 80)
print()

# Count domains
with open('data/all_sources.txt', 'r') as f:
    domain_count = len(f.readlines())

print(f"Total domains: {domain_count}")
print("Workers: 4")
print("Mode: FULL (active + passive enumeration)")
print("Filter: Require CNAME + Shopify domains")
print()
print("Estimated time: 10-15 hours")
print("⚠️  WARNING: May exceed Kaggle 12-hour limit!")
print()
print("=" * 80)
print()

# Run scan with real-time output
process = subprocess.Popen(
    [sys.executable, '-u', 'scan.py', 
     '-l', 'data/all_sources.txt',
     '--mode', 'full',
     '--require-cname',
     '--provider', 'Shopify',
     '--workers', '4'],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    universal_newlines=True,
    bufsize=1
)

# Stream output line by line in real-time
for line in process.stdout:
    print(line, end='', flush=True)

process.wait()
print()
print("Scan completed with return code:", process.returncode)

## Cell 9: View Results Summary

Displays scan results with risk level breakdown and top findings.

In [None]:
import json
from pathlib import Path

PROJECT_DIR = Path("/kaggle/working/subdomain-playground")
if not PROJECT_DIR.exists():
    PROJECT_DIR = Path.cwd()

results_file = PROJECT_DIR / "data/scans/shopify_takeover_candidates.json"

if results_file.exists():
    with results_file.open("r") as f:
        results = json.load(f)

    print("=" * 80)
    print("SHOPIFY TAKEOVER SCAN RESULTS")
    print("=" * 80)
    print("")
    print(f"Total candidates found: {len(results)}")

    # Count by risk level
    risk_counts = {}
    for r in results:
        risk = r.get("risk_level", "unknown")
        risk_counts[risk] = risk_counts.get(risk, 0) + 1

    print("")
    print("Breakdown by risk level:")
    for risk, count in sorted(risk_counts.items()):
        print(f"  {risk.upper()}: {count}")

    print("")
    print("=" * 80)
    print("TOP 10 FINDINGS (by confidence score)")
    print("=" * 80)

    sorted_results = sorted(results, key=lambda x: x.get("confidence_score", 0), reverse=True)

    for i, r in enumerate(sorted_results[:10], 1):
        print("")
        print(f"{i}. {r['subdomain']}")
        print(f"   CNAME: {r.get('cname', 'N/A')}")
        print(f"   HTTP Status: {r.get('http_status', 'N/A')}")
        print(f"   Risk Level: {r.get('risk_level', 'N/A')}")
        print(f"   Confidence Score: {r.get('confidence_score', 0)}")
        if r.get("cname_chain"):
            print(f"   CNAME Chain: {' → '.join(r['cname_chain'][:3])}")
else:
    print(f"✗ Results file not found: {results_file}")
    print("")
    print("Make sure Cell 8 completed successfully.")


## Cell 10: Export to CSV

Exports results to `shopify_results.csv` for easy analysis.

In [None]:
import json
import pandas as pd
from pathlib import Path

PROJECT_DIR = Path("/kaggle/working/subdomain-playground")
if not PROJECT_DIR.exists():
    PROJECT_DIR = Path.cwd()

results_file = PROJECT_DIR / "data/scans/shopify_takeover_candidates.json"
if not results_file.exists():
    raise SystemExit(f"✗ Results file not found: {results_file}. Run the scan first.")

with results_file.open("r") as f:
    results = json.load(f)

if not results:
    raise SystemExit("✗ No results to export. Make sure the scan produced findings.")

df = pd.DataFrame(results)

# Select key columns
columns = [
    "subdomain", "cname", "http_status", "risk_level", "confidence_score",
    "cname_chain_count", "final_cname_target", "a_records", "provider"
]
df_export = df[[col for col in columns if col in df.columns]]
df_export = df_export.sort_values("confidence_score", ascending=False)

# Save to CSV
output_csv = PROJECT_DIR / "shopify_results.csv"
df_export.to_csv(output_csv, index=False)

print(f"✓ Exported {len(df_export)} results to {output_csv}")
print("
Preview (top 10):")
display(df_export.head(10))

print("
Column descriptions:")
print("  - subdomain: Domain scanned")
print("  - cname: CNAME record pointing to Shopify")
print("  - http_status: HTTP response code (403/404 = potential takeover)")
print("  - risk_level: low, medium, high, or critical")
print("  - confidence_score: 0-100 (higher = more confident)")


## Cell 11: Filter High-Risk Only

Creates a separate CSV with only critical and high-risk findings.

In [None]:
import pandas as pd
from pathlib import Path

PROJECT_DIR = Path("/kaggle/working/subdomain-playground")
if not PROJECT_DIR.exists():
    PROJECT_DIR = Path.cwd()

results_csv = PROJECT_DIR / "shopify_results.csv"
if not results_csv.exists():
    raise SystemExit(f"✗ Results CSV not found: {results_csv}. Run the export cell first.")

df = pd.read_csv(results_csv)
df_high = df[df["risk_level"].isin(["critical", "high"])]

print(f"High-risk findings: {len(df_high)} out of {len(df)} total")
print("")

if len(df_high) > 0:
    high_risk_csv = PROJECT_DIR / "shopify_high_risk.csv"
    df_high.to_csv(high_risk_csv, index=False)
    print(f"✓ Saved to {high_risk_csv}")
    print("
High-risk results:")
    display(df_high)
    
    print("
⚠️ PRIORITY ACTIONS:")
    print("  1. Verify these findings manually")
    print("  2. Check if you own these domains")
    print("  3. Claim Shopify stores if authorized")
    print("  4. Report findings to domain owners")
else:
    print("✓ No high-risk findings detected.")
    print("
This is good news! Either:")
    print("  - No critical vulnerabilities found")
    print("  - All findings are low/medium risk")


## Cell 12: Download Results

Provides download links for all result files.

In [None]:
from IPython.display import FileLink, display
from pathlib import Path

PROJECT_DIR = Path("/kaggle/working/subdomain-playground")
if not PROJECT_DIR.exists():
    PROJECT_DIR = Path.cwd()

print("Download your results:")
print("=" * 80)
print("")

files = [
    (PROJECT_DIR / "shopify_results.csv", "All Shopify takeover candidates (CSV)"),
    (PROJECT_DIR / "shopify_high_risk.csv", "High-risk findings only (CSV)"),
    (PROJECT_DIR / "data/scans/shopify_takeover_candidates.json", "Full results with metadata (JSON)")
]

for file_path, description in files:
    if file_path.exists():
        file_size = file_path.stat().st_size
        size_kb = file_size / 1024
        print(f"✓ {description}")
        print(f"  Size: {size_kb:.1f} KB")
        display(FileLink(str(file_path)))
        print("")
    else:
        print(f"- {description} (not found)")
        print("")

print("=" * 80)
print("
✅ SCAN COMPLETE!")
print("
Next steps:")
print("  1. Download the CSV files above")
print("  2. Review high-risk findings first")
print("  3. Manually verify critical findings")
print("  4. Take appropriate action on confirmed vulnerabilities")
print("
⚠️ Legal reminder: Only act on domains you own or have authorization to test.")
