# Shopify Subdomain Takeover Scanner

**Estimated time:** 5-6 hours for ~10,000 domains

**What this does:**
- Scans domains for Shopify CNAME records
- Checks HTTP status (403/404 indicates potential takeover)
- Exports results to CSV for review

**Important:** This notebook includes cleanup and tool building from source.

## Cell 1: Clean Up Old Files

In [None]:
%%bash
cd /kaggle/working/subdomain-playground

echo "Removing old documentation files..."
rm -f *.md
rm -f test_*.sh test_*.py
rm -f setup.sh setup_domains.sh run_scan.sh download_wordlist.sh merge_all_domains.sh

echo "✓ Cleanup complete!"
echo ""
echo "Remaining files:"
ls -lh *.py *.txt *.sh 2>/dev/null

## Cell 2: Install Go 1.22

In [None]:
%%bash
set -e

echo "=========================================="
echo "Installing Go 1.22"
echo "=========================================="

cd /kaggle/working

# Remove old Go
sudo rm -rf /usr/lib/go* 2>/dev/null || true

# Install Go 1.22
echo "Downloading Go 1.22.3..."
wget -q https://go.dev/dl/go1.22.3.linux-amd64.tar.gz -O /tmp/go.tar.gz

echo "Installing Go..."
sudo tar -C /usr/local -xzf /tmp/go.tar.gz

export PATH=/usr/local/go/bin:$PATH

echo ""
echo "Go version:"
/usr/local/go/bin/go version

echo ""
echo "✓ Go 1.22 installed successfully!"

## Cell 3: Build Security Tools

In [None]:
%%bash
export PATH=/usr/local/go/bin:$PATH
cd /kaggle/working/subdomain-playground

echo "=========================================="
echo "Building Security Tools"
echo "=========================================="
echo ""

# Create bin directory
mkdir -p bin

# Build httpx
echo "[1/3] Building httpx (30 seconds)..."
GOBIN=$(pwd)/bin /usr/local/go/bin/go install github.com/projectdiscovery/httpx/cmd/httpx@latest

# Build dnsx
echo "[2/3] Building dnsx (30 seconds)..."
GOBIN=$(pwd)/bin /usr/local/go/bin/go install github.com/projectdiscovery/dnsx/cmd/dnsx@latest

# Build subzy
echo "[3/3] Building subzy (30 seconds)..."
GOBIN=$(pwd)/bin /usr/local/go/bin/go install github.com/LukaSikic/subzy@latest

echo ""
echo "=========================================="
echo "Verification"
echo "=========================================="
echo ""

# Verify tools
echo "Tool versions:"
./bin/httpx -version
./bin/dnsx -version

echo ""
echo "Binary details:"
file bin/httpx
file bin/dnsx
file bin/subzy

echo ""
echo "Tool sizes:"
ls -lh bin/ | grep -E "(httpx|dnsx|subzy)"

echo ""
echo "✓ All tools built successfully!"

## Cell 4: Configure Environment for Kaggle

In [None]:
%%bash
cd /kaggle/working/subdomain-playground

echo "Configuring environment..."

cat > .env << 'EOF'
DNSX_PATH=/kaggle/working/subdomain-playground/bin/dnsx
HTTPX_PATH=/kaggle/working/subdomain-playground/bin/httpx
SUBZY_PATH=/kaggle/working/subdomain-playground/bin/subzy
EOF

echo "✓ Environment configured"
echo ""
cat .env

## Cell 5: Extract Domains from CSV Files

In [None]:
%%bash
cd /kaggle/working/subdomain-playground

echo "=========================================="
echo "Extracting Domains from CSV"
echo "=========================================="
echo ""

# Check if CSV files exist
if [ -d "data/domain_sources/myleadfox" ]; then
    CSV_COUNT=$(ls data/domain_sources/myleadfox/*.csv 2>/dev/null | wc -l)
    echo "Found $CSV_COUNT CSV files"

    # Extract domains
    cat data/domain_sources/myleadfox/*.csv | \
      tail -n +2 | \
      cut -d',' -f1 | \
      sed 's/"//g' | \
      grep -E '^[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' | \
      sort -u > data/all_sources.txt

    DOMAIN_COUNT=$(wc -l < data/all_sources.txt | tr -d ' ')
    echo "✓ Extracted $DOMAIN_COUNT unique domains"
    echo ""
    echo "First 10 domains:"
    head -10 data/all_sources.txt
else
    echo "✗ CSV directory not found!"
    echo "Expected: data/domain_sources/myleadfox/"
fi

## Cell 6: Quick Test (5 domains)

**Important:** Watch for real-time output showing DNS/HTTP info!

In [None]:
%%bash
cd /kaggle/working/subdomain-playground

echo "=========================================="
echo "Quick Test with 5 Domains"
echo "=========================================="
echo ""

# Create test file
head -5 data/all_sources.txt > data/test_5.txt

echo "Testing with:"
cat data/test_5.txt
echo ""

# Run quick test
python scan.py -l data/test_5.txt --shopify-takeover-only --workers 2 --mode quick

echo ""
echo "=========================================="
echo "✓ Test Complete"
echo "=========================================="
echo ""
echo "If you see real-time output above with DNS/HTTP info, the scanner is working!"
echo "You can now run the FULL SCAN in the next cell."

## Cell 7: FULL SCAN - ALL DOMAINS

⚠️ **WARNING:** This will take 5-6 hours for ~10,000 domains!

Kaggle sessions timeout after 12 hours, so you have enough time.

**You will see real-time progress streaming as domains are scanned.**

In [None]:
%%bash
cd /kaggle/working/subdomain-playground

echo "=========================================="
echo "STARTING FULL SCAN"
echo "=========================================="
echo ""
echo "Domains to scan: $(wc -l < data/all_sources.txt)"
echo "Estimated time: 5-6 hours"
echo "Workers: 4"
echo ""
echo "=========================================="
echo ""

# Run full scan
python scan.py -l data/all_sources.txt \
    --shopify-takeover-only \
    --workers 4 \
    --mode quick

echo ""
echo "=========================================="
echo "✓ SCAN COMPLETE!"
echo "=========================================="
echo ""
echo "Results saved to: data/scans/shopify_takeover_candidates.json"

## Cell 8: View Results Summary

In [None]:
import json
import os

os.chdir('/kaggle/working/subdomain-playground')

results_file = 'data/scans/shopify_takeover_candidates.json'

if os.path.exists(results_file):
    with open(results_file, 'r') as f:
        results = json.load(f)

    print("=" * 80)
    print("SHOPIFY TAKEOVER SCAN RESULTS")
    print("=" * 80)
    print(f"\nTotal candidates found: {len(results)}")

    # Count by risk level
    risk_counts = {}
    for r in results:
        risk = r.get('risk_level', 'unknown')
        risk_counts[risk] = risk_counts.get(risk, 0) + 1

    print("\nBreakdown by risk level:")
    for risk, count in sorted(risk_counts.items()):
        print(f"  {risk.upper()}: {count}")

    print("\n" + "=" * 80)
    print("TOP 10 FINDINGS:")
    print("=" * 80)

    sorted_results = sorted(results, key=lambda x: x.get('confidence_score', 0), reverse=True)

    for i, r in enumerate(sorted_results[:10], 1):
        print(f"\n{i}. {r['subdomain']}")
        print(f"   CNAME: {r['cname']}")
        print(f"   HTTP Status: {r['http_status']}")
        print(f"   Risk: {r['risk_level']} | Confidence: {r['confidence_score']}")
else:
    print("✗ Results file not found!")
    print(f"Expected: {results_file}")

## Cell 9: Export to CSV

In [None]:
import json
import pandas as pd
import os

os.chdir('/kaggle/working/subdomain-playground')

with open('data/scans/shopify_takeover_candidates.json', 'r') as f:
    results = json.load(f)

df = pd.DataFrame(results)

# Select key columns
columns = [
    'subdomain', 'cname', 'http_status', 'risk_level', 'confidence_score',
    'cname_chain_count', 'final_cname_target', 'a_records', 'provider'
]
df_export = df[[col for col in columns if col in df.columns]]
df_export = df_export.sort_values('confidence_score', ascending=False)

df_export.to_csv('shopify_results.csv', index=False)

print(f"✓ Exported {len(df_export)} results to shopify_results.csv")
print("\nPreview (top 10):")
display(df_export.head(10))

## Cell 10: Filter High-Risk Only

In [None]:
import pandas as pd
import os

os.chdir('/kaggle/working/subdomain-playground')

df = pd.read_csv('shopify_results.csv')
df_high = df[df['risk_level'].isin(['critical', 'high'])]

print(f"High-risk findings: {len(df_high)} out of {len(df)} total")

if len(df_high) > 0:
    df_high.to_csv('shopify_high_risk.csv', index=False)
    print("✓ Saved to shopify_high_risk.csv")
    print("\nHigh-risk results:")
    display(df_high)
else:
    print("No high-risk findings.")

## Cell 11: Download Results

In [None]:
from IPython.display import FileLink, display
import os

os.chdir('/kaggle/working/subdomain-playground')

print("Download your results:")
print("=" * 80)
print("")

files = [
    'shopify_results.csv',
    'shopify_high_risk.csv',
    'data/scans/shopify_takeover_candidates.json'
]

for file in files:
    if os.path.exists(file):
        print(f"✓ {file}")
        display(FileLink(file))
    else:
        print(f"- {file} (not found)")

## OPTIONAL: Manual Verification

Use this cell to manually verify specific findings. Replace `DOMAIN` with actual domain from results.

In [None]:
%%bash
DOMAIN="example.myshopify.com"

echo "Manual verification for: $DOMAIN"
echo "=========================================="
echo ""

echo "1. DNS CNAME check:"
dig $DOMAIN CNAME +short
echo ""

echo "2. DNS A record check:"
dig $DOMAIN A +short
echo ""

echo "3. HTTP status check:"
curl -I -L https://$DOMAIN 2>&1 | head -20