# Red Project - Vast.ai Provisioning Script Diagnostics

This notebook helps diagnose issues with the provisioning scripts and supervisor configuration in the Red Project Vast.ai environment. It provides tools to:

1. Check if the provisioning scripts have been executed
2. Validate supervisor installation and configuration
3. Check service statuses (SSH, VNC, Jupyter)
4. View logs from key components
5. Fix common issues with the Vast.ai integration

Last updated: April 29, 2025

## Environment Overview

The Red project is a containerized ML environment designed for Vast.ai GPU instances. It consists of:

- **VisoMaster**: Core ML application
- **VNC Server**: For graphical interface access
- **JupyterLab**: For interactive development
- **SSH**: For secure connections

All these services should be managed by `supervisor`.

In [None]:
# Check OS information and environment context
import os
import subprocess
import sys
import platform

print(f"\nPython Version: {sys.version}")
print(f"Platform: {platform.platform()}")
print(f"Current working directory: {os.getcwd()}")
print(f"\nEnvironment Variables:\n")
for var in ['JUPYTER_BASE_URL', 'PROVISIONING_SCRIPT', 'VNC_RESOLUTION', 'VNC_PW']:
    print(f"  {var}: {os.environ.get(var, 'Not set')}")

## 1. Check Provisioning Scripts

Let's check if the provisioning scripts exist and their content.

In [None]:
# Check for provisioning script locations
script_locations = [
    '/vast_ai_provisioning_script.sh',
    '/src/provisioning_script.sh',
    '/VisoMaster/src/provisioning_script.sh',
    '/root/provisioning_script.sh'
]

print("Checking for provisioning scripts...\n")

for location in script_locations:
    if os.path.exists(location):
        print(f"✅ Found script at: {location}")
        # Get file modification time
        mtime = os.path.getmtime(location)
        print(f"   Last modified: {subprocess.check_output(['date', '-d', f'@{mtime}'], text=True).strip()}")
        print(f"   Size: {os.path.getsize(location)} bytes\n")
    else:
        print(f"❌ No script at: {location}\n")

In [None]:
# Check for evidence that the provisioning script has been executed
log_locations = [
    '/logs/onstart_script.log',
    '/logs/provisioning_script.log',
    '/logs/startup.log'
]

print("Checking for provisioning execution logs...\n")

for log in log_locations:
    if os.path.exists(log):
        print(f"✅ Found log at: {log}")
        # Get file modification time
        mtime = os.path.getmtime(log)
        print(f"   Last modified: {subprocess.check_output(['date', '-d', f'@{mtime}'], text=True).strip()}")
        print(f"   Size: {os.path.getsize(log)} bytes")
        # Show last few lines of the log
        try:
            last_lines = subprocess.check_output(["tail", "-n", "5", log], text=True)
            print(f"   Last few lines:\n{last_lines}\n")
        except Exception as e:
            print(f"   Could not read log: {e}\n")
    else:
        print(f"❌ No log at: {log}\n")

## 2. Check Supervisor Status

Let's verify if supervisor is installed and running correctly.

In [None]:
# Check supervisor installation and configuration
def run_cmd(cmd):
    try:
        return subprocess.check_output(cmd.split(), stderr=subprocess.STDOUT, text=True)
    except subprocess.CalledProcessError as e:
        return f"ERROR (code {e.returncode}): {e.output}"
    except Exception as e:
        return f"ERROR: {str(e)}"

# Check if supervisor is installed
print("Checking supervisor installation:\n")
print(run_cmd("which supervisord"))
print(run_cmd("supervisord --version"))

# Check supervisor configuration
print("\nChecking supervisor configuration:\n")
config_paths = [
    '/etc/supervisor/conf.d/supervisord.conf',
    '/etc/supervisord.conf'
]

for path in config_paths:
    if os.path.exists(path):
        print(f"✅ Found config at: {path}")
        try:
            with open(path, 'r') as f:
                config_content = f.read()
            print(f"\nConfiguration content preview (first 400 chars):\n")
            print(config_content[:400] + "...")
        except Exception as e:
            print(f"❌ Could not read config: {e}")
    else:
        print(f"❌ No config at: {path}")

In [None]:
# Check supervisor process status
print("Checking if supervisor is running:\n")
print(run_cmd("ps aux | grep supervisord"))

# Check supervisor managed services
print("\nChecking supervisor managed services:\n")
print(run_cmd("supervisorctl status"))

## 3. Check Critical Services

Let's check if the critical services are running.

In [None]:
# Check critical services status
print("Checking SSH service:\n")
print(run_cmd("ps aux | grep sshd"))
print("\nChecking SSH port:\n")
print(run_cmd("netstat -tuln | grep 22"))

print("\nChecking VNC service:\n")
print(run_cmd("ps aux | grep vnc"))
print("\nChecking VNC ports:\n")
print(run_cmd("netstat -tuln | grep 590"))
print(run_cmd("netstat -tuln | grep 6080"))

print("\nChecking Jupyter service:\n")
print(run_cmd("ps aux | grep jupyter"))
print("\nChecking Jupyter port:\n")
print(run_cmd("netstat -tuln | grep 8888"))

## 4. View Important Logs

Let's examine the key logs to understand what's happening.

In [None]:
# Check logs directory contents
print("Checking /logs directory contents:\n")
try:
    logs = os.listdir("/logs")
    if logs:
        for log in logs:
            log_path = os.path.join("/logs", log)
            size = os.path.getsize(log_path)
            mtime = os.path.getmtime(log_path)
            print(f"{log} - Size: {size} bytes, Last modified: {subprocess.check_output(['date', '-d', f'@{mtime}'], text=True).strip()}")
    else:
        print("No log files found in /logs directory")
except Exception as e:
    print(f"Error accessing logs directory: {e}")

In [None]:
# View supervisor log
supervisor_log = '/logs/supervisord.log'
print(f"Viewing supervisor log: {supervisor_log}\n")

if os.path.exists(supervisor_log):
    try:
        last_lines = subprocess.check_output(["tail", "-n", "50", supervisor_log], text=True)
        print(last_lines)
    except Exception as e:
        print(f"Error reading log: {e}")
else:
    print(f"Log file does not exist: {supervisor_log}")

## 5. Fix Common Issues

Here are some common fixes for issues with the environment.

In [None]:
# Fix: Supervisor not running
def start_supervisor():
    print("Attempting to start supervisor...\n")
    result = run_cmd("supervisord -c /etc/supervisor/conf.d/supervisord.conf")
    print(result)
    print("\nVerifying supervisor status after start attempt:\n")
    print(run_cmd("ps aux | grep supervisord"))
    print(run_cmd("supervisorctl status"))

# Only run this when needed
# start_supervisor()

In [None]:
# Fix: Missing logs directory
def ensure_logs_directory():
    print("Ensuring /logs directory exists with proper permissions...")
    try:
        if not os.path.exists('/logs'):
            os.makedirs('/logs', exist_ok=True)
        # Make sure everyone can write to it
        os.system('chmod 777 /logs')
        print("✅ Logs directory created/fixed successfully")
    except Exception as e:
        print(f"❌ Error creating logs directory: {e}")

# Only run this when needed
# ensure_logs_directory()

In [None]:
# Fix: Supervisor configuration issues
def copy_supervisor_config():
    print("Copying supervisor configuration to the correct location...")
    try:
        # Source locations to check
        source_paths = [
            '/src/supervisord.conf',
            '/VisoMaster/src/supervisord.conf'
        ]
        
        source_path = None
        for path in source_paths:
            if os.path.exists(path):
                source_path = path
                break
        
        if source_path:
            # Create directory if it doesn't exist
            os.makedirs('/etc/supervisor/conf.d', exist_ok=True)
            # Copy the config file
            os.system(f'cp {source_path} /etc/supervisor/conf.d/supervisord.conf')
            print(f"✅ Copied supervisor config from {source_path} to /etc/supervisor/conf.d/supervisord.conf")
        else:
            print("❌ Could not find supervisor configuration file in known locations")
    except Exception as e:
        print(f"❌ Error copying supervisor config: {e}")

# Only run this when needed
# copy_supervisor_config()

## 6. Restart Services

Functions to restart individual services or all services.

In [None]:
# Restart individual services
def restart_service(service_name):
    print(f"Restarting {service_name} service...")
    result = run_cmd(f"supervisorctl restart {service_name}")
    print(result)
    print(f"\nChecking {service_name} status after restart:")
    print(run_cmd(f"supervisorctl status {service_name}"))

# Available services: sshd, jupyter, vnc
# restart_service('sshd')
# restart_service('jupyter')
# restart_service('vnc')

In [None]:
# Restart all services
def restart_all_services():
    print("Restarting all supervised services...")
    result = run_cmd("supervisorctl restart all")
    print(result)
    print("\nChecking all services status after restart:")
    print(run_cmd("supervisorctl status"))

# Only run this when needed
# restart_all_services()

## 7. Additional System Diagnostics

Check system resources and overall health.

In [None]:
# System resources check
print("Checking system resources:\n")
print("CPU and Memory Usage:")
print(run_cmd("top -bn1 | head -15"))

print("\nDisk Space:")
print(run_cmd("df -h"))

print("\nGPU Status:")
print(run_cmd("nvidia-smi"))

In [None]:
# Check network connections
print("Active network connections:\n")
print(run_cmd("netstat -tuln"))

## 8. Manual Service Control

If supervisor is not working, you can manually start services.

In [None]:
# Manually start SSH server
def manual_start_sshd():
    print("Manually starting SSH server...")
    # Make sure the directory exists
    os.system("mkdir -p /var/run/sshd")
    result = run_cmd("/usr/sbin/sshd")
    print(result)
    print("\nVerifying SSH service status:")
    print(run_cmd("ps aux | grep sshd"))
    print(run_cmd("netstat -tuln | grep 22"))

# Only run this when needed
# manual_start_sshd()

In [None]:
# Manually start VNC server
def manual_start_vnc():
    print("Manually starting VNC server...")
    
    # Kill existing VNC processes if any
    os.system("pkill -f vnc 2>/dev/null || true")
    
    # Initialize Xauthority
    os.system("touch /root/.Xauthority")
    os.system("xauth generate :1 . trusted")
    
    # Create VNC password if it doesn't exist
    if not os.path.exists("/root/.vnc/passwd"):
        os.makedirs("/root/.vnc", exist_ok=True)
        os.system("echo 'vncpasswd123' | vncpasswd -f > /root/.vnc/passwd")
        os.system("chmod 600 /root/.vnc/passwd")
    
    # Start VNC server
    os.system("vncserver :1 -depth 24 -geometry 1280x800 -localhost no")
    
    # Start WebSockets proxy
    os.system("websockify 0.0.0.0:6080 localhost:5901 &")
    
    print("\nVerifying VNC service status:")
    print(run_cmd("ps aux | grep vnc"))
    print(run_cmd("netstat -tuln | grep 590"))
    print(run_cmd("netstat -tuln | grep 6080"))

# Only run this when needed
# manual_start_vnc()

## 9. Overall Provisioning Status Report

Generate a summary report of the environment status.

In [None]:
# Generate overall status report
def check_service_running(service_name, process_pattern):
    try:
        output = subprocess.check_output(["ps", "aux"], text=True)
        return process_pattern in output and not (f"grep {process_pattern}" in output)
    except:
        return False

def check_port_listening(port):
    try:
        output = subprocess.check_output(["netstat", "-tuln"], text=True)
        return f":{port} " in output
    except:
        return False

def check_directory_exists(directory):
    return os.path.exists(directory) and os.path.isdir(directory)

def check_file_exists(file_path):
    return os.path.exists(file_path) and os.path.isfile(file_path)

print("=== Red Project Vast.ai Environment Status Report ===\n")

# Check key components
components = [
    ("Supervisor Running", check_service_running("supervisord", "supervisord"), "Critical"),
    ("SSH Server Running", check_service_running("sshd", "sshd"), "Critical"),
    ("SSH Port (22) Open", check_port_listening(22), "Critical"),
    ("VNC Server Running", check_service_running("vnc", "Xvnc"), "Important"),
    ("VNC Port (5901) Open", check_port_listening(5901), "Important"),
    ("WebSocket Proxy Port (6080) Open", check_port_listening(6080), "Important"),
    ("Jupyter Running", check_service_running("jupyter", "jupyter"), "Important"),
    ("Jupyter Port (8888) Open", check_port_listening(8888), "Important"),
    ("Logs Directory Exists", check_directory_exists("/logs"), "Critical"),
    ("VisoMaster Installed", check_directory_exists("/VisoMaster"), "Critical"),
    ("Supervisor Config Exists", check_file_exists("/etc/supervisor/conf.d/supervisord.conf"), "Critical")
]

# Print status table
print(f"{'Component':<40} {'Status':<10} {'Priority':<10}")
print("-" * 60)

overall_status = True
for component, status, priority in components:
    status_str = "✅ OK" if status else "❌ FAIL"
    if not status and priority == "Critical":
        overall_status = False
    print(f"{component:<40} {status_str:<10} {priority:<10}")

print("\nOverall Status: " + ("✅ Environment appears to be functioning properly" if overall_status else "❌ Critical components are missing or not running"))

if not overall_status:
    print("\nRecommended actions:")
    print("1. Run the 'ensure_logs_directory()' function to create the logs directory")
    print("2. Run the 'copy_supervisor_config()' function to fix supervisor configuration")
    print("3. Run the 'start_supervisor()' function to start supervisor")
    print("4. If supervisor still fails, use the manual service start functions")