# Selective Notebook Execution Based on Requirements Changes

This notebook demonstrates how to modify the smart workflow to run only notebooks in directories where requirements.txt files have changed.

## Current Challenge

The current smart workflow treats any `requirements.txt` change as requiring full CI for all notebooks. However, with directory-specific requirements files under `/notebooks`, we want more granular control:

```
notebooks/
├── data_analysis/
│   ├── requirements.txt  ← Change here should only affect data_analysis notebooks
│   ├── analysis1.ipynb
│   └── analysis2.ipynb
├── visualization/
│   ├── requirements.txt  ← Change here should only affect visualization notebooks
│   ├── plot1.ipynb
│   └── plot2.ipynb
└── modeling/
    ├── requirements.txt  ← Change here should only affect modeling notebooks
    ├── model1.ipynb
    └── model2.ipynb
```

## Solution: Enhanced File Detection Logic

We need to modify the `detect-changes` job to:
1. Detect which specific directories have requirements changes
2. Pass the affected directories to the CI pipeline
3. Run notebooks selectively based on the changed directories

In [None]:
# Enhanced file detection logic for the smart workflow
detect_changes_script = '''
# Get the list of changed files from the last commit
echo "Getting changed files from last commit..."
CHANGED_FILES=$(git diff --name-only HEAD~1 HEAD)

echo "Changed files:"
echo "$CHANGED_FILES"

# Initialize flags and arrays
NOTEBOOKS_CHANGED=false
DOCS_CONFIG_CHANGED=false
REQUIREMENTS_CHANGED=false
DEPLOY_NEEDED=false
AFFECTED_DIRECTORIES=()
CHANGED_NOTEBOOKS=()

# Check each changed file
while IFS= read -r file; do
  echo "Checking file: $file"
  
  case "$file" in
    notebooks/*.ipynb)
      echo "  → Notebook file changed: $file"
      NOTEBOOKS_CHANGED=true
      DEPLOY_NEEDED=true
      CHANGED_NOTEBOOKS+=("$file")
      
      # Extract directory from notebook path
      dir=$(dirname "$file")
      if [[ ! " ${AFFECTED_DIRECTORIES[@]} " =~ " $dir " ]]; then
        AFFECTED_DIRECTORIES+=("$dir")
      fi
      ;;
    notebooks/*/requirements.txt)
      echo "  → Requirements file changed: $file"
      REQUIREMENTS_CHANGED=true
      DEPLOY_NEEDED=true
      
      # Extract directory from requirements path
      dir=$(dirname "$file")
      echo "  → Affected directory: $dir"
      if [[ ! " ${AFFECTED_DIRECTORIES[@]} " =~ " $dir " ]]; then
        AFFECTED_DIRECTORIES+=("$dir")
      fi
      ;;
    requirements.txt|pyproject.toml|setup.py)
      echo "  → Root requirements file changed: $file"
      REQUIREMENTS_CHANGED=true
      DEPLOY_NEEDED=true
      # Root requirements affect all notebooks
      AFFECTED_DIRECTORIES=("notebooks")
      ;;
    _config.yml|_toc.yml)
      echo "  → Documentation config changed"
      DOCS_CONFIG_CHANGED=true
      DEPLOY_NEEDED=true
      ;;
    *.md|*.rst)
      echo "  → Documentation file changed"
      DOCS_CONFIG_CHANGED=true
      DEPLOY_NEEDED=true
      ;;
    *)
      echo "  → Other file type"
      ;;
  esac
done <<< "$CHANGED_FILES"

# Convert arrays to JSON for output
affected_dirs_json=$(printf '%s\n' "${AFFECTED_DIRECTORIES[@]}" | jq -R . | jq -s .)
changed_notebooks_json=$(printf '%s\n' "${CHANGED_NOTEBOOKS[@]}" | jq -R . | jq -s .)

echo "Affected directories: $affected_dirs_json"
echo "Changed notebooks: $changed_notebooks_json"

# Determine workflow path
if [ "$NOTEBOOKS_CHANGED" = "true" ] || [ "$REQUIREMENTS_CHANGED" = "true" ]; then
  echo "📔 Notebooks or requirements changed - Selective CI needed"
  echo "notebooks-changed=true" >> $GITHUB_OUTPUT
  echo "docs-only=false" >> $GITHUB_OUTPUT
  echo "affected-directories=$affected_dirs_json" >> $GITHUB_OUTPUT
  echo "changed-notebooks=$changed_notebooks_json" >> $GITHUB_OUTPUT
elif [ "$DOCS_CONFIG_CHANGED" = "true" ]; then
  echo "📚 Only documentation/config files changed - Docs rebuild only"
  echo "notebooks-changed=false" >> $GITHUB_OUTPUT
  echo "docs-only=true" >> $GITHUB_OUTPUT
else
  echo "📝 No significant changes detected - Skip CI"
  echo "notebooks-changed=false" >> $GITHUB_OUTPUT
  echo "docs-only=false" >> $GITHUB_OUTPUT
  DEPLOY_NEEDED=false
fi

echo "deploy-needed=$DEPLOY_NEEDED" >> $GITHUB_OUTPUT
'''

print("Enhanced detection script created!")

## Modified Workflow Structure

The enhanced workflow will have these outputs from the `detect-changes` job:

- `notebooks-changed`: Boolean indicating if notebooks or requirements changed
- `docs-only`: Boolean for documentation-only changes
- `affected-directories`: JSON array of directories that need CI
- `changed-notebooks`: JSON array of specific notebooks that changed
- `deploy-needed`: Boolean for deployment requirement

In [None]:
# Example of how the CI pipeline job would be modified
selective_ci_job = '''
# Selective notebook CI with directory-specific execution
selective-notebook-ci:
  needs: detect-changes
  if: needs.detect-changes.outputs.notebooks-changed == 'true'
  strategy:
    matrix:
      directory: ${{ fromJson(needs.detect-changes.outputs.affected-directories) }}
  uses: spacetelescope/notebook-ci-actions/.github/workflows/ci_pipeline.yml@main
  with:
    python-version: "3.11"
    execution-mode: "full"
    notebook-path: ${{ matrix.directory }}  # New parameter for directory-specific execution
    build-html: false
    security-scan: true
  secrets:
    CASJOBS_USERID: ${{ secrets.CASJOBS_USERID }}
    CASJOBS_PW: ${{ secrets.CASJOBS_PW }}
'''

print("Selective CI job structure defined!")

## Benefits of Selective Execution

1. **Faster CI**: Only affected notebooks run, reducing execution time
2. **Resource Efficiency**: Lower compute usage and cost
3. **Parallel Execution**: Different directories can run in parallel
4. **Targeted Testing**: Changes are tested in isolation
5. **Better Debugging**: Easier to identify which directory caused issues

## Example Scenarios

### Scenario 1: Requirements Change in Single Directory
```
Changed files:
- notebooks/data_analysis/requirements.txt

Result:
- affected-directories: ["notebooks/data_analysis"]
- Only notebooks in data_analysis/ directory are executed
- Execution time: ~5-10 minutes instead of 20-30 minutes
```

### Scenario 2: Notebook Change in Multiple Directories
```
Changed files:
- notebooks/data_analysis/analysis1.ipynb
- notebooks/visualization/plot1.ipynb

Result:
- affected-directories: ["notebooks/data_analysis", "notebooks/visualization"]
- Both directories run in parallel
- Faster than sequential execution
```

### Scenario 3: Root Requirements Change
```
Changed files:
- requirements.txt (root level)

Result:
- affected-directories: ["notebooks"]
- All notebooks run (safety first for global changes)
- Same behavior as current workflow
```

## Implementation Requirements

To implement this selective execution, we need:

1. **Enhanced CI Pipeline**: Modify `ci_pipeline.yml` to accept a `notebook-path` parameter
2. **Directory Structure**: Ensure each notebook subdirectory has its own `requirements.txt`
3. **Matrix Strategy**: Use GitHub Actions matrix to run multiple directories in parallel
4. **Error Handling**: Graceful handling when directories don't exist
5. **Documentation**: Clear guidelines for organizing notebook directories

In [None]:
# Example directory structure validation
validation_script = '''
#!/bin/bash
# Validate notebook directory structure for selective execution

echo "Validating notebook directory structure..."

# Check if notebooks directory exists
if [ ! -d "notebooks" ]; then
  echo "❌ No notebooks directory found"
  exit 1
fi

# Find all subdirectories in notebooks/
subdirs=$(find notebooks -mindepth 1 -maxdepth 1 -type d)

if [ -z "$subdirs" ]; then
  echo "ℹ️  No subdirectories found in notebooks/"
  echo "ℹ️  Selective execution will treat entire notebooks/ as one unit"
else
  echo "📁 Found notebook subdirectories:"
  for dir in $subdirs; do
    echo "  - $dir"
    
    # Check for requirements.txt in each subdirectory
    if [ -f "$dir/requirements.txt" ]; then
      echo "    ✅ Has requirements.txt"
    else
      echo "    ⚠️  No requirements.txt (will use root requirements)"
    fi
    
    # Count notebooks in directory
    notebook_count=$(find "$dir" -name "*.ipynb" | wc -l)
    echo "    📓 Contains $notebook_count notebook(s)"
  done
fi

echo "✅ Validation complete"
'''

print("Directory structure validation script created!")

## Next Steps

To implement selective notebook execution:

1. **Modify the smart workflow** with the enhanced detection logic
2. **Update the CI pipeline** to accept directory-specific parameters
3. **Test with different scenarios** to ensure reliability
4. **Document the new structure** for repository maintainers
5. **Gradually roll out** to repositories with appropriate directory structures

This approach will significantly improve CI efficiency while maintaining the safety and reliability of the current system.