# Step 2: Image Generation for VisCYPNet

_In this notebook, we will read scaffold‑split CSVs for each CYP450 isoform and render 224×224 PNG images in two styles: **clean** and **sketch**. We use RDKit’s MolDraw2DCairo for high‑fidelity rendering, and PIL for the sketch effect._


No Augmentation

In [7]:
# Cell 1: Imports & Configuration

import os                    
import io                  
import pandas as pd         
from rdkit import Chem       
from rdkit.Chem.Draw import rdMolDraw2D  
from PIL import Image, ImageFilter, ImageEnhance 
from rdkit import RDLogger    
import warnings              

# Suppress warnings for a cleaner notebook output
warnings.filterwarnings('ignore')
RDLogger.DisableLog('rdApp.*')


In [8]:
# Cell 2: Molecule Drawing Function

def draw_molecule(smiles: str, size: int = 224, bond_width: int = 2) -> Image.Image:
    """
    Convert a SMILES string to a PIL Image of size (size x size) using RDKit MolDraw2DCairo.
    
    Parameters:
    - smiles: the SMILES string of the molecule
    - size: the pixel width and height of the output image
    - bond_width: line thickness for bonds
    
    Returns:
    A PIL Image object, or raises ValueError if the SMILES is invalid.
    """
    # Convert SMILES to RDKit Mol object
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        # If SMILES parsing fails, inform the caller
        raise ValueError(f"Invalid SMILES: {smiles}")
    
    # Initialize a Cairo drawer of the given size
    drawer = rdMolDraw2D.MolDraw2DCairo(size, size)
    opts = drawer.drawOptions()
    opts.bondLineWidth = bond_width  # set bond line thickness
    
    # Draw the molecule and finalize the image
    drawer.DrawMolecule(mol)
    drawer.FinishDrawing()
    
    # Get raw PNG bytes and open with PIL
    png_bytes = drawer.GetDrawingText()
    return Image.open(io.BytesIO(png_bytes))


In [9]:
# Cell 3: Image Generation Function

def generate_images(isoform: str, split: str, style: str):
    """
    Read the processed CSV for one isoform+split, render each molecule to PNG, 
    and save in the appropriate folder and style.
    
    Parameters:
    - isoform: e.g. 'CYP3A4'
    - split: one of 'train', 'val', 'test'
    - style: 'clean' or 'sketch'
    """
    # Construct input CSV path using your existing naming convention
    csv_path = f"../data/processed/{isoform}_{split}.csv"
    # Construct output folder path
    out_dir  = f"../images/{isoform}/{split}/{style}"
    os.makedirs(out_dir, exist_ok=True)
    
    # Load the CSV (expects columns: Drug_ID, Drug, Y)
    df = pd.read_csv(csv_path)
    
    # Iterate row‑by‑row
    for _, row in df.iterrows():
        drug_id = int(row['Drug_ID']) # unique identifier
        smiles  = row['Drug']         # SMILES string
        label   = int(row['Y'])       # binary label 0/1
        
        try:
            # Render base image
            img = draw_molecule(smiles, size=224, bond_width=2)
        except ValueError as e:
            # Skip invalid SMILES
            print(f"[WARN] {e}")
            continue
        
        # If sketch style, apply blur + contrast boost
        if style == 'sketch':
            img = img.filter(ImageFilter.GaussianBlur(radius=1.5))
            img = ImageEnhance.Contrast(img).enhance(1.8)
        
        # Save the final PNG with the pattern {Drug_ID}_{label}.png
        out_path = os.path.join(out_dir, f"{drug_id}_{label}.png")
        img.save(out_path)


In [5]:
# Cell 4: Loop Over Splits & Styles

# Specify the isoform you want to process
isoform = '3A4'

# All dataset splits
splits = ['train', 'val', 'test']
# Both rendering styles
styles = ['clean', 'sketch']

# Nested loops to run generation for every combination
for split in splits:
    for style in styles:
        print(f"Rendering {style} images for {isoform} [{split}]...")
        generate_images(isoform, split, style)
        print(f"→ Completed {isoform} [{split}] in {style} style.\n")


Rendering clean images for 3A4 [train]...
→ Completed 3A4 [train] in clean style.

Rendering sketch images for 3A4 [train]...
→ Completed 3A4 [train] in sketch style.

Rendering clean images for 3A4 [val]...
→ Completed 3A4 [val] in clean style.

Rendering sketch images for 3A4 [val]...
→ Completed 3A4 [val] in sketch style.

Rendering clean images for 3A4 [test]...
→ Completed 3A4 [test] in clean style.

Rendering sketch images for 3A4 [test]...
→ Completed 3A4 [test] in sketch style.



In [4]:
# Cell 4: Loop Over Splits & Styles

# Specify the isoform you want to process
isoform = '1A2'

# All dataset splits
splits = ['train', 'val', 'test']
# Both rendering styles
styles = ['clean', 'sketch']

# Nested loops to run generation for every combination
for split in splits:
    for style in styles:
        print(f"Rendering {style} images for {isoform} [{split}]...")
        generate_images(isoform, split, style)
        print(f"→ Completed {isoform} [{split}] in {style} style.\n")


Rendering clean images for 1A2 [train]...
→ Completed 1A2 [train] in clean style.

Rendering sketch images for 1A2 [train]...
→ Completed 1A2 [train] in sketch style.

Rendering clean images for 1A2 [val]...
→ Completed 1A2 [val] in clean style.

Rendering sketch images for 1A2 [val]...
→ Completed 1A2 [val] in sketch style.

Rendering clean images for 1A2 [test]...
→ Completed 1A2 [test] in clean style.

Rendering sketch images for 1A2 [test]...
→ Completed 1A2 [test] in sketch style.



In [4]:
# Cell 4: Loop Over Splits & Styles

# Specify the isoform you want to process
isoform = '2C9'

# All dataset splits
splits = ['train', 'val', 'test']
# Both rendering styles
styles = ['clean', 'sketch']

# Nested loops to run generation for every combination
for split in splits:
    for style in styles:
        print(f"Rendering {style} images for {isoform} [{split}]...")
        generate_images(isoform, split, style)
        print(f"→ Completed {isoform} [{split}] in {style} style.\n")


Rendering clean images for 2C9 [train]...
→ Completed 2C9 [train] in clean style.

Rendering sketch images for 2C9 [train]...
→ Completed 2C9 [train] in sketch style.

Rendering clean images for 2C9 [val]...
→ Completed 2C9 [val] in clean style.

Rendering sketch images for 2C9 [val]...
→ Completed 2C9 [val] in sketch style.

Rendering clean images for 2C9 [test]...
→ Completed 2C9 [test] in clean style.

Rendering sketch images for 2C9 [test]...
→ Completed 2C9 [test] in sketch style.



In [5]:
# Cell 4: Loop Over Splits & Styles

# Specify the isoform you want to process
isoform = '2C19'

# All dataset splits
splits = ['train', 'val', 'test']
# Both rendering styles
styles = ['clean', 'sketch']

# Nested loops to run generation for every combination
for split in splits:
    for style in styles:
        print(f"Rendering {style} images for {isoform} [{split}]...")
        generate_images(isoform, split, style)
        print(f"→ Completed {isoform} [{split}] in {style} style.\n")


Rendering clean images for 2C19 [train]...
→ Completed 2C19 [train] in clean style.

Rendering sketch images for 2C19 [train]...
→ Completed 2C19 [train] in sketch style.

Rendering clean images for 2C19 [val]...
→ Completed 2C19 [val] in clean style.

Rendering sketch images for 2C19 [val]...
→ Completed 2C19 [val] in sketch style.

Rendering clean images for 2C19 [test]...
→ Completed 2C19 [test] in clean style.

Rendering sketch images for 2C19 [test]...
→ Completed 2C19 [test] in sketch style.



In [6]:
# Cell 4: Loop Over Splits & Styles

# Specify the isoform you want to process
isoform = '2D6'

# All dataset splits
splits = ['train', 'val', 'test']
# Both rendering styles
styles = ['clean', 'sketch']

# Nested loops to run generation for every combination
for split in splits:
    for style in styles:
        print(f"Rendering {style} images for {isoform} [{split}]...")
        generate_images(isoform, split, style)
        print(f"→ Completed {isoform} [{split}] in {style} style.\n")


Rendering clean images for 2D6 [train]...
→ Completed 2D6 [train] in clean style.

Rendering sketch images for 2D6 [train]...
→ Completed 2D6 [train] in sketch style.

Rendering clean images for 2D6 [val]...
→ Completed 2D6 [val] in clean style.

Rendering sketch images for 2D6 [val]...
→ Completed 2D6 [val] in sketch style.

Rendering clean images for 2D6 [test]...
→ Completed 2D6 [test] in clean style.

Rendering sketch images for 2D6 [test]...
→ Completed 2D6 [test] in sketch style.



2C9_train_downsampled.csv

In [11]:
# Cell 4: Loop Over Splits & Styles

# Specify the isoform you want to process
isoform = '2C9'

# All dataset splits
splits = ['train', 'val', 'test']
# Both rendering styles
styles = ['clean', 'sketch']

# Nested loops to run generation for every combination
for split in splits:
    for style in styles:
        print(f"Rendering {style} images for {isoform} [{split}]...")
        generate_images(isoform, split, style)
        print(f"→ Completed {isoform} [{split}] in {style} style.\n")


Rendering clean images for 2C9 [train]...
→ Completed 2C9 [train] in clean style.

Rendering sketch images for 2C9 [train]...
→ Completed 2C9 [train] in sketch style.

Rendering clean images for 2C9 [val]...
→ Completed 2C9 [val] in clean style.

Rendering sketch images for 2C9 [val]...
→ Completed 2C9 [val] in sketch style.

Rendering clean images for 2C9 [test]...
→ Completed 2C9 [test] in clean style.

Rendering sketch images for 2C9 [test]...
→ Completed 2C9 [test] in sketch style.

