<a href="https://www.kaggle.com/code/stpeteishii/cpu-avatarartist1-2d-domain-transfer?scriptVersionId=288388234" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **CPU AvatarArtist1: 2D Domain Transfer**

---

## **What is AvatarArtist?**
AvatarArtist is a cutting-edge AI system that creates high-quality 3D avatars from text descriptions or 2D images. It can transform a simple photo or text prompt into a fully-realized 3D character in various artistic styles.

https://kumapowerliu.github.io/AvatarArtist/

---

## **About This Notebook**

**This notebook demonstrates a simplified, CPU-compatible version of the first step (Step 1: 2D Domain Transfer) from the complete pipeline described in the paper.**

### Limitations

- ‚ö†Ô∏è **It is NOT possible to execute the full process and achieve the intended performance on CPU alone**
- This notebook is intended for educational purposes and algorithm understanding
- To obtain practical results, execution of the complete pipeline with GPU is required

### Complete Pipeline

The full AvatarArtist pipeline consists of 5 steps:

| Step | Description | 
|------|-------------|
| **Step 1** | 2D Domain Transfer | 
| **Step 2** | NeXT3D & 4D-GAN Fine-tuning | 
| **Step 3** | Triplane Decomposition | 
| **Step 4** | Diffusion Transformer Training |
| **Step 5** | Avatar Generation (Inference) | 

### Reference

üìÑ **Original Paper**: [AvatarArtist Project Page](https://kumapowerliu.github.io/AvatarArtist/)

---

## üìä Pipeline Overview 

```
Step 1: 2D Domain Transfer (‚Üê You are here )
   ‚Üì
Step 2: NeXT3D & 4D-GAN Fine-tuning
   ‚Üì
Step 3: Triplane Decomposition
   ‚Üì
Step 4: Diffusion Transformer Training
   ‚Üì
Step 5: Avatar Generation (Inference)
```

---



## ‚öôÔ∏è System Requirements 

### This Notebook (Step 1 - CPU) 
| Component | Minimum | Recommended |
|-----------|---------|-------------|
| CPU | Multi-core processor | Intel i7/AMD Ryzen 7+ |
| RAM | 8GB | 16GB+ |
| Storage | 5GB | 10GB+ |
| Time per image | 5-10 minutes | 3-5 minutes |

### Full Pipeline (GPU) 
| Component | Minimum | Recommended |
|-----------|---------|-------------|
| GPU | NVIDIA RTX 3060 (12GB) | NVIDIA RTX 3090/4090 (24GB+) |
| RAM | 16GB | 32GB+ |
| Storage | 50GB | 100GB+ |
| Time per avatar | 2-4 hours | 1-2 hours |

---

---

## **Step 1 Pipeline Explanation: 2D Domain Transfer (Style Transfer)**

## Overview
This is the **foundation pipeline** that transforms real-life photographs into artistic avatar styles (Pixar, anime, LEGO, etc.). It uses **Stable Diffusion + ControlNet** to maintain facial structure while completely changing the artistic style. The output becomes the training data for Step 2.

---

## Pipeline Purpose

```
Input: Real photograph (512√ó512)
   ‚Üì
[ControlNet] Extract pose/structure
   ‚Üì
[Stable Diffusion] Apply artistic style
   ‚Üì
Output: Stylized avatar (512√ó512)
```

**Key Goal**: Generate hundreds of style-consistent avatar images that will be used to train the 3D-aware GAN (Next3D) in Step 2.

---

## Core Technologies

### **1. Stable Diffusion**

**What it is:**
- State-of-the-art text-to-image diffusion model
- Can generate high-quality images from text prompts
- Version used: v1.5 (runwayml) or v2.1 (stabilityai)

**How it works:**
```
Text Prompt ‚Üí CLIP Encoder ‚Üí Latent Space ‚Üí U-Net Denoising ‚Üí VAE Decoder ‚Üí Image
```

**Role in this pipeline:**
- Generates images in target artistic style
- Maintains semantic meaning (face, features)
- Produces high-quality, consistent outputs

---

### **2. ControlNet**

**What it is:**
- Neural network that adds spatial control to Stable Diffusion
- Preserves structural information (pose, edges) during generation
- Ensures face position/proportions stay consistent

**Available Control Types:**

| Control Type | What It Preserves | Speed | Best For |
|--------------|-------------------|-------|----------|
| **OpenPose** | Body/face keypoints | Slower | Full body portraits |
| **Canny** | Edge outlines | Faster | Face-only images |

**Why we need it:**
```
Without ControlNet:
  Input: Photo of person
  Prompt: "Pixar style face"
  Result: Random Pixar character (wrong pose, different person)

With ControlNet:
  Input: Photo of person + Pose skeleton
  Prompt: "Pixar style face"
  Result: Same person, same pose, Pixar style ‚úì
```

**Example Process:**

```python
# 1. Extract control signal
Input Image ‚Üí OpenPose Detector ‚Üí Skeleton/Keypoints
                                    (18 body points + 70 face points)

# 2. Guide generation
Skeleton + Text Prompt ‚Üí ControlNet + SD ‚Üí Stylized Image
                                           (same pose, new style)
```

---

### **3. SDEdit (Stochastic Differential Editing)**

**Concept:**
- Add controlled noise to input image
- Denoise with style prompt
- Result: Modified image that maintains structure

**Noise Strength Parameter:**

| Strength | Effect | Use Case |
|----------|--------|----------|
| 0.0 | No change | Minimal style transfer |
| 0.3 | Subtle style | Keep more original features |
| 0.5 | Balanced | Standard transformation |
| 0.7 | Heavy style | Maximum artistic change |
| 1.0 | Complete regeneration | Full style replacement |

**Process:**
```
Original Image (t=0)
   ‚Üì Add noise (strength=0.5)
Noisy Image (t=500)
   ‚Üì Denoise with style prompt (500 steps)
Styled Image (t=0)
```

---

## Architecture Components

### **Class: AvatarArtist2D**

#### **Initialization Parameters:**

```python
AvatarArtist2D(
    model_id="runwayml/stable-diffusion-v1-5",
    controlnet_model="lllyasviel/sd-controlnet-openpose",
    device="cpu",  # or "cuda"
    dtype=torch.float32,  # CPU requires float32
    use_canny=False,  # True for faster edge-based control
    hf_token=None  # Hugging Face authentication token
)
```

**Parameter Guide:**

| Parameter | Options | Purpose | Notes |
|-----------|---------|---------|-------|
| `model_id` | SD v1.5, v2.1 | Base diffusion model | v1.5 = no token, v2.1 = token needed |
| `controlnet_model` | OpenPose, Canny | Control type | Auto-selected based on model_id |
| `device` | "cpu", "cuda" | Computing device | GPU 10-50√ó faster |
| `dtype` | float32, float16 | Precision | CPU requires float32 |
| `use_canny` | True/False | Edge vs Pose | True = faster, False = more accurate |

---

### **Model Loading Process**

**Step-by-Step:**

1. **Load ControlNet**
   ```python
   controlnet = ControlNetModel.from_pretrained(
       "lllyasviel/sd-controlnet-openpose"
   )
   ```
   - Size: ~1.5 GB
   - Contains: Encoder that processes control images

2. **Load Stable Diffusion Pipeline**
   ```python
   pipe = StableDiffusionControlNetPipeline.from_pretrained(
       "runwayml/stable-diffusion-v1-5",
       controlnet=controlnet
   )
   ```
   - Size: ~4-5 GB total
   - Contains: Text encoder, U-Net, VAE decoder

3. **Configure Scheduler**
   ```python
   pipe.scheduler = DDIMScheduler.from_config(...)
   ```
   - DDIM: Fast, deterministic sampling
   - Alternative: UniPC (fewer steps needed)

4. **CPU Optimizations**
   ```python
   pipe.enable_attention_slicing(slice_size=1)
   pipe.enable_vae_slicing()
   ```
   - Reduces memory usage by ~50%
   - Essential for CPU/low-VRAM GPUs
   - Slight slowdown (~10%) but prevents OOM

---

### **Control Signal Extraction**

#### **OpenPose Detection:**

```python
processor = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
control_image = processor(input_image)
```

**Output:**
- Skeleton with 18 body keypoints
- 70 facial landmarks (eyes, nose, mouth, jaw)
- Colored visualization (different colors = different body parts)

**Keypoint Categories:**
```
Body: Neck, shoulders, elbows, wrists, hips, knees, ankles (18 points)
Face: Eyes, eyebrows, nose, mouth, jaw contour (70 points)
```

#### **Canny Edge Detection:**

```python
processor = CannyDetector()
control_image = processor(input_image)
```

**Output:**
- Binary edge map (black/white)
- Detects gradients > threshold
- Faster than OpenPose (~5√ó speed)

**Comparison:**

| Aspect | OpenPose | Canny |
|--------|----------|-------|
| **Speed** | Slow (2-5 sec/image) | Fast (0.1-0.5 sec) |
| **Accuracy** | High (semantic) | Medium (edges only) |
| **Face Detail** | Excellent (70 points) | Good (outline) |
| **CPU Usage** | Higher | Lower |
| **Recommended** | GPU, high quality | CPU, speed priority |

---

## Processing Workflow

### **Single Image Process:**

```python
process_single_image(
    image_path="input.jpg",
    output_path="output.png",
    style_prompt="Pixar animation style",
    noise_strength=0.5,
    controlnet_strength=1.0,
    guidance_scale=7.5,
    num_steps=50,
    seed=42
)
```

**Step-by-Step Execution:**

1. **Load and Resize**
   ```python
   image = load_image("input.jpg")
   image = image.resize((512, 512))  # SD v1.5 optimal size
   ```

2. **Extract Control Signal**
   ```python
   control_image = processor(image)
   # Returns skeleton/edges matching input pose
   ```

3. **Style Transfer**
   ```python
   output = pipe(
       prompt=style_prompt,
       image=control_image,  # Structural guidance
       num_inference_steps=50,  # Denoising iterations
       guidance_scale=7.5,  # Prompt adherence
       controlnet_conditioning_scale=1.0,  # Control strength
       generator=torch.Generator().manual_seed(42)
   )
   ```

4. **Save Result**
   ```python
   output.images[0].save("output.png")
   ```

---

## Key Parameters Explained

### **1. Style Prompt**

**Format:**
```python
"[subject] in [style], [quality modifiers]"
```

**Examples:**

```python
STYLE_PROMPTS = {
    "pixar": "a 3D render of a face in Pixar animation style, "
             "high quality, detailed, professional lighting",
    
    "anime": "anime style portrait, cel shaded, vibrant colors, "
             "expressive eyes, detailed",
    
    "lego": "LEGO minifigure face, plastic texture, "
            "simplified features, toy style",
    
    "cartoon": "cartoon style portrait, bold lines, "
               "vibrant colors, simplified features",
    
    "oil_painting": "oil painting portrait, classical style, "
                    "rich colors, brushstrokes visible"
}
```

**Prompt Engineering Tips:**

| Component | Purpose | Example |
|-----------|---------|---------|
| **Subject** | What to generate | "face", "portrait", "person" |
| **Style** | Artistic direction | "Pixar", "anime", "watercolor" |
| **Quality** | Improve output | "high quality", "detailed", "professional" |
| **Lighting** | Visual enhancement | "soft lighting", "dramatic shadows" |
| **Technical** | Format/medium | "3D render", "digital art", "oil painting" |

**Negative Prompts** (what to avoid):
```python
negative_prompt = "blurry, low quality, distorted, ugly, bad anatomy"
```

---

### **2. Noise Strength**

**Controls how much to modify the image:**

```python
noise_strength = 0.5  # Range: 0.0 - 1.0
```

| Value | Effect | Output Characteristics | Use Case |
|-------|--------|------------------------|----------|
| **0.1-0.2** | Minimal | Slight color/texture change | Subtle enhancement |
| **0.3-0.4** | Light | Recognizable person + style hints | Balanced realism |
| **0.5-0.6** | Medium | Clear style, some original features | **Recommended** |
| **0.7-0.8** | Heavy | Full style, minimal original | Artistic freedom |
| **0.9-1.0** | Complete | Entirely new generation | Max creativity |

**Visual Comparison:**
```
Noise 0.2: Photo ‚Üí Photo with slight cartoon coloring
Noise 0.5: Photo ‚Üí Clear Pixar character (same person)
Noise 0.8: Photo ‚Üí Full Pixar character (generic features)
```

---

### **3. ControlNet Strength**

**Controls how strictly to follow the input structure:**

```python
controlnet_conditioning_scale = 1.0  # Range: 0.0 - 2.0
```

| Value | Effect | Structural Fidelity | Use Case |
|-------|--------|---------------------|----------|
| **0.3-0.5** | Loose | Approximate pose | Creative variation |
| **0.7-0.9** | Moderate | Close match | Balanced |
| **1.0** | Standard | Exact match | **Recommended** |
| **1.2-1.5** | Strong | Very precise | Technical accuracy |
| **1.8-2.0** | Maximum | Rigid adherence | Exact reproduction |

**Trade-off:**
- **Higher** = More faithful to original pose/structure
- **Lower** = More creative freedom in composition

---

### **4. Guidance Scale**

**Controls prompt following strength:**

```python
guidance_scale = 7.5  # Range: 1.0 - 20.0
```

| Value | Effect | Output Quality | Use Case |
|-------|--------|----------------|----------|
| **1-3** | Weak | Ignores prompt, generic | Not recommended |
| **5-7** | Moderate | Balanced style + creativity | Artistic freedom |
| **7.5** | Standard | Clear style adherence | **Recommended** |
| **10-12** | Strong | Strict prompt following | Precise results |
| **15-20** | Extreme | Over-saturated, artifacts | Experimental |

**CFG (Classifier-Free Guidance) Formula:**
```
output = unconditional_output + guidance_scale √ó (conditional - unconditional)
```

---

### **5. Inference Steps**

**Number of denoising iterations:**

```python
num_inference_steps = 50  # Range: 10 - 150
```

| Steps | Quality | Time (GPU) | Time (CPU) | Use Case |
|-------|---------|------------|------------|----------|
| **10-15** | Low | 2-3 sec | 1-2 min | Preview |
| **20-30** | Medium | 4-6 sec | 3-5 min | Fast production |
| **50** | High | 8-10 sec | 8-12 min | **Recommended** |
| **75-100** | Very High | 15-20 sec | 15-20 min | Maximum quality |

**Diminishing Returns:**
- Steps 10‚Üí25: Major quality improvement
- Steps 25‚Üí50: Noticeable improvement
- Steps 50‚Üí100: Marginal improvement
- Steps >100: Minimal difference

---

## Style Presets Explained

### **Pixar Style**

```python
prompt = "a 3D render of a face in Pixar animation style, 
          high quality, detailed, professional lighting"

noise_strength = 0.4-0.5
controlnet_strength = 0.8-1.0
guidance_scale = 7.5
```

**Characteristics:**
- Smooth, rounded features
- Large, expressive eyes
- Soft lighting and shadows
- Cartoon proportions with realism
- Clean, polished appearance

**Best For:** General-purpose avatars, friendly characters

---

### **Anime Style**

```python
prompt = "anime style portrait, cel shaded, vibrant colors, 
          expressive eyes, detailed"

noise_strength = 0.5-0.6
controlnet_strength = 0.7-0.9
guidance_scale = 8.0
```

**Characteristics:**
- Large eyes with highlights
- Simplified nose (small triangle)
- Cel shading (flat colors with sharp shadows)
- Colorful hair and features
- Stylized proportions

**Best For:** Manga-style characters, gaming avatars

---

### **LEGO Style**

```python
prompt = "LEGO minifigure face, plastic texture, 
          simplified features, toy style"

noise_strength = 0.6-0.7
controlnet_strength = 1.0
guidance_scale = 9.0
```

**Characteristics:**
- Cylindrical head shape
- Minimal facial features (dots, curves)
- Solid colors
- Plastic/toy appearance
- High contrast

**Best For:** Gaming, collectibles, fun avatars

---

### **Oil Painting Style**

```python
prompt = "oil painting portrait, classical style, 
          rich colors, brushstrokes visible"

noise_strength = 0.4-0.5
controlnet_strength = 0.9
guidance_scale = 7.0
```

**Characteristics:**
- Visible brushstrokes
- Textured appearance
- Rich, deep colors
- Artistic imperfections
- Classical lighting

**Best For:** Artistic profiles, formal portraits

---

## Batch Processing

### **Process Multiple Images:**

```python
process_batch(
    input_dir="./input_images",
    output_dir="./output_styled",
    style_prompt=STYLE_PROMPTS["pixar"],
    noise_strength=0.5,
    controlnet_strength=1.0,
    guidance_scale=7.5,
    num_steps=50,
    seed=42
)
```

**Features:**
- Processes all .jpg, .jpeg, .png files in input folder
- Creates output folder automatically
- Progress tracking for each image
- Error handling (continues even if one image fails)
- Consistent seed = reproducible results

**Output Naming:**
```
input_images/
  ‚îú‚îÄ person1.jpg
  ‚îú‚îÄ person2.jpg
  ‚îî‚îÄ person3.png

output_styled/
  ‚îú‚îÄ styled_person1.jpg
  ‚îú‚îÄ styled_person2.jpg
  ‚îî‚îÄ styled_person3.png
```

---

## Output Quality Assessment

### **Good Results:**

‚úÖ **Structural Preservation:**
- Face position matches original
- Facial proportions correct
- Expression maintained
- Pose identical

‚úÖ **Style Consistency:**
- Clear artistic style visible
- Consistent across batch
- No realistic photo elements
- Professional appearance

‚úÖ **Quality:**
- No artifacts or distortions
- Sharp details
- Proper lighting
- Natural colors within style

---

### **Poor Results:**

‚ùå **Common Issues:**

**1. Wrong Face Shape**
- **Cause**: ControlNet strength too low
- **Solution**: Increase to 0.9-1.0

**2. Not Enough Style**
- **Cause**: Noise strength too low
- **Solution**: Increase to 0.5-0.6

**3. Over-Styled (Unrecognizable)**
- **Cause**: Noise strength too high
- **Solution**: Decrease to 0.3-0.4

**4. Ignores Prompt**
- **Cause**: Guidance scale too low
- **Solution**: Increase to 8.0-10.0

**5. Artifacts/Distortions**
- **Cause**: Guidance scale too high
- **Solution**: Decrease to 6.0-7.0

---

## CPU vs GPU Performance

### **Speed Comparison (per 512√ó512 image, 50 steps):**

| Device | Control Extraction | Style Transfer | Total Time |
|--------|-------------------|----------------|------------|
| **CPU (8 cores)** | 2-5 sec | 8-15 min | **8-15 min** |
| **GPU (GTX 1080)** | 0.5 sec | 8-12 sec | **10-15 sec** |
| **GPU (RTX 3090)** | 0.3 sec | 4-6 sec | **5-8 sec** |
| **GPU (A100)** | 0.2 sec | 2-3 sec | **2-4 sec** |

### **Batch Processing Time:**

**100 images:**
- CPU: 13-25 hours
- GPU (GTX 1080): 15-25 minutes
- GPU (RTX 3090): 8-15 minutes

**Recommendation:** Use GPU for production batches, CPU only for testing/small batches

---

## Memory Requirements

### **Model Loading:**

| Component | Size | Purpose |
|-----------|------|---------|
| Stable Diffusion v1.5 | 4 GB | Base model |
| ControlNet OpenPose | 1.5 GB | Pose control |
| ControlNet Canny | 1.5 GB | Edge control |
| Working Memory | 2-4 GB | Inference |
| **Total (OpenPose)** | **7-9 GB** | |
| **Total (Canny)** | **7-9 GB** | |

### **Hardware Recommendations:**

| Device | RAM | VRAM | Batch Size | Steps |
|--------|-----|------|------------|-------|
| **CPU** | 16 GB | N/A | 1 | 30-50 |
| **GPU (8GB)** | 8 GB | 8 GB | 1 | 50 |
| **GPU (12GB)** | 8 GB | 12 GB | 1-2 | 50-75 |
| **GPU (24GB)** | 16 GB | 24 GB | 2-4 | 75-100 |

---

## Common Issues & Solutions

### **Problem 1: Hugging Face Authentication Error**

**Symptoms:**
```
‚ùå Authentication Error: This model requires a Hugging Face token
```

**Solutions:**

**Method 1: CLI Login**
```bash
pip install huggingface_hub
huggingface-cli login
# Enter token when prompted
```

**Method 2: Environment Variable**
```bash
export HF_TOKEN="hf_your_token_here"
```

**Method 3: Code Parameter**
```python
artist = AvatarArtist2D(hf_token="hf_your_token_here")
```

**Method 4: Use Token-Free Model**
```python
MODEL_ID = "runwayml/stable-diffusion-v1-5"  # No token needed
```

---

### **Problem 2: Out of Memory (OOM)**

**Symptoms:**
```
RuntimeError: CUDA out of memory
```

**Solutions:**

```python
# 1. Enable memory optimizations
pipe.enable_attention_slicing(slice_size=1)
pipe.enable_vae_slicing()

# 2. Reduce inference steps
NUM_STEPS = 30  # vs 50

# 3. Use Canny instead of OpenPose
USE_CANNY = True  # Faster, less memory

# 4. Process one image at a time
# Don't batch multiple images simultaneously

# 5. Switch to CPU (slow but works)
device = "cpu"
```

---

### **Problem 3: ControlNet Not Loading**

**Symptoms:**
```
Error: Failed to load lllyasviel/sd-controlnet-openpose
```

**Solutions:**

**Automatic Fallback:**
```python
# Code automatically tries Canny if OpenPose fails
try:
    controlnet = load("openpose")
except:
    controlnet = load("canny")  # Fallback
    use_canny = True
```

**Manual Selection:**
```python
# Force Canny from start
artist = AvatarArtist2D(use_canny=True)
```

---

### **Problem 4: Style Not Applied**

**Symptoms:**
- Output looks like original photo
- No artistic style visible

**Solutions:**

```python
# Increase noise strength
noise_strength = 0.6  # vs 0.3-0.4

# Increase guidance scale
guidance_scale = 9.0  # vs 7.5

# Reduce ControlNet strength (allow more freedom)
controlnet_strength = 0.7  # vs 1.0

# Improve prompt
prompt = "highly detailed Pixar 3D animated character, " \
         "professional render, studio lighting, " \
         "smooth surfaces, expressive features"
```

---

### **Problem 5: Face Position Wrong**

**Symptoms:**
- Face moved to different position
- Proportions changed
- Pose doesn't match

**Solutions:**

```python
# Increase ControlNet strength
controlnet_strength = 1.2  # vs 0.8-1.0

# Use OpenPose instead of Canny
use_canny = False  # More precise control

# Check control image visually
control_image.save("debug_control.png")
# Should show skeleton/edges matching input
```

---

### **Problem 6: Slow CPU Performance**

**Expected:** CPU is 50-100√ó slower than GPU

**Optimizations:**

```python
# 1. Use Canny (5√ó faster than OpenPose)
use_canny = True

# 2. Reduce steps (proportional speedup)
num_steps = 20  # vs 50 (2.5√ó faster)

# 3. Enable all optimizations
pipe.enable_attention_slicing(slice_size=1)
pipe.enable_vae_slicing()

# 4. Process smaller batches overnight
# 5. Consider cloud GPU (Google Colab, AWS, etc.)
```

---

## Integration with Next Steps

### **Output Requirements for Step 2:**

**Quantity:**
- Minimum: 100 images
- Recommended: 500-1000 images
- Optimal: 2000+ images

**Quality:**
- Clear artistic style
- Consistent across batch
- Proper face structure
- No major artifacts

**Diversity:**
- Different people
- Various expressions
- Multiple angles
- Age/gender variety

### **Validation Before Step 2:**

```python
# Check output quality
for img_path in output_images:
    img = Image.open(img_path)
    
    # Size check
    assert img.size == (512, 512), "Wrong size"
    
    # Visual inspection
    # - Clear style?
    # - Face recognizable?
    # - No artifacts?
```

---

## Summary

This pipeline:

‚úÖ **Converts** real photos ‚Üí artistic avatars  
‚úÖ **Preserves** facial structure and pose  
‚úÖ **Applies** consistent artistic style  
‚úÖ **Supports** multiple styles (Pixar, anime, LEGO, etc.)  
‚úÖ **Works** on CPU and GPU (GPU strongly recommended)  
‚úÖ **Processes** single images or batches  
‚úÖ **Handles** errors gracefully with fallbacks

**Key Parameters:**
- **Noise Strength**: 0.4-0.6 (style intensity)
- **ControlNet Strength**: 0.8-1.0 (structure preservation)
- **Guidance Scale**: 7.5 (prompt adherence)
- **Steps**: 30-50 (quality vs speed)

**Performance:**
- **GPU**: 5-10 seconds per image
- **CPU**: 8-15 minutes per image

**Output:** High-quality stylized avatars ready for 3D avatar training in Steps 2-5.

---

In [None]:
!pip install torch torchvision
!pip install diffusers transformers accelerate
!pip install controlnet-aux opencv-python pillow
!pip install mediapipe==0.10.13

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token0 = user_secrets.get_secret("secret_hf_token")

In [None]:
import shutil
import os
import random

paths=[]
for dirname, _, filenames in os.walk('/kaggle/input/pins-face-recognition/105_classes_pins_dataset'):
    for filename in filenames:
        paths+=[(os.path.join(dirname, filename))]
print(paths[0:6])
random.shuffle(paths)

os.makedirs("input_images", exist_ok=True)
for path in paths[0:6]:
    shutil.copy(path, "input_images")

for dirname, _, filenames in os.walk('./input_images'):
    for filename in filenames:
        print(filename)

In [None]:
"""
AvatarArtist: 2D Domain Transfer Script (CPU Optimized)
Converts real-life images into specific styles using 
Stable Diffusion + ControlNet + SDEdit.
"""

import os
import torch
import numpy as np
from PIL import Image
from pathlib import Path
from typing import Optional, List, Tuple
import cv2
from diffusers import (
    StableDiffusionControlNetPipeline,
    ControlNetModel,
    DDIMScheduler,
    UniPCMultistepScheduler
)
from diffusers.utils import load_image
from controlnet_aux import OpenposeDetector, CannyDetector

# MediaPipe is optional
try:
    import mediapipe as mp
    MEDIAPIPE_AVAILABLE = True
except ImportError:
    MEDIAPIPE_AVAILABLE = False
    print("Warning: MediaPipe not available. Using ControlNet only for pose detection.")

class AvatarArtist2D:
    """Main class for 2D domain transfer (CPU Optimized)."""
    
    def __init__(
        self,
        model_id: str = "runwayml/stable-diffusion-v1-5",
        controlnet_model: str = "lllyasviel/sd-controlnet-openpose",
        device: str = "cpu",
        dtype: torch.dtype = torch.float32,  # CPU requires float32
        use_canny: bool = False,
        hf_token: Optional[str] = None
    ):
        """
        Args:
            model_id: Path or ID for the Stable Diffusion model.
            controlnet_model: Path or ID for the ControlNet model.
            device: Computing device to use.
            dtype: Data type (float32 for CPU).
            use_canny: Use Canny edge detection (simpler and lightweight).
            hf_token: Hugging Face token (optional).
        """
        self.device = device
        self.dtype = dtype
        self.use_canny = use_canny
        self.hf_token = hf_token or os.environ.get("HF_TOKEN")
        
        print(f"Using Model: {model_id}")
        print(f"Device: {device.upper()}")
        print(f"Data Type: {dtype}")
        print("Loading models...")
        
        # Select ControlNet based on the base model
        if "stable-diffusion-v1-5" in model_id or "v1-5" in model_id or "v1-4" in model_id:
            if use_canny:
                controlnet_model = "lllyasviel/sd-controlnet-canny"
            else:
                controlnet_model = "lllyasviel/sd-controlnet-openpose"
        elif "stable-diffusion-2" in model_id:
            if use_canny:
                controlnet_model = "thibaud/controlnet-sd21-canny-diffusers"
            else:
                controlnet_model = "thibaud/controlnet-sd21-openpose-diffusers"
        
        print(f"ControlNet: {controlnet_model}")
        
        # Load ControlNet
        try:
            self.controlnet = ControlNetModel.from_pretrained(
                controlnet_model,
                torch_dtype=dtype,
                token=self.hf_token
            )
            print(f"‚úì ControlNet loaded successfully")
        except Exception as e:
            print(f"‚ö† Error: Failed to load {controlnet_model}")
            print(f"  Details: {e}")
            print("Attempting fallback to Canny model...")
            try:
                fallback_model = "lllyasviel/sd-controlnet-canny"
                self.controlnet = ControlNetModel.from_pretrained(
                    fallback_model,
                    torch_dtype=dtype,
                    token=self.hf_token
                )
                self.use_canny = True
                print(f"‚úì Using Canny ControlNet as fallback.")
            except Exception as e2:
                raise Exception(f"Failed to load any ControlNet model: {e2}")
        
        # Stable Diffusion pipeline setup
        try:
            self.pipe = StableDiffusionControlNetPipeline.from_pretrained(
                model_id,
                controlnet=self.controlnet,
                torch_dtype=dtype,
                safety_checker=None,
                token=self.hf_token
            )
            print(f"‚úì Stable Diffusion loaded successfully")
        except Exception as e:
            error_msg = str(e)
            if "gated" in error_msg.lower() or "token" in error_msg.lower():
                raise Exception(
                    f"\n{'='*60}\n"
                    f"üîê Authentication Error: This model requires a Hugging Face token.\n"
                    f"\nInstructions:"
                    f"\n1. Get a token at: https://huggingface.co/settings/tokens"
                    f"\n2. Set it using one of these methods:"
                    f"\n   a) export HF_TOKEN='your_token'"
                    f"\n   b) huggingface-cli login"
                    f"\n   c) artist = AvatarArtist2D(hf_token='your_token')"
                    f"\n\nAlternatively, use a model that doesn't require a token:"
                    f"\n   model_id='runwayml/stable-diffusion-v1-5'"
                    f"\n{'='*60}\n"
                )
            raise
        
        # Scheduler configuration (SDEdit compatible)
        self.pipe.scheduler = DDIMScheduler.from_config(
            self.pipe.scheduler.config
        )
        
        self.pipe = self.pipe.to(device)
        
        # CPU Optimization: Enable attention slicing to reduce memory usage
        self.pipe.enable_attention_slicing(slice_size=1)
        
        # CPU Optimization: Enable VAE slicing
        if hasattr(self.pipe, 'enable_vae_slicing'):
            self.pipe.enable_vae_slicing()
            print("‚úì VAE slicing enabled for memory efficiency")
        
        print(f"‚úì Memory optimization enabled for CPU")
        
        # Control image processor
        if self.use_canny:
            print("Initializing Canny detector...")
            self.processor = CannyDetector()
        else:
            print("Loading Openpose processor...")
            try:
                self.processor = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
            except Exception as e:
                print(f"Warning: OpenPose load failed: {e}")
                print("Falling back to Canny...")
                self.processor = CannyDetector()
                self.use_canny = True
        
        # MediaPipe Face Detection (Optional: for more detailed control)
        self.face_mesh = None
        if MEDIAPIPE_AVAILABLE:
            try:
                mp_face_mesh = mp.solutions.face_mesh
                self.face_mesh = mp_face_mesh.FaceMesh(
                    static_image_mode=True,
                    max_num_faces=1,
                    min_detection_confidence=0.5
                )
                print("MediaPipe face detection enabled.")
            except Exception as e:
                print(f"MediaPipe initialization failed: {e}")
                self.face_mesh = None
        
        print("Initialization complete!")
    
    def extract_pose_landmarks(self, image: Image.Image) -> Image.Image:
        """Extract control image from input (OpenPose or Canny)."""
        control_image = self.processor(image)
        return control_image
    
    def extract_face_landmarks(self, image: Image.Image) -> Optional[np.ndarray]:
        """Extract face landmarks using MediaPipe."""
        if self.face_mesh is None:
            return None
            
        try:
            image_np = np.array(image)
            results = self.face_mesh.process(cv2.cvtColor(image_np, cv2.COLOR_RGB2BGR))
            
            if results.multi_face_landmarks:
                landmarks = results.multi_face_landmarks[0]
                h, w = image_np.shape[:2]
                points = np.array([
                    [lm.x * w, lm.y * h] 
                    for lm in landmarks.landmark
                ])
                return points
        except Exception as e:
            print(f"Warning: Face landmark extraction failed: {e}")
        return None
    
    def apply_sdedit(
        self,
        image: Image.Image,
        prompt: str,
        control_image: Image.Image,
        noise_strength: float = 0.5,
        controlnet_conditioning_scale: float = 1.0,
        guidance_scale: float = 7.5,
        num_inference_steps: int = 50,
        seed: Optional[int] = None
    ) -> Image.Image:
        """Perform domain transfer applying SDEdit logic."""
        if seed is not None:
            generator = torch.Generator(device=self.device).manual_seed(seed)
        else:
            generator = None
        
        output = self.pipe(
            prompt=prompt,
            image=control_image,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            controlnet_conditioning_scale=controlnet_conditioning_scale,
            generator=generator,
        )
        return output.images[0]
    
    def process_single_image(
        self,
        image_path: str,
        output_path: str,
        style_prompt: str,
        noise_strength: float = 0.5,
        controlnet_strength: float = 1.0,
        guidance_scale: float = 7.5,
        num_steps: int = 50,
        seed: Optional[int] = None
    ) -> bool:
        """Process a single image."""
        try:
            image = load_image(image_path)
            image = image.resize((512, 512))
            
            print(f"  Extracting control image...")
            control_image = self.extract_pose_landmarks(image)
            
            print(f"  Transforming style (this may take a while on CPU)...")
            output_image = self.apply_sdedit(
                image=image,
                prompt=style_prompt,
                control_image=control_image,
                noise_strength=noise_strength,
                controlnet_conditioning_scale=controlnet_strength,
                guidance_scale=guidance_scale,
                num_inference_steps=num_steps,
                seed=seed
            )
            
            output_image.save(output_path)
            print(f"  Saved to: {output_path}")
            return True
        except Exception as e:
            print(f"  Error: {str(e)}")
            return False
    
    def process_batch(
        self,
        input_dir: str,
        output_dir: str,
        style_prompt: str,
        noise_strength: float = 0.5,
        controlnet_strength: float = 1.0,
        guidance_scale: float = 7.5,
        num_steps: int = 50,
        extensions: List[str] = [".jpg", ".jpeg", ".png"],
        seed: Optional[int] = None
    ):
        """Process all images in a folder."""
        os.makedirs(output_dir, exist_ok=True)
        input_path = Path(input_dir)
        image_files = []
        for ext in extensions:
            image_files.extend(list(input_path.glob(f"*{ext}")))
            image_files.extend(list(input_path.glob(f"*{ext.upper()}")))
        
        print(f"\nProcessing {len(image_files)} images")
        print(f"Style: {style_prompt}")
        print(f"Noise Strength: {noise_strength}")
        print(f"ControlNet Strength: {controlnet_strength}")
        print(f"‚ö† Note: CPU processing is slower than GPU. Each image may take several minutes.\n")
        
        success_count = 0
        for i, img_path in enumerate(image_files, 1):
            print(f"[{i}/{len(image_files)}] Processing: {img_path.name}")
            output_path = os.path.join(output_dir, f"styled_{img_path.name}")
            
            if self.process_single_image(
                str(img_path), output_path, style_prompt,
                noise_strength, controlnet_strength,
                guidance_scale, num_steps, seed
            ):
                success_count += 1
        
        print(f"\nFinished: Transformed {success_count}/{len(image_files)} images.")

def main():
    """Main execution entry point."""
    INPUT_DIR = "./input_images"
    OUTPUT_DIR = "./output_styled"
    
    # Model Selection (Fixed: removed trailing comma)
    MODEL_ID = "runwayml/stable-diffusion-v1-5"  # Token-free model
    
    # Alternative models:
    # MODEL_ID = "CompVis/stable-diffusion-v1-4"
    # MODEL_ID = "stabilityai/stable-diffusion-2-1"  # May require token
    
    # Hugging Face Token (if required)
    HF_TOKEN = hf_token0  # Use your token if needed
    
    STYLE_PROMPTS = {
        "pixar": "a 3D render of a face in Pixar animation style, high quality, detailed, professional lighting",
        "anime": "anime style portrait, cel shaded, vibrant colors, expressive eyes, detailed",
        "lego": "LEGO minifigure face, plastic texture, simplified features, toy style",
        "oil_painting": "oil painting portrait, classical style, rich colors, brushstrokes visible",
        "cartoon": "cartoon style portrait, bold lines, vibrant colors, simplified features"
    }
    
    STYLE = "pixar"
    NOISE_STRENGTH = 0.4
    CONTROLNET_STRENGTH = 0.8
    GUIDANCE_SCALE = 7.5
    NUM_STEPS = 30  # Reduced for faster CPU processing (was 50)
    SEED = 42
    USE_CANNY = True  # Canny is faster than OpenPose on CPU
    
    try:
        artist = AvatarArtist2D(
            model_id=MODEL_ID,
            device="cpu",  # Force CPU
            dtype=torch.float32,  # CPU requires float32
            use_canny=USE_CANNY,
            hf_token=HF_TOKEN
        )
    except Exception as e:
        print(f"\n‚ùå Initialization Error: {e}")
        print("\nüí° Troubleshooting:")
        print("  1. Login to Hugging Face: huggingface-cli login")
        print("  2. Or set environment variable: export HF_TOKEN='your_token'")
        print("  3. Or use a token-free model: MODEL_ID='runwayml/stable-diffusion-v1-5'")
        return
    
    artist.process_batch(
        input_dir=INPUT_DIR,
        output_dir=OUTPUT_DIR,
        style_prompt=STYLE_PROMPTS[STYLE],
        noise_strength=NOISE_STRENGTH,
        controlnet_strength=CONTROLNET_STRENGTH,
        guidance_scale=GUIDANCE_SCALE,
        num_steps=NUM_STEPS,
        seed=SEED
    )

if __name__ == "__main__":
    main()

In [None]:
# Display results
import matplotlib.pyplot as plt
from PIL import Image

def show_image(image_dir):
    image_paths = [
        os.path.join(image_dir, f)
        for f in sorted(os.listdir(image_dir))
        if f.lower().endswith((".png", ".jpg", ".jpeg"))
    ][:6]  
    
    if not image_paths:
        print(f"No images found in {image_dir}")
        return
    
    fig, axes = plt.subplots(2, 3, figsize=(12, 8))
    axes = axes.flatten()
    for ax, img_path in zip(axes, image_paths):
        img = Image.open(img_path)
        ax.imshow(img)
        ax.axis("off")
        ax.set_title(os.path.basename(img_path), fontsize=9)
    for ax in axes[len(image_paths):]:
        ax.axis("off")
    plt.tight_layout()
    plt.show()

print("\n=== Input Images ===")
show_image('input_images')

print("\n=== Styled Output Images ===")
if os.path.exists('./output_styled'):
    show_image('./output_styled')
else:
    print("Output directory not found. Run the main() function first.")