Gradient Checkpointing Support

This feature aims to support memory-efficient training by enabling gradient checkpointing on specific layers of Lux models. Users will be able to configure which layers should trade compute for memory.
Success when:
Gradient checkpointing can be toggled per layer.
Memory usage reduction is shown via benchmarks on larger models.