Cell 1: Installation of Project Dependencies
This cell installs the necessary Python libraries required for the project.

diffusers: A core Hugging Face library providing pre-trained diffusion models and pipelines for various modalities, including text-to-video.

transformers: Provides the underlying architecture and utilities for loading and managing models from the Hugging Face Hub.

accelerate: An auxiliary library from Hugging Face that optimizes PyTorch code for execution on various hardware configurations (e.g., GPUs, TPUs), ensuring efficient use of available resources.

"imageio[ffmpeg]": A library for reading and writing a wide range of image and video data. The [ffmpeg] extra ensures that the FFmpeg backend is installed, which is a robust, open-source multimedia framework required for encoding the generated frames into a video file (e.g., MP4).

av: A Pythonic binding for FFmpeg libraries, providing an alternative and sometimes more direct interface for video processing tasks.



In [None]:
# Cell 1: Install project dependencies via pip
%pip install diffusers transformers accelerate "imageio[ffmpeg]" av

Cell 2: Authentication with Hugging Face Hub
This section handles the preliminary setup of the environment. This includes authenticating with the Hugging Face Hub to access pre-trained model weights. Access to pre-trained models on the Hugging Face Hub often requires authentication. This cell securely logs into the service using a token retrieved from Colab's secure secret manager (userdata), which is the best practice for handling sensitive credentials in a notebook environment, preventing accidental exposure.

In [None]:
# Cell 2: Authenticate with Hugging Face Hub
from huggingface_hub import login
from google.colab import userdata

# Retrieve and use token stored in Colab's secret environment
hf_token = userdata.get("HF_TOKEN")
login(hf_token.strip())

Cell 3: Import Required Python Libraries
This cell imports the primary modules for the script's execution.

DiffusionPipeline: A high-level abstraction from the diffusers library that encapsulates the entire generative process (from text to video).

torch: The core PyTorch library, upon which the diffusion model is built. It is essential for tensor operations and GPU acceleration.

numpy: A fundamental library for numerical computing in Python, used here for manipulating the video frames as numerical arrays.

imageio: The library used for the final step of encoding the sequence of generated image frames into a standard video file format.

IPython.display.HTML: A utility for rendering HTML content directly within the notebook, used here to display the final generated video.

In [None]:
# Cell 3: Import required Python libraries for inference and video processing
from diffusers import DiffusionPipeline
import torch
import numpy as np
import imageio
from IPython.display import HTML

Cell 4: Initialize Text-to-Video Generation Pipeline
This section focuses on loading the pre-trained text-to-video model and configuring it for inference. Here, we instantiate the DiffusionPipeline using a specific pre-trained model.

"cerspense/zeroscope_v2_576w": This identifier specifies the model to be downloaded from the Hugging Face Hub. It's a version of the ZeroScope model optimized for generating videos with a width of 576 pixels.

torch_dtype=torch.float32: This parameter sets the model's weights to use 32-bit floating-point precision. While float16 offers faster inference and lower memory usage, float32 provides higher numerical precision and stability. The choice represents a trade-off between performance and precision.

The model is then moved to the GPU ("cuda") to leverage hardware acceleration, a critical step for handling the computational demands of diffusion models.

In [None]:
# Cell 4: Initialize text-to-video generation pipeline
pipeline = DiffusionPipeline.from_pretrained(
    "cerspense/zeroscope_v2_576w",
    torch_dtype=torch.float32  # Set model weights to 32-bit float precision
)

# Move model to GPU to enable accelerated generation
pipeline.to("cuda")

Cell 5: User Input for Generative Prompt
The core of the notebook resides in this section. It begins by capturing the user's creative intent via a textual prompt. The textual prompt is the primary input that conditions the diffusion model's generative process. The model will synthesize a video sequence that it interprets as corresponding to the semantic content of this prompt.

In [None]:
# Cell 5: Prompt user for video generation input
prompt = input("Enter your prompt: ")  # Accepts descriptive text to guide video synthesis

Cell 6: Set Video Generation Parameters
This cell defines key parameters that control the characteristics of the output video and the generation process itself.

fps: Frames Per Second. Set to 8, a common rate for AI-generated video that produces reasonably smooth motion without excessive computational cost.

total_frames: The total number of frames desired for the final video. The calculation (30 seconds * 8 fps) yields 240 frames.

chunk_size: A critical parameter for memory management. Generating all 240 frames at once would likely cause an out-of-memory (OOM) error on most GPUs. By setting a smaller chunk size (e.g., 10 frames), we process the video in manageable batches.

output_dir: A dedicated directory to store the intermediate video chunks before they are concatenated.

In [None]:
# Cell 6: Set video generation parameters
fps = 8
total_frames = 240  # 30 seconds * 8 fps
chunk_size = 10     # safe batch size for memory
output_dir = "/content/video_chunks"

import os
os.makedirs(output_dir, exist_ok=True)

Cell 7: Generate and Save Video in Chunks
This is the main execution loop for video synthesis. It iterates through the total frame count, generating one chunk at a time to conserve memory. A try-except block ensures that an error in one chunk does not terminate the entire process.

The pipeline is executed with the following key parameters:

num_inference_steps: The number of denoising steps. More steps can improve quality but increase computation time. 40 is a balanced value.

guidance_scale (CFG): Controls how strictly the model adheres to the prompt. A higher value (e.g., 7.5) enforces stronger prompt alignment.

The model outputs frames as float tensors, which are scaled to the standard 8-bit integer range [0, 255]. The code also includes a robust validation block to handle potential shape inconsistencies, ensuring every frame is standardized to 3-channel RGB before being saved. Finally, imageio writes the processed frames of the current chunk to an MP4 file using the efficient libx264 codec.

In [None]:
# Cell 7: Generate and save video in chunks (final version with shape fix)
import numpy as np
import imageio

chunk_count = total_frames // chunk_size

for i in range(chunk_count):
    print(f"\nGenerating chunk {i+1}/{chunk_count}...")

    try:
        # Run pipeline — returns shape: (1, 10, 320, 320, 3)
        video_batch = pipeline(
            prompt,
            num_frames=chunk_size,
            num_inference_steps=40,
            guidance_scale=7.5,
            height=320,
            width=320
        )[0][0]  # Remove outer batch dim → shape: (10, 320, 320, 3)

        # Convert batch to list of frames
        video_frames = [frame.copy() for frame in video_batch]

        # Sanity check
        if len(video_frames) != chunk_size:
            print(f"⚠️ Warning: Expected {chunk_size} frames, got {len(video_frames)}")
            continue

        frames_uint8 = []

        for idx, frame in enumerate(video_frames):
            try:
                frame_uint8 = (frame * 255).astype(np.uint8)

                # Fix grayscale or alpha issues
                if frame_uint8.ndim == 2:
                    frame_uint8 = np.stack([frame_uint8] * 3, axis=-1)
                elif frame_uint8.shape[-1] > 3:
                    frame_uint8 = frame_uint8[..., :3]
                elif frame_uint8.shape[-1] < 3:
                    frame_uint8 = np.repeat(frame_uint8, 3, axis=-1)

                # Final shape check
                if frame_uint8.ndim != 3 or frame_uint8.shape[-1] != 3:
                    raise ValueError(f"Frame {idx} has shape {frame_uint8.shape}")

                frames_uint8.append(frame_uint8)
            except Exception as e:
                print(f"⚠️ Skipping frame {idx}: {e}")

        # Write this chunk to mp4
        chunk_path = f"{output_dir}/chunk_{i+1}.mp4"
        with imageio.get_writer(chunk_path, fps=fps, codec="libx264") as writer:
            for frame in frames_uint8:
                writer.append_data(frame)

        print(f"✅ Saved chunk {i+1} with {len(frames_uint8)} frames")

    except Exception as e:
        print(f"❌ Skipping chunk {i+1} due to error: {e}")


Cell 8: Create a Concatenation File List for FFmpeg
With all the video chunks generated, the final step is to merge them. FFmpeg's concat demuxer requires a manifest file that lists all the input files to be joined in the correct order. This cell programmatically generates that chunks_list.txt file.



In [None]:
# Cell 8: Create a list of chunk files for ffmpeg
with open("chunks_list.txt", "w") as f:
    for i in range(chunk_count):
        f.write(f"file '{output_dir}/chunk_{i+1}.mp4'\n")


Cell 9: Merge Video Chunks with FFmpeg
This cell executes a shell command to call the FFmpeg utility.

-f concat: Specifies the concatenation protocol.

-safe 0: A necessary flag to allow concatenation of files with paths as specified in our list file.

-c copy: This is a crucial optimization. It tells FFmpeg to perform a stream copy of the video data without re-encoding. This is extremely fast and preserves the original quality of the chunks.

/content/final_output.mp4: The path for the final, merged video.

In [None]:
# Cell 9: Merge all chunks into one video using ffmpeg
!ffmpeg -f concat -safe 0 -i chunks_list.txt -c copy /content/final_output.mp4


Cell 10: Note on Redundant Code Block
The code in this cell appears to be from a previous workflow version, as it processes a variable video_frames that now only holds the last processed chunk. The primary workflow already saves chunks in Cell 7 and merges them in Cell 9. This block is not part of the main chunking-and-merging pipeline but could be used for quickly previewing a single chunk without the full merge process.



In [None]:
# Cell 10: Convert float32 frames to 8-bit format and encode to MP4
video_path = "/content/Wan2.1-T2V-14B_output1.mp4"

# Extract frames and scale from [0.0, 1.0] to [0, 255] as uint8 for encoding
frames_np    = np.array(video_frames)[0]  # Only process the first video in batch
frames_uint8 = (frames_np * 255).astype(np.uint8)

# Encode frames into a video using imageio with H.264 codec
with imageio.get_writer(video_path, fps=8, codec="libx264") as writer:
    for frame in frames_uint8:
        writer.append_data(frame)

Cell 11: Display Rendered Video Inline
This cell uses IPython's HTML rendering capabilities to embed an HTML5 <video> tag directly into the notebook's output. This allows for immediate playback and review of the final generated video without needing to download it first. The src path points to the final, merged video.

In [None]:
# Cell 11: Display rendered video inline using HTML5 video tag
HTML(f"""
<video width="256" height="256" controls autoplay loop>
  <source src="{video_path}" type="video/mp4">
  Your browser does not support the video tag.
</video>
""")

Cell 12: Provide Download Link for Final Video
This cell leverages a Google Colab-specific module (google.colab.files) to create and trigger a browser-based download of the final video file. This provides a convenient method for saving the artifact locally.

In [None]:
# Cell 12: Provide download link for final video
from google.colab import files
files.download("/content/final_output.mp4")
