# Visual Sampling: Audio To Generative Art

Alex Lee

## Concept
This project expands the concept of sampling, which originated from a musical practice where musicians mixed and matched "samples" of pre-existing music to create distinct results. This project extends that practice across mediums: audio is reinterpreted by Stable Diffusion into generative art, and then translated back to audio.
Machine learning systems treat audio and images as interchangeable data—arrays that can be reshaped and reinterpreted. This system exploits that property to create a translation chain where unexpected meaning emerges through gaps in conversion.



## Process:
Audio → Spectrogram (visual representation)
<br> Spectrogram → Abstract art (via Stable Diffusion)
<br> Abstract art → Audio (via data conversion methods)

<br> The same source material produces radically different results depending on conversion method. This variability is the point—it reveals how meaning is constructed through our methods of reading data.

The following is the source code for this project. You **DO NOT need to know any programming to use this system**. Please follow the instuctions carefully.

## Getting Started

Navigate to Edit > Run All to start using this system. Connect to a T4 GPU runtime.  

In [None]:
# --- Install dependencies ---
!pip install torch torchaudio torchvision diffusers transformers accelerate safetensors pillow matplotlib --quiet

In [None]:
# --- Imports ---
import io
import os
import torch
import torchaudio
import torchaudio.transforms as T
import torchvision.transforms as vtrans
import matplotlib.pyplot as plt
from datetime import datetime
from PIL import Image
from diffusers import StableDiffusionImg2ImgPipeline
import numpy as np

## Audio Upload

Upload an audio file. You are free to upload any format as long as it is an audio file, but non-wav files will be converted to wav in the following cell.

In [None]:
# --- Upload any audio file ---
from google.colab import files
uploaded = files.upload()
audio_path = list(uploaded.keys())[0]

In [None]:
# convert non-wav files into wav if necessary
# Check if the uploaded file is already a WAV
if not audio_path.lower().endswith('.wav'):
    print(f"Converting '{audio_path}' to WAV format...")
    # Load the audio file
    waveform, sample_rate = torchaudio.load(audio_path)

    # Define a new WAV file path
    # Using os.path.splitext to get base name and then append .wav
    base_name = os.path.splitext(audio_path)[0]
    new_audio_path = f"{base_name}.wav"

    # Save as WAV
    torchaudio.save(new_audio_path, waveform, sample_rate)

    # Update audio_path to point to the new WAV file
    audio_path = new_audio_path
    print(f"Conversion complete. New audio path: '{audio_path}'")
else:
    print(f"File '{audio_path}' is already a WAV file. No conversion needed.")

## Audio -> Spectrogram

The following code extracts a spectrogram using standard signal processing (torchaudio).

### What's a Spectrogram?
A spectrogram is a time-frequency representation of audio:

X-axis: time
Y-axis: frequency
Intensity: amplitude (rendered as brightness/color)

This creates a visual "map" of sound. When fed to Stable Diffusion—an AI trained on images, not audio—the model interprets these patterns as visual information and transforms them according to its training, with no awareness of the sonic origin.

In [None]:
# --- Define helper functions ---
def get_spectrogram_image(audio_path):
    waveform, sample_rate = torchaudio.load(audio_path)
    spec = T.Spectrogram(n_fft=512)(waveform)
    if spec.shape[0] > 1:
        spec = spec.mean(dim=0, keepdim=True)
    spec_db = 10 * torch.log10(spec + 1e-9)
    spec_db = (spec_db - spec_db.min()) / (spec_db.max() - spec_db.min())
    spec_np = (spec_db.squeeze(0).numpy() * 255).astype(np.uint8)
    rgb_img = Image.fromarray(spec_np).convert("RGB")
    return rgb_img.resize((400, 300))

# --- Generate spectrogram image ---
spec_image = get_spectrogram_image(audio_path)

print("Normalized Spectrogram")
spec_image


In [None]:
# --- Setup device and model ---
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "runwayml/stable-diffusion-v1-5"

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to(device)

## Spectrogram -> Image

Stable Diffusion recreates the spectrogram as abstract art. I deliberately prompted for abstract art to maximize room for interpretation. I also wanted to avoid associations with real-life objects or people.

### Avenues for Experimentation:

 - Changing Stable Diffusion prompts (you can change the prompt in double quotation marks below)

 - Hyperparameter Tuning
    - Guidance Scale: Determines how much influence the prompt has over the outcome. Works from 1 or higher. Defaults to 7.5
    - Strength: Between 0 and 1. Determines how much the original image (ie spectrogram) is supposed to change.
    - Negative Prompt: Things you don't want in the outcome.

Examples: Change the prompt to "acrylic art". Lower the strength to a value below 0.5.

In [None]:
# --- Prompt and generation ---
prompt = (
   "caffeinated thoughts of a tired college student pulling their 2nd all nighter flying saucers and jellyfish living forever spicy pumpkin coffee abstract image no concrete objects"
)

result = pipe(
    prompt=prompt,
    negative_prompt="no letters, no people, no recognizable objects",
    image=spec_image,
    strength=0.9,
    guidance_scale=2.5,
).images

# --- Save and show result ---
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"generative_art_from_{audio_path}_at_{timestamp}.png"
result[0].save(filename)

print(f"✅ Saved as {filename}")
display(result[0])



## Image -> Audio

This is where multiple interpretive possibilities open up. The abstract art gained from the previous step is a color image, which comes with an RGB color channel. However, audio waveforms only require one-dimensional amplitude information.

This surplus of visual information creates a translation problem that's simultaneously technical and artistic: which information gets used, and how?

The simplest approach averages all three color channels into a single amplitude stream. But there are a wide range of other methods available. One such approach is based on scipy's chirp function.

Each channel can be assigned a distinct sonic parameter (red channel → pitch, green → amplitude, blue → duration, etc.) and combined into composite audio.

These choices might appear purely technical, but they carry interpretive weight: why should red correspond to pitch rather than green? What does that mapping mean?

In [None]:
#convert image back into sound
img = result[0].convert("RGB")
print(type(img))
to_tensor = vtrans.ToTensor()
rgb = to_tensor(img)  # [3, H, W]

# Convert each channel to spectrogram magnitude
def invert_channel(channel):
    spec = torch.pow(10.0, (channel * 80 - 80) / 20.0)
    freq_bins = channel.shape[0]
    n_fft = (freq_bins - 1) * 2
    hop_length = n_fft //2
    griffin_lim = T.GriffinLim(n_fft=n_fft, hop_length=hop_length)
    reconstructed_waveform = griffin_lim(spec)
    return reconstructed_waveform

waveforms = [invert_channel(c) for c in rgb]

waveforms[0]


In [None]:
#method 1: outcome from the red channel only
for channel in rgb:
    waveform = invert_channel(channel)
    print(waveform.size())
    max_val = waveform.abs().amax()
    if max_val > 0:
        waveform = waveform / max_val

    print(waveform.unsqueeze(0).shape)
    torchaudio.save("red_channel.wav", waveform.unsqueeze(0), 24000)

    from IPython.display import Audio

    display(Audio("red_channel.wav"))
    break


In [None]:
#Take an average of R, G, and B channels
# Stack all waveforms into a single tensor: [C, T]
stacked = torch.stack(waveforms, dim=0)

# Average across channels
mixed = stacked.mean(dim=0)

# Normalize safely
max_val = mixed.abs().amax()
if max_val > 0:
    mixed = mixed / max_val

import torchaudio.functional as F

#make audio file longer
def stretch_audio(original_audio, stretch_rate):
    stretched = F.phase_vocoder(original_audio.unsqueeze(0), rate=1/stretch_rate, phase_advance=original_audio.size()[0] * stretch_rate)
    return stretched

stretched = stretch_audio(mixed, 3)

torchaudio.save("rgb_avg.wav", stretched.to(torch.float32), 24000)

# Play output audio
Audio("rgb_avg.wav")

In [None]:
#method 2:
from scipy.signal import chirp

img_np = np.array(result[0].resize((400, 300))) / 255.0
height, width, _ = img_np.shape
duration = 30  # seconds
sample_rate = 24000
samples = np.linspace(0, duration, int(sample_rate*duration))

audio = np.zeros_like(samples)
for x in range(width):
    col = img_np[:, x, :].mean(axis=0)
    freq = 800 * col[0]   # red → pitch
    amp = 0.2 + 0.8 * col[1]    # green → amplitude
    tone = amp * chirp(samples, f0=freq, f1=freq * 1.5, t1=duration, method='linear')
    audio += tone / width

# Convert NumPy array to PyTorch tensor and add a channel dimension
audio_tensor = torch.from_numpy(audio).unsqueeze(0).to(torch.float32)

torchaudio.save("chirp_file.wav", audio_tensor, 24000)

#Play audio
Audio("chirp_file.wav")

## Conceptual Framework

This project operates at the intersection of several questions:

 - On sampling: If hip-hop sampling recontextualizes music, what happens when we sample across modalities? Is it still sampling if the outcome passes through a different format?

  - On ownership: Who has the right to claim work coming out of this system? The person who created the original audio? This system itself? The user?

 - On translation: Every conversion (audio→visual→audio) is interpretive, not neutral. The technical choices I make about how to read data determine what meaning survives the translation.

 - On machine learning as medium: AI/ML systems don't distinguish between audio and visual data. Both are multidimensional arrays. I exploited this interchangeability as a creative opening.

 - On multiplicity: Different conversion methods extract different "meanings" from identical data.

## Notes

- Each run is unique due to Stable Diffusion's stochastic sampling.

- All result files are thrown out at the end of each session. Save outputs before you close the Colab tab or navigate away to a different window. You can download output files by clicking the file icon at the left sidebar and selecting 'Download' from the dropdown menu for each file.

## Context
Created as a final project for Poetics and Protocols of Sampling (Autumn 2025) at the School of Poetic Computation. SFPC is an experimental online school hosting classes that blend art, code, and critical theory.

## Contact Info
Alex Lee
<br> GitHub: https://github.com/seohyeonlee2020
