```{contents}
```
## Data Formats

Data formats in generative AI refer to the types of input–output representations used by different model architectures. Each modality uses its own structure, encoding, and preprocessing style.

Below is a clear breakdown:

---

### **1. Text Data Formats**

Used by LLMs (GPT, LLaMA, Gemini Text).

**Common formats**

* **Raw text** (TXT)
* **Tokenized text** (integer token IDs)
* **JSON / structured text**
* **Markdown / HTML** when structure matters

**Processing**

* Normalization, tokenization, embedding.

**Used for**

* Chatbots
* Summarization
* Code generation

---

### **2. Image Data Formats**

Used by models like Stable Diffusion, Midjourney, Imagen.

**Common formats**

* **PNG** (lossless, transparent)
* **JPEG/JPG** (lossy, small)
* **BMP**, **TIFF**

**Processing**

* Resize, normalize pixel values, RGB conversion.

**Used for**

* Image generation
* Image editing
* Vision–language models

---

### **3. Audio Data Formats**

Used by models like Whisper, AudioLM, VALL-E.

**Common formats**

* **WAV** (raw audio, uncompressed)
* **MP3** (compressed)
* **FLAC** (lossless compressed)

**Processing**

* Convert to waveform → spectrogram → tokens.

**Used for**

* Speech-to-text
* Text-to-speech
* Audio generation

---

### **4. Video Data Formats**

Used by models like Sora, Runway, Pika.

**Common formats**

* **MP4**, **MOV** (compressed video)
* **Frames** stored as image sequences

**Processing**

* Frame extraction, temporal modeling.

**Used for**

* Video generation
* Video prediction

---

### **5. Tabular Data Formats**

Used in data-driven generative modeling.

**Common formats**

* **CSV**
* **Parquet**
* **Excel**
* **SQL tables**

**Processing**

* Numerical scaling, categorical encoding.

**Used for**

* Synthetic data generation
* Forecasting models

---

### **6. Embedding Formats**

Represent semantic meaning numerically.

**Common formats**

* **Dense vectors** (NumPy array, .npy)
* **Float lists** stored in DBs (FAISS, Pinecone)

**Used for**

* RAG pipelines
* Search engines
* Recommendation systems

---

### **7. Multimodal Formats**

For models combining text, image, audio, video.

**Formats**

* **JSON with mixed fields**
* **Tensor representations**
* **Base64 encoded media**

**Used for**

* Vision–language models (e.g., GPT-4o)
* Image captioning
* Text-to-image/video models

---

### **8. Code Formats**

Used by code models (Codex, StarCoder).

**Formats**

* **Source code files**: .py, .js, .java, .c, etc.
* **Tokenized code sequences**

**Used for**

* Code generation
* Fixing errors
* Auto-complete

---

### **Summary**

| Modality   | Main Formats          | Used For                 |
| ---------- | --------------------- | ------------------------ |
| Text       | TXT, JSON, tokens     | LLMs                     |
| Images     | PNG, JPG              | Diffusion, vision models |
| Audio      | WAV, MP3              | Speech models            |
| Video      | MP4, MOV              | Video generation         |
| Tabular    | CSV, Parquet          | Synthetic data           |
| Embeddings | Vectors, NumPy arrays | RAG, search              |
| Multimodal | JSON + media          | Vision-language tasks    |
| Code       | .py, .js, .java       | Code generation          |
