# Avro File Format (Apache Avro)

**Apache Avro** is a row-oriented binary serialization format designed for efficient data exchange and schema evolution in distributed systems. It is widely used in Kafka, Hadoop, and streaming pipelines.

* Schema-Based Serialization
* Data is written with a schema defined in JSON.
* The schema can be embedded in the file (common) or managed externally (e.g., Schema Registry).

In [22]:
import os

print("""
{
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "age", "type": ["null", "int"], "default": null}
    ]
}
""")


{
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "age", "type": ["null", "int"], "default": null}
    ]
}



**Compact & Fast**

Uses binary encoding â†’ smaller size than JSON/CSV.
Optimized for fast write/read, especially in streaming.

**Schema Evolution (Key Strength)**

Avro supports backward and forward compatibility:
Add/remove fields
Change defaults
Reader and writer schemas can differ
This is critical in event-driven systems.

**Row-Oriented Storage**

Stores data record by record.

**Ideal for:**

    Streaming
    Message queues
    Incremental ingestion

    Not ideal for heavy analytical scans (columnar formats are better).

**Avro File Structure**

An .avro file typically contains:

**Header**

    Magic bytes
    Metadata (including schema)

**Data Blocks**

    Serialized records

**Sync Marker**

    Enables file splitting in distributed systems

| Feature          | Avro             | Parquet         | ORC       |
| ---------------- | ---------------- | --------------- | --------- |
| Storage          | Row-based        | Columnar        | Columnar  |
| Schema           | Embedded JSON    | External        | External  |
| Best For         | Streaming, Kafka | Analytics, OLAP | Analytics |
| Compression      | Yes              | Yes (better)    | Yes       |
| Schema Evolution | Excellent        | Limited         | Limited   |


In [23]:
import fastavro

In [24]:
data = [
    {"id": 1, "name": "Shravan", "age": 28},
    {"id": 2, "name": "Hanvika", "age": 25.8}  # int accepts float, if it is a string, raises TypeError
]

In [25]:
schema_registry = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"}
    ]
}

In [26]:
with open("my_data.avro", "wb") as my_data:
    fastavro.writer(my_data, schema_registry, data)

In [27]:
with open("my_data.avro", "rb") as read_my_data:
    for r in fastavro.reader(read_my_data):
        print(r)

{'id': 1, 'name': 'Shravan', 'age': 28}
{'id': 2, 'name': 'Hanvika', 'age': 25}


In [28]:
import os

os.listdir()

['avro_file.ipynb',
 'config_file.ipynb',
 'csv_file.ipynb',
 'json_file.ipynb',
 'my_data.avro',
 'orc_file.ipynb',
 'parquet_file.ipynb',
 'toml_file.ipynb',
 'yaml_file.ipynb']

In [29]:
os.unlink("my_data.avro")