# **Day 1: The Physics of Data ‚Äî Interactive Lab**

**User Profile:** Senior Data Engineer  
**Objective:** Stop treating the OS as a black box. In this notebook, we will use **Bash** (to inspect the Kernel) and **Python** (to implement specific I/O patterns) side-by-side.

### **Core Concepts:**
1. **Inodes:** The physical identity of a file.
2. **Kernel Metrics:** Reading `/proc` to see what the OS is thinking.
3. **System Limits:** Crashing the `ulimit`.
4. **I/O Physics:** 1 Byte vs 4KB vs Memory Mapping.

---

## **üê£ Level 1: Beginner (The Basics)**
**Goal:** Prove that filenames are just labels and the OS uses Integers (Inodes & FDs) to track data.

### **Example 1: The Inode Identity**
**Theory:** If you create a "Hard Link", you have two filenames pointing to the **same** physical Inode.

In [None]:
%%bash
echo "--- [Bash] Inspecting Inodes ---"
# 1. Create a file
echo "I am physical data on the disk." > /tmp/inode_demo.txt

# 2. Create a Hard Link (A second pointer to the same data)
ln /tmp/inode_demo.txt /tmp/inode_link.txt

# 3. Show Inode Numbers (-i flag)
# Notice the first column is IDENTICAL
ls -li /tmp/inode_demo.txt /tmp/inode_link.txt

In [None]:
import os

print("--- [Python] Verifying Inodes ---")
f1 = "/tmp/inode_demo.txt"
f2 = "/tmp/inode_link.txt"

# Get Inode integers
inode_1 = os.stat(f1).st_ino
inode_2 = os.stat(f2).st_ino

print(f"File 1 Inode: {inode_1}")
print(f"File 2 Inode: {inode_2}")

if inode_1 == inode_2:
    print("‚úÖ PROOF: Different names, SAME physical file.")
else:
    print("‚ùå Failed.")
    
# Cleanup
if os.path.exists(f1): os.remove(f1)
if os.path.exists(f2): os.remove(f2)

### **Example 2: The Kernel Spy (/proc)**
**Theory:** Tools like `htop` just read text files in `/proc`. Let's see the **Page Cache** (RAM used to cache disk).

In [None]:
%%bash
echo "--- [Bash] Reading Kernel Memory Info ---"
# The Kernel exposes memory stats here
grep -E "MemTotal|Cached|Dirty" /proc/meminfo | head -3

In [None]:
print("--- [Python] Parsing Kernel Metrics ---")

def get_kernel_memory():
    stats = {}
    with open("/proc/meminfo", "r") as f:
        for line in f:
            parts = line.split(":")
            if len(parts) == 2:
                stats[parts[0].strip()] = parts[1].strip()
    return stats

mem = get_kernel_memory()
print(f"Total RAM:   {mem.get('MemTotal')}")
print(f"Page Cache:  {mem.get('Cached')} (Data sitting in RAM)")
print(f"Dirty Pages: {mem.get('Dirty')} (Data waiting to flush to disk)")

---

## **üîß Level 2: Intermediate (Mechanics)**
**Goal:** Understand System Calls, File Descriptors, and Limits.

### **Example 3: The "Too Many Open Files" Crash**
**Theory:** Every process has a limit (`ulimit -n`). Leaking file handles crashes apps like Kafka or Spark.

In [None]:
%%bash
echo "--- [Bash] Checking Limits ---"
echo "Soft Limit (Process): $(ulimit -n)"
echo "Hard Limit (Max):     $(ulimit -Hn)"

In [None]:
import resource

print("--- [Python] Leaking FDs ---")
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print(f"My Limit: {soft} files")

handles = []
try:
    # Attempt to open more files than allowed
    for i in range(soft + 10):
        f = open("/dev/null", "r")
        handles.append(f)
except OSError as e:
    print(f"\nüí• CRASHED at file #{len(handles)}!")
    print(f"Error: {e}")
finally:
    # Cleanup
    for f in handles: f.close()

### **Example 4: The "Flush" Latency Trap**
**Theory:** Data sits in a User Space Buffer until "flushed". This reduces syscalls but adds latency.

In [None]:
import time
import os

filename = "/tmp/latency_test.txt"

print("--- [Python] Buffering Demo ---")
with open(filename, "w") as f:
    f.write("Hidden Data")
    size = os.stat(filename).st_size
    print(f"Written to buffer. Size on disk: {size} bytes (Expected 0 or small)")
    
    f.flush() # Force write to Kernel
    os.fsync(f.fileno()) # Force write to Disk Platter
    
    size = os.stat(filename).st_size
    print(f"After Flush+Fsync. Size on disk: {size} bytes")

os.remove(filename)

---

## **üöÄ Level 3: Advanced (Optimization)**
**Goal:** Bypass the kernel overhead for high performance.

### **Example 5: The Block Size Benchmark**
**Theory:** Writing 1 Byte at a time = 1 Million System Calls. Writing 4KB = 250 System Calls.

In [None]:
%%bash
echo "--- [Bash] DD Benchmark ---"
# Write 10MB in 1-byte chunks (Very Slow)
echo "1. Writing 10MB in 1-byte chunks..."
time dd if=/dev/zero of=/tmp/test_1b.dat bs=1 count=500000 2>&1 | grep "records in"

# Write 10MB in 4KB chunks (Instant)
echo "2. Writing 10MB in 4KB chunks..."
time dd if=/dev/zero of=/tmp/test_4k.dat bs=4096 count=122 2>&1 | grep "records in"

rm /tmp/test_1b.dat /tmp/test_4k.dat

In [None]:
import time

print("--- [Python] Syscall Benchmark ---")

def benchmark_write(chunk_size, total_bytes=500_000):
    filename = f"/tmp/bench_{chunk_size}.bin"
    data = b'X' * chunk_size
    
    start = time.time()
    with open(filename, "wb", buffering=0) as f: # buffering=0 forces Syscall per write
        for _ in range(total_bytes // chunk_size):
            f.write(data)
    return time.time() - start

t_1 = benchmark_write(1)
t_4k = benchmark_write(4096)

print(f"1 Byte Writes: {t_1:.4f}s")
print(f"4 KB Writes:   {t_4k:.4f}s")
print(f"üöÄ Speedup: {t_1/t_4k:.1f}x Faster")

### **Example 6: Zero-Copy mmap**
**Theory:** Bypass the `read()` syscall entirely. Treat the disk file as an array in RAM.

In [None]:
import mmap
import os

print("--- [Python] Memory Mapping ---")
filename = "/tmp/mmap_test.dat"

# 1. Create a file
with open(filename, "wb") as f:
    f.write(b"Hello Kernel World " * 100)

# 2. Map it
with open(filename, "r+b") as f:
    # Map file to memory
    mm = mmap.mmap(f.fileno(), 0)
    
    # Read it like a string (Zero System Calls)
    print(f"First 20 bytes: {mm[:20]}")
    
    # Modify directly on disk by changing memory
    mm[0:5] = b"HELLO"
    mm.close()

os.remove(filename)

---

## **üëë Level 4: Senior/Expert (Architecture)**
**Goal:** Replicate Database internals (Pages, ACID).

### **Example 7: The "Database Page" Reader**
**Theory:** Databases don't read files line-by-line. They jump (`seek`) to specific 8KB "Pages".

In [None]:
print("--- [Python] DB Page Simulation ---")
PAGE_SIZE = 4096 # 4KB Page
filename = "/tmp/db_data.dat"

# Create dummy DB file (3 Pages)
with open(filename, "wb") as f:
    f.write(b"Page0..." + b"\0" * (PAGE_SIZE - 8))
    f.write(b"Page1..." + b"\0" * (PAGE_SIZE - 8))
    f.write(b"Page2..." + b"\0" * (PAGE_SIZE - 8))

def read_page(page_id):
    with open(filename, "rb") as f:
        # The Senior Move: Seek directly to offset
        f.seek(page_id * PAGE_SIZE)
        return f.read(8) # Read header only

print(f"Reading Page 1: {read_page(1)}")
print(f"Reading Page 2: {read_page(2)}")

os.remove(filename)

### **Example 8: Atomic Swap (ACID Basics)**
**Theory:** How do you update a file without corrupting it if power fails? Write to temp, then `os.replace` (Atomic Inode Switch).

In [None]:
%%bash
echo "--- [Bash] Atomic Move ---"
echo "Old Data" > /tmp/production.db
echo "New Data" > /tmp/production.tmp

# 'mv' on the same filesystem is atomic
# It swaps the Inode pointer instantly
mv /tmp/production.tmp /tmp/production.db

cat /tmp/production.db
rm /tmp/production.db

In [None]:
import json

print("--- [Python] Atomic Replace ---")
db_file = "/tmp/app_config.json"
tmp_file = "/tmp/app_config.tmp"

# Initial state
with open(db_file, "w") as f: f.write('{"status": "running"}')

# 1. Write to Temp (Safe Zone)
with open(tmp_file, "w") as f:
    f.write('{"status": "maintenance"}')
    f.flush()
    os.fsync(f.fileno())

# 2. Atomic Switch
os.replace(tmp_file, db_file)

with open(db_file, "r") as f:
    print(f"Final Config: {f.read()}")

if os.path.exists(db_file): os.remove(db_file)

---

## **üõ†Ô∏è External Lab Execution**
We also have standalone scripts in the `lab/` folder. You can run them directly from here.

In [None]:
%%bash
echo "Running External FD Leak Detector..."
# Note: Ensure you are in the correct directory or provide full path
# python3 ../lab/fd_leak_detector.py
echo "(Skipped for safety in notebook, run in terminal)"