# **Day 2 Lab: Advanced Shell & Stream Processing â€” The Physics of Pipes**

**Objective:** Move from the Kernel (Day 1) to the Shell (Day 2). We will explore the physics of pipes, efficient filtering, and parallel processing.

### **Core Concepts:**
1. **The Physics of Pipes:** RAM Buffers vs. Disk I/O.
2. **Tool Efficiency:** Why `grep` is faster than `awk`.
3. **Parallelism:** Saturation of CPU cores using `xargs`/Multiprocessing.
4. **Process Substitution:** Named pipes mechanics.

In [None]:
# Setup: Ensure directory for Day 2 scripts exists
import os
import sys

os.makedirs("lab", exist_ok=True)
print("âœ… 'lab' directory created.")

--- 
## **1. The Physics of Pipes (RAM vs Disk)**

**Concept:** 
* **Disk-Based:** `cmd1 > file; cmd2 < file` (Writes to disk, reads from disk. Slow. High I/O.)
* **Stream-Based:** `cmd1 | cmd2` (Writes to Kernel Buffer in RAM. Fast. Zero Disk I/O for intermediates.)

**Task:** We will simulate a "Pipe Race". We'll write a script that processes data using intermediate files vs. pipes and measures the time difference.

In [9]:
!python lab/pipe_race.py

Input file ready.

[1] Disk-Based (Intermediate File)...
    Time: 0.6603s

[2] Stream-Based (Pipe logic)...
    Time: 0.1925s

 Speedup: 3.43x faster using Stream Logic


--- 
## **2. Tool Mechanics: grep vs awk**

**Concept:** 
* **`grep`:** Optimized C engine (often using Boyer-Moore algorithm). It skips bytes and doesn't parse line structure. **Fast.**
* **`awk`:** Interpreted language. It splits *every* line into fields (`$1`, `$2`...) even if you don't use them. **Slower overhead.**

**Task:** We will simulate a search for a specific string in a large dataset using both approaches to measure the parsing penalty.

In [13]:
!python lab/tool_race.py

Generating data...
Running Grep Simulation (Substring Check)...
  Time: 0.4649s
Running Awk Simulation (Field Parsing)...
  Time: 1.0644s

>>> Result: Grep-style (Scanning) was 2.3x faster than Awk-style (Parsing).


--- 
## **3. Parallelism (xargs / Multiprocessing)**

**Concept:** Standard tools (`grep`, `gzip`) are single-threaded. To use all CPU cores on a modern server, we must parallelize. 
* **Shell:** `xargs -P`
* **Python:** `multiprocessing`

**Task:** Simulate a CPU-heavy task (e.g., calculating sum of squares) and run it sequentially vs. parallel to see the speedup.

In [16]:
!python lab/parallel_bench.py

--- Processing 8 CPU-heavy tasks ---
Sequential Time: 2.3733s
Parallel (4 workers) Time: 1.6605s

>>> Speedup: 1.43x


## **ðŸ§¹ Cleanup**
Removing temporary lab scripts.

In [None]:
import shutil
if os.path.exists("lab"): 
    shutil.rmtree("lab")
    print("Cleanup complete.")