# 070: Edge AI & TinyML - On-Device Inference

## üìò Complete Guide to Deploying ML on Microcontrollers & Edge Devices

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. **Understand Edge AI fundamentals**: Why run models on-device (latency, privacy, cost)
2. **Master TinyML**: Deploy ML models on microcontrollers (<1MB memory)
3. **Implement model optimization**: Quantization (INT8, INT4), pruning, knowledge distillation
4. **Deploy production systems**: TensorFlow Lite, ONNX Runtime, TensorRT, Arduino, ESP32
5. **Build real-world projects**: Keyword spotting, gesture recognition, anomaly detection ($50M-$150M/year value)

---

## üìä What is Edge AI & TinyML?

### Edge AI
**Definition**: Running machine learning inference on edge devices (smartphones, IoT sensors, embedded systems) instead of cloud servers

**Key characteristics:**
- **Local inference**: No internet required (works offline)
- **Low latency**: <10ms response time (vs 100-500ms cloud)
- **Privacy**: Data never leaves device (GDPR/HIPAA compliant)
- **Cost efficiency**: No cloud API fees ($0.001-$0.01 per inference)

### TinyML
**Definition**: Subset of Edge AI focused on ultra-constrained devices (microcontrollers with <1MB RAM, <1MHz CPU)

**Target devices:**
- **Microcontrollers**: Arduino Nano 33 BLE (256KB RAM), ESP32 (520KB RAM), STM32 (64-512KB RAM)
- **AI accelerators**: Google Coral Edge TPU, NVIDIA Jetson Nano, Intel Movidius
- **Wearables**: Smartwatches, fitness trackers, hearing aids
- **IoT sensors**: Smart home, industrial sensors, environmental monitoring

**Constraints:**
- **Memory**: <1MB RAM (model must fit entirely in SRAM)
- **Compute**: <100 MFLOPS (vs 10 TFLOPS GPU)
- **Power**: <10mW (battery-powered, must last months/years)
- **Storage**: <1MB flash (model + firmware)

---

## üöÄ Why Edge AI Matters: The Deployment Problem

### The Cloud Inference Problem

**Example: Image classification API**
- **Input**: 224√ó224√ó3 RGB image = 150KB
- **Upload time**: 150KB √∑ 10Mbps (typical 4G) = 120ms
- **Inference time**: 50ms (cloud GPU)
- **Download time**: 1KB result √∑ 10Mbps = 1ms
- **Total latency**: 120 + 50 + 1 = **171ms** ‚ùå

**Problems:**
1. **Latency**: 171ms too slow for real-time (autonomous vehicles need <10ms)
2. **Privacy**: User uploads personal images to cloud (GDPR violation risk)
3. **Cost**: $0.001 per inference √ó 1B inferences/month = $1M/month
4. **Reliability**: Requires internet connection (fails offline)

### The Edge AI Solution

**On-device inference:**
- **No upload**: 0ms (data stays on device)
- **Local inference**: 10ms (optimized model)
- **No download**: 0ms (result computed locally)
- **Total latency**: **10ms** ‚úÖ (17√ó faster)

**Benefits:**
1. **Latency**: 10ms (real-time capable)
2. **Privacy**: Data never leaves device (GDPR compliant)
3. **Cost**: $0 per inference after deployment (no API fees)
4. **Reliability**: Works offline (no internet required)

---

## üí∞ Business Value: $50M-$150M/year

Edge AI unlocks massive business value across three dimensions:

### Use Case 1: Smart Home Voice Assistant ($15M-$40M/year)

**Problem**: Cloud-based voice assistants (Alexa, Google Home)
- **Latency**: 300-500ms ("always listening" but slow response)
- **Privacy**: All audio sent to cloud (user concern, GDPR issues)
- **Cost**: $0.005 per query √ó 10M users √ó 100 queries/day = $5M/month = $60M/year
- **Dependency**: Requires internet (fails during outages)

**Edge AI Solution**: On-device keyword spotting + local command processing
- **Latency**: 10-20ms ("instant" response)
- **Privacy**: Audio processed locally (only wake word triggers upload)
- **Cost**: $0/query after deployment (savings: $60M/year)
- **Reliability**: Works offline (critical for security cameras, door locks)

**Implementation:**
- **Wake word detection**: TinyML model on microcontroller (10KB model, <5mW power)
- **Command classification**: Edge model on device (100KB model, 50ms inference)
- **Only upload if needed**: Complex queries go to cloud (90% handled locally)

**Business metrics:**
- **Cost savings**: $60M/year ‚Üí $5M/year infrastructure = **$55M/year saved**
- **User satisfaction**: NPS +12 (privacy + speed)
- **Market differentiation**: "Privacy-first assistant" (vs cloud competitors)
- **Total value**: **$15M-$40M/year** (10M users)

---

### Use Case 2: Manufacturing Defect Detection ($20M-$60M/year)

**Problem**: Cloud-based visual inspection
- **Latency**: 200ms (too slow for production line, 60 items/min max)
- **Privacy**: Cannot send proprietary product images to cloud (trade secrets)
- **Cost**: $0.01 per inspection √ó 1M inspections/day = $10K/day = $3.6M/year
- **Bandwidth**: 500KB per image √ó 1M images/day = 500GB/day = $45/day = $16K/year

**Edge AI Solution**: On-device defect detection (camera + edge device)
- **Latency**: 10ms (600 items/min, 10√ó faster throughput)
- **Privacy**: Images processed locally (no upload, trade secrets protected)
- **Cost**: $0 per inspection (no API fees)
- **Bandwidth**: 0 (no uploads)

**Implementation:**
- **Edge device**: NVIDIA Jetson Nano ($99, 128 CUDA cores)
- **Model**: MobileNetV2 + INT8 quantization (5MB model, 10ms inference)
- **Deployment**: 100 production lines √ó $99 = $9,900 one-time cost

**Business metrics:**
- **Throughput increase**: 60 ‚Üí 600 items/min (10√ó faster)
- **Revenue impact**: $1M/year per line √ó 10√ó throughput = $10M/year (or avoid $10M capex for 10√ó more lines)
- **Cost savings**: $3.6M/year API fees ‚Üí $0
- **Privacy value**: Priceless (trade secret protection)
- **Conservative value per factory**: **$5M-$15M/year**
- **Total (4 factories)**: **$20M-$60M/year**

---

### Use Case 3: Wearable Health Monitoring ($15M-$50M/year)

**Problem**: Cloud-based health monitoring (Apple Watch, Fitbit)
- **Latency**: 500ms (ECG ‚Üí cloud ‚Üí analysis ‚Üí alert)
- **Privacy**: Sensitive health data uploaded to cloud (HIPAA concerns)
- **Cost**: $0.001 per heartbeat √ó 100 beats/min √ó 10M users √ó 60 min/hr √ó 24 hr/day = $14.4B/day ‚ùå (impossible)
- **Battery**: Constant uploads drain battery (24-hour lifespan vs 7-day goal)

**Edge AI Solution**: On-device health analytics
- **Latency**: <10ms (real-time arrhythmia detection)
- **Privacy**: Health data stays on device (HIPAA compliant)
- **Cost**: $0 per inference (no uploads)
- **Battery**: 7-day lifespan (only upload alerts, not raw data)

**Implementation:**
- **TinyML model**: Arrhythmia detection (50KB model, 5ms inference, <1mW power)
- **Microcontroller**: ARM Cortex-M4 (built into smartwatch)
- **Alert mechanism**: Only upload if anomaly detected (99.9% filtered locally)

**Business metrics:**
- **Cost avoidance**: $14.4B/day ‚Üí $0 (cloud inference impossible, edge AI enables feature)
- **Battery life**: 24hr ‚Üí 7 days (7√ó improvement)
- **Privacy**: HIPAA compliant (vs regulatory risk)
- **Market differentiation**: "Medical-grade on-device monitoring"
- **Revenue**: 10M users √ó $5/month subscription = $50M/month = **$600M/year** (aspirational)
- **Conservative capture**: 2.5-8% of revenue attributed to edge AI = **$15M-$50M/year**

---

### Total Business Value Summary

| Use Case | Annual Value | Key Metric | Deployment |
|----------|--------------|------------|------------|
| Smart Home Voice (10M users) | $15M-$40M | $55M cost savings | Microcontroller (10KB model) |
| Manufacturing Defect (4 factories) | $20M-$60M | 10√ó throughput | Jetson Nano (5MB model) |
| Wearable Health (10M users) | $15M-$50M | 7-day battery | ARM Cortex-M4 (50KB model) |
| **Total** | **$50M-$150M** | Latency + Privacy + Cost | Edge/TinyML |

**Conservative midpoint**: **$100M/year** across edge AI applications

---

## üîÑ Edge AI Workflow: From Cloud Model to Microcontroller

```mermaid
graph TD
    A[Train Large Model on Cloud<br/>ResNet-50, 98.5% accuracy, 25MB] --> B[Model Optimization<br/>Quantization + Pruning + Distillation]
    B --> C[Compressed Model<br/>MobileNetV2, 96.2% accuracy, 5MB]
    C --> D[Convert to Edge Format<br/>TensorFlow Lite / ONNX / TensorRT]
    D --> E[Deploy to Edge Device<br/>Smartphone / Jetson / Microcontroller]
    E --> F[On-Device Inference<br/>10ms latency, 0 cost, privacy-preserving]
    
    style A fill:#ffcccc
    style C fill:#ccffcc
    style F fill:#ccccff
```

**Key steps:**
1. **Train large model**: ResNet-50, 25MB, 98.5% accuracy (cloud GPU)
2. **Optimize model**: Quantize (INT8) + Prune (50%) + Distill ‚Üí 5MB, 96.2% accuracy
3. **Convert to edge format**: TensorFlow Lite (.tflite), ONNX (.onnx), TensorRT (.engine)
4. **Deploy to device**: Flash to microcontroller or install on mobile app
5. **Inference**: 10ms latency, no internet, privacy-preserving

---

## üìê Edge AI Architecture Spectrum

```mermaid
graph LR
    A[Cloud Only<br/>500ms latency<br/>$1M/month cost<br/>Privacy risk] --> B[Hybrid<br/>Edge filtering + Cloud fallback<br/>50ms latency<br/>$100K/month]
    B --> C[Edge Only<br/>Smartphone/Jetson<br/>10ms latency<br/>$1K/month]
    C --> D[TinyML<br/>Microcontroller<br/>5ms latency<br/>$0/month]
    
    style A fill:#ffcccc
    style D fill:#ccffcc
```

**Device spectrum:**
1. **Cloud only**: All inference on server (500ms, $1M/month, privacy risk)
2. **Hybrid**: Edge filtering + cloud fallback (50ms, $100K/month, 90% local)
3. **Edge only**: Smartphone/Jetson (10ms, $1K/month, full offline)
4. **TinyML**: Microcontroller (5ms, $0/month, ultra-low power)

---

## üéØ When to Use Edge AI vs Cloud

### Use Edge AI when:
‚úÖ **Latency critical**: <50ms required (autonomous vehicles, robotics)
‚úÖ **Privacy sensitive**: Health data, personal photos, voice recordings
‚úÖ **Offline capability**: No internet (remote sensors, submarines, aircraft)
‚úÖ **High volume**: >1M inferences/day (cost prohibitive in cloud)
‚úÖ **Bandwidth constrained**: Cannot upload large data (video, images)

### Use Cloud when:
‚úÖ **Complex models**: >100MB models (GPT-4, DALL-E, large ensembles)
‚úÖ **Low volume**: <1K inferences/day (cloud cheaper than edge deployment)
‚úÖ **Continuous learning**: Model updates daily (edge deployment friction)
‚úÖ **Heterogeneous devices**: Many device types (cloud API simpler)

### Hybrid approach (best of both):
- **Edge**: Wake word detection, face detection, sensor anomaly filtering (99% of data)
- **Cloud**: Complex analysis, speech recognition, image captioning (1% of data)

---

## üõ†Ô∏è Edge AI Technology Stack

### Model Optimization Techniques

| Technique | Compression | Accuracy Loss | Compute Speedup |
|-----------|-------------|---------------|-----------------|
| **Quantization (INT8)** | 4√ó | 0.5-2% | 2-4√ó |
| **Pruning (structured)** | 2-5√ó | 1-3% | 1.5-3√ó |
| **Knowledge Distillation** | 10-100√ó | 2-5% | 5-20√ó |
| **Neural Architecture Search** | 5-20√ó | 0-2% | 3-10√ó |
| **Combined (all above)** | 50-200√ó | 3-8% | 10-50√ó |

### Deployment Frameworks

| Framework | Target Devices | Model Size | Use Case |
|-----------|----------------|------------|----------|
| **TensorFlow Lite** | Android/iOS/Linux | 1MB-100MB | Mobile apps, edge devices |
| **TensorFlow Lite Micro** | Microcontrollers | 10KB-500KB | TinyML, ultra-low power |
| **ONNX Runtime** | Cross-platform | 1MB-1GB | Server, edge, mobile |
| **TensorRT** | NVIDIA GPUs | 1MB-10GB | Jetson, autonomous vehicles |
| **Core ML** | Apple devices | 1MB-100MB | iPhone, iPad, Mac |
| **OpenVINO** | Intel CPUs/VPUs | 1MB-1GB | Intel NUC, RealSense |

### Edge Hardware Comparison

| Device | RAM | Compute | Power | Cost | Use Case |
|--------|-----|---------|-------|------|----------|
| **Arduino Nano 33 BLE** | 256KB | 64 MHz | 5mW | $25 | TinyML, keyword spotting |
| **ESP32** | 520KB | 240 MHz | 10mW | $5 | IoT sensors, gesture recognition |
| **STM32H7** | 1MB | 480 MHz | 20mW | $15 | Industrial sensors, audio processing |
| **Raspberry Pi 4** | 4GB | 1.5 GHz | 3W | $55 | Edge server, multi-model inference |
| **Google Coral Edge TPU** | 8GB | 4 TOPS | 2W | $75 | Vision applications, 400 FPS |
| **NVIDIA Jetson Nano** | 4GB | 472 GFLOPS | 10W | $99 | Autonomous robots, manufacturing |
| **NVIDIA Jetson Xavier NX** | 8GB | 21 TOPS | 15W | $399 | Autonomous vehicles, drones |
| **iPhone 14 (A16 Bionic)** | 6GB | 17 TOPS | 5W | $799 | Mobile AI, AR/VR |

---

## üìö Historical Context: The Edge AI Revolution

### 2010-2015: Cloud-Only Era
- All ML inference in cloud (AlexNet, VGG, ResNet on server GPUs)
- Mobile devices only captured data and displayed results
- Latency: 500ms, Cost: $0.01/inference

### 2016-2017: Mobile AI Emerges
- **SqueezeNet (2016)**: 50√ó smaller than AlexNet, same accuracy
- **MobileNets (2017)**: Designed for mobile devices (4.2MB model)
- **Core ML (2017)**: Apple enables on-device ML on iPhone

### 2018-2019: Edge AI Accelerates
- **TensorFlow Lite (2018)**: Google's mobile/edge ML framework
- **EfficientNet (2019)**: State-of-the-art efficiency (10√ó better than ResNet)
- **NVIDIA Jetson Nano (2019)**: $99 edge AI computer (472 GFLOPS)

### 2019-2020: TinyML Born
- **TensorFlow Lite Micro (2019)**: ML on microcontrollers (<1MB RAM)
- **TinyML Summit (2019)**: First conference on ultra-low-power ML
- **Google Coral (2019)**: Edge TPU ($75, 4 TOPS)

### 2021-Present: Edge AI Mainstream
- **Apple Neural Engine**: 15.8 TOPS on iPhone 13 (2021)
- **Qualcomm AI Engine**: 15 TOPS on Snapdragon 8 Gen 2 (2022)
- **Edge AI market**: $15B (2023) ‚Üí $75B projected (2028)
- **TinyML deployments**: 1B+ devices worldwide

---

## üéì Learning Path Context

**Where we are:**
- **Completed**: 066 Attention ‚Üí 067 NAS ‚Üí 068 Model Compression ‚Üí 069 Federated Learning
- **Current**: 070 Edge AI & TinyML (practical deployment)
- **Next**: 071 Transformers & BERT (large language models)

**Why Edge AI matters:**
- **Practical deployment**: Models useless if too large/slow for production
- **Real-world constraints**: 99% of ML applications need edge deployment
- **Business value**: $50M-$150M/year from latency + privacy + cost savings

---

## üîç What Makes Edge AI Challenging?

### Challenge 1: Model Size Constraints
- **Cloud**: 25MB ResNet-50 (no problem)
- **Mobile**: 5MB MobileNetV2 (acceptable)
- **Microcontroller**: 50KB model (must fit in SRAM) ‚ùå

**Solution**: Aggressive compression (quantization + pruning + distillation)

### Challenge 2: Compute Constraints
- **Cloud GPU**: 10 TFLOPS (bfloat16), 50ms inference
- **Mobile CPU**: 100 GFLOPS (float32), 100ms inference
- **Microcontroller**: 10 MFLOPS (int8), 10ms inference ‚ùå

**Solution**: Optimize operations (depthwise conv, skip connections, INT8 quantization)

### Challenge 3: Power Constraints
- **Edge server**: 300W (plugged in, no constraint)
- **Mobile phone**: 5W (battery 1 day, acceptable)
- **Wearable**: 10mW (battery 1 week, critical) ‚ùå

**Solution**: Ultra-low-power operations (sparse inference, event-driven, duty cycling)

### Challenge 4: Memory Constraints
- **Cloud**: 32GB GPU memory (no constraint)
- **Mobile**: 4GB RAM (acceptable)
- **Microcontroller**: 256KB SRAM (model + activations + stack must fit) ‚ùå

**Solution**: In-place operations, memory reuse, streaming inference

---

## üéØ Key Questions This Notebook Answers

1. **How to compress 25MB model ‚Üí 50KB model** (500√ó compression, <5% accuracy loss)
2. **How to quantize float32 ‚Üí INT8** (4√ó smaller, 2-4√ó faster, <2% accuracy loss)
3. **How to deploy to TensorFlow Lite** (mobile/edge devices)
4. **How to deploy to TensorFlow Lite Micro** (microcontrollers, <1MB RAM)
5. **How to deploy to NVIDIA Jetson** (edge AI computer, 472 GFLOPS)
6. **When to use Edge AI vs Cloud** (latency, privacy, cost trade-offs)
7. **How to build production edge AI systems** (8 real-world projects, $50M-$150M/year)

---

## üìñ Notebook Structure

1. **Introduction** (this cell): Why Edge AI, business value, historical context
2. **Mathematical Foundations**: Quantization theory, pruning theory, distillation theory, efficiency metrics
3. **Implementation**: TensorFlow Lite, TFLite Micro, Arduino deployment, Jetson deployment
4. **Production Projects**: 8 real-world projects (smart home, manufacturing, wearable, autonomous vehicles, etc.)

---

## üöÄ Let's Build Edge AI Systems!

In the next cells, we'll:
1. **Derive the math**: Quantization (symmetric, asymmetric), pruning (magnitude, structured), distillation (Hinton 2015)
2. **Implement from scratch**: INT8 quantization, structured pruning, KD loss
3. **Deploy to production**: TensorFlow Lite (mobile), TFLite Micro (Arduino), TensorRT (Jetson)
4. **Build 8 projects**: Smart home voice ($15M-$40M), manufacturing defect ($20M-$60M), wearable health ($15M-$50M), etc.

**Total business value**: $50M-$150M/year from edge AI deployment

Ready? Let's make ML models run anywhere! üöÄüîßüì±

---

**Learning Progression:**
- **Previous**: 069 Federated Learning (Privacy-Preserving Distributed ML)
- **Current**: 070 Edge AI & TinyML (On-Device Inference)
- **Next**: 071 Transformers & BERT (Self-Attention, Pre-training, Transfer Learning)

---

‚úÖ **Introduction complete! Next: Mathematical foundations of model compression for edge deployment.**

# üìê Mathematical Foundations: Model Optimization for Edge Deployment

---

## Overview

To deploy ML models on edge devices (smartphones, microcontrollers, IoT sensors), we need **3 fundamental techniques**:

1. **Quantization**: Convert float32 (32 bits) ‚Üí INT8 (8 bits) = 4√ó smaller, 2-4√ó faster
2. **Pruning**: Remove unimportant weights (50-90% sparsity) = 2-10√ó smaller
3. **Knowledge Distillation**: Train small "student" model to mimic large "teacher" = 10-100√ó smaller

**Combined effect**: 25MB ResNet-50 ‚Üí 50KB MobileNetV2 (500√ó compression, <5% accuracy loss)

---

# 1Ô∏è‚É£ Quantization: Float32 ‚Üí INT8

## What is Quantization?

**Definition**: Map high-precision floating-point values to low-precision integers

**Motivation**:
- **Memory**: float32 (4 bytes) ‚Üí INT8 (1 byte) = **4√ó smaller**
- **Compute**: INT8 operations 2-4√ó faster than float32 (hardware optimized)
- **Power**: INT8 uses 5-10√ó less energy than float32

**Trade-off**: Slight accuracy loss (typically 0.5-2%)

---

## 1.1 Symmetric Quantization

**Idea**: Map float range [-Œ±, Œ±] to INT8 range [-127, 127]

### Formula

$$
q = \text{round}\left(\frac{r}{s}\right)
$$

Where:
- $r$ = real-valued float32 number
- $q$ = quantized INT8 integer
- $s$ = scale factor = $\frac{\alpha}{127}$

### Dequantization (INT8 ‚Üí float32)

$$
r = s \cdot q
$$

### Example

**Quantize weight matrix:**

Original weights (float32):
$$
W = \begin{bmatrix}
0.8 & -0.5 & 0.3 \\
-0.2 & 0.9 & -0.7
\end{bmatrix}
$$

**Step 1**: Find maximum absolute value
$$
\alpha = \max(|W|) = 0.9
$$

**Step 2**: Calculate scale factor
$$
s = \frac{\alpha}{127} = \frac{0.9}{127} = 0.00709
$$

**Step 3**: Quantize each weight
$$
q_{ij} = \text{round}\left(\frac{w_{ij}}{s}\right)
$$

$$
q_{11} = \text{round}\left(\frac{0.8}{0.00709}\right) = \text{round}(112.8) = 113
$$

$$
q_{12} = \text{round}\left(\frac{-0.5}{0.00709}\right) = \text{round}(-70.5) = -71
$$

$$
q_{13} = \text{round}\left(\frac{0.3}{0.00709}\right) = \text{round}(42.3) = 42
$$

Similarly for second row:

$$
Q = \begin{bmatrix}
113 & -71 & 42 \\
-28 & 127 & -99
\end{bmatrix} \quad \text{(INT8)}
$$

**Dequantization** (to verify):
$$
w_{11} = 0.00709 \times 113 = 0.801 \approx 0.8 \quad ‚úÖ
$$

**Quantization error**:
$$
\text{Error} = |0.801 - 0.8| = 0.001 \quad \text{(0.125% relative error)}
$$

---

## 1.2 Asymmetric Quantization

**Motivation**: Symmetric quantization wastes range if distribution is not centered at 0

**Example**: ReLU activations (always non-negative)
- **Symmetric**: Maps [0, 1.0] to [-127, 127] ‚Üí Wastes negative range ‚ùå
- **Asymmetric**: Maps [0, 1.0] to [0, 255] ‚Üí Uses full range ‚úÖ

### Formula

$$
q = \text{round}\left(\frac{r}{s}\right) + z
$$

Where:
- $s$ = scale factor = $\frac{r_{\max} - r_{\min}}{q_{\max} - q_{\min}}$
- $z$ = zero-point (INT8 integer representing 0.0)

For UINT8: $q_{\min} = 0$, $q_{\max} = 255$

### Dequantization

$$
r = s \cdot (q - z)
$$

### Example

**Quantize ReLU activations (always ‚â• 0):**

$$
A = [0.0, 0.3, 0.6, 0.9, 1.2]
$$

**Step 1**: Find range
$$
r_{\min} = 0.0, \quad r_{\max} = 1.2
$$

**Step 2**: Calculate scale and zero-point
$$
s = \frac{1.2 - 0.0}{255 - 0} = \frac{1.2}{255} = 0.00471
$$

$$
z = \text{round}\left(-\frac{r_{\min}}{s}\right) = \text{round}\left(-\frac{0.0}{0.00471}\right) = 0
$$

**Step 3**: Quantize
$$
q_i = \text{round}\left(\frac{a_i}{s}\right) + z
$$

$$
q_1 = \text{round}\left(\frac{0.0}{0.00471}\right) + 0 = 0
$$

$$
q_2 = \text{round}\left(\frac{0.3}{0.00471}\right) + 0 = 64
$$

$$
q_3 = \text{round}\left(\frac{0.6}{0.00471}\right) + 0 = 127
$$

$$
q_4 = \text{round}\left(\frac{0.9}{0.00471}\right) + 0 = 191
$$

$$
q_5 = \text{round}\left(\frac{1.2}{0.00471}\right) + 0 = 255
$$

$$
Q = [0, 64, 127, 191, 255] \quad \text{(UINT8)}
$$

**Full range utilized**: 0-255 ‚úÖ (vs symmetric would only use 0-127)

---

## 1.3 Per-Channel Quantization

**Problem**: Different channels have different value ranges

**Example**: Conv layer with 3 output channels

$$
W_1 \in [-0.1, 0.1], \quad W_2 \in [-0.5, 0.5], \quad W_3 \in [-1.0, 1.0]
$$

**Per-tensor quantization** (single scale for all channels):
$$
s = \frac{1.0}{127} = 0.00787
$$

Channel 1 uses only [-13, 13] out of [-127, 127] ‚Üí **Wastes 90% of range** ‚ùå

**Per-channel quantization** (separate scale per channel):
$$
s_1 = \frac{0.1}{127} = 0.00079, \quad s_2 = \frac{0.5}{127} = 0.00394, \quad s_3 = \frac{1.0}{127} = 0.00787
$$

Each channel uses full [-127, 127] range ‚Üí **10√ó better precision** ‚úÖ

### Formula

For output channel $c$:
$$
q_{i,c} = \text{round}\left(\frac{w_{i,c}}{s_c}\right)
$$

Where:
$$
s_c = \frac{\max(|W_c|)}{127}
$$

**Accuracy improvement**: Per-channel quantization typically recovers 1-2% accuracy vs per-tensor

---

## 1.4 Quantization-Aware Training (QAT)

**Problem**: Post-training quantization (PTQ) loses 1-3% accuracy

**Solution**: Train model with quantization simulation (fake quantization)

### Fake Quantization

During forward pass, simulate quantization:
$$
w_{\text{fake}} = s \cdot \text{round}\left(\frac{w}{s}\right)
$$

During backward pass, use straight-through estimator (STE):
$$
\frac{\partial L}{\partial w} \approx \frac{\partial L}{\partial w_{\text{fake}}}
$$

**Effect**: Model learns to be robust to quantization noise

### Algorithm

```
# Pseudo-code for QAT
for epoch in range(num_epochs):
    for batch in train_loader:
        # Forward pass with fake quantization
        weights_float = model.weights
        weights_quantized = fake_quantize(weights_float)
        
        # Compute loss with quantized weights
        output = forward(input, weights_quantized)
        loss = criterion(output, target)
        
        # Backward pass (gradients w.r.t. float weights)
        loss.backward()  # STE: gradient flows through round()
        
        # Update float weights
        optimizer.step()
```

**Accuracy improvement**: QAT typically recovers 0.5-1.5% accuracy vs PTQ

---

## 1.5 Quantization for Different Layers

### Convolution Layer

**Operation**: $Y = W * X + b$

**Quantization**:
- **Weights**: INT8 symmetric, per-channel
- **Activations**: UINT8 asymmetric (ReLU outputs)
- **Bias**: INT32 (higher precision to avoid accumulation error)

**Quantized operation**:
$$
Y_q = \text{round}\left(\frac{1}{s_y}\left(s_w \cdot s_x \cdot (W_q * X_q) + b\right)\right)
$$

Where:
- $W_q$, $X_q$, $Y_q$ are INT8 tensors
- $s_w$, $s_x$, $s_y$ are scale factors
- Multiplication $s_w \cdot s_x$ is folded into runtime

### Fully Connected Layer

**Operation**: $Y = WX + b$

**Same quantization as convolution**

### Batch Normalization

**Problem**: BN requires float32 mean/variance (expensive)

**Solution**: Fuse BN into previous conv layer

$$
Y = \text{Conv}(X, W, b) \rightarrow Y' = \gamma \frac{Y - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$

**Fused operation**:
$$
Y' = \text{Conv}\left(X, \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W, \frac{\gamma(b - \mu)}{\sqrt{\sigma^2 + \epsilon}} + \beta\right)
$$

**Result**: No runtime BN computation, just conv with modified weights/bias ‚úÖ

---

## 1.6 Quantization Accuracy Analysis

### Theoretical Error Bound

**Quantization error per weight**:
$$
\epsilon = \left|w - s \cdot \text{round}\left(\frac{w}{s}\right)\right| \leq \frac{s}{2}
$$

**For symmetric INT8** ($s = \frac{\alpha}{127}$):
$$
\epsilon \leq \frac{\alpha}{254}
$$

**Example**: $\alpha = 1.0$
$$
\epsilon \leq \frac{1.0}{254} = 0.0039 \quad \text{(0.39% max error per weight)}
$$

### Layer-wise Error Propagation

For $L$-layer network:
$$
\text{Total error} \approx \sum_{l=1}^{L} \epsilon_l \cdot \left|\frac{\partial \text{loss}}{\partial w_l}\right|
$$

**Typical results** (image classification):
- **1 layer**: 0.1% accuracy loss
- **10 layers**: 1.0% accuracy loss (errors compound)
- **50 layers**: 3-5% accuracy loss (deep networks more sensitive)

**Mitigation**: Use mixed precision (sensitive layers in float32, others in INT8)

---

# 2Ô∏è‚É£ Pruning: Remove Unimportant Weights

## What is Pruning?

**Definition**: Set unimportant weights to zero to reduce model size and computation

**Motivation**:
- **Sparsity**: 50-90% of weights can be pruned with <3% accuracy loss
- **Lottery ticket hypothesis** (Frankle & Carbin 2019): Sparse subnetworks exist that match full network performance

---

## 2.1 Magnitude-based Pruning

**Idea**: Prune weights with smallest absolute values (least important)

### Algorithm

**Step 1**: Compute importance score for each weight
$$
I_i = |w_i|
$$

**Step 2**: Sort weights by importance
$$
|w_{(1)}| \leq |w_{(2)}| \leq \cdots \leq |w_{(n)}|
$$

**Step 3**: Prune bottom $p\%$ of weights
$$
w_i = \begin{cases}
w_i & \text{if } |w_i| > \text{threshold} \\
0 & \text{otherwise}
\end{cases}
$$

Where threshold is $(1-p)$-th percentile of $|w_i|$

### Example

**Original weights**:
$$
W = \begin{bmatrix}
0.8 & -0.1 & 0.3 \\
-0.05 & 0.9 & -0.2
\end{bmatrix}
$$

**Sorted by magnitude**:
$$
|-0.05| < |-0.1| < |-0.2| < |0.3| < |0.8| < |0.9|
$$

**Prune 50%** (bottom 3 weights):
$$
W_{\text{pruned}} = \begin{bmatrix}
0.8 & 0 & 0.3 \\
0 & 0.9 & 0
\end{bmatrix}
$$

**Sparsity**: 3/6 = 50% weights pruned ‚úÖ

---

## 2.2 Structured Pruning

**Problem**: Unstructured (magnitude-based) pruning creates irregular sparsity ‚Üí Hard to accelerate on hardware

**Solution**: Prune entire channels, filters, or layers (regular sparsity)

### Channel Pruning

**Idea**: Remove entire output channels from conv layer

**Importance score** for channel $c$:
$$
I_c = \sum_{i,j,k} |w_{i,j,k,c}| \quad \text{(sum of absolute weights in channel)}
$$

**Prune channels with lowest** $I_c$

**Example**: Conv layer (64 channels ‚Üí 32 channels)
- **Original**: 3√ó3√ó3√ó64 = 1,728 parameters
- **After pruning**: 3√ó3√ó3√ó32 = 864 parameters (50% reduction)
- **Speedup**: 2√ó faster (regular sparsity, hardware-friendly) ‚úÖ

---

## 2.3 Iterative Pruning with Fine-tuning

**Problem**: Pruning 90% in one shot loses >10% accuracy ‚ùå

**Solution**: Prune gradually (10-20% per iteration) + fine-tune

### Algorithm

```
# Pseudo-code for iterative pruning
model = train_model()  # Train baseline

for iteration in range(num_iterations):
    # Prune 20% of remaining weights
    prune_percentage = 0.2
    prune_weights(model, prune_percentage)
    
    # Fine-tune for 10 epochs
    fine_tune(model, epochs=10)
    
    # Evaluate
    accuracy = evaluate(model, test_loader)
    print(f"Iteration {iteration}, Sparsity: {get_sparsity(model):.0%}, Acc: {accuracy:.2%}")
```

**Typical results** (ResNet-50 on ImageNet):
- **Baseline**: 0% sparsity, 76.1% accuracy
- **Iteration 1**: 20% sparsity, 75.8% accuracy (0.3% loss)
- **Iteration 2**: 36% sparsity, 75.3% accuracy (0.8% loss)
- **Iteration 3**: 49% sparsity, 74.5% accuracy (1.6% loss)
- **Iteration 4**: 59% sparsity, 73.2% accuracy (2.9% loss)
- **Iteration 5**: 67% sparsity, 71.5% accuracy (4.6% loss) ‚ö†Ô∏è

**Optimal**: 50-60% sparsity (2-3% accuracy loss, acceptable)

---

## 2.4 Lottery Ticket Hypothesis

**Discovery** (Frankle & Carbin 2019): Dense networks contain sparse "winning tickets" that can match full accuracy

### Finding Winning Tickets

**Algorithm**:
1. **Train full network** to convergence
2. **Prune** smallest-magnitude weights (e.g., 90%)
3. **Reset** remaining weights to initial values
4. **Retrain** sparse network from scratch

**Result**: Sparse network matches or exceeds full network accuracy! üéØ

**Why it works**: Initialization matters more than we thought (lucky lottery tickets exist)

---

## 2.5 Pruning + Quantization = Deep Compression

**Deep Compression** (Han et al. 2016): Combine pruning + quantization + Huffman coding

### Pipeline

**Step 1**: Magnitude-based pruning (10√ó reduction)
- AlexNet: 61M ‚Üí 6.7M parameters (90% pruned)

**Step 2**: Quantization (4√ó reduction)
- INT8 per-channel quantization

**Step 3**: Huffman coding (2√ó reduction)
- Encode quantized values with variable-length codes

**Total compression**: 10 √ó 4 √ó 2 = **80√ó compression** üöÄ

**Example**: AlexNet
- **Original**: 240MB (float32)
- **After pruning**: 24MB (90% sparsity)
- **After quantization**: 6MB (INT8)
- **After Huffman**: 3MB (variable-length encoding)
- **Final**: **240MB ‚Üí 3MB = 80√ó smaller** ‚úÖ

**Accuracy**: 57.2% ‚Üí 57.0% (0.2% loss, negligible)

---

# 3Ô∏è‚É£ Knowledge Distillation: Train Small Model to Mimic Large Model

## What is Knowledge Distillation?

**Definition**: Transfer knowledge from large "teacher" model to small "student" model

**Motivation**:
- **Teacher**: ResNet-50 (25MB, 98.5% accuracy) ‚Üí Too large for edge ‚ùå
- **Student**: MobileNetV2 (5MB, 95.0% accuracy) ‚Üí Fits on edge but lower accuracy ‚ùå
- **Distilled student**: MobileNetV2 (5MB, 97.2% accuracy) ‚Üí Learns from teacher ‚úÖ

**Gain**: 2.2% accuracy improvement (95.0% ‚Üí 97.2%) from distillation

---

## 3.1 Hinton's Distillation (2015)

### Soft Targets

**Problem**: Hard labels (one-hot) lose information

**Example**: Dog vs Wolf classification
- **Hard label**: Dog = [1, 0] (binary)
- **Soft label**: Dog = [0.9, 0.1] (teacher thinks 10% wolf-like)

**Soft label captures similarity** between classes (dog closer to wolf than to airplane)

### Temperature Scaling

**Standard softmax** (temperature T=1):
$$
p_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)}
$$

**Softmax with temperature**:
$$
p_i^T = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}
$$

**Effect of temperature**:
- **T = 1**: Standard softmax (sharp distribution)
- **T = 5**: Softer distribution (reveals class similarities)
- **T = 10**: Very soft (more information for student)

**Example**: Logits $z = [5.0, 2.0, 1.0]$

**T = 1** (standard):
$$
p = [0.935, 0.047, 0.017] \quad \text{(very confident)}
$$

**T = 5** (soft):
$$
p^5 = [0.585, 0.237, 0.178] \quad \text{(class 2 and 3 have meaningful probabilities)}
$$

### Distillation Loss

**Total loss** (weighted combination):
$$
L = \alpha \cdot L_{\text{hard}} + (1 - \alpha) \cdot L_{\text{soft}}
$$

Where:
- **Hard loss**: Cross-entropy with true labels
$$
L_{\text{hard}} = -\sum_i y_i \log(p_i^{\text{student}})
$$

- **Soft loss**: KL divergence between teacher and student soft outputs
$$
L_{\text{soft}} = T^2 \cdot \text{KL}\left(p^{\text{teacher}}_T \parallel p^{\text{student}}_T\right)
$$

- **Œ±**: Weight (typically Œ±=0.1, more weight on soft loss)
- **T¬≤**: Compensate for magnitude change with temperature

### Example Calculation

**Teacher outputs** (T=5):
$$
p^{\text{teacher}}_5 = [0.585, 0.237, 0.178]
$$

**Student outputs** (T=5):
$$
p^{\text{student}}_5 = [0.550, 0.300, 0.150]
$$

**KL divergence**:
$$
\text{KL} = \sum_i p_i^{\text{teacher}} \log\left(\frac{p_i^{\text{teacher}}}{p_i^{\text{student}}}\right)
$$

$$
= 0.585 \log\left(\frac{0.585}{0.550}\right) + 0.237 \log\left(\frac{0.237}{0.300}\right) + 0.178 \log\left(\frac{0.178}{0.150}\right)
$$

$$
= 0.585 \times 0.0614 + 0.237 \times (-0.2364) + 0.178 \times 0.1718
$$

$$
= 0.0359 - 0.0560 + 0.0306 = 0.0105
$$

**Soft loss**:
$$
L_{\text{soft}} = T^2 \cdot \text{KL} = 5^2 \times 0.0105 = 0.2625
$$

---

## 3.2 Feature-based Distillation

**Idea**: Match intermediate feature maps (not just final outputs)

### FitNet (Romero et al. 2015)

**Match intermediate layers**:
$$
L_{\text{feature}} = \frac{1}{2}\left\|F^{\text{student}} - W_r F^{\text{teacher}}\right\|^2
$$

Where:
- $F^{\text{teacher}}$ = Teacher's intermediate feature map (e.g., 512 channels)
- $F^{\text{student}}$ = Student's intermediate feature map (e.g., 256 channels)
- $W_r$ = Linear projection matrix (512 ‚Üí 256 to match dimensions)

**Effect**: Student learns better internal representations (not just final outputs)

---

## 3.3 Relation-based Distillation

**Idea**: Match relationships between feature pairs (not absolute values)

### RKD (Park et al. 2019)

**Distance-wise relation**:
$$
L_{\text{distance}} = \sum_{i,j} \left\|\frac{d_{ij}^{\text{student}}}{\mu_s} - \frac{d_{ij}^{\text{teacher}}}{\mu_t}\right\|^2
$$

Where:
- $d_{ij} = \|f_i - f_j\|$ (distance between features of samples i and j)
- $\mu$ = mean distance (normalization)

**Angle-wise relation**:
$$
L_{\text{angle}} = \sum_{i,j,k} \left|\cos\theta_{ijk}^{\text{student}} - \cos\theta_{ijk}^{\text{teacher}}\right|^2
$$

**Effect**: Student learns geometric structure of feature space (robust to feature magnitude differences)

---

## 3.4 Self-Distillation

**Idea**: No external teacher, use model's own predictions

### Born-Again Networks (Furlanello et al. 2018)

**Algorithm**:
1. Train model M1 (student becomes teacher)
2. Train model M2 using M1 as teacher
3. Train model M3 using M2 as teacher
4. ...

**Result**: M2 > M1, M3 > M2 (accuracy improves through self-distillation!)

**Why it works**: Ensemble effect (each generation learns from predecessor's mistakes)

---

# 4Ô∏è‚É£ Efficiency Metrics for Edge AI

## 4.1 Model Size

**Total parameters**:
$$
\text{Size} = \sum_{\text{layers}} n_{\text{params}} \times \text{bits per param}
$$

**Example**: ResNet-50
- **Parameters**: 25.6M
- **Precision**: float32 (32 bits = 4 bytes)
- **Size**: 25.6M √ó 4 = **102.4 MB**

**After INT8 quantization**:
- **Size**: 25.6M √ó 1 = **25.6 MB** (4√ó smaller)

---

## 4.2 FLOPs (Floating-Point Operations)

**Conv layer FLOPs**:
$$
\text{FLOPs} = 2 \times H_{\text{out}} \times W_{\text{out}} \times C_{\text{in}} \times C_{\text{out}} \times K \times K
$$

Where:
- $H_{\text{out}}, W_{\text{out}}$ = Output height/width
- $C_{\text{in}}, C_{\text{out}}$ = Input/output channels
- $K \times K$ = Kernel size
- Factor of 2 = Multiply + Add

**Example**: Standard conv
- Input: 56√ó56√ó64
- Output: 56√ó56√ó128
- Kernel: 3√ó3
- **FLOPs**: $2 \times 56 \times 56 \times 64 \times 128 \times 3 \times 3 = 924M$ ‚ùå (expensive)

**Depthwise separable conv** (MobileNet):
- **Depthwise**: $2 \times 56 \times 56 \times 64 \times 1 \times 3 \times 3 = 3.6M$
- **Pointwise**: $2 \times 56 \times 56 \times 64 \times 128 \times 1 \times 1 = 51M$
- **Total**: 3.6M + 51M = **54.6M** ‚úÖ (17√ó fewer FLOPs)

---

## 4.3 Latency

**Total latency**:
$$
\text{Latency} = \sum_{\text{layers}} \frac{\text{FLOPs}}{\text{Device throughput}} + \text{Memory overhead}
$$

**Example**: ResNet-50 on different devices
- **Cloud GPU** (NVIDIA V100, 125 TFLOPS): 4 billion FLOPs √∑ 125 TFLOPS = 32ms
- **Edge GPU** (Jetson Nano, 472 GFLOPS): 4 billion FLOPs √∑ 472 GFLOPS = 8,500ms ‚ùå (too slow)
- **Optimized mobile** (MobileNetV2, 300M FLOPs): 300M √∑ 472 GFLOPS = 635ms (better)

**INT8 quantization speedup**: 2-4√ó ‚Üí 635ms √∑ 3 = **212ms** ‚úÖ (acceptable for mobile)

---

## 4.4 Energy Consumption

**Energy per operation** (Horowitz 2014):
- **INT8 ADD**: 0.03 pJ
- **INT8 MULT**: 0.2 pJ
- **FP32 ADD**: 0.9 pJ (30√ó more than INT8)
- **FP32 MULT**: 3.7 pJ (18√ó more than INT8)
- **DRAM access**: 640 pJ (3,200√ó more than INT8 mult!)

**Insight**: Memory access dominates energy (not compute) ‚Üí Optimize for memory locality

**Example**: 1 billion FP32 multiplications
- **Energy**: 1B √ó 3.7 pJ = 3.7 mJ
- **Battery**: 3.7Wh typical smartphone battery
- **Battery drain**: 3.7 mJ √∑ 3.7 Wh = 0.0003% per inference (negligible)

But DRAM access:
- **Energy**: 1B √ó 640 pJ = 640 mJ (173√ó more!)
- **Battery drain**: 0.05% per inference (significant for always-on applications)

**Optimization**: Use on-chip SRAM (5 pJ, 128√ó less than DRAM)

---

## 4.5 MAC (Multiply-Accumulate) Operations

**Definition**: Core operation in neural networks

$$
y = \sum_{i=1}^{n} w_i x_i + b \quad \text{(n MACs + 1 ADD)}
$$

**Efficiency comparison**:

| Device | MACs/second | Power | MACs/Watt |
|--------|-------------|-------|-----------|
| Cloud GPU (V100) | 125 TMAC | 300W | 417 GMAC/W |
| Edge GPU (Jetson Nano) | 472 GMAC | 10W | 47 GMAC/W |
| Mobile CPU (A16 Bionic) | 17 TMAC | 5W | 3.4 TMAC/W |
| Microcontroller (Cortex-M4) | 10 MMAC | 0.01W | 1 GMAC/W |

**Insight**: Cloud GPU has highest absolute performance, but mobile/edge have better energy efficiency

---

# üéØ Key Formulas Summary

## Quantization

**Symmetric**:
$$
q = \text{round}\left(\frac{r}{s}\right), \quad s = \frac{\alpha}{127}
$$

**Asymmetric**:
$$
q = \text{round}\left(\frac{r}{s}\right) + z, \quad s = \frac{r_{\max} - r_{\min}}{255}
$$

**Dequantization**:
$$
r = s \cdot (q - z)
$$

---

## Pruning

**Magnitude importance**:
$$
I_i = |w_i|
$$

**Prune**:
$$
w_i = \begin{cases}
w_i & \text{if } |w_i| > \text{threshold} \\
0 & \text{otherwise}
\end{cases}
$$

---

## Knowledge Distillation

**Soft targets**:
$$
p_i^T = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}
$$

**Distillation loss**:
$$
L = \alpha \cdot L_{\text{hard}} + (1 - \alpha) \cdot T^2 \cdot \text{KL}(p^{\text{teacher}}_T \parallel p^{\text{student}}_T)
$$

---

## Efficiency Metrics

**Model size** (MB):
$$
\text{Size} = \frac{\text{Parameters} \times \text{Bits per param}}{8 \times 10^6}
$$

**FLOPs** (conv layer):
$$
\text{FLOPs} = 2 \times H_{\text{out}} \times W_{\text{out}} \times C_{\text{in}} \times C_{\text{out}} \times K^2
$$

**Latency** (ms):
$$
\text{Latency} = \frac{\text{FLOPs}}{\text{Device throughput}}
$$

---

# üìä Compression Comparison Table

| Technique | Compression | Accuracy Loss | Speedup | Energy Saving |
|-----------|-------------|---------------|---------|---------------|
| **INT8 Quantization** | 4√ó | 0.5-2% | 2-4√ó | 5-10√ó |
| **Pruning (50%)** | 2√ó | 1-3% | 1.5√ó | 2√ó |
| **Pruning (90%)** | 10√ó | 3-8% | 3√ó | 5√ó |
| **Knowledge Distillation** | 10-100√ó | 2-5% | 10-50√ó | 20-100√ó |
| **Combined (all)** | 50-500√ó | 3-10% | 20-100√ó | 50-200√ó |

**Optimal strategy**: Combine all three techniques for maximum compression

**Example**: ResNet-50 ‚Üí MobileNetV2
- **Original**: 25MB, 4B FLOPs, 98.5% accuracy
- **Distillation**: 5MB (5√ó smaller), 300M FLOPs (13√ó fewer), 96.0% accuracy
- **+ Quantization**: 1.25MB (20√ó smaller), 150M FLOPs (27√ó fewer), 95.5% accuracy
- **+ Pruning**: 0.5MB (50√ó smaller), 75M FLOPs (53√ó fewer), 94.2% accuracy

**Final**: **50√ó compression, 53√ó speedup, 4.3% accuracy loss** ‚úÖ

---

# üéì Takeaways

1. **Quantization (INT8)**: 4√ó smaller, 2-4√ó faster, <2% accuracy loss (essential for edge)
2. **Pruning (50-90%)**: 2-10√ó smaller, 1.5-3√ó faster, 1-8% accuracy loss (hardware-dependent)
3. **Knowledge Distillation**: 10-100√ó smaller, 10-50√ó faster, 2-5% accuracy loss (train small model)
4. **Combined**: 50-500√ó compression possible with 3-10% accuracy loss (production edge AI)
5. **Energy is key**: DRAM access 100√ó more expensive than compute ‚Üí Optimize memory first

**Next**: Implementation (TensorFlow Lite, TFLite Micro, Arduino, Jetson deployment)

---

‚úÖ **Mathematical foundations complete! Next: Production implementation and deployment.**

### üìù Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
# ===========================
# Edge AI & TinyML Implementation
# Complete production-ready code for deploying ML models on edge devices
# ===========================
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
import os
print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {keras.__version__}")
# ===========================
# Part 1: INT8 Quantization Implementation
# ===========================
class SymmetricQuantizer:
    """
    Symmetric INT8 quantization: Maps [-Œ±, Œ±] ‚Üí [-127, 127]
    """
    def __init__(self):
        self.scale = None
        self.alpha = None
    
    def calibrate(self, weights):
        """
        Calibrate quantization parameters from weights
        
        Args:
            weights: numpy array of float32 weights
        """
        self.alpha = np.max(np.abs(weights))
        self.scale = self.alpha / 127.0
    
    def quantize(self, weights):
        """
        Quantize float32 weights to INT8
        
        Args:
            weights: numpy array of float32 weights
        
        Returns:
            quantized: INT8 weights
        """
        if self.scale is None:
            self.calibrate(weights)
        
        quantized = np.round(weights / self.scale)
        quantized = np.clip(quantized, -127, 127).astype(np.int8)
        return quantized
    
    def dequantize(self, quantized):
        """
        Dequantize INT8 weights back to float32
        
        Args:
            quantized: INT8 weights
        
        Returns:
            dequantized: float32 weights (approximately original)
        """
        return self.scale * quantized.astype(np.float32)
class AsymmetricQuantizer:
    """
    Asymmetric UINT8 quantization: Maps [min, max] ‚Üí [0, 255]
    Used for activations (e.g., ReLU outputs always non-negative)
    """
    def __init__(self):
        self.scale = None
        self.zero_point = None
        self.r_min = None
        self.r_max = None
    
    def calibrate(self, activations):
        """
        Calibrate quantization parameters from activations
        
        Args:
            activations: numpy array of float32 activations
        """
        self.r_min = np.min(activations)
        self.r_max = np.max(activations)
        
        # Ensure range includes 0
        self.r_min = min(self.r_min, 0.0)
        self.r_max = max(self.r_max, 0.0)
        
        self.scale = (self.r_max - self.r_min) / 255.0
        self.zero_point = int(np.round(-self.r_min / self.scale))
        self.zero_point = np.clip(self.zero_point, 0, 255)
    
    def quantize(self, activations):
        """
        Quantize float32 activations to UINT8
        
        Args:
            activations: numpy array of float32 activations
        
        Returns:
            quantized: UINT8 activations
        """
        if self.scale is None:
            self.calibrate(activations)
        
        quantized = np.round(activations / self.scale) + self.zero_point
        quantized = np.clip(quantized, 0, 255).astype(np.uint8)
        return quantized
    
    def dequantize(self, quantized):
        """
        Dequantize UINT8 activations back to float32
        
        Args:
            quantized: UINT8 activations
        
        Returns:
            dequantized: float32 activations (approximately original)
        """
        return self.scale * (quantized.astype(np.float32) - self.zero_point)
# Demo: Quantization accuracy
print("\n" + "="*60)
print("Demo 1: INT8 Quantization")
print("="*60)
# Simulate weight matrix
weights = np.random.randn(100, 100).astype(np.float32) * 0.5
# Symmetric quantization
quantizer = SymmetricQuantizer()
quantized_weights = quantizer.quantize(weights)
dequantized_weights = quantizer.dequantize(quantized_weights)
# Calculate error
mse = np.mean((weights - dequantized_weights) ** 2)
relative_error = np.mean(np.abs(weights - dequantized_weights) / (np.abs(weights) + 1e-8)) * 100
print(f"Original weights: mean={weights.mean():.4f}, std={weights.std():.4f}, range=[{weights.min():.4f}, {weights.max():.4f}]")
print(f"Quantized weights: min={quantized_weights.min()}, max={quantized_weights.max()}")
print(f"Quantization scale: {quantizer.scale:.6f}")
print(f"Mean Squared Error: {mse:.6f}")
print(f"Relative Error: {relative_error:.2f}%")
print(f"Compression: float32 (4 bytes) ‚Üí INT8 (1 byte) = 4√ó smaller")


### üìù Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ===========================
# Part 2: Build and Train Mobile Model
# ===========================
print("\n" + "="*60)
print("Part 2: Train MobileNetV2-style Model")
print("="*60)
# Load CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
# Normalize
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
# One-hot encode labels
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)
print(f"Training data: {x_train.shape}, Labels: {y_train.shape}")
print(f"Test data: {x_test.shape}, Labels: {y_test.shape}")
def build_mobile_model(input_shape=(32, 32, 3), num_classes=10):
    """
    Build lightweight mobile-friendly CNN (inspired by MobileNetV2)
    
    Uses depthwise separable convolutions for efficiency:
    - Depthwise conv: 3√ó3 spatial filtering per channel
    - Pointwise conv: 1√ó1 conv to mix channels
    
    Efficiency: 9√ó fewer parameters than standard conv
    """
    inputs = layers.Input(shape=input_shape)
    
    # Initial conv
    x = layers.Conv2D(32, 3, padding='same', activation='relu')(inputs)
    x = layers.BatchNormalization()(x)
    
    # Depthwise separable block 1
    x = layers.DepthwiseConv2D(3, padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.Conv2D(64, 1, padding='same')(x)  # Pointwise
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.MaxPooling2D(2)(x)  # 32√ó32 ‚Üí 16√ó16
    
    # Depthwise separable block 2
    x = layers.DepthwiseConv2D(3, padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.Conv2D(128, 1, padding='same')(x)  # Pointwise
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.MaxPooling2D(2)(x)  # 16√ó16 ‚Üí 8√ó8
    
    # Depthwise separable block 3
    x = layers.DepthwiseConv2D(3, padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.Conv2D(256, 1, padding='same')(x)  # Pointwise
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    
    # Global average pooling
    x = layers.GlobalAveragePooling2D()(x)
    
    # Classifier
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)
    
    model = keras.Model(inputs, outputs, name='mobile_cnn')
    return model
# Build model
model = build_mobile_model()
model.summary()
# Count parameters
total_params = model.count_params()
print(f"\nTotal parameters: {total_params:,}")
print(f"Model size (float32): {total_params * 4 / 1e6:.2f} MB")
print(f"Model size (INT8): {total_params / 1e6:.2f} MB (after quantization)")
# Compile
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)
# Train (quick training for demo, only 5 epochs)
print("\n" + "="*60)
print("Training mobile model (5 epochs for demo)...")
print("="*60)
history = model.fit(
    x_train[:10000], y_train[:10000],  # Use subset for faster demo
    batch_size=128,
    epochs=5,
    validation_split=0.2,
    verbose=1
)
# Evaluate
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
print(f"\nTest accuracy (float32 model): {test_acc:.4f}")


### üìù Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ===========================
# Part 3: Post-Training Quantization (TensorFlow Lite)
# ===========================
print("\n" + "="*60)
print("Part 3: Post-Training Quantization (PTQ)")
print("="*60)
# Convert to TensorFlow Lite (float32 baseline)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model_float = converter.convert()
# Save float32 model
with open('/tmp/mobile_model_float32.tflite', 'wb') as f:
    f.write(tflite_model_float)
float_size = len(tflite_model_float)
print(f"Float32 TFLite model size: {float_size / 1e6:.2f} MB")
# Convert to INT8 (post-training quantization)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Representative dataset for calibration
def representative_dataset():
    """
    Provide representative data for quantization calibration
    """
    for i in range(100):
        yield [x_train[i:i+1]]
converter.representative_dataset = representative_dataset
# Full integer quantization (INT8 weights + INT8 activations)
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model_int8 = converter.convert()
# Save INT8 model
with open('/tmp/mobile_model_int8.tflite', 'wb') as f:
    f.write(tflite_model_int8)
int8_size = len(tflite_model_int8)
print(f"INT8 TFLite model size: {int8_size / 1e6:.2f} MB")
print(f"Compression ratio: {float_size / int8_size:.2f}√ó")
# Evaluate INT8 model
interpreter = tf.lite.Interpreter(model_content=tflite_model_int8)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Test on subset (TFLite inference is slower in Python)
num_test_samples = 100
correct = 0
for i in range(num_test_samples):
    # Prepare input (UINT8)
    input_data = (x_test[i:i+1] * 255).astype(np.uint8)
    
    # Run inference
    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()
    
    # Get output (UINT8)
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    # Dequantize output
    output_scale = output_details[0]['quantization'][0]
    output_zero_point = output_details[0]['quantization'][1]
    output_float = output_scale * (output_data.astype(np.float32) - output_zero_point)
    
    # Predict
    pred = np.argmax(output_float)
    true = np.argmax(y_test[i])
    
    if pred == true:
        correct += 1
int8_accuracy = correct / num_test_samples
print(f"\nTest accuracy (INT8 model): {int8_accuracy:.4f}")
print(f"Accuracy loss from quantization: {(test_acc - int8_accuracy) * 100:.2f}%")


### üìù Implementation Part 4

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ===========================
# Part 4: Magnitude-based Pruning
# ===========================
print("\n" + "="*60)
print("Part 4: Magnitude-based Pruning")
print("="*60)
def prune_weights(weights, sparsity=0.5):
    """
    Prune weights by magnitude (set smallest |w| to zero)
    
    Args:
        weights: numpy array
        sparsity: fraction of weights to prune (0.5 = 50%)
    
    Returns:
        pruned_weights: weights with smallest values set to zero
    """
    threshold = np.percentile(np.abs(weights), sparsity * 100)
    mask = np.abs(weights) > threshold
    pruned_weights = weights * mask
    return pruned_weights
# Get first conv layer weights
first_conv_weights = model.layers[1].get_weights()[0]  # Shape: (3, 3, 3, 32)
print(f"Original weights shape: {first_conv_weights.shape}")
print(f"Original non-zero weights: {np.count_nonzero(first_conv_weights)}")
# Prune 50%
pruned_weights = prune_weights(first_conv_weights, sparsity=0.5)
print(f"After 50% pruning: {np.count_nonzero(pruned_weights)} non-zero weights")
print(f"Actual sparsity: {1 - np.count_nonzero(pruned_weights) / first_conv_weights.size:.2%}")
# Visualize pruning effect
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(first_conv_weights.flatten(), bins=50, alpha=0.7)
axes[0].set_title('Original Weight Distribution')
axes[0].set_xlabel('Weight value')
axes[0].set_ylabel('Frequency')
axes[0].axvline(0, color='red', linestyle='--', label='Zero')
axes[0].legend()
axes[1].hist(pruned_weights.flatten(), bins=50, alpha=0.7)
axes[1].set_title('Pruned Weight Distribution (50% sparsity)')
axes[1].set_xlabel('Weight value')
axes[1].set_ylabel('Frequency')
axes[1].axvline(0, color='red', linestyle='--', label='Zero (many)')
axes[1].legend()
plt.tight_layout()
plt.savefig('/tmp/pruning_demo.png', dpi=150, bbox_inches='tight')
print("Saved pruning visualization to /tmp/pruning_demo.png")
# ===========================
# Part 5: Knowledge Distillation
# ===========================
print("\n" + "="*60)
print("Part 5: Knowledge Distillation")
print("="*60)
class Distiller(keras.Model):
    """
    Knowledge distillation wrapper
    
    Trains student model to match teacher's soft targets
    """
    def __init__(self, student, teacher, temperature=5.0, alpha=0.1):
        super().__init__()
        self.student = student
        self.teacher = teacher
        self.temperature = temperature
        self.alpha = alpha
        
    def compile(self, optimizer, metrics):
        super().compile(optimizer=optimizer, metrics=metrics)
        self.student_loss_tracker = keras.metrics.Mean(name="student_loss")
        self.distillation_loss_tracker = keras.metrics.Mean(name="distillation_loss")
        
    @property
    def metrics(self):
        return [self.student_loss_tracker, self.distillation_loss_tracker]
    
    def train_step(self, data):
        x, y = data
        
        # Teacher predictions (soft targets)
        teacher_predictions = self.teacher(x, training=False)
        
        with tf.GradientTape() as tape:
            # Student predictions
            student_predictions = self.student(x, training=True)
            
            # Hard loss (cross-entropy with true labels)
            student_loss = keras.losses.categorical_crossentropy(y, student_predictions)
            
            # Soft loss (KL divergence with teacher)
            # Apply temperature scaling
            teacher_soft = tf.nn.softmax(teacher_predictions / self.temperature)
            student_soft = tf.nn.softmax(student_predictions / self.temperature)
            
            distillation_loss = keras.losses.kl_divergence(teacher_soft, student_soft)
            
            # Combined loss (weighted)
            # Note: temperature^2 scales KL divergence magnitude
            loss = self.alpha * student_loss + (1 - self.alpha) * distillation_loss * (self.temperature ** 2)
        
        # Update student weights
        trainable_vars = self.student.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))
        
        # Update metrics
        self.student_loss_tracker.update_state(student_loss)
        self.distillation_loss_tracker.update_state(distillation_loss)
        
        return {
            "student_loss": self.student_loss_tracker.result(),
            "distillation_loss": self.distillation_loss_tracker.result(),
        }
    
    def test_step(self, data):
        x, y = data
        
        # Student predictions
        student_predictions = self.student(x, training=False)
        
        # Hard loss
        student_loss = keras.losses.categorical_crossentropy(y, student_predictions)
        
        # Update metrics
        self.student_loss_tracker.update_state(student_loss)
        
        return {"student_loss": self.student_loss_tracker.result()}
# Create teacher (larger model, assume already trained)
teacher = model  # Use our trained model as teacher
# Create student (smaller model)


### üìù Function: build_small_student

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
def build_small_student(input_shape=(32, 32, 3), num_classes=10):
    """
    Build smaller student model (half the channels of teacher)
    """
    inputs = layers.Input(shape=input_shape)
    
    x = layers.Conv2D(16, 3, padding='same', activation='relu')(inputs)
    x = layers.BatchNormalization()(x)
    
    x = layers.DepthwiseConv2D(3, padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.Conv2D(32, 1, padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.MaxPooling2D(2)(x)
    
    x = layers.DepthwiseConv2D(3, padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.Conv2D(64, 1, padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.MaxPooling2D(2)(x)
    
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)
    
    model = keras.Model(inputs, outputs, name='student_cnn')
    return model
student = build_small_student()
student_params = student.count_params()
teacher_params = teacher.count_params()
print(f"Teacher parameters: {teacher_params:,}")
print(f"Student parameters: {student_params:,}")
print(f"Compression: {teacher_params / student_params:.2f}√ó fewer parameters")
# Create distiller
distiller = Distiller(student=student, teacher=teacher, temperature=5.0, alpha=0.1)
distiller.compile(
    optimizer='adam',
    metrics=['accuracy']
)
# Train student with distillation (quick demo)
print("\nTraining student with knowledge distillation...")
distiller.fit(
    x_train[:5000], y_train[:5000],
    batch_size=128,
    epochs=3,
    validation_split=0.2,
    verbose=1
)
# Evaluate student
student_loss, student_acc = student.evaluate(x_test, y_test, verbose=0)
print(f"\nTeacher accuracy: {test_acc:.4f}")
print(f"Student accuracy (with distillation): {student_acc:.4f}")
print(f"Accuracy gap: {(test_acc - student_acc) * 100:.2f}%")
# ===========================
# Part 6: TensorFlow Lite Micro (Arduino Deployment)
# ===========================
print("\n" + "="*60)
print("Part 6: TensorFlow Lite Micro (TinyML)")
print("="*60)
# Convert to TFLite Micro-compatible format
converter = tf.lite.TFLiteConverter.from_keras_model(student)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
converter.representative_dataset = representative_dataset
tflite_micro_model = converter.convert()
# Save
with open('/tmp/student_tflite_micro.tflite', 'wb') as f:
    f.write(tflite_micro_model)
micro_size = len(tflite_micro_model)
print(f"TFLite Micro model size: {micro_size / 1024:.2f} KB")
# Check if fits on Arduino Nano 33 BLE (256KB RAM)
arduino_ram = 256 * 1024
if micro_size < arduino_ram:
    print(f"‚úÖ Fits on Arduino Nano 33 BLE (256KB RAM)")
    print(f"   Remaining RAM: {(arduino_ram - micro_size) / 1024:.2f} KB")
else:
    print(f"‚ùå Too large for Arduino Nano 33 BLE")
    print(f"   Need to reduce by: {(micro_size - arduino_ram) / 1024:.2f} KB")
# Generate C array for Arduino
print("\nGenerating C array for Arduino deployment...")
with open('/tmp/model.h', 'w') as f:
    f.write("// Generated model for Arduino deployment\n")
    f.write("// Include this file in your Arduino sketch\n\n")
    f.write("#ifndef MODEL_H\n")
    f.write("#define MODEL_H\n\n")
    f.write(f"const unsigned int model_size = {micro_size};\n")
    f.write("const unsigned char model_data[] = {\n")
    
    # Convert bytes to C array
    for i, byte in enumerate(tflite_micro_model):
        if i % 12 == 0:
            f.write("  ")
        f.write(f"0x{byte:02x}")
        if i < len(tflite_micro_model) - 1:
            f.write(", ")
        if (i + 1) % 12 == 0:
            f.write("\n")
    
    f.write("\n};\n\n")
    f.write("#endif  // MODEL_H\n")
print("Saved Arduino header to /tmp/model.h")
# Generate Arduino sketch template
arduino_sketch = """
// TensorFlow Lite Micro Arduino Example
// Deploy quantized model on microcontroller
#include <TensorFlowLite.h>
#include "model.h"
// TFLite globals
namespace {
  const tflite::Model* model = nullptr;
  tflite::MicroInterpreter* interpreter = nullptr;
  TfLiteTensor* input = nullptr;
  TfLiteTensor* output = nullptr;
  
  // Tensor arena (adjust size based on model)
  constexpr int kTensorArenaSize = 60 * 1024;  // 60KB
  uint8_t tensor_arena[kTensorArenaSize];
}
void setup() {
  Serial.begin(115200);
  while (!Serial) {}
  
  // Load model
  model = tflite::GetModel(model_data);
  if (model->version() != TFLITE_SCHEMA_VERSION) {
    Serial.println("Model schema mismatch!");
    return;
  }
  
  // Set up interpreter
  static tflite::MicroMutableOpResolver<5> resolver;
  resolver.AddConv2D();
  resolver.AddDepthwiseConv2D();
  resolver.AddFullyConnected();
  resolver.AddReshape();
  resolver.AddSoftmax();
  
  static tflite::MicroInterpreter static_interpreter(
      model, resolver, tensor_arena, kTensorArenaSize);
  interpreter = &static_interpreter;
  
  // Allocate tensors
  TfLiteStatus allocate_status = interpreter->AllocateTensors();
  if (allocate_status != kTfLiteOk) {
    Serial.println("AllocateTensors() failed!");
    return;
  }
  
  // Get input/output tensors
  input = interpreter->input(0);
  output = interpreter->output(0);
  
  Serial.println("Model loaded successfully!");
  Serial.print("Input shape: ");
  Serial.print(input->dims->data[1]);
  Serial.print("x");
  Serial.println(input->dims->data[2]);
}
void loop() {
  // Read sensor data (e.g., camera, microphone)
  // For demo, use dummy data
  for (int i = 0; i < input->bytes; i++) {
    input->data.uint8[i] = random(0, 256);
  }
  
  // Run inference
  unsigned long start = micros();
  TfLiteStatus invoke_status = interpreter->Invoke();
  unsigned long end = micros();
  
  if (invoke_status != kTfLiteOk) {
    Serial.println("Invoke failed!");
    return;
  }
  
  // Get results
  int max_idx = 0;
  uint8_t max_val = 0;
  for (int i = 0; i < 10; i++) {
    if (output->data.uint8[i] > max_val) {
      max_val = output->data.uint8[i];
      max_idx = i;
    }
  }
  
  Serial.print("Prediction: ");
  Serial.print(max_idx);
  Serial.print(", Confidence: ");
  Serial.print(max_val);
  Serial.print(", Latency: ");
  Serial.print(end - start);
  Serial.println(" us");
  
  delay(1000);  // Run once per second
}
"""
with open('/tmp/tflite_micro_arduino.ino', 'w') as f:
    f.write(arduino_sketch)
print("Saved Arduino sketch template to /tmp/tflite_micro_arduino.ino")


### üìù Implementation Part 6

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ===========================
# Part 7: Efficiency Analysis
# ===========================
print("\n" + "="*60)
print("Part 7: Efficiency Analysis")
print("="*60)
# Model comparison table
models = {
    'Teacher (float32)': {
        'params': teacher_params,
        'size_mb': teacher_params * 4 / 1e6,
        'accuracy': test_acc,
    },
    'Teacher (INT8)': {
        'params': teacher_params,
        'size_mb': int8_size / 1e6,
        'accuracy': int8_accuracy,
    },
    'Student (float32)': {
        'params': student_params,
        'size_mb': student_params * 4 / 1e6,
        'accuracy': student_acc,
    },
    'Student (INT8)': {
        'params': student_params,
        'size_mb': micro_size / 1e6,
        'accuracy': student_acc * 0.98,  # Estimate (slight loss from INT8)
    }
}
print(f"\n{'Model':<20} {'Params':>12} {'Size (MB)':>12} {'Accuracy':>12} {'Compression':>12}")
print("-" * 80)
baseline_size = models['Teacher (float32)']['size_mb']
baseline_acc = models['Teacher (float32)']['accuracy']
for name, metrics in models.items():
    compression = baseline_size / metrics['size_mb']
    acc_loss = (baseline_acc - metrics['accuracy']) * 100
    
    print(f"{name:<20} {metrics['params']:>12,} {metrics['size_mb']:>12.2f} "
          f"{metrics['accuracy']:>12.4f} {compression:>11.1f}√ó")
# ===========================
# Summary
# ===========================
print("\n" + "="*60)
print("SUMMARY: Edge AI Model Optimization")
print("="*60)
print(f"""
‚úÖ Quantization (INT8):
   - Size: {baseline_size:.2f}MB ‚Üí {int8_size/1e6:.2f}MB ({baseline_size/(int8_size/1e6):.1f}√ó smaller)
   - Accuracy loss: {(test_acc - int8_accuracy)*100:.2f}%
   - Speedup: 2-4√ó (on hardware with INT8 support)
‚úÖ Knowledge Distillation:
   - Size: {teacher_params:,} ‚Üí {student_params:,} params ({teacher_params/student_params:.1f}√ó fewer)
   - Accuracy: {test_acc:.4f} ‚Üí {student_acc:.4f} ({(test_acc-student_acc)*100:.2f}% loss)
   - Model size: {baseline_size:.2f}MB ‚Üí {student_params*4/1e6:.2f}MB ({baseline_size/(student_params*4/1e6):.1f}√ó smaller)
‚úÖ Combined (Distillation + Quantization):
   - Total compression: {baseline_size/(micro_size/1e6):.1f}√ó smaller
   - Final size: {micro_size/1024:.2f}KB
   - Fits on microcontroller: {'Yes ‚úÖ' if micro_size < arduino_ram else 'No ‚ùå'}
‚úÖ Deployment Ready:
   - TensorFlow Lite model: /tmp/mobile_model_int8.tflite
   - TFLite Micro model: /tmp/student_tflite_micro.tflite
   - Arduino header: /tmp/model.h
   - Arduino sketch: /tmp/tflite_micro_arduino.ino
""")
print("\n" + "="*60)
print("Next: Deploy to production edge devices!")
print("Part 8: Real-world projects (smart home, manufacturing, wearables)")
print("="*60)


# üöÄ Production Projects & Deployment

---

## üìã Overview

This section presents **8 production-grade Edge AI & TinyML projects** with complete implementation roadmaps, deployment strategies, and business value quantification.

**Total business value**: $50M-$150M/year across use cases

---

# üéØ Project 1: Smart Home Voice Assistant (Keyword Spotting)

## Business Objective
Deploy wake word detection on microcontroller for privacy-preserving, always-on voice assistant

**Problem**: Cloud-based assistants violate privacy and have 300-500ms latency

**Edge Solution**: On-device keyword spotting (10ms latency, zero privacy risk)

## Technical Implementation

### Architecture
```
Microphone ‚Üí MFCC Feature Extraction ‚Üí TinyML Model (50KB) ‚Üí Wake Word Detection
```

**Model**: 1D CNN for audio classification
- Input: 40 MFCC features √ó 49 frames = 1,960 features
- Architecture: Conv1D (32) ‚Üí Conv1D (64) ‚Üí Dense (128) ‚Üí Softmax (12 classes)
- Size: 50KB (INT8 quantized)
- Latency: 10ms on ARM Cortex-M4
- Power: 5mW (battery lasts 6 months)

### Week-by-Week Roadmap

**Week 1-2: Data Collection & Preparation**
```python
# Collect wake word dataset
# - 10 keywords: "Hey Assistant", "OK Computer", "Wake Up", etc.
# - 2000 samples per keyword
# - Background noise samples (TV, music, traffic)

import librosa

def extract_mfcc(audio_file, n_mfcc=40, n_frames=49):
    """Extract MFCC features from audio"""
    y, sr = librosa.load(audio_file, sr=16000)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)
    
    # Pad or truncate to fixed length
    if mfcc.shape[1] < n_frames:
        mfcc = np.pad(mfcc, ((0, 0), (0, n_frames - mfcc.shape[1])))
    else:
        mfcc = mfcc[:, :n_frames]
    
    return mfcc.T  # Shape: (49, 40)
```

**Week 3-4: Model Training**
```python
def build_keyword_spotting_model():
    """Build 1D CNN for keyword spotting"""
    inputs = layers.Input(shape=(49, 40))
    
    x = layers.Conv1D(32, 3, activation='relu')(inputs)
    x = layers.MaxPooling1D(2)(x)
    x = layers.Dropout(0.2)(x)
    
    x = layers.Conv1D(64, 3, activation='relu')(x)
    x = layers.MaxPooling1D(2)(x)
    x = layers.Dropout(0.2)(x)
    
    x = layers.GlobalAveragePooling1D()(x)
    x = layers.Dense(128, activation='relu')(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(12, activation='softmax')(x)  # 10 keywords + silence + unknown
    
    return keras.Model(inputs, outputs)

# Train with data augmentation (time shift, pitch shift, noise addition)
# Target: 95%+ accuracy
```

**Week 5-6: Quantization & Optimization**
```python
# INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

# Result: 200KB ‚Üí 50KB (4√ó compression)
# Accuracy: 95.2% ‚Üí 94.8% (0.4% loss, acceptable)
```

**Week 7-8: Deploy to Arduino Nano 33 BLE**
```cpp
// Arduino deployment code
#include <TensorFlowLite.h>
#include <PDM.h>  // Microphone library

// Inference every 1 second
void loop() {
  // Read audio from microphone
  PDM.read(audio_buffer, BUFFER_SIZE);
  
  // Extract MFCC
  extract_mfcc(audio_buffer, mfcc_features);
  
  // Run inference (10ms)
  interpreter->Invoke();
  
  // Check if wake word detected
  float max_prob = 0;
  int max_idx = 0;
  for (int i = 0; i < 10; i++) {
    if (output[i] > max_prob) {
      max_prob = output[i];
      max_idx = i;
    }
  }
  
  if (max_prob > 0.8) {  // Confidence threshold
    Serial.println("Wake word detected!");
    // Trigger full assistant (send to cloud for complex queries)
  }
}
```

## Business Value: $15M-$40M/year

**Cost Savings**:
- **API fees**: $60M/year ‚Üí $5M/year = **$55M saved**
- **Bandwidth**: 500GB/day ‚Üí 5GB/day = $16K/year saved

**Privacy Value**:
- GDPR compliant (no audio upload)
- User trust +12 NPS points
- Market differentiation

**Conservative estimate**: **$15M-$40M/year** (10M users)

---

# üéØ Project 2: Manufacturing Defect Detection (Edge Vision)

## Business Objective
Real-time visual inspection on production line using edge AI camera

**Problem**: Cloud inference 200ms latency limits throughput to 60 items/min

**Edge Solution**: On-device inference 10ms ‚Üí 600 items/min (10√ó throughput)

## Technical Implementation

### Hardware
- **NVIDIA Jetson Nano**: $99, 472 GFLOPS, 4GB RAM
- **Industrial camera**: 1920√ó1080, 60 FPS
- **Deployment**: 100 production lines

### Model Architecture
```python
# MobileNetV2 + Custom defect classifier
base = keras.applications.MobileNetV2(
    input_shape=(224, 224, 3),
    include_top=False,
    weights='imagenet'
)

# Freeze base
base.trainable = False

# Add defect classification head
x = base.output
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dropout(0.4)(x)
outputs = layers.Dense(5, activation='softmax')(x)  # 5 defect types

model = keras.Model(base.input, outputs)

# Train on factory data
# Target: 98%+ defect detection rate
```

### TensorRT Optimization
```python
# Convert to TensorRT for maximum speed
import tensorrt as trt

# Build TensorRT engine
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network()
parser = trt.OnnxParser(network, TRT_LOGGER)

# Load ONNX model
with open('defect_detector.onnx', 'rb') as f:
    parser.parse(f.read())

# Build with INT8 precision
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = EntropyCalibrator(calibration_data)

# Build engine
engine = builder.build_engine(network, config)

# Result: 50ms ‚Üí 10ms inference (5√ó speedup)
```

### Week-by-Week Roadmap

**Week 1-2**: Data collection (10K defect images per line)

**Week 3-4**: Train MobileNetV2 classifier (98% accuracy)

**Week 5-6**: Optimize with TensorRT INT8 (10ms inference)

**Week 7-8**: Deploy to 100 lines, validate throughput

## Business Value: $20M-$60M/year

**Throughput Increase**:
- 60 items/min ‚Üí 600 items/min (10√ó faster)
- Revenue: $10M/year per line √ó 10√ó = **$100M/year** (or avoid $100M capex for 10√ó more lines)

**Cost Savings**:
- API fees: $3.6M/year ‚Üí $0
- Bandwidth: $16K/year ‚Üí $0

**Per Factory**: $5M-$15M/year  
**Total (4 factories)**: **$20M-$60M/year**

---

# üéØ Project 3: Wearable Health Monitoring (Arrhythmia Detection)

## Business Objective
Real-time ECG analysis on smartwatch for arrhythmia detection

**Problem**: Sending 100 beats/min to cloud costs $14.4B/day (impossible)

**Edge Solution**: On-device ECG analysis, only upload anomalies

## Technical Implementation

### Model Architecture
```python
# 1D CNN for ECG classification
def build_ecg_model():
    inputs = layers.Input(shape=(300, 1))  # 300 samples, 1 channel
    
    x = layers.Conv1D(32, 5, activation='relu')(inputs)
    x = layers.MaxPooling1D(2)(x)
    
    x = layers.Conv1D(64, 5, activation='relu')(x)
    x = layers.MaxPooling1D(2)(x)
    
    x = layers.Conv1D(128, 5, activation='relu')(x)
    x = layers.GlobalAveragePooling1D()(x)
    
    x = layers.Dense(64, activation='relu')(x)
    outputs = layers.Dense(5, activation='softmax')(x)  # Normal, AFib, VTach, VFib, PVC
    
    return keras.Model(inputs, outputs)

# Optimize for ARM Cortex-M4 (smartwatch)
# Target: 50KB model, 5ms inference, <1mW power
```

### Power Optimization
```python
# Duty cycling for battery life
import time

def run_ecg_monitoring():
    while True:
        # Read ECG (10ms)
        ecg_data = read_ecg_sensor()
        
        # Inference (5ms)
        prediction = model.predict(ecg_data)
        
        # Check anomaly
        if is_anomaly(prediction):
            send_alert_to_cloud()
        
        # Sleep for 950ms (95% duty cycle)
        time.sleep(0.95)

# Power consumption:
# - Active (15ms): 10mW
# - Sleep (985ms): 0.1mW
# - Average: (10 * 15 + 0.1 * 985) / 1000 = 0.25mW
# - Battery life: 200mAh √∑ 0.25mW = 7 days ‚úÖ
```

### Week-by-Week Roadmap

**Week 1-2**: Collect ECG dataset (MIT-BIH, PTB-XL)

**Week 3-4**: Train 1D CNN (98% accuracy)

**Week 5-6**: Quantize to INT8, optimize for ARM

**Week 7-8**: Deploy to smartwatch, validate battery life

## Business Value: $15M-$50M/year

**Feature Enablement**:
- Cloud inference impossible ($14B/day cost)
- Edge AI enables real-time monitoring ($0 cost)

**Market Value**:
- 10M users √ó $5/month subscription = **$600M/year**
- Conservative attribution to edge AI: 2.5-8% = **$15M-$50M/year**

---

# üéØ Project 4: Autonomous Vehicle (Perception)

## Business Objective
Real-time object detection for autonomous driving

**Latency requirement**: <10ms (safety-critical)

## Technical Implementation

### Model: YOLOv5-Nano + TensorRT
```python
# Optimized for NVIDIA Jetson Xavier NX
# Input: 640√ó640√ó3 camera image
# Output: Bounding boxes (cars, pedestrians, cyclists, traffic signs)
# Latency: 8ms @ 30 FPS
# Power: 15W

# INT8 quantization + TensorRT optimization
# Accuracy: 88% mAP (vs 90% float32, acceptable for redundancy)
```

## Business Value: $10M-$30M/year

**Bandwidth Savings**: $900K/year (no image upload)

**Safety Value**: Priceless (enables autonomous driving)

**Time-to-Market**: 50√ó faster model iteration

**Conservative**: **$10M-$30M/year**

---

# üéØ Project 5: Smart Agriculture (Crop Disease Detection)

## Business Objective
On-device plant disease detection for farmers

## Technical Implementation
- **Device**: Smartphone app with TensorFlow Lite
- **Model**: MobileNetV2 (5MB, 100ms inference)
- **Dataset**: PlantVillage (38 disease classes)

## Business Value: $5M-$15M/year

**Farmer Value**: Early detection saves 20-30% crop loss

**App Revenue**: 1M farmers √ó $5/month = **$60M/year**

**Conservative attribution**: 8-25% = **$5M-$15M/year**

---

# üéØ Project 6: Industrial IoT (Predictive Maintenance)

## Business Objective
On-sensor anomaly detection for industrial equipment

## Technical Implementation
- **Device**: ESP32 microcontroller on each machine
- **Model**: Autoencoder (50KB, 10ms inference)
- **Sensors**: Vibration, temperature, pressure

## Business Value: $10M-$30M/year

**Bandwidth Savings**: 10,000 sensors √ó 1MB/day ‚Üí 10KB/day = $500K/year

**Downtime Reduction**: 30% improvement = **$30M/year**

**Conservative**: **$10M-$30M/year**

---

# üéØ Project 7: Retail (Cashierless Checkout)

## Business Objective
Real-time product recognition for automated checkout

## Technical Implementation
- **Device**: Edge camera + Jetson Nano
- **Model**: EfficientNet-Lite (3MB, 20ms)
- **Products**: 500 SKUs

## Business Value: $5M-$20M/year

**Labor Savings**: 10 stores √ó $200K/year = **$2M/year**

**Customer Experience**: Faster checkout, +10% sales = **$5M/year**

**Total**: **$5M-$20M/year**

---

# üéØ Project 8: Semiconductor (Wafer Defect Detection)

## Business Objective
Real-time wafer defect detection during fabrication

## Technical Implementation
- **Device**: Custom ASIC vision processor
- **Model**: Custom CNN (1MB, 5ms)
- **Resolution**: 10,000√ó10,000 pixels (full wafer scan)

## Business Value: $20M-$50M/year

**Yield Improvement**: 1% yield increase = **$50M/year** (300mm fab)

**Throughput**: Real-time vs batch = 2√ó faster

**Conservative**: **$20M-$50M/year**

---

# üìä Business Value Summary

| Project | Annual Value | Key Metric | Device |
|---------|--------------|------------|--------|
| 1. Smart Home Voice | $15M-$40M | $55M cost savings | Arduino Nano 33 BLE |
| 2. Manufacturing Defect | $20M-$60M | 10√ó throughput | Jetson Nano |
| 3. Wearable Health | $15M-$50M | 7-day battery | ARM Cortex-M4 |
| 4. Autonomous Vehicle | $10M-$30M | <10ms latency | Jetson Xavier NX |
| 5. Smart Agriculture | $5M-$15M | 20-30% crop loss reduction | Smartphone |
| 6. Industrial IoT | $10M-$30M | 30% downtime reduction | ESP32 |
| 7. Retail Checkout | $5M-$20M | Labor + sales increase | Jetson Nano |
| 8. Semiconductor Inspection | $20M-$50M | 1% yield improvement | Custom ASIC |
| **Total** | **$100M-$295M** | Latency + Privacy + Cost | Edge/TinyML |

**Conservative midpoint**: **$200M/year** across all edge AI applications

---

# üîß Deployment Framework Comparison

## TensorFlow Lite (Mobile/Edge)
**Best for**: Android, iOS, Raspberry Pi  
**Model size**: 1MB-100MB  
**Latency**: 10-100ms  
**Deployment**: Simple (one-click export from TensorFlow)

## TensorFlow Lite Micro (TinyML)
**Best for**: Microcontrollers (Arduino, ESP32, STM32)  
**Model size**: 10KB-500KB  
**Latency**: 1-50ms  
**Power**: <10mW  
**Deployment**: C++ library, compile with Arduino IDE

## TensorRT (NVIDIA)
**Best for**: Jetson, autonomous vehicles, data centers  
**Model size**: 1MB-10GB  
**Latency**: 1-10ms  
**Optimization**: INT8, FP16, custom kernels  
**Speedup**: 5-10√ó faster than TensorFlow Lite

## Core ML (Apple)
**Best for**: iPhone, iPad, Mac  
**Model size**: 1MB-100MB  
**Latency**: 5-50ms  
**Integration**: Native iOS/macOS APIs  
**Hardware**: Neural Engine (15.8 TOPS on A16 Bionic)

---

# üéì Key Takeaways

## When to Use Edge AI

‚úÖ **Use Edge AI when**:
1. **Latency critical**: <50ms required
2. **Privacy sensitive**: Data cannot leave device
3. **High volume**: >1M inferences/day (cloud expensive)
4. **Offline capability**: No internet connection
5. **Bandwidth constrained**: Cannot upload large data

‚ùå **Don't use Edge AI when**:
1. **Complex models**: >1GB models (GPT-4, DALL-E)
2. **Low volume**: <1K inferences/day
3. **Continuous learning**: Model updates daily
4. **Heterogeneous devices**: Many device types

## Optimization Strategy

**Step 1**: Train large model on cloud (maximize accuracy)  
**Step 2**: Knowledge distillation (10-100√ó compression, 2-5% accuracy loss)  
**Step 3**: INT8 quantization (4√ó compression, 0.5-2% accuracy loss)  
**Step 4**: Pruning (2-5√ó compression, 1-3% accuracy loss)  
**Step 5**: Deploy to target device  
**Result**: 50-500√ó total compression, 3-10% accuracy loss

## Hardware Selection

| Use Case | Device | Cost | Power | Latency |
|----------|--------|------|-------|---------|
| **Always-on wake word** | Arduino Nano 33 BLE | $25 | 5mW | 10ms |
| **Wearable health** | ARM Cortex-M4 | Built-in | <1mW | 5ms |
| **Smart home camera** | Raspberry Pi 4 | $55 | 3W | 50ms |
| **Manufacturing vision** | Jetson Nano | $99 | 10W | 10ms |
| **Autonomous vehicle** | Jetson Xavier NX | $399 | 15W | 5ms |
| **Mobile app** | Smartphone | User-owned | 5W | 20ms |

---

# üìö Resources & Next Steps

## Frameworks
1. **TensorFlow Lite**: https://tensorflow.org/lite
2. **TensorFlow Lite Micro**: https://tensorflow.org/lite/microcontrollers
3. **TensorRT**: https://developer.nvidia.com/tensorrt
4. **Core ML**: https://developer.apple.com/machine-learning/core-ml/

## Hardware
1. **Arduino Nano 33 BLE**: $25, 256KB RAM
2. **ESP32**: $5, 520KB RAM
3. **Raspberry Pi 4**: $55, 4GB RAM
4. **NVIDIA Jetson Nano**: $99, 472 GFLOPS
5. **Google Coral Dev Board**: $150, 4 TOPS

## Courses
1. **TinyML (edX)**: Harvard CS249r
2. **Edge AI (Coursera)**: TensorFlow Lite deployment
3. **NVIDIA Deep Learning Institute**: Jetson AI courses

## Papers
1. **MobileNets** (Howard et al., 2017): Efficient mobile architectures
2. **EfficientNet** (Tan & Le, 2019): State-of-the-art efficiency
3. **TinyML** (Banbury et al., 2020): Machine learning on microcontrollers

---

# ‚úÖ Success Criteria Checklist

Before deploying edge AI, verify:

- [ ] **Latency requirement**: <50ms achieved
- [ ] **Model size**: Fits on target device (RAM + Flash)
- [ ] **Accuracy**: Within 3-5% of cloud model
- [ ] **Power consumption**: Meets battery life goals (<10mW for wearables)
- [ ] **Quantization**: INT8 working (4√ó compression)
- [ ] **Deployment tested**: Model runs on actual hardware
- [ ] **Business value**: ROI quantified ($XM-$YM/year)
- [ ] **Fallback strategy**: Hybrid edge+cloud for complex cases

---

# üéØ Conclusion

**Edge AI enables $100M-$300M/year business value**:
- **Smart home**: $15M-$40M (privacy + cost savings)
- **Manufacturing**: $20M-$60M (10√ó throughput)
- **Wearables**: $15M-$50M (7-day battery, feature enablement)
- **Autonomous vehicles**: $10M-$30M (safety + bandwidth)
- **Total**: **$100M-$295M/year** across 8 use cases

**Key techniques**:
1. **Quantization** (INT8): 4√ó smaller, 2-4√ó faster, <2% accuracy loss
2. **Knowledge distillation**: 10-100√ó smaller, 2-5% accuracy loss
3. **Pruning**: 2-10√ó smaller, 1-8% accuracy loss
4. **Combined**: 50-500√ó compression, 3-10% accuracy loss

**Deployment platforms**:
- **TensorFlow Lite**: Mobile/edge (1MB-100MB models)
- **TFLite Micro**: Microcontrollers (10KB-500KB models)
- **TensorRT**: NVIDIA Jetson (maximum speed)
- **Core ML**: Apple devices (native integration)

**Next steps**:
1. Choose use case (voice, vision, sensor)
2. Train baseline model on cloud
3. Optimize (quantization + distillation + pruning)
4. Deploy to target device (Arduino, Jetson, mobile)
5. Validate latency, accuracy, power consumption
6. Quantify business value ($XM-$YM/year)

**Remember**: Edge AI is essential for real-time, privacy-sensitive, high-volume applications. Start deploying today! üöÄüì±üîß

---

**Learning Progression:**
- **Previous**: 069 Federated Learning (Privacy-Preserving Distributed ML)
- **Current**: 070 Edge AI & TinyML (On-Device Inference, Microcontrollers)
- **Next**: 071 Transformers & BERT (Self-Attention, Pre-training, Transfer Learning)

---

‚úÖ **Notebook Complete! Ready for production edge AI deployment and $100M-$300M/year business value creation.**