# Agentic AI


# Inference Time Compute
Inference time compute refers to the amount of computational resources (such as processing power, memory, and time) required to make predictions using a trained machine learning (ML) or deep learning model.

Breakdown of Inference Time Compute:
	1.	Inference – This is the process of using a trained model to make predictions on new data.
	2.	Compute – This includes the processing power (e.g., CPU, GPU, or specialized hardware like TPUs) needed to perform inference.

Key Factors Affecting Inference Time Compute:
	•	Model Size & Complexity – Larger models (e.g., deep neural networks) require more computations.
	•	Hardware Acceleration – GPUs and TPUs speed up inference compared to CPUs.
	•	Batch Size – Processing multiple inputs simultaneously can be more efficient.
	•	Quantization & Optimization – Techniques like model pruning and quantization reduce inference time.
	•	Latency vs. Throughput – Optimizing for real-time applications (low latency) versus bulk processing (high throughput).

Example:

A deep learning model like GPT-4 requires significant inference time compute due to its billions of parameters, whereas a small decision tree model has minimal computational requirements.

Optimizing inference time compute depends on the model type, hardware, and use case. Here are some key strategies to speed up inference while maintaining accuracy:

1. Model Optimization Techniques

a) Quantization
	•	Converts high-precision (e.g., 32-bit floating point) weights to lower precision (e.g., 8-bit integer).
	•	Reduces memory usage and speeds up computations.
	•	Works well for mobile and edge AI devices.
	•	Tools: TensorFlow Lite, ONNX Runtime, PyTorch quantization.

b) Pruning
	•	Removes redundant or insignificant weights in a neural network.
	•	Speeds up inference without significant accuracy loss.
	•	Works well for compressed model deployment.

c) Knowledge Distillation
	•	Trains a smaller model (student) to mimic a larger one (teacher).
	•	Used in cases like transformer models (e.g., TinyBERT, DistilBERT).
	•	Reduces compute requirements with minimal performance drop.

2. Hardware Optimization

a) Leverage Specialized Accelerators
	•	GPUs – Suitable for parallel computing and large deep learning models.
	•	TPUs – Designed for fast AI inference (e.g., Google Cloud TPUs).
	•	FPGAs & ASICs – Custom chips optimized for low-power inference (e.g., Edge TPU, Nvidia Jetson).

b) Use Tensor Cores & Mixed Precision
	•	Nvidia Tensor Cores in RTX and A100 GPUs allow faster matrix multiplications.
	•	Mixed precision (FP16, INT8) helps optimize inference speed.

3. Software & Algorithmic Improvements

a) Efficient Model Architectures
	•	Use lightweight models like MobileNet, EfficientNet, or YOLO for vision tasks.
	•	Transformer-based models like ALBERT (for NLP) are optimized for inference.

b) Batch Processing & Parallelism
	•	Running inferences in batches (instead of one by one) improves throughput.
	•	Parallelize inference using multi-threading or distributed computing.

c) Use Optimized Inference Engines
	•	TensorRT (for Nvidia GPUs) speeds up model execution.
	•	ONNX Runtime (for cross-platform deployment) optimizes models across different hardware.
	•	TFLite (for mobile and edge devices) reduces inference time on limited hardware.

4. Deployment-Specific Optimizations

a) Edge vs. Cloud Inference
	•	Edge Inference: Low-latency, runs locally on devices (phones, cameras, IoT).
	•	Cloud Inference: More compute power but higher latency due to network overhead.

b) Asynchronous Processing
	•	For real-time applications, process requests in parallel or asynchronously.

5. Case Studies:
	•	Google Assistant uses distilled BERT for fast, low-latency speech processing.
	•	Tesla Autopilot optimizes deep learning inference with custom AI chips.
	•	Stable Diffusion reduces image generation time with optimized tensor computation.

# Very Large Language Models
> 2 or ~50 trillion parameters

# Very Small Language Models
a few billion parameters

# Human-in-the-Loop Learning/Augmentation