Data ingestion topics:

1. Batch vs streaming ingestion
2. Source systems (databases, APIs, files, message queues)
3. Change Data Capture (CDC)
4. Data formats (CSV, JSON, Parquet, Avro, ORC)
5. Schema design and schema evolution
6. Data validation and quality checks
7. Deduplication and noise removal
8. Data transformation (ETL vs ELT)
9. Orchestration and workflow scheduling
10. Error handling and retry policies
11. Metadata and lineage tracking
12. Security and access control
13. Scalability and throughput optimization
14. Latency vs consistency trade-offs
15. Storage targets (data lake, warehouse, OLTP/OLAP)


Below is a **complete, structured list of all major concepts involved in Generative AI**, covering **theory, systems, models, data pipelines, MLOps, RAG, optimization, safety, evaluation, deployment, and governance**.

This is a **master checklist** of everything you need to understand for Generative AI end-to-end.

---

# 1. Core AI & ML Foundations

* Machine learning basics
* Deep learning basics
* Neural networks
* Backpropagation
* Loss functions (cross-entropy, KL divergence)
* Activation functions (ReLU, GELU, SiLU)
* Optimization (Adam, AdamW, SGD)
* Regularization (dropout, weight decay)
* Gradient clipping
* Learning rate schedulers

---

# 2. Generative Model Families

* Transformers
* Encoder–decoder models
* Decoder-only LLMs
* Diffusion models
* GANs
* VAE (Variational Autoencoders)
* Autoregressive models
* Masked language modeling

---

# 3. Large Language Models (LLMs)

* Attention mechanism
* Multi-head attention
* Self-attention vs cross-attention
* Positional encodings
* Tokenization (BPE, SentencePiece, WordPiece)
* Context window & sliding windows
* KV cache
* Sampling strategies (top-k, top-p, temperature)
* Beam search
* Low-rank adaptation (LoRA)
* Quantization (8-bit, 4-bit, AWQ, GPTQ)
* Instruction tuning
* Supervised fine-tuning (SFT)
* Preference modeling
* Reinforcement learning (RLHF)
* DPO (Direct Preference Optimization)
* PPO vs DPO differences

---

# 4. Multimodal Models

* Vision Transformers (ViT)
* CLIP
* Vision–language models (VLM)
* Speech-to-text
* Text-to-speech
* Audio representations (spectrograms)
* Image embeddings
* Video embeddings
* OCR systems

---

# 5. Embeddings

* Text embeddings
* Sentence embeddings
* Image/audio embeddings
* Cosine similarity
* Vector normalization
* Dimensionality
* Embedding model drift
* Embedding versioning
* Nearest neighbors search (ANN)
* FAISS / HNSW / ScaNN

---

# 6. RAG (Retrieval Augmented Generation)

## Retrieval

* Chunking strategies (semantic, fixed-size, hybrid)
* Overlap strategies
* Semantic search
* Hybrid search (BM25 + embeddings)
* Ranking / re-ranking
* Context assembly
* Query rewriting
* Multi-step retrieval
* Routing / query classification

## Storage

* Vector databases (Pinecone, Qdrant, Weaviate, Chroma)
* Metadata stores
* Indexes (HNSW, IVF, Flat)
* Sharding & replicas
* TTL and versioning

## Response synthesis

* Context windows
* Chain-of-thought (CoT)
* Tools & function calling
* Retrieval fusion
* Guardrails

---

# 7. Data Engineering for GenAI

## Data ingestion

* Connectors (S3/GCS/SharePoint/API/DB)
* Scheduling (Airflow, Dagster)
* KubernetesPodOperator
* ETL vs ELT
* Micro-batching
* Streaming (Kafka, Kinesis)

## Data preprocessing

* Parsing (PDF, HTML, DOCX)
* OCR
* Cleaning (boilerplate removal, noise removal)
* Normalization
* Deduplication (exact + near-dup)
* Tokenization & chunking
* Schema design & evolution
* Metadata extraction
* Lineage tracking
* PII detection
* Safety filters

## Data storage

* Data lakes (S3/GCS/ADLS)
* Data warehouses (Snowflake/BigQuery/Redshift)
* OLTP databases (Postgres/MySQL/DynamoDB)
* OLAP systems (ClickHouse, Druid)

## Batch & streaming versioning

* Bronze/Silver/Gold medallion architecture
* Lakehouse concepts
* Delta/Iceberg/Hudi

---

# 8. Training & Fine-Tuning

* Pretraining datasets
* Tokenization pipelines
* SFT (Supervised Fine-Tuning)
* RLHF (Reinforcement Learning from Human Feedback)
* Preference modeling
* DPO, ORPO
* Continual training
* Model drift
* Curriculum learning
* Training infrastructure (GPU/TPU clusters, distributed training)
* Checkpointing
* Mixed-precision training (FP16/BF16/FP8)
* Gradient accumulation
* Data packing

---

# 9. Evaluation

* Perplexity
* BLEU/ROUGE (older metrics)
* Win-rate evaluation
* Human evaluation
* Model hallucination tests
* Groundedness checks
* RAG evaluation (Faithfulness, Recall, Relevance)
* Bias / toxicity eval
* Adversarial prompts testing

---

# 10. Scaling & Performance

* Latency vs consistency trade-offs
* Throughput optimization
* Batching & micro-batching
* Caching (KV cache, response cache)
* Parallel inference
* Speculative decoding
* GPU utilization
* Autoscaling
* Distributed serving
* Cost optimization

---

# 11. Safety, Security & Governance

* Access control & RBAC
* ABAC
* Secrets management
* Private networking / VPC
* Encryption (in-flight & at-rest)
* PII detection & redaction
* Policy filtering
* Guardrails
* Data quality rules
* Audit logs
* Compliance (SOC2, HIPAA, GDPR)

---

# 12. Monitoring & Observability

* Prompt logs
* Response latency
* Embedding drift detection
* Pipeline run metrics
* Retrieval quality metrics
* Model usage analytics
* Reliability SLOs

---

# 13. Deployment

* Model serving frameworks (Triton, vLLM, TensorRT)
* API gateways
* Load balancers
* Autoscaling GPU pods
* Canary + shadow deployments
* CI/CD for LLM pipelines
* Prompt router / model router

---

# 14. Applications & Workflows

* RAG chatbots
* QA search systems
* Document automation
* Code generation
* Multi-agent workflows
* Autonomous task execution
* Summarization, translation
* Fine-tuned domain assistants

---

# 15. Supporting Concepts

* Knowledge graphs
* Tool calling
* Agent architectures
* APIs, SDKs, LangChain, LlamaIndex
* System prompts & prompt engineering
* Self-embedding & long-context strategies
* Context compression
* Memory mechanisms

---

# Summary

These 15 categories include **all major concepts** you will encounter when working with Generative AI—covering **ML theory, data engineering, RAG, embeddings, training, MLOps, scaling, safety, evaluation, deployment, and governance**.

If you want, I can also generate:

* A **one-page cheat sheet**
* A **roadmap for mastering Generative AI**
* A **mind map diagram** of all concepts
* A **learning path** based on your background


In [1]:
import boto3
from botocore.exceptions import ClientError

def check_iam_user(username):
    iam = boto3.client("iam")

    print("\n=== Checking IAM User ===")

    # Check if user exists
    try:
        user = iam.get_user(UserName=username)
        print("User Found:", user["User"]["UserName"])
        print("User ARN:", user["User"]["Arn"])
        print("Created On:", user["User"]["CreateDate"])
    except ClientError as e:
        print("Error:", e.response["Error"]["Message"])
        return

    print("\n=== Access Keys ===")

    # List access keys
    keys = iam.list_access_keys(UserName=username)["AccessKeyMetadata"]
    
    if not keys:
        print("No access keys found.")
    else:
        for key in keys:
            print("\nAccessKeyId:", key["AccessKeyId"])
            print("Status:", key["Status"])
            print("Created:", key["CreateDate"])

            # Check last used
            try:
                last_used = iam.get_access_key_last_used(AccessKeyId=key["AccessKeyId"])
                print("Last Used:", last_used["AccessKeyLastUsed"].get("LastUsedDate", "Never Used"))
                print("Service:", last_used["AccessKeyLastUsed"].get("ServiceName", "None"))
            except ClientError as e:
                print("Could not fetch last used:", e.response["Error"]["Message"])
    
    print("\n=== Login Profile (Console Password) ===")

    # Check password login profile (optional)
    try:
        profile = iam.get_login_profile(UserName=username)
        print("Password Last Set On:", profile["LoginProfile"]["CreateDate"])
    except ClientError as e:
        if e.response["Error"]["Code"] == "NoSuchEntity":
            print("User does NOT have an AWS Console password.")
        else:
            print("Error fetching login profile:", e.response["Error"]["Message"])

# Example
check_iam_user("myUser")



=== Checking IAM User ===
Error: The security token included in the request is invalid.


In [2]:
!pip install boto3

Collecting boto3
  Downloading boto3-1.40.76-py3-none-any.whl.metadata (6.8 kB)
Collecting botocore<1.41.0,>=1.40.76 (from boto3)
  Downloading botocore-1.40.76-py3-none-any.whl.metadata (5.9 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.15.0,>=0.14.0 (from boto3)
  Downloading s3transfer-0.14.0-py3-none-any.whl.metadata (1.7 kB)
Downloading boto3-1.40.76-py3-none-any.whl (139 kB)
Downloading botocore-1.40.76-py3-none-any.whl (14.2 MB)
   ---------------------------------------- 0.0/14.2 MB ? eta -:--:--
    --------------------------------------- 0.3/14.2 MB ? eta -:--:--
   -- ------------------------------------- 0.8/14.2 MB 1.7 MB/s eta 0:00:08
   -- ------------------------------------- 0.8/14.2 MB 1.7 MB/s eta 0:00:08
   --- ------------------------------------ 1.3/14.2 MB 1.5 MB/s eta 0:00:09
   ---- ----------------------------------- 1.6/14.2 MB 1.5 MB/s eta 0:00:09
   ----- ----------


[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip
