<center>
<img src="https://supportvectors.ai/logo-poster-transparent.png" width="400px" style="opacity:0.7">
</center>

In [1]:
%run supportvectors-common.ipynb


<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



# Session 1: Building Enterprise AI Systems

## Overview
- The bootcamp emphasizes transitioning from **proof-of-concept AI** to **fully operational enterprise systems**.
- Focus on **scaling AI** beyond experimentation into **real-world production environments**.

---

## Key Components of AI Systems

<img src='../images/ai_system.png' width=1000px>


### 1. Data Infrastructure
- **Data is foundational** — often referred to as the "food" for AI.
- **Data-related tasks consume ~80%** of AI development efforts.
- Despite the "big data" era, **data scarcity and quality remain major challenges**.
- Emphasis on **data engineering** and **feature engineering** to ensure effective model training.

<img src='../images/AI_infra.png' width=1000px>


### 2. Technical Building Blocks
- **Models & Algorithms** form the mathematical backbone of AI.
- **Machine Learning platforms** provide the environment for model development and deployment.
- **Specialized AI hardware** (e.g., NVIDIA Blackwell, Huawei chips) is essential for performance and scalability.

### 3. Operational Aspects (MLOps)
- **MLOps** differs significantly from traditional DevOps/SysOps.
- Key responsibilities include:
  - Implementing **security guardrails**
  - Ensuring **system observability**
  - Conducting **performance monitoring**
  - Managing **scalability** and **latency**


### 4. Critical Considerations
- **AI Security**: Enterprise AI systems are vulnerable and must be protected against adversarial threats.
- **Interpretability**: Understanding how AI systems make decisions is crucial.
- **Iterative Development**: Includes building, training, fine-tuning, and benchmarking AI models.

---

# Session 2: From Keyword Search to Semantic Search with Transformers and Embeddings
## 🧭 1. Introduction

Search is fundamental to how we access knowledge — from finding files in a company knowledge base to retrieving answers from the web. Traditionally, search was based on keywords. But language is nuanced, and keyword-based systems often miss the mark.

This notebook walks through the evolution from traditional **keyword search** to modern **semantic search** powered by **transformers** and **contextual embeddings**.

## 📚 2. Traditional Search and the Inverted Index

### 🔍 What is an Inverted Index?

An inverted index is a mapping from **keywords** to the **documents** that contain them.

**Example:**
```
Doc1: "The cow jumped over the moon."
Doc2: "To go to the moon you need a big rocket."
```

**Inverted Index:**
```
cow     → [Doc1]
moon    → [Doc1, Doc2]
rocket  → [Doc2]
```

### ❌ Limitations

- Ignores context (e.g., "bank" can mean money bank or river bank)
- Hard to rank relevance effectively
- Can't handle paraphrased queries
- Keyword match ≠ semantic match

## 🔢 3. Vectors and Embeddings

### 🧮 What are Vectors?

- **Scalar:** Single number (0D)
- **Vector:** Array of numbers (1D)
- **Matrix:** 2D array
- **Tensor:** Generalization to n-dimensions

### 💡 Why Vectors?

All machine learning models (especially neural networks) operate on numbers.

> **"Vector in → Vector out"**  
> All text, images, audio must be transformed into vectors before models can use them.

### 📦 Word Embeddings

Early NLP models used static word embeddings like **Word2Vec** or **GloVe**.
```
Embedding("cow") → [0.23, -0.11, ..., 0.56]
Embedding("moon") → [-0.45, 0.89, ..., -0.12]
```
These embeddings are **static** — the same word always has the same vector.

## 🧠 4. The Problem of Context

Take the word **"bank"**:

- "He went to the bank to deposit money."
- "She sat by the river bank."
- "I'm banking on you to help."

Using a single vector for **"bank"** is misleading. The meaning comes from **context**.

## 🎯 5. Attention and Transformers

### 🌟 Intuition for Attention

> Attention allows the model to **focus** on relevant parts of input based on the context.

**Analogy:**  
You're hiking, enjoying the stars. Suddenly you hear a growl in the bushes — your attention instantly shifts. Your sensory input hasn't changed, but **your context** has.

### 📐 Transformer = Attention is All You Need

Introduced in 2017, the Transformer architecture:
- Uses self-attention to relate words in a sentence
- Allows **contextual interpretation** of each word
- Foundation of today's **Large Language Models (LLMs)**

## 🧩 6. Contextual Word Embeddings

Transformer models generate **contextual embeddings** — where the meaning of each word is determined by the **context** it appears in.

Unlike static embeddings (e.g., Word2Vec), which assign a single vector to each word regardless of usage, contextual embeddings differ based on surrounding words.

**Examples:**
```
"The river overflowed the bank."  →  "bank" ≈ shoreline
"He deposited cash in the bank."  →  "bank" ≈ financial institution
```

This is made possible by the **attention mechanism**, which helps each word understand its role based on the full sentence context.

> 🔁 Every word asks: "Who am I?"  
> And answers it by attending to its neighbors.

## 🧮 7. Sentence Embeddings and Representation

To represent an entire sentence (not just the individual words), we need to **combine the contextual word embeddings** into one unified vector.

### Two common strategies:

#### 1. **Mean Pooling**
Take the average of all word vectors in the sentence.

> Sentence vector = mean(word1, word2, ..., wordN)

#### 2. **[CLS] Token**
Insert a special token `[CLS]` at the beginning of the sentence.  
The transformer learns to use this token to capture the **entire sentence meaning**.

- `[CLS]` pays attention to all words.
- It starts with no meaning but learns it **entirely from context**.
- Used by models like BERT for classification or semantic tasks.

> Resulting vector = semantic summary of the full sentence.

This **sentence embedding** becomes a powerful input for tasks like classification, clustering, or semantic search.

## ⚖️ 8. Contrastive Learning

To teach a model semantic similarity:

- Sentence A: "The cow jumped over the moon"
- Sentence B: "The car leaped over the puddle"
- Sentence C: "NVIDIA stock jumped 10%"

A and B are semantically related. A and C are not.

### 🧠 Training Objective

**Contrastive Loss:**
- **Minimize** distance between A and B
- **Maximize** distance between A and C

This forces the model to place similar texts **closer in vector space**.

## 🧭 9. Semantic Search

Once all your documents are converted into **semantic vectors**, search becomes:

### 🧱 Workflow

1. User enters a query → Convert it to a vector (Q)
2. Compare Q to all document vectors (D1, D2, ..., Dn)
3. Return the **nearest neighbors** (e.g., by cosine similarity)

### 💡 Pseudo-code
```
Q = embed("How do cows behave?")
Docs = [D1, D2, ..., Dn]

For each Di:
    similarity[i] = cosine_similarity(Q, Di)

Return top-k documents sorted by similarity
```

This is **semantic search** — retrieving documents based on **meaning**, not just keyword overlap.

## 🏢 10. Industry Adoption

### Before 2018:
- Keyword search engines (Elasticsearch, Solr)
- Heuristics, BM25, inverted indices

### After 2018:
- Google and Bing use neural search
- Transformer-based retrieval models
- Enterprises now adopting vector databases (Pinecone, FAISS, Weaviate)

## 🔚 11. Conclusion

We’ve gone from:
- Searching **words** → Understanding **meaning**
- Static word embeddings → Contextualized sentence embeddings
- Rule-based ranking → Learned semantic spaces

This shift has transformed **search**, **recommendation**, and **language understanding** in modern AI systems.

## 📝 Key Terms Recap

- **Inverted Index** – Maps words to documents  
- **Embedding** – Vector representation of a word/sentence  
- **Attention** – Mechanism to weigh input relevance dynamically  
- **CLS Token** – Captures full sentence meaning  
- **Contrastive Loss** – Encourages meaningful vector spacing  
- **Semantic Search** – Retrieves based on meaning, not keywords  


# Session 3: UV Setup + PDF Downloader Project

## Environment Setup
- uv installation
- ai_environment
- uv sync --upgrade

## Project: PDF Downloader
 1. Problem
	 1. Use prompt in **cursor** to generate the python script to automatically download all the pdfs in the given links.
 2. Improve the code
	 1. Object oriented
	 2. Type hinting
	 3. Reduce the cyclometric complexity
	 4. Documentation
	 5. Logging
	 6. Pre-condition checks
	 7. Post-condition checks
	 8. Simplification
	 9. Exception nesting
	 10. Unit tests
	 11. Architecture and Sequence diagrams in MERMAID
 3. Sort out issues with the code and documentation, maybe using claude.