```{contents}
```
## Short-Term Memory 

### 1. Definition

**Short-Term Memory (STM)** in Generative AI refers to the **temporary information storage** that a model uses **within a single interaction or context window** to generate coherent, relevant, and context-aware outputs.

It is:

* **Session-local**
* **Non-persistent**
* **Token-bounded**
* **Stateless across sessions (by default)**

---

### 2. Why Short-Term Memory Exists

LLMs do not have continuous consciousness.
They compute each output using only:

[
P(\text{next token} \mid \text{current context})
]

The **context** acts as the model’s short-term memory.

| Challenge              | Role of STM                             |
| ---------------------- | --------------------------------------- |
| Coherence across turns | Remembers prior user messages           |
| Reference resolution   | Links pronouns, variables, names        |
| Task continuity        | Maintains goals and constraints         |
| Reasoning chains       | Preserves intermediate steps            |
| Instruction following  | Keeps system and developer rules active |

---

### 3. Where STM Lives in the Architecture

```
[ System Prompt ]
[ Developer Prompt ]
[ Conversation History ]  ← Short-Term Memory
[ Current User Message ]
--------------------------------------------
            Context Window
                    ↓
             Transformer
                    ↓
               Output
```

STM = **all tokens currently inside the context window**

---

### 4. Characteristics of Short-Term Memory

| Property            | Description                                  |
| ------------------- | -------------------------------------------- |
| **Temporary**       | Disappears after session ends                |
| **Token-limited**   | Restricted by model’s context length         |
| **Implicit**        | Not a separate module; embedded in attention |
| **Differentiable**  | Stored as token embeddings                   |
| **Attention-based** | Accessed via self-attention                  |

---

### 5. Mechanism: How STM Works

In Transformers, STM is implemented via **self-attention**.

For each token:

[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V
]

* **Q**: current query
* **K,V**: all previous tokens in context
* Every token can attend to all others → full memory access

This enables:

* Long-range dependency tracking
* Cross-turn reasoning
* Instruction persistence

---

### 6. Types of Short-Term Memory in Practice

| Type                      | Example                         |
| ------------------------- | ------------------------------- |
| **Instruction memory**    | System + developer messages     |
| **Conversational memory** | Past user–assistant turns       |
| **Task memory**           | Goals, constraints, variables   |
| **Working memory**        | Intermediate reasoning steps    |
| **Tool memory**           | Tool outputs inside the context |

---

### 7. Workflow with STM

```
User Input
    ↓
Append to Context
    ↓
Trim if Exceeds Window
    ↓
Transformer Computes Attention Over All Tokens
    ↓
Generate Output Using STM
```

---

### 8. Demonstration with Code (Conceptual)

```python
context = []

def chat(user_input):
    context.append(f"User: {user_input}")
    
    prompt = "\n".join(context[-4000:])  # STM limited by tokens
    output = model.generate(prompt)
    
    context.append(f"Assistant: {output}")
    return output
```

STM is the `context` list currently inside the window.

---

### 9. STM vs Long-Term Memory

| Feature     | Short-Term Memory | Long-Term Memory         |
| ----------- | ----------------- | ------------------------ |
| Persistence | Session only      | Across sessions          |
| Storage     | Context tokens    | External DB / embeddings |
| Access      | Attention         | Retrieval                |
| Capacity    | Limited           | Scalable                 |
| Learning    | No update         | Can accumulate knowledge |

---

### 10. Failure Modes of STM

| Issue          | Cause                   |
| -------------- | ----------------------- |
| Forgetting     | Context window overflow |
| Contradictions | Older info dropped      |
| Hallucination  | Missing facts in STM    |
| Topic drift    | Poor memory salience    |

---

### 11. Design Patterns Using STM

* **Conversation summarization**
* **Memory compression**
* **Sliding window context**
* **Hierarchical memory**
* **Prompt caching**

---

### 12. Intuition Summary

Short-term memory in Generative AI is simply:

> **Everything the model can currently “see” inside its context window and attend to while generating tokens.**

It is not a database, not a state variable, not persistent storage —
it is **attention over tokens**.

---

If you want, next topics that naturally follow this:

* Memory compression strategies
* How retrieval augments STM
* Designing hybrid STM + LTM systems
