# Lesson 18: Logical Fallacy Detection
## Part 1: Cherry-Picked Benchmarks

**Objective:** Understand how "Cherry-Picked Benchmarks" mislead AI evaluation and learn to identify them using data-driven methods.

### Learning Objectives
By the end of this notebook, you will be able to:
1. **Define** "Cherry-Picked Benchmarks" and explain why they represent a statistical Hasty Generalization.
2. **Analyze** real-world examples (like the Gemini demo) to understand the gap between demo and reality.
3. **Evaluate** a dispute resolution scenario where "99% Accuracy" is misleading.
4. **Apply** a "Red Flags" checklist to audit AI performance claims.

### 1. What are Cherry-Picked Benchmarks?

**Definition:** Cherry-Picking in AI benchmarks occurs when performance is reported on a specific subset of data that flatters the system, while ignoring edge cases, failures, or more representative datasets.

> "It works perfectly (on my machine)."

It is a statistical form of the **Hasty Generalization** fallacy. You take a small, non-representative sample (the "cherry") and generalize its properties to the entire population (the "tree").

In AI, this often manifests as:
- **Selection Bias:** Choosing test data that you know the model handles well.
- **Survivorship Bias:** Reporting only the successful runs or models.
- **Seed Hacking:** Running an experiment 100 times and reporting only the best result.

### 2. Real-World Example: The Google Gemini Demo

In December 2023, Google released a video showing their Gemini model interacting in real-time video. The AI seemed to respond instantly to voice and video cues with zero latency.

**The Reality:**
It was later revealed that the video was **not real-time**. It was stitched together from still image frames and text prompts. The latency was edited out.

**Why this is dangerous:**
This is a classic "Demo-to-Production" leap. The company "cherry-picked" the successful interactions and the presentation format to imply a capability (real-time video reasoning) that didn't actually exist in that form. If you built a product assuming that latency, it would fail immediately.

### 3. Domain Scenario: The "99% Accuracy" Dispute Classifier

Let's look at a concrete example from our **Dispute Resolution Chatbot**.

**The Pitch:**
> "Our new `ReasonCodeClassifier` achieves **99% accuracy** on the golden test set! It's ready for production."

**The Hidden Data:**
The "golden test set" contained 100 records:
- **100%** were **Visa** transactions.
- **100%** were Reason Code **10.4**.

**The Production Reality:**
When deployed to a diverse environment (Amex, Mastercard, distinct Reason Codes), the model blindly classified everything as "Visa 10.4".

**The Result:**
Real-world accuracy dropped from **99%** to **~25%**.

### 4. ðŸš© Red Flags Checklist

When evaluating AI claims, use this checklist to spot potential cherry-picking:

- [ ] **The "Single Metric" Flex:** Claims like "95% Accuracy" without defining the dataset distribution.
- [ ] **The "Internal" Test Set:** "Tested on our proprietary internal benchmark" (which no one else can audit).
- [ ] **"Best of N" Reporting:** Reporting the single best run instead of the mean and variance (Seed Hacking).
- [ ] **Perfect Round Numbers:** "100% success rate" usually implies the test was too easy or data leaked.