Step-1: Setup

In [1]:
!pip install -q google-generativeai


In [2]:
import google.generativeai as genai
import os


In [3]:
import os
os.environ["GOOGLE_API_KEY"] = "###"


In [4]:
import google.generativeai as genai

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])


In [5]:
import google.generativeai as genai
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

models = genai.list_models()
for m in models:
    print(m.name, "— supports generateContent:",
          "generateContent" in m.supported_generation_methods)


models/embedding-gecko-001 — supports generateContent: False
models/gemini-2.5-flash — supports generateContent: True
models/gemini-2.5-pro — supports generateContent: True
models/gemini-2.0-flash-exp — supports generateContent: True
models/gemini-2.0-flash — supports generateContent: True
models/gemini-2.0-flash-001 — supports generateContent: True
models/gemini-2.0-flash-exp-image-generation — supports generateContent: True
models/gemini-2.0-flash-lite-001 — supports generateContent: True
models/gemini-2.0-flash-lite — supports generateContent: True
models/gemini-2.0-flash-lite-preview-02-05 — supports generateContent: True
models/gemini-2.0-flash-lite-preview — supports generateContent: True
models/gemini-exp-1206 — supports generateContent: True
models/gemini-2.5-flash-preview-tts — supports generateContent: True
models/gemini-2.5-pro-preview-tts — supports generateContent: True
models/gemma-3-1b-it — supports generateContent: True
models/gemma-3-4b-it — supports generateContent: T

Step 2 — Fix the experiment constants

In [6]:
MODEL = "models/gemma-3-1b-it"


In [14]:
PROMPT = "List 10 startup ideas using LLMs."




Step 3 — Temperature sweep (core experiment)

In [15]:
import time

model_cache = {}

def run(topK, run_id, retries=3, delay=2):
    if topK not in model_cache:
        model_cache[topK] = genai.GenerativeModel(
            MODEL,
            generation_config={
                "temperature": 0.7,
                "top_p": 1.0,  # disable nucleus sampling for pure top-k test
                "top_k": topK,
                "max_output_tokens": 200,
            }
        )

    model = model_cache[topK]

    for attempt in range(retries):
        try:
            response = model.generate_content(PROMPT)
            text = response.text.strip()
            print(f"\n[topK={topK}] Run {run_id}")
            print(text)
            return text
        except Exception as e:
            if attempt == retries - 1:
                raise e
            time.sleep(delay)

# === 3 runs per topK ===
results = []

for k in [10, 50, 100]:
    for i in range(1, 4):   # ← THIS is the 3 times
        output = run(k, i)
        results.append({
            "topK": k,
            "run": i,
            "output": output
        })
        time.sleep(1)  # throttle to avoid 429



[topK=10] Run 1
Okay, here are 10 startup ideas leveraging Large Language Models (LLMs), categorized by complexity and potential market:

1. **Personalized Content Curator & Summarizer (Simple):**
   * **Concept:** An app that uses an LLM to analyze a user's interests (from social media, browsing history, etc.) and then generates personalized summaries of articles, news, or research papers. It could also suggest related content.
   * **LLM Use:** Summarization, topic extraction, sentiment analysis, personalization.
   * **Monetization:** Freemium (basic features free, premium for advanced features like detailed summaries, priority access, etc.)

2. **Automated Legal Document Review (Medium):**
   * **Concept:** An LLM-powered tool that analyzes contracts, legal briefs, and other documents to identify key clauses, potential risks, and compliance issues.
   * **LLM Use:** Contract analysis, clause extraction, risk assessment

[topK=10] Run 2
Okay, here are 10 startup ideas leveraging La

In [16]:
import pandas as pd

df = pd.DataFrame(results)
df


Unnamed: 0,topK,run,output
0,10,1,"Okay, here are 10 startup ideas leveraging Lar..."
1,10,2,"Okay, here are 10 startup ideas leveraging Lar..."
2,10,3,"Okay, here are 10 startup ideas leveraging Lar..."
3,50,1,"Okay, here are 10 startup ideas leveraging Lar..."
4,50,2,"Okay, here are 10 startup ideas leveraging Lar..."
5,50,3,"Okay, here are 10 startup ideas leveraging Lar..."
6,100,1,"Okay, here are 10 startup ideas leveraging Lar..."
7,100,2,"Okay, here are 10 startup ideas leveraging Lar..."
8,100,3,"Okay, here are 10 startup ideas leveraging Lar..."


| Top-k   | Idea Repetition | Depth vs Breadth                   | Generic Ideas | What Actually Happened        |
| ------- | --------------- | ---------------------------------- | ------------- | ----------------------------- |
| **10**  | Very high       | Narrow breadth, shallow depth      | Immediate     | Hard collapse at early tokens |
| **50**  | Very high       | Slightly wider wording, same ideas | Immediate     | Collapse unchanged            |
| **100** | Very high       | More adjectives, same structure    | Immediate     | Noise added, no diversity     |


In list-based ideation prompts with strong priors, early-token dominance causes collapse that top-k cannot fix; adaptive probability mass control (top-p) is required for safe diversity.Top-k did not control diversity here.
Collapse occurred before top-k mattered due to early-token dominance and strong priors.

For this prompt, variance never became instability; increasing temperature changed phrasing but did not break structure or semantic correctness.

In [18]:
import time

model_cache = {}

def run(topP, run_id, retries=3, delay=2):
    # Cache per topP value
    if topP not in model_cache:
        model_cache[topP] = genai.GenerativeModel(
            MODEL,
            generation_config={
                "temperature": 0.7,
                "top_p": topP,
                "top_k": 50,              # fixed top-k
                "max_output_tokens": 200,
            }
        )

    model = model_cache[topP]

    for attempt in range(retries):
        try:
            response = model.generate_content(PROMPT)
            text = response.text.strip()
            print(f"\n[topP={topP}] Run {run_id}")
            print(text)
            return text
        except Exception as e:
            if attempt == retries - 1:
                raise e
            time.sleep(delay)

# === 3 runs per topP ===
results = []

for p in [0.9, 0.95, 0.99]:
    for i in range(1, 4):
        output = run(p, i)
        results.append({
            "topP": p,
            "run": i,
            "output": output
        })
        time.sleep(1)  # throttle to avoid 429



[topP=0.9] Run 1
Okay, here are 10 startup ideas leveraging Large Language Models (LLMs), categorized by complexity and potential market:

1. **Personalized Content Curator & Summarizer (Simple):**
   * **Concept:** An app that uses an LLM to analyze a user's interests (from social media, browsing history, etc.) and then generates personalized summaries of articles, news, or research papers. It could also suggest related content.
   * **LLM Use:** Summarization, topic extraction, sentiment analysis, personalization.
   * **Monetization:** Freemium (basic features free, premium for advanced summarization, ad-free experience).

2. **Automated Legal Document Review (Medium):**
   * **Concept:** An LLM-powered tool that analyzes contracts, legal briefs, and other documents to identify key clauses, potential risks, and compliance issues.
   * **LLM Use:** Contract analysis, legal jargon understanding, risk assessment,

[topP=0.9] Run 2
Okay, here are 10 startup ideas leveraging Large Langu

Observation:

- Zero structural divergence

- Same ideas

- Same ordering

- Same framing

- Same priors

In [19]:
import pandas as pd

df = pd.DataFrame(results)
df


Unnamed: 0,topP,run,output
0,0.9,1,"Okay, here are 10 startup ideas leveraging Lar..."
1,0.9,2,"Okay, here are 10 startup ideas leveraging Lar..."
2,0.9,3,"Okay, here are 10 startup ideas leveraging Lar..."
3,0.95,1,"Okay, here are 10 startup ideas leveraging Lar..."
4,0.95,2,"Okay, here are 10 startup ideas leveraging Lar..."
5,0.95,3,"Okay, here are 10 startup ideas leveraging Lar..."
6,0.99,1,"Okay, here are 10 startup ideas leveraging Lar..."
7,0.99,2,"Okay, here are 10 startup ideas leveraging Lar..."
8,0.99,3,"Okay, here are 10 startup ideas leveraging Lar..."


| Top-p    | Idea Repetition | Depth vs Breadth              | Generic Ideas       | What Actually Changed                     |
| -------- | --------------- | ----------------------------- | ------------------- | ----------------------------------------- |
| **0.9**  | Extremely high  | Narrow breadth, shallow depth | Immediate (idea #1) | Conservative path locked instantly        |
| **0.95** | Extremely high  | Slight wording variation only | Immediate           | Marginal flexibility, same priors         |
| **0.99** | Extremely high  | Slight phrasing noise         | Immediate           | More tokens allowed, no structural change |


Top-p controls safe diversity only when multiple continuations already compete early.
When the model is highly confident, both top-k and top-p become irrelevant.

In [20]:
PROMPT = """
You are brainstorming under uncertainty.

Invent 6 startup ideas that use LLMs, but obey these rules:

- Each idea must belong to a DIFFERENT industry
- Avoid common categories (chatbots, summarization, legal, education, customer support)
- Each idea must describe:
  • a specific user
  • a concrete pain
  • a non-obvious LLM capability being exploited
- Do NOT use bullet points
- Do NOT number the ideas
- Start each idea with a short, unusual title (3–6 words)

Think step by step before writing each idea.
"""

In [21]:

model_cache = {}

def run(topP, run_id, retries=3, delay=2):
    if topP not in model_cache:
        model_cache[topP] = genai.GenerativeModel(
            MODEL,
            generation_config={
                "temperature": 0.7,
                "top_p": topP,
                "top_k": 50,              # fixed safety cap
                "max_output_tokens": 400,
            }
        )

    model = model_cache[topP]

    for attempt in range(retries):
        try:
            response = model.generate_content(PROMPT)
            text = response.text.strip()
            print(f"\n[topP={topP}] Run {run_id}")
            print(text)
            return text
        except Exception as e:
            if attempt == retries - 1:
                raise e
            time.sleep(delay)

# === 3 runs per topP ===
results = []

for p in [0.9, 0.95, 0.99]:
    for i in range(1, 3):
        output = run(p, i)
        results.append({
            "topP": p,
            "run": i,
            "output": output
        })
        time.sleep(1)  # throttle to avoid 429



[topP=0.9] Run 1
Okay, let’s dive into this – generating 6 startup ideas leveraging LLMs, adhering to the specified constraints. Here we go:

**1.  Culinary Cartographer**

The user is a seasoned but increasingly frustrated home cook struggling with complex recipes and precise ingredient measurements. They want to recreate dishes from their grandmother’s cookbooks but often struggle with the nuances and subtle flavor combinations.

The LLM capability is “Flavor Synthesis & Contextualization.” It doesn’t just regurgitate recipes; it analyzes a recipe, identifies key ingredients and their relative importance, and then generates a detailed, step-by-step guide *including* suggested flavor pairings and potential substitutions based on regional variations and the cook’s personal preferences, all while considering the cook’s past cooking history.

**2.  Temporal Artisan**

This startup caters to collectors of vintage tools and equipment, particularly those interested in the history and craft

| Top-p    | Idea Repetition | Depth vs Breadth             | Coherence | Failure Mode Observed                                |
| -------- | --------------- | ---------------------------- | --------- | ---------------------------------------------------- |
| **0.9**  | Low–moderate    | High depth, narrower breadth | Very high | Occasional truncation, conservative exploration      |
| **0.95** | Low             | Best balance                 | High      | Minor repetition across runs, but structurally sound |
| **0.99** | Very low        | Breadth ↑, depth uneven      | Medium    | Drift, unfinished ideas, verbosity creep             |


In [22]:


PROMPT = """
You are brainstorming under uncertainty.

Invent 6 startup ideas that use LLMs, but obey these rules:

- Each idea must belong to a DIFFERENT industry
- Avoid common categories (chatbots, summarization, legal, education, customer support)
- Each idea must describe:
  • a specific user
  • a concrete pain
  • a non-obvious LLM capability being exploited
- Do NOT use bullet points
- Do NOT number the ideas
- Start each idea with a short, unusual title (3–6 words)
"""

PRESETS = {
    "A_baseline_safe": {
        "temperature": 0.7,
        "top_p": 0.95,
        "top_k": 50,
    },
    "B_collapse_probe": {
        "temperature": 0.3,
        "top_p": 0.9,
        "top_k": 20,
    },
    "C_drift_probe": {
        "temperature": 0.9,
        "top_p": 0.99,
        "top_k": 80,
    },
    "D_overconstrained": {
        "temperature": 0.2,
        "top_p": 0.8,
        "top_k": 10,
    },
}

results = {}

for name, cfg in PRESETS.items():
    model = genai.GenerativeModel(
        MODEL,
        generation_config={
            "temperature": cfg["temperature"],
            "top_p": cfg["top_p"],
            "top_k": cfg["top_k"],
            "max_output_tokens": 500,
        }
    )

    print(f"\n=== {name} ===")
    print(
        f"(temp={cfg['temperature']}, top_p={cfg['top_p']}, top_k={cfg['top_k']})"
    )

    response = model.generate_content(PROMPT)
    text = response.text.strip()
    print(text)

    results[name] = {
        "config": cfg,
        "output": text,
    }

    time.sleep(1)  # throttle



=== A_baseline_safe ===
(temp=0.7, top_p=0.95, top_k=50)
Here are six startup ideas leveraging LLMs, adhering to your specific constraints:

**1.  Culinary Cartographer**

The user is a seasoned but increasingly frustrated home cook struggling with complex recipes and ingredient substitutions. They want to create truly personalized meal plans but lack the time and knowledge to meticulously research and adapt. The LLM’s capability is its ability to generate detailed, step-by-step instructions *and* instantly translate ingredient substitutions into perfectly balanced flavor profiles, factoring in regional cuisines and dietary restrictions.

**2.  Shadow Sculptor**

This startup caters to aspiring digital artists and designers who lack the technical skill to realize their visions. The user is a conceptual artist struggling to translate intricate ideas into tangible 3D models. The LLM analyzes the user’s sketches and descriptions, then generates a series of variations – not just visual re

| Preset                  | What we saw                                                      | Core behavior             | Verdict                     |
| ----------------------- | ----------------------------------------------------------------- | ------------------------- | --------------------------- |
| **A — Baseline Safe**   | Diverse, coherent, detailed ideas; minor truncation at the end    | Controlled exploration    | Best default              |
| **B — Collapse Probe**  | Depth remained, but structure started repeating; output shortened | Early constraint pressure | Collapse visible          |
| **C — Drift Probe**     | High novelty, wider themes, verbosity, unfinished ideas           | Creative drift            | Drift visible             |
| **D — Overconstrained** | Safe, conservative, strongly patterned ideas                      | Deterministic bias        | Over-constraint confirmed |




**Core idea:** Sampling controls *expression under uncertainty*, not intelligence.

### What each knob really does
- **Temperature** → controls energy/confidence (how fast the model moves)
- **Top-p** → controls safe exploration (how far it can go without breaking)
- **Top-k** → hard safety cap (prevents collapse or noise)

### Key learnings
- Early tokens dominate output; if the start collapses, sampling won’t help.
- Top-p is the real diversity dial *only when early uncertainty exists*.
- Top-k does not create creativity; set it once (~50) and forget it.
- Prompt design determines whether sampling parameters even matter.
- One-run regime testing > brute-force grid search.


