----------------------------------
#### **Top-p Sampling in OpenAI GPT**
----------------------------------

**Description:**
- Top-p sampling, also known as "nucleus sampling," is a decoding technique used during text generation.
- Instead of considering all possible tokens during generation (as in greedy decoding), top-p sampling focuses on a subset of tokens whose cumulative probability is above a specified threshold \( p \).
- This allows the model to dynamically adjust the number of tokens it considers at each step based on the probability distribution of the tokens.

- **top_p = 1**: This includes all tokens in the probability distribution (equivalent to no restriction).
- **top_p < 1**: The model samples from a smaller set of high-probability tokens, adding randomness while still maintaining some control over the coherence of the output.


**Example Illustration:**
- If `top_p = 0.9`, the model will consider the smallest subset of tokens whose cumulative probability is at least 90%. The remaining tokens are ignored, preventing low-probability (often irrelevant) tokens from being selected.
  
For instance, if a model predicts the following probabilities for the next word:
- "dog": 0.4
- "cat": 0.3
- "mouse": 0.2
- "elephant": 0.05
- "lion": 0.05

With **top_p = 0.9**, only the first three words ("dog", "cat", "mouse") will be considered, as their cumulative probability is 0.9. The tokens "elephant" and "lion" will be excluded from sampling.

#### Greedy Decoding:
- **How it works:** Selects the token with the highest probability at each step.
- **Result:** Deterministic and often repetitive output.
- **Use case:** Best for short, factual responses where randomness is not desired.

#### Top-p Sampling:
- **How it works:** Selects a token from the smallest subset of tokens whose cumulative probability meets or exceeds the threshold `p`. Randomly samples from this subset.
- **Result:** More creative and varied output, with controlled randomness.
- **Use case:** Ideal for tasks that benefit from creativity or diverse responses, like story generation or dialogue systems.

#### Key Difference:
- Greedy decoding picks the single highest-probability token every time.
- Top-p sampling considers a range of tokens and samples from them, depending on the set defined by `p`, leading to more diverse outputs.


In [1]:
import openai

In [2]:
from openai import OpenAI

In [3]:
client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    # api_key = openai_api_key
)

In [5]:
prompt="Once upon a time, in a distant land, there was a kingdom ruled by a wise and just king who"

In [6]:
# top_p decoding with temperature 0
response_greedy = client.chat.completions.create(
        model      = "gpt-3.5-turbo",  # You can use other engines like gpt-3.5-turbo
        messages   = [
            {"role": "user", "content": f'{prompt}'}
        ],
        max_tokens = 150,
        top_p      = 0.15,  # Top-p sampling for creative diversity
        temperature= 0.0    # Introduce a bit of randomness
    )

print(response_greedy.choices[0].message.content)

 was loved by all his subjects. The kingdom was prosperous and peaceful, and the people lived happily under the king's rule.

One day, a dark cloud appeared on the horizon. A neighboring kingdom, ruled by a cruel and power-hungry queen, declared war on the peaceful kingdom. The queen's army was vast and powerful, and it seemed like the peaceful kingdom stood no chance against their might.

The king knew that he had to protect his people and his kingdom at all costs. He called upon his bravest knights and soldiers to defend the kingdom against the invading army. The people of the kingdom rallied behind their king, ready to fight for their home and their freedom.

The battle was fierce and bloody, but the king's army fought with


#### Key Differences in How `temperature` and `top_p` Influence Randomness

##### Temperature alone can introduce randomness:
- **What it does**: It scales the entire probability distribution. Lowering `temperature` makes the model more deterministic (similar to greedy decoding), while raising it allows for more exploration of less probable tokens.
- **Effect**: It introduces **randomness globally** across all tokens.

##### Top-p alone can influence randomness, but in a more controlled way:
- **What it does**: It restricts the model to choosing from a subset of the most probable tokens whose cumulative probability is within the threshold `p`. It still randomly samples from this subset.
- **Effect**: It introduces **randomness locally** within the top probable tokens but prevents the model from choosing extremely low-probability tokens.

---

##### Why `top_p` Works Without Temperature:
Even without setting `temperature`, **top-p sampling** (or nucleus sampling) introduces randomness by restricting the token pool, ensuring the model samples from the top `p%` of probable tokens. 

##### Example:
- If you set `top_p = 0.9`, the model will sample tokens only from the subset where their cumulative probability is 90%. 
- This **limits randomness** to the most probable tokens but still samples randomly from this subset.
- **Temperature is not required** for top-p to work because it inherently involves randomness in selecting among the probable tokens.

---

##### When Does `top_p` Need Temperature?

- **top_p** restricts the token pool to the most probable tokens.
- **temperature** controls the level of randomness in choosing from that pool.

When you combine both:
- `top_p` defines the **subset** of most probable tokens.
- `temperature` **smooths the probabilities** within the restricted token pool, allowing a gradient of selection, rather than sharp deterministic selections.

By setting both, you gain **finer control** over the randomness in the generated output.