------------
# **Topic: The "Dark Matter" Bridge ‚Äì Volume Estimation & Probabilistic Attribution**

----------------

##**Coding Task 1 ‚Äì The "Calibration" (Volume Estimation)**

**Objective:** Build a dataset of Intent Clusters and estimate their "Hidden" AI Volume.

###**Step 1: The "Dark Matter" Generation (AI Prompt Creation)**


## **Objective**
In this step, we generate a synthetic dataset of **AI-driven user prompts** related to *buying a Compact SUV in India*.  
These prompts represent the **‚ÄúDark Matter‚Äù** of customer journeys ‚Äî conversations that happen inside AI tools and are invisible to traditional analytics.

---

## What We Will Do

- Define **intent clusters** that reflect real user decision-making while buying a compact SUV.
- Use a **Large Language Model (LLM)** to generate natural, human-like prompts for each intent.
- Ensure prompts are:
  - Contextual to the Indian market
  - A mix of simple factual queries and complex reasoning-based questions
  - Unique (no duplicates)
- Generate prompts **incrementally with checkpoints**, so the process can be safely resumed if interrupted.
- Build a structured dataset that will later be used for:
  - AI volume estimation
  - Probabilistic attribution to website sessions

---

## Why This Matters

Traditional analytics can track *keywords and clicks*, but not *AI conversations*.  
By simulating realistic AI prompts at scale, we create a foundation to:

- Estimate hidden AI search demand
- Understand user intent beyond keywords
- Bridge the gap between AI discovery and website traffic

---

## Output of This Step

A dataframe containing approximately **1,000 unique AI prompts**, with:
- Intent cluster labels
- Prompt text
- Metadata for further modeling

This dataset becomes the input for **Volume Estimation (Step 2)**.

--------

In [3]:
#Import Libraries
import os
import time
import random
import hashlib
from datetime import datetime

# Data handling
import pandas as pd
import numpy as np

# Display
from IPython.display import display

In [7]:
!pip install -q groq

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/138.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m138.3/138.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [8]:
from groq import Groq

client = Groq(api_key="REMOVED_FOR_SECURITY")

print("‚úÖ Groq client initialized")


‚úÖ Groq client initialized


**‚ö†Ô∏è Note:**

API calls were used during development to generate AI prompts.

They have been disabled in the public version for security reasons.


In [9]:
#sanity check
resp = client.chat.completions.create(
    model="llama-3.1-8b-instant", # Changed to a currently supported model based on recent Groq updates
    messages=[
        {"role": "user", "content": "Generate 5 questions someone in India might ask before buying a compact SUV."}
    ],
    temperature=0.8
)

print(resp.choices[0].message.content)

Here are five potential questions someone in India might ask before buying a compact SUV:

1. 'What is the ground clearance of the vehicle, and can it handle off-road conditions such as unpaved roads and rough terrain in India?'

This question is relevant because India has diverse terrain, and the buyer would want to know if the compact SUV can handle rough roads and off-road conditions.

2. 'Which safety features are standard in this vehicle, such as airbags, ABS, and electronic stability control?'

This question is important for Indian buyers as safety features have become a priority in recent years, and the buyer would want to know if the compact SUV has the necessary safety features.

3. 'What is the fuel efficiency of the vehicle, and does it have a diesel or petrol option? Also, what is the claimed mileage under normal driving conditions?'

This question is relevant because fuel efficiency is a major concern for Indian buyers due to rising fuel costs, and they would want to know 

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
#Intent Registry
INTENT_CLUSTERS = [
    {"intent_id": 1, "intent_name": "Safety & NCAP", "target": 60},
    {"intent_id": 2, "intent_name": "Family & Comfort", "target": 55},
    {"intent_id": 3, "intent_name": "Price & Budget", "target": 60},
    {"intent_id": 4, "intent_name": "EMI & Finance", "target": 55},
    {"intent_id": 5, "intent_name": "Comparison & Elimination", "target": 65},
    {"intent_id": 6, "intent_name": "Variants & Features", "target": 55},
    {"intent_id": 7, "intent_name": "Mileage & Fuel Type", "target": 55},
    {"intent_id": 8, "intent_name": "Maintenance & Service", "target": 50},
    {"intent_id": 9, "intent_name": "Resale Value", "target": 50},
    {"intent_id": 10, "intent_name": "City vs Highway Usage", "target": 55},
    {"intent_id": 11, "intent_name": "ADAS & Technology", "target": 55},
    {"intent_id": 12, "intent_name": "Automatic vs Manual", "target": 50},
    {"intent_id": 13, "intent_name": "Boot Space & Practicality", "target": 50},
    {"intent_id": 14, "intent_name": "First-Time Buyer", "target": 55},
    {"intent_id": 15, "intent_name": "Test Drive Readiness", "target": 50},
    {"intent_id": 16, "intent_name": "Offers & Discounts", "target": 50},
    {"intent_id": 17, "intent_name": "Brand Trust & Perception", "target": 50},
    {"intent_id": 18, "intent_name": "Final Shortlisting", "target": 55},
]

In [12]:
#Output file + resume-safe load
OUTPUT_FILE = "/content/drive/MyDrive/ai_dark_matter_prompts.csv"

if os.path.exists(OUTPUT_FILE):
    prompts_df = pd.read_csv(OUTPUT_FILE)
    print(f"üîÅ Resuming with {len(prompts_df)} prompts")
else:
    prompts_df = pd.DataFrame(columns=[
        "prompt_id",
        "intent_id",
        "intent_name",
        "prompt_text",
        "prompt_hash",
        "generated_at"
    ])
    print("üÜï Starting fresh dataset")
    MODEL_NAME= "llama-3.1-8b-instant"

üîÅ Resuming with 1000 prompts


In [13]:
def text_hash(text):
    return hashlib.md5(text.lower().strip().encode()).hexdigest()

existing_hashes = set(prompts_df["prompt_hash"].values)
PROMPT_ID_START = len(prompts_df) + 1

In [32]:
#Prompt chunk generator
def generate_prompt_chunk(intent_name, chunk_size=15):
    prompt = f"""
Generate {chunk_size} DISTINCT and REALISTIC user questions related to:
"{intent_name}" while buying a compact SUV in India.

Rules:
- Each question must explore a DIFFERENT angle. ALL unique questions.
- Mix simple, complex, emotional, comparison, and scenario-based queries
-different car types((Tata Nexon, Hyundai Creta, Maruti Grand Vitara, Kia Seltos)
- Indian context (traffic, family, budget, weather, resale)
- One question per line
- No numbering or bullet points
"""
    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.9
    )

    text = response.choices[0].message.content
    return [line.strip() for line in text.split("\n") if line.strip()]

In [35]:
#main loop
from datetime import datetime, timezone

CHUNK_SIZE = 15

for intent in INTENT_CLUSTERS:
    intent_id = intent["intent_id"]
    intent_name = intent["intent_name"]
    target = intent["target"]

    current = len(prompts_df[prompts_df["intent_id"] == intent_id])
    remaining = target - current

    if remaining <= 0:
        print(f"\n‚úÖ Intent '{intent_name}' already completed")
        continue

    print("\n" + "=" * 60)
    print(f"\nüöÄ Intent: {intent_name}")
    print(f"Target: {target} | Existing: {current}")
    print("=" * 60)

    chunk_no = 1

    while remaining > 0:
        size = min(CHUNK_SIZE, remaining)
        print(f"  \nüîπ Chunk {chunk_no}: generating {size} prompts")

        prompts = generate_prompt_chunk(intent_name, size)

        new_rows = []
        for text in prompts:
            h = text_hash(text)
            if h in existing_hashes:
                continue

            new_rows.append({
                "prompt_id": PROMPT_ID_START,
                "intent_id": intent_id,
                "intent_name": intent_name,
                "prompt_text": text,
                "prompt_hash": h,
                "generated_at": datetime.now(timezone.utc).isoformat()
            })

            existing_hashes.add(h)
            PROMPT_ID_START += 1

            if len(new_rows) >= remaining:
                break

        if new_rows:
            prompts_df = pd.concat([prompts_df, pd.DataFrame(new_rows)], ignore_index=True)
            prompts_df.to_csv(OUTPUT_FILE, index=False)
            remaining -= len(new_rows)
            print(f"  ‚úÖ Added {len(new_rows)} | Remaining: {remaining}")
        else:
            print(" \n ‚ö†Ô∏è No new unique prompts in this chunk")

        chunk_no += 1
        time.sleep(8 + random.uniform(2, 4))

    print(f"\nüéØ Completed intent: {intent_name}")
    print("\n‚è∏ Cooling down before next intent...\n")
    time.sleep(25 + random.uniform(5, 8))



üöÄ Intent: Safety & NCAP
Target: 60 | Existing: 46
  
üîπ Chunk 1: generating 14 prompts
  ‚úÖ Added 14 | Remaining: 0

üéØ Completed intent: Safety & NCAP

‚è∏ Cooling down before next intent...



üöÄ Intent: Family & Comfort
Target: 55 | Existing: 0
  
üîπ Chunk 1: generating 15 prompts
  ‚úÖ Added 15 | Remaining: 40
  
üîπ Chunk 2: generating 15 prompts
  ‚úÖ Added 14 | Remaining: 26
  
üîπ Chunk 3: generating 15 prompts
  ‚úÖ Added 15 | Remaining: 11
  
üîπ Chunk 4: generating 11 prompts
  ‚úÖ Added 11 | Remaining: 0

üéØ Completed intent: Family & Comfort

‚è∏ Cooling down before next intent...



üöÄ Intent: Price & Budget
Target: 60 | Existing: 0
  
üîπ Chunk 1: generating 15 prompts
  ‚úÖ Added 15 | Remaining: 45
  
üîπ Chunk 2: generating 15 prompts
  ‚úÖ Added 16 | Remaining: 29
  
üîπ Chunk 3: generating 15 prompts
  ‚úÖ Added 15 | Remaining: 14
  
üîπ Chunk 4: generating 14 prompts
  ‚úÖ Added 14 | Remaining: 0

üéØ Completed intent: Price & Budget

‚è∏ C

In [14]:
prompts_df.head()

Unnamed: 0,prompt_id,intent_id,intent_name,prompt_text,prompt_hash,generated_at
0,1,1,Safety & NCAP,What are the safety features offered in the Ta...,15c21f73bf4c7456c83330c8d740a033,2026-02-01T07:48:00.607946+00:00
1,2,1,Safety & NCAP,How does the Maruti Grand Vitara's 5-star Glob...,45539f34175539bc21b3ca50ebaf9660,2026-02-01T07:48:00.607975+00:00
2,3,1,Safety & NCAP,Can you explain the different types of airbags...,932886d6dc31bf8a77cf32ac99b53efe,2026-02-01T07:48:00.607984+00:00
3,4,1,Safety & NCAP,I'm planning to buy a compact SUV for my famil...,9c6b8907c51cc80f7477877c33f1d6fa,2026-02-01T07:48:00.607991+00:00
4,5,1,Safety & NCAP,Which compact SUV offers the best visibility a...,3a115d6c2e3dc83877d87817eeb45556,2026-02-01T07:48:00.607996+00:00


In [15]:
#Sanity check
print("Total prompts:", len(prompts_df))
prompts_df.groupby("intent_name").size()

Total prompts: 1000


Unnamed: 0_level_0,0
intent_name,Unnamed: 1_level_1
ADAS & Technology,55
Automatic vs Manual,50
Boot Space & Practicality,50
Brand Trust & Perception,50
City vs Highway Usage,55
Comparison & Elimination,70
EMI & Finance,55
Family & Comfort,60
Final Shortlisting,60
First-Time Buyer,55


In [16]:
#Extra 25 rows
manual_prompts = [
    # Comparison & Elimination
    ("Comparison & Elimination", 5, "Creta vs Seltos which is better for city driving"),
    ("Comparison & Elimination", 5, "Nexon or Brezza which is safer for family"),
    ("Comparison & Elimination", 5, "Grand Vitara vs Creta mileage and comfort comparison"),
    ("Comparison & Elimination", 5, "Best compact SUV in India for long term ownership"),
    ("Comparison & Elimination", 5, "Which compact SUV makes more sense in 2025"),

    # Price & Budget
    ("Price & Budget", 3, "Best compact SUV under 12 lakhs in India"),
    ("Price & Budget", 3, "Cheapest automatic compact SUV available right now"),
    ("Price & Budget", 3, "Is Creta worth the price compared to rivals"),
    ("Price & Budget", 3, "Budget friendly compact SUV for first time buyers"),
    ("Price & Budget", 3, "Compact SUV that offers most features for the price"),

    # Family & Comfort
    ("Family & Comfort", 2, "Best compact SUV for parents and kids"),
    ("Family & Comfort", 2, "Comfortable compact SUV for long highway trips"),
    ("Family & Comfort", 2, "Which SUV has best rear seat comfort"),
    ("Family & Comfort", 2, "Family friendly compact SUV with good boot space"),
    ("Family & Comfort", 2, "Compact SUV suitable for daily family use"),

    # Safety & NCAP
    ("Safety & NCAP", 1, "Safest compact SUV under 15 lakhs"),
    ("Safety & NCAP", 1, "Which compact SUV has highest NCAP rating"),
    ("Safety & NCAP", 1, "Is Tata Nexon safer than Hyundai Creta"),
    ("Safety & NCAP", 1, "Compact SUV with best safety features in India"),
    ("Safety & NCAP", 1, "Safety comparison of popular compact SUVs"),

    # Final Shortlisting
    ("Final Shortlisting", 18, "Which compact SUV should I finally buy"),
    ("Final Shortlisting", 18, "Best overall compact SUV in India right now"),
    ("Final Shortlisting", 18, "One compact SUV recommendation for mixed usage"),
    ("Final Shortlisting", 18, "Final verdict Creta vs Seltos vs Nexon"),
    ("Final Shortlisting", 18, "Best compact SUV for Indian roads overall"),
]

from datetime import datetime, timezone

for intent_name, intent_id, text in manual_prompts:
    h = text_hash(text)
    if h in existing_hashes:
        continue

    prompts_df.loc[len(prompts_df)] = [
        len(prompts_df) + 1,
        intent_id,
        intent_name,
        text,
        h,
        datetime.now(timezone.utc).isoformat()
    ]
    existing_hashes.add(h)

prompts_df.to_csv(OUTPUT_FILE, index=False)
print("‚úÖ Added 25 manual prompts safely")

‚úÖ Added 25 manual prompts safely


In [17]:
#Sanity check
prompts_df.head()

Unnamed: 0,prompt_id,intent_id,intent_name,prompt_text,prompt_hash,generated_at
0,1,1,Safety & NCAP,What are the safety features offered in the Ta...,15c21f73bf4c7456c83330c8d740a033,2026-02-01T07:48:00.607946+00:00
1,2,1,Safety & NCAP,How does the Maruti Grand Vitara's 5-star Glob...,45539f34175539bc21b3ca50ebaf9660,2026-02-01T07:48:00.607975+00:00
2,3,1,Safety & NCAP,Can you explain the different types of airbags...,932886d6dc31bf8a77cf32ac99b53efe,2026-02-01T07:48:00.607984+00:00
3,4,1,Safety & NCAP,I'm planning to buy a compact SUV for my famil...,9c6b8907c51cc80f7477877c33f1d6fa,2026-02-01T07:48:00.607991+00:00
4,5,1,Safety & NCAP,Which compact SUV offers the best visibility a...,3a115d6c2e3dc83877d87817eeb45556,2026-02-01T07:48:00.607996+00:00


In [18]:
print(f"Total number of prompts: {len(prompts_df)}")
print(f"Number of unique prompt hashes: {len(prompts_df['prompt_hash'].unique())}\n")

if len(prompts_df) == len(prompts_df['prompt_hash'].unique()):
    print(" All prompts are unique!")
else:
    print(" Duplicate prompts found!")

Total number of prompts: 1000
Number of unique prompt hashes: 1000

 All prompts are unique!


## **Dark Matter Prompt Generation ‚Äì Summary**

To simulate AI-driven user discovery ("Dark Matter"), we generated a synthetic dataset of AI-style prompts related to buying a compact SUV in India.

--------------------
#### Approach
- We first defined 18 intent clusters representing real user decision-making stages such as safety, price, comparison, family comfort, and final shortlisting.
- We used a Large Language Model (LLM) via the Groq API to generate prompts in small, controlled chunks to ensure stability and diversity.
- Each generated prompt was deduplicated and saved incrementally to maintain data quality and allow safe resumption.

-----------------

#### Dataset Creation
- Using the LLM, we generated **975 unique prompts** distributed across all intent clusters.
- To reach a clean round number and improve coverage of high-impact intents, we manually added **25 curated prompts** that were:
  - Simple
  - Natural
  - Unique
  - Focused on high-volume intents such as comparison, price, safety, and final decision-making.

---------------
#### Final Dataset
- Total prompts: **1,000**
- All prompts are labeled by intent cluster.
- The dataset reflects realistic AI search behavior rather than traditional keyword queries.
- This dataset serves as the foundation for AI volume estimation and probabilistic attribution in subsequent steps.

-------------------

---
###**Step 2: The "Visible Light" Mapping(Google Search Simulation)**

#### Objective
The goal of this step is to simulate what **traditional analytics tools can see** ‚Äî  
Google search keywords and their Monthly Search Volume (MSV).

While AI conversations happen in a ‚Äúdark‚Äù space, Google search behavior represents the **visible layer of user intent**.  
This layer allows us to compare and later merge AI-driven demand with known, trackable search demand.

---

#### What We Do
- Map each AI-style user prompt to a **parent Google keyword**.
- Assign a **synthetic Monthly Search Volume (MSV)** to each keyword.
- Ensure the MSV distribution follows a **long-tail pattern**, where:
  - Most keywords have low search volume.
  - A few keywords have very high search volume.

---

#### Why This Matters
Google search volumes act as the **baseline demand signal**.  
This ‚ÄúVisible Light‚Äù layer is later combined with AI-specific signals to estimate hidden AI-driven traffic.


In [26]:
#01.Create Parent Keywords
import re

def generate_parent_keyword(prompt_text):
    text = prompt_text.lower()
    text = re.sub(r"[^a-z0-9\s]", "", text)

    stopwords = {
        "is","the","a","an","for","to","of","in","on","with",
        "which","what","should","i","my","can","you","im","am"
    }

    words = [w for w in text.split() if w not in stopwords]
    return " ".join(words[:4])

In [27]:
#02. Apply keyword mapping to dataset
prompts_df["google_keyword"] = prompts_df["prompt_text"].apply(generate_parent_keyword)

prompts_df[["prompt_text", "google_keyword"]].head()

Unnamed: 0,prompt_text,google_keyword
0,What are the safety features offered in the Ta...,are safety features offered
1,How does the Maruti Grand Vitara's 5-star Glob...,how does maruti grand
2,Can you explain the different types of airbags...,explain different types airbags
3,I'm planning to buy a compact SUV for my famil...,planning buy compact suv
4,Which compact SUV offers the best visibility a...,compact suv offers best


In [28]:
#03. Generate log-normal MSV values
import numpy as np

np.random.seed(42)  # reproducibility
N = len(prompts_df) # Define N as the number of prompts
msv_values = np.random.lognormal(mean=7.5, sigma=1.2, size=N)


In [29]:
#04. Assign MSV per unique keyword
unique_keywords = prompts_df["google_keyword"].unique()

msv_values = np.random.lognormal(
    mean=7.5,
    sigma=1.2,
    size=len(unique_keywords)
)

# Clip to realistic bounds
msv_values = np.clip(msv_values, 50, 100000).astype(int)

keyword_msv_map = dict(zip(unique_keywords, msv_values))

In [30]:
#05. Map MSV back to prompts
prompts_df["google_msv"] = prompts_df["google_keyword"].map(keyword_msv_map)

prompts_df[["google_keyword", "google_msv"]].head()

Unnamed: 0,google_keyword,google_msv
0,are safety features offered,9693
1,how does maruti grand,5483
2,explain different types airbags,1942
3,planning buy compact suv,831
4,compact suv offers best,4179


In [31]:
#Sanity Check
prompts_df["google_msv"].describe(percentiles=[0.5, 0.75, 0.9, 0.95, 0.99])

Unnamed: 0,google_msv
count,1000.0
mean,3955.294
std,5997.185258
min,54.0
50%,1966.0
75%,4419.75
90%,8900.0
95%,13458.15
99%,29268.04
max,83427.0


### **Summary: Visible Light Mapping**

In this section, we created a realistic simulation of Google search behavior by mapping AI prompts to parent keywords and assigning synthetic search volumes.

**Key outcomes**:
- Each prompt is linked to a Google-style keyword.
- Monthly Search Volume (MSV) values follow a **log-normal distribution**, reflecting real-world search behavior.
- Most keywords fall in the low-volume long tail, while a small number carry high search demand.

This Visible Light layer now serves as a **grounded, interpretable baseline** that can be merged with AI-specific complexity signals to estimate hidden AI demand in the next step.

--------------

-------
## **Step 3: The Merging Logic (Core Estimation Step)**

#### Objective
The purpose of the merging step is to combine two layers of user demand:

- **Visible Light**: Google search demand represented by keywords and Monthly Search Volume (MSV).
- **Dark Matter**: AI-style conversational prompts that reflect deeper, multi-constraint user intent.

Google search volumes provide a baseline level of demand, but they do not fully capture how users interact with AI systems.

---

#### Key Idea
As user questions become more complex, users are more likely to rely on AI rather than traditional search.  
Therefore, AI-driven demand should be modeled as an **amplification of Google demand**, driven by query complexity and intent type.

---

#### Approach
Before estimating AI volume, we assign each prompt a **complexity score** that represents how difficult the query is to answer using traditional search alone.

This complexity score is later used to non-linearly amplify Google search demand and estimate hidden AI traffic.

----------------------

**COMPLEXITY SCORING**

In [32]:
#Define keyword sets
REASONING_KEYWORDS = {
    "compare", "better", "best", "vs", "versus",
    "should", "worth", "recommend", "final",
    "long term", "long-term", "ownership",
    "difference", "choose"
}

CONSTRAINT_KEYWORDS = {
    "family", "budget", "price", "mileage",
    "city", "highway", "safety", "comfort",
    "maintenance", "resale", "boot", "features",
    "automatic", "manual", "emi"
}

In [36]:
#Complexity scoring function
import numpy as np

def compute_complexity_score(prompt_text):
    text = prompt_text.lower()

    # 1. Length score (scaled)
    word_count = len(text.split())
    length_score = min(word_count / 25, 1.0)

    # 2. Reasoning keyword score
    reasoning_hits = sum(1 for kw in REASONING_KEYWORDS if kw in text)
    reasoning_score = min(reasoning_hits / 3, 1.0)

    # 3. Constraint score
    constraint_hits = sum(1 for kw in CONSTRAINT_KEYWORDS if kw in text)
    constraint_score = min(constraint_hits / 4, 1.0)

    # Weighted combination
    raw_score = (
        0.4 * length_score +
        0.4 * reasoning_score +
        0.25 * constraint_score
    )

    return round(raw_score, 3)

In [37]:
#Apply
prompts_df["complexity_score"] = prompts_df["prompt_text"].apply(compute_complexity_score)

prompts_df[["prompt_text", "complexity_score"]].head()

Unnamed: 0,prompt_text,complexity_score
0,What are the safety features offered in the Ta...,0.562
1,How does the Maruti Grand Vitara's 5-star Glob...,0.319
2,Can you explain the different types of airbags...,0.399
3,I'm planning to buy a compact SUV for my famil...,0.721
4,Which compact SUV offers the best visibility a...,0.562


In [38]:
#Sanity check the distribution
prompts_df["complexity_score"].describe(percentiles=[0.25, 0.5, 0.75, 0.9])

Unnamed: 0,complexity_score
count,1000.0
mean,0.529284
std,0.137135
min,0.096
25%,0.437
50%,0.525
75%,0.626
90%,0.721
max,0.925


**Finalizing the Merging Function**

In [39]:
#Intent Weights
INTENT_WEIGHTS = {
    "Comparison & Elimination": 1.30,
    "Final Shortlisting": 1.30,
    "Price & Budget": 1.20,
    "Safety & NCAP": 1.20,
    "Family & Comfort": 1.15,

    "First-Time Buyer": 1.15,
    "Mileage & Fuel Type": 1.10,
    "City vs Highway Usage": 1.10,

    "Variants & Features": 1.05,
    "ADAS & Technology": 1.05,

    "Maintenance & Service": 1.00,
    "Resale Value": 1.00,
    "Offers & Discounts": 0.95,
    "Test Drive Readiness": 0.95,

    "Automatic vs Manual": 0.90,
    "Boot Space & Practicality": 0.90,
    "Brand Trust & Perception": 0.85,
}

In [40]:
#Non Linear Amplification
ALPHA = 2.0

In [41]:
#function

def estimate_ai_volume(google_msv, complexity_score, intent_name):
    """
    Estimate AI-driven volume as a non-linear amplification
    of Google search demand.
    """

    intent_weight = INTENT_WEIGHTS.get(intent_name, 1.0)

    ai_volume = (
        google_msv
        * intent_weight
        * ((1 + complexity_score) ** ALPHA)
    )

    return round(ai_volume, 2)

In [44]:
#test
prompts_df["estimated_ai_volume"] = prompts_df.apply(
    lambda row: estimate_ai_volume(
        row["google_msv"],
        row["complexity_score"],
        row["intent_name"]
    ),
    axis=1
)
prompts_df["estimated_ai_volume"].head()

Unnamed: 0,estimated_ai_volume
0,28379.29
1,11446.93
2,4561.06
3,2953.55
4,12235.33


In [43]:
prompts_df["estimated_ai_volume"].describe(percentiles=[0.5, 0.75, 0.9, 0.95])

Unnamed: 0,estimated_ai_volume
count,1000.0
mean,10090.88497
std,15757.754212
min,102.42
50%,4989.78
75%,11234.485
90%,24255.084
95%,34681.079
max,191255.93


In [45]:
#Comapre google vs AI
prompts_df[[
    "prompt_text",
    "google_msv",
    "complexity_score",
    "estimated_ai_volume"
]].head(10)

Unnamed: 0,prompt_text,google_msv,complexity_score,estimated_ai_volume
0,What are the safety features offered in the Ta...,9693,0.562,28379.29
1,How does the Maruti Grand Vitara's 5-star Glob...,5483,0.319,11446.93
2,Can you explain the different types of airbags...,1942,0.399,4561.06
3,I'm planning to buy a compact SUV for my famil...,831,0.721,2953.55
4,Which compact SUV offers the best visibility a...,4179,0.562,12235.33
5,Are the front and rear crumple zones of the Hy...,2899,0.4,6818.45
6,What role does the Electronic Stability Contro...,5293,0.336,11336.95
7,Can you compare the safety ratings of the Maru...,3874,0.548,11139.94
8,What is the process of replacing or repairing ...,6370,0.4,14982.24
9,Considering the hot and humid climate of India...,951,0.463,2442.59


####**Total Estimated AI Volume per Intent Cluster**

In [47]:
intent_ai_volume = (
    prompts_df
    .groupby("intent_name", as_index=False)
    .agg(
        total_ai_volume=("estimated_ai_volume", "sum"),
        avg_ai_volume=("estimated_ai_volume", "mean"),
        prompt_count=("estimated_ai_volume", "count")
    )
    .sort_values(by="total_ai_volume", ascending=False)
)



total_ai = intent_ai_volume["total_ai_volume"].sum()

intent_ai_volume["ai_volume_share_pct"] = (
    intent_ai_volume["total_ai_volume"] / total_ai * 100
).round(2)

intent_ai_volume


Unnamed: 0,intent_name,total_ai_volume,avg_ai_volume,prompt_count,ai_volume_share_pct
5,Comparison & Elimination,880084.49,12572.635571,70,8.72
15,Safety & NCAP,860523.07,13238.816462,65,8.53
13,Price & Budget,779876.04,11998.092923,65,7.73
4,City vs Highway Usage,725073.91,13183.162,55,7.19
9,First-Time Buyer,717694.77,13048.995818,55,7.11
0,ADAS & Technology,666246.12,12113.565818,55,6.6
7,Family & Comfort,598508.13,9975.1355,60,5.93
17,Variants & Features,581919.0,10580.345455,55,5.77
14,Resale Value,516428.7,10328.574,50,5.12
2,Boot Space & Practicality,496615.23,9932.3046,50,4.92


#### **Conclusion**

We successfully clustered 1,000 AI-style user prompts into 18 intent clusters representing different stages of the compact SUV buying journey.

Using a non-linear merging model, we estimated AI-driven demand for each prompt by combining:
- Google search volume as a baseline,
- Prompt-level complexity,
- Intent-level behavioral weighting.

By aggregating estimated AI volume at the intent level, we identified which user intents generate the highest AI interaction.  
The results show that comparison, safety, pricing, and final decision-making intents dominate AI demand, highlighting where AI-driven discovery is most impactful.

This approach provides a structured and explainable framework to quantify hidden AI-driven traffic beyond traditional search analytics.

----------------


---------
## **Coding Task 2: The ‚ÄúInference‚Äù Engine (Probabilistic Attribution)**
------

#### **Objective**
The objective of this step is to connect the estimated AI-driven intent volumes to real-world website behavior using probability.

While direct and AI-driven traffic does not expose explicit search keywords, user behavior on the website provides indirect signals about underlying intent. This task builds an inference engine that probabilistically attributes anonymous website sessions to intent clusters.

---

#### Key Challenge
Traditional analytics cannot directly identify the user‚Äôs intent for direct or AI-driven visits.  
Instead, we infer intent using observable behavioral signals such as:
- Landing page type
- Time spent on page
- Scroll depth
- Device type
- Follow-up actions (clicks, downloads, shares, bounces)

---

#### Approach
- Each website session is treated as an observation with behavioral features.
- Intent clusters act as hidden states.
- Using probabilistic scoring, we estimate how likely each intent is given the observed behavior.
- The output is a probability distribution over intent clusters for each session rather than a single hard label.

This allows us to bridge AI-driven ‚Äúdark traffic‚Äù with measurable website interactions in a principled and explainable way.

-----

**Step 1: Load & Inspect Session Data**

In [61]:
import pandas as pd
import numpy as np

SESSION_FILE = "/content/direct_traffic_sessions.csv"
sessions_df = pd.read_csv(SESSION_FILE)

sessions_df.head()

Unnamed: 0,ID,Landing URL,Time on Page,Scroll Depth,Device,Next Action
0,S01,/blog/safest-cars-india-ncap-2025,5m 12s,90%,Mobile,"Click: ""Tata Nexon"""
1,S02,/model/hyundai-creta-sx,0m 30s,15%,Desktop,Bounce
2,S03,/finance/emi-calculator,2m 45s,100%,Desktop,Download Quote
3,S04,/compare/brezza-cng-vs-petrol,4m 10s,85%,Mobile,Share on WhatsApp
4,S05,/model/mahindra-thar-rox,0m 10s,5%,Mobile,Bounce


In [62]:
sessions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   ID            25 non-null     object
 1   Landing URL   25 non-null     object
 2   Time on Page  25 non-null     object
 3   Scroll Depth  24 non-null     object
 4   Device        25 non-null     object
 5   Next Action   25 non-null     object
dtypes: object(6)
memory usage: 1.3+ KB


In [63]:
#rename columns
sessions_df.columns = [
    "session_id",
    "landing_page",
    "time_on_page_sec",
    "scroll_depth_pct",
    "device_type",
    "next_action"
]

In [64]:
# Clean and convert 'scroll_depth_pct'
sessions_df["scroll_depth_pct"] = (
    sessions_df["scroll_depth_pct"]
    .astype(str) # Ensure all values are strings
    .str.replace('%', '', regex=False) # Remove '%' character
    .replace('N/A', np.nan) # Replace 'N/A' with NaN to be handled by fillna
)
# Convert to numeric, coercing errors, then fill NaN and divide by 100
sessions_df["scroll_depth_pct"] = (
    pd.to_numeric(sessions_df["scroll_depth_pct"], errors='coerce')
    .fillna(0) # Fill any remaining NaN (from 'N/A' or coercion) with 0
    / 100
)


In [65]:
# Clean and convert 'time_on_page_sec'
def convert_time_to_seconds(time_str):
    if pd.isna(time_str) or not isinstance(time_str, str):
        return 0
    parts = time_str.replace(' ', '').split('m')
    minutes = int(parts[0]) if parts[0] else 0
    seconds = int(parts[1].replace('s', '')) if len(parts) > 1 and parts[1] else 0
    return minutes * 60 + seconds

sessions_df["time_on_page_sec"] = sessions_df["time_on_page_sec"].apply(convert_time_to_seconds)


In [67]:
sessions_df.dtypes

Unnamed: 0,0
session_id,object
landing_page,object
time_on_page_sec,int64
scroll_depth_pct,float64
device_type,object
next_action,object


In [69]:
sessions_df.head()

Unnamed: 0,session_id,landing_page,time_on_page_sec,scroll_depth_pct,device_type,next_action
0,S01,/blog/safest-cars-india-ncap-2025,312,0.9,Mobile,"Click: ""Tata Nexon"""
1,S02,/model/hyundai-creta-sx,30,0.15,Desktop,Bounce
2,S03,/finance/emi-calculator,165,1.0,Desktop,Download Quote
3,S04,/compare/brezza-cng-vs-petrol,250,0.85,Mobile,Share on WhatsApp
4,S05,/model/mahindra-thar-rox,10,0.05,Mobile,Bounce


**Step 2: Feature Normalization & Behavioral Signals**

Objective of this step

Convert raw behavioral signals into normalized evidence scores that can be used probabilistically.

We will create four interpretable signals:

- Time engagement

- Scroll engagement

- Action strength

- Device context

---------------------

In [70]:
# Log-normalize time on page
sessions_df["time_score"] = np.log1p(sessions_df["time_on_page_sec"])

# Scale to 0‚Äì1
sessions_df["time_score"] = (
    sessions_df["time_score"] / sessions_df["time_score"].max()
)

In [71]:
sessions_df["scroll_score"] = sessions_df["scroll_depth_pct"]

In [73]:
#Encode action strenght
ACTION_WEIGHTS = {
    "Bounce": 0.1,
    "Click: \"Tata Nexon\"": 0.6,
    "Click": 0.6,
    "Download Quote": 0.8,
    "Share on WhatsApp": 0.9
}

def encode_action(action):
    for key, weight in ACTION_WEIGHTS.items():
        if key in action:
            return weight
    return 0.3  # neutral fallback
sessions_df["action_score"] = sessions_df["next_action"].apply(encode_action)

In [74]:
#Encode device Context
DEVICE_WEIGHTS = {
    "Mobile": 1.0,
    "Desktop": 0.8
}

sessions_df["device_score"] = sessions_df["device_type"].map(
    DEVICE_WEIGHTS
).fillna(0.8)


In [75]:
#Combine
sessions_df["engagement_score"] = (
    0.35 * sessions_df["time_score"] +
    0.30 * sessions_df["scroll_score"] +
    0.25 * sessions_df["action_score"] +
    0.10 * sessions_df["device_score"]
).round(3)

In [76]:
#Check
sessions_df[[
    "session_id",
    "time_score",
    "scroll_score",
    "action_score",
    "device_score",
    "engagement_score"
]]

Unnamed: 0,session_id,time_score,scroll_score,action_score,device_score,engagement_score
0,S01,0.927339,0.9,0.6,0.8,0.825
1,S02,0.554187,0.15,0.1,0.8,0.344
2,S03,0.824987,1.0,0.8,0.8,0.869
3,S04,0.891714,0.85,0.9,1.0,0.892
4,S05,0.386979,0.05,0.1,0.8,0.255
5,S06,0.959066,0.95,0.6,0.8,0.851
6,S07,0.915577,0.8,0.3,0.8,0.715
7,S08,0.676139,0.0,0.3,0.8,0.392
8,S09,0.289159,0.0,0.1,0.8,0.206
9,S10,0.851797,0.7,0.6,0.8,0.738


####**Step 3: Intent Likelihood Mapping**

In [83]:
def infer_candidate_intents(landing_page):
    page = landing_page.lower()

    if "/compare" in page:
        return ["Comparison & Elimination"]
    if "/finance" in page or "emi" in page:
        return ["EMI & Finance", "Price & Budget"]
    if "/blog" in page and "safest" in page:
        return ["Safety & NCAP"]
    if "/model" in page:
        return ["Final Shortlisting", "Variants & Features"]
    if "/offers" in page:
        return ["Offers & Discounts"]

    return ["First-Time Buyer", "City vs Highway Usage"]

In [84]:
sessions_df["candidate_intents"] = sessions_df["landing_page"].apply(
    infer_candidate_intents
)

In [86]:
sessions_df[["session_id", "landing_page", "candidate_intents"]].head()

Unnamed: 0,session_id,landing_page,candidate_intents
0,S01,/blog/safest-cars-india-ncap-2025,[Safety & NCAP]
1,S02,/model/hyundai-creta-sx,"[Final Shortlisting, Variants & Features]"
2,S03,/finance/emi-calculator,"[EMI & Finance, Price & Budget]"
3,S04,/compare/brezza-cng-vs-petrol,[Comparison & Elimination]
4,S05,/model/mahindra-thar-rox,"[Final Shortlisting, Variants & Features]"


In [87]:
def compute_intent_likelihoods(candidate_intents, engagement_score):
    base_prob = engagement_score / len(candidate_intents)
    return {intent: round(base_prob, 3) for intent in candidate_intents}

In [88]:
sessions_df["intent_likelihoods"] = sessions_df.apply(
    lambda row: compute_intent_likelihoods(
        row["candidate_intents"],
        row["engagement_score"]
    ),
    axis=1
)
sessions_df[[
    "session_id",
    "candidate_intents",
    "engagement_score",
    "intent_likelihoods"
]]

Unnamed: 0,session_id,candidate_intents,engagement_score,intent_likelihoods
0,S01,[Safety & NCAP],0.825,{'Safety & NCAP': 0.825}
1,S02,"[Final Shortlisting, Variants & Features]",0.344,"{'Final Shortlisting': 0.172, 'Variants & Feat..."
2,S03,"[EMI & Finance, Price & Budget]",0.869,"{'EMI & Finance': 0.434, 'Price & Budget': 0.434}"
3,S04,[Comparison & Elimination],0.892,{'Comparison & Elimination': 0.892}
4,S05,"[Final Shortlisting, Variants & Features]",0.255,"{'Final Shortlisting': 0.128, 'Variants & Feat..."
5,S06,"[First-Time Buyer, City vs Highway Usage]",0.851,"{'First-Time Buyer': 0.425, 'City vs Highway U..."
6,S07,"[Final Shortlisting, Variants & Features]",0.715,"{'Final Shortlisting': 0.357, 'Variants & Feat..."
7,S08,"[First-Time Buyer, City vs Highway Usage]",0.392,"{'First-Time Buyer': 0.196, 'City vs Highway U..."
8,S09,"[Final Shortlisting, Variants & Features]",0.206,"{'Final Shortlisting': 0.103, 'Variants & Feat..."
9,S10,"[First-Time Buyer, City vs Highway Usage]",0.738,"{'First-Time Buyer': 0.369, 'City vs Highway U..."


####**Step 4: Bayesian Intent Attribution**

In [89]:
# Build prior from intent-level AI volume
intent_priors = (
    intent_ai_volume
    .set_index("intent_name")["total_ai_volume"]
)

# Normalize to probabilities
intent_priors = intent_priors / intent_priors.sum()

intent_priors

Unnamed: 0_level_0,total_ai_volume
intent_name,Unnamed: 1_level_1
Comparison & Elimination,0.087216
Safety & NCAP,0.085277
Price & Budget,0.077285
City vs Highway Usage,0.071854
First-Time Buyer,0.071123
ADAS & Technology,0.066025
Family & Comfort,0.059312
Variants & Features,0.057668
Resale Value,0.051178
Boot Space & Practicality,0.049214


In [90]:
def bayesian_intent_attribution(intent_likelihoods, intent_priors):
    """
    Combine likelihoods with priors and normalize.
    """
    unnormalized = {}

    for intent, likelihood in intent_likelihoods.items():
        prior = intent_priors.get(intent, 0)
        unnormalized[intent] = likelihood * prior

    total = sum(unnormalized.values())

    if total == 0:
        return {}

    return {
        intent: round(score / total, 3)
        for intent, score in unnormalized.items()
    }

In [91]:
sessions_df["posterior_intent_probs"] = sessions_df["intent_likelihoods"].apply(
    lambda x: bayesian_intent_attribution(x, intent_priors)
)

In [92]:
def get_top_intent(posterior_probs):
    if not posterior_probs:
        return None
    return max(posterior_probs, key=posterior_probs.get)

In [93]:
sessions_df["inferred_intent"] = sessions_df["posterior_intent_probs"].apply(
    get_top_intent
)

In [95]:
sessions_df["confidence_score"] = sessions_df["posterior_intent_probs"].apply(
    lambda x: max(x.values()) if x else 0
)

In [96]:
sessions_df[[
    "session_id",
    "candidate_intents",
    "engagement_score",
    "posterior_intent_probs",
    "inferred_intent",
    "confidence_score"
]]

Unnamed: 0,session_id,candidate_intents,engagement_score,posterior_intent_probs,inferred_intent,confidence_score
0,S01,[Safety & NCAP],0.825,{'Safety & NCAP': 1.0},Safety & NCAP,1.0
1,S02,"[Final Shortlisting, Variants & Features]",0.344,"{'Final Shortlisting': 0.444, 'Variants & Feat...",Variants & Features,0.556
2,S03,"[EMI & Finance, Price & Budget]",0.869,"{'EMI & Finance': 0.38, 'Price & Budget': 0.62}",Price & Budget,0.62
3,S04,[Comparison & Elimination],0.892,{'Comparison & Elimination': 1.0},Comparison & Elimination,1.0
4,S05,"[Final Shortlisting, Variants & Features]",0.255,"{'Final Shortlisting': 0.444, 'Variants & Feat...",Variants & Features,0.556
5,S06,"[First-Time Buyer, City vs Highway Usage]",0.851,"{'First-Time Buyer': 0.497, 'City vs Highway U...",City vs Highway Usage,0.503
6,S07,"[Final Shortlisting, Variants & Features]",0.715,"{'Final Shortlisting': 0.444, 'Variants & Feat...",Variants & Features,0.556
7,S08,"[First-Time Buyer, City vs Highway Usage]",0.392,"{'First-Time Buyer': 0.497, 'City vs Highway U...",City vs Highway Usage,0.503
8,S09,"[Final Shortlisting, Variants & Features]",0.206,"{'Final Shortlisting': 0.444, 'Variants & Feat...",Variants & Features,0.556
9,S10,"[First-Time Buyer, City vs Highway Usage]",0.738,"{'First-Time Buyer': 0.497, 'City vs Highway U...",City vs Highway Usage,0.503


### **Conclusion: Coding Task 2 ‚Äì Inference Engine**

In this task, we built a probabilistic inference engine to connect estimated AI-driven intent volumes with real-world website behavior.

Using observable session signals such as time on page, scroll depth, device type, and user actions, we modeled how well each session behavior aligns with different intent clusters. These behavioral likelihoods were then combined with intent-level AI volume estimates as prior probabilities.

By applying Bayesian attribution, we calculated the probability of each intent cluster given a session and assigned the most likely intent along with a confidence score. This approach enables intent inference for anonymous or AI-driven traffic where traditional keyword attribution is not available, effectively uncovering hidden user intent behind direct website visits.

---------------------------