# Instagram profile discovery via Google & Perplexity

This notebook demonstrates four ways to discover real Instagram profiles for a specific niche and location:

1) Google → Instagram search (Bright Data Google SERP dataset):  
   Triggers a SERP job with targeted `site:instagram.com ...` queries, waits for the snapshot, and downloads JSON results.

2) Google → Instagram search (Bright Data AI Mode Google dataset):  
   Uses a short natural-language prompt (instead of a strict keyword) to guide discovery, then fetches results as JSON.

3) Perplexity → Instagram search (Bright Data Web Scrapers Library):  
   Runs a Perplexity-powered search with a concise prompt and retrieves the JSON output for post-processing.

4) Direct call to Perplexity (OpenRouter):  
   Calls a Perplexity online model via OpenRouter and extracts only profile URLs from the plain-text response.


Prerequisites
- Environment variables:
  - `BRIGHTDATA_API_TOKEN` for Bright Data dataset calls
  - `OPENROUTER_API_KEY` for OpenRouter / Perplexity calls
- Python libs: `requests`, `pydantic-ai` (with OpenRouter provider)



In [None]:
from dotenv import load_dotenv

load_dotenv(override=True)


## 1. Google → Instagram search (Bright Data Google SERP dataset, targeted Google query)

This code sends a targeted Google query (e.g., `site:instagram.com "AI tools"`) to the Bright Data Google SERP dataset, waits until the snapshot is fully processed, and then downloads the results as JSON.  

`trigger_body` defines what exactly the Bright Data SERP dataset should search for.  
- `url` — the Google domain to query.  
- `keyword` — the full Google search string; here we limit results to Instagram profiles using `site:instagram.com` plus niche terms.  
- `language` — Google interface language.  
- `country` — the geographic region used for SERP results.  
- `start_page` / `end_page` — how many Google results pages to scrape.

This block tells Bright Data to run a targeted Google search and return up to 5 pages of results.

Examples of keywords:
- site:instagram.com "sustainable fashion" "Europe"
- site:instagram.com "indie maker" OR "solopreneur" "reels"
- site:instagram.com "AI tools" OR "data engineer" OR "#buildinpublic"
- site:instagram.com "nocode" "startup founder" "reels"


In [None]:
import os
import requests
import time
import json, pathlib

API_KEY = os.getenv("BRIGHTDATA_API_TOKEN")
DATASET_ID = "gd_mfz5x93lmsjjjylob"
BASE_URL = "https://api.brightdata.com"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
}


# Trigger the Google SERP dataset run
trigger_body = [
    {
        "url": "https://www.google.com/",
        "keyword": (
            'site:instagram.com ("sourdough" OR "sourdough bread" OR "starter") '
            '("NYC" OR "New York" OR "Brooklyn" OR "Manhattan" OR "Queens" OR "Bronx") '
            '("bio" OR "profile" OR "baker") -restaurant -shop -bakery -menu -delivery'
        ),
        "language": "en",
        "country": "US",
        "start_page": 1,
        "end_page": 2
    }
]
# other exapple for keyword param:
# site:instagram.com "sourdough" "New York"
# site:instagram.com "bread baking" "NYC"
# site:instagram.com "artisan baker" "New York"
# site:instagram.com "home baker" "NYC" "sourdough"
# site:instagram.com "baker" "Brooklyn"
# site:instagram.com "sourdough bakery" "Manhattan"
# site:instagram.com "micro bakery" "New York"
# site:instagram.com "bagels" "NYC"

trigger_resp = requests.post(
    f"{BASE_URL}/datasets/v3/trigger",
    headers=headers,
    params={"dataset_id": DATASET_ID, "include_errors": "true"},
    json=trigger_body,
)
trigger_resp.raise_for_status()
print("Trigger raw text:", trigger_resp.text)
print("Trigger response:", trigger_resp.status_code, trigger_resp.text)
snapshot_id = trigger_resp.json().get("snapshot_id")

# Poll progress until ready
progress_url = f"{BASE_URL}/datasets/v3/progress/{snapshot_id}"

while True:
    r = requests.get(progress_url, headers=headers)
    print("Progress raw:", r.text[:300])

    r.raise_for_status()
    j = r.json()
    status = j.get("status")

    if status in {"done", "completed", "ready"}:
        print("Snapshot ready!")
        break

    time.sleep(5)

# Download results as JSON and save to file
download_url = f"{BASE_URL}/datasets/v3/snapshot/{snapshot_id}"

resp = requests.get(
    download_url,
    headers=headers,
    params={"format": "json"},
)

resp.raise_for_status()

data = resp.json()

path = pathlib.Path("serp_results_1.json")
with path.open("w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)



## 2. Google → Instagram search (Bright Data AI Mode Google dataset, human language prompt)

This code triggers an AI-Mode Google dataset run with a natural-language prompt, polls progress and downloads results as JSON. Instead of passing a strict keyword string, you write a short instruction (prompt) describing what to find and Bright Data’s AI-Mode performs the Google-style discovery for you.

Prompt examples:
- Find Instagram profiles of European sustainable-fashion creators. Use the Google query: site:instagram.com "sustainable fashion" "Europe". Return profile URLs only.
- Find Instagram creators who are indie makers or solopreneurs posting reels. Use: site:instagram.com "indie maker" OR "solopreneur" "reels". Return profile URLs only.
- Find Instagram profiles related to AI tools or data engineering. Use: site:instagram.com "AI tools" OR "data engineer" OR "#buildinpublic". Return profile URLs only.
- Find Instagram profiles of startup founders who work with nocode and post reels. Use: site:instagram.com "nocode" "startup founder" "reels". Return profile URLs only.

Tip: keep the prompt concise, include site:instagram.com and your niche terms and ask explicitly for profile URLs only.


In [None]:
import os, time, json, pathlib, requests

API_KEY = os.getenv("BRIGHTDATA_API_TOKEN")  
DATASET_ID = "gd_mcswdt6z2elth3zqr2"        # AI Mode Google dataset
BASE_URL = "https://api.brightdata.com"

headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

# Trigger the dataset run
payload = [
    {
        "url": "https://google.com/aimode",
        "prompt": (
            "Act as an Instagram influencer discovery assistant. "
            "Use the following Google query to find candidates: "
            "site:instagram.com (\"pizza baker\" OR \"pizza blogger\" OR \"home pizza\" OR \"homemade pizza\") "
            "(\"Lisbon\" OR \"Lisboa\") -restaurant -pizzeria -shop -menu -delivery. "
            "From the results, identify Instagram creators based in Lisbon who consistently post homemade pizza content. "
            "Focus on individuals rather than restaurants or commercial accounts, and prefer creators who share their own dough experiments, "
            "baking techniques, and personal recipes. Exclude brands, shops, pizzerias, and SEO spam. "
            "Return ONLY Instagram profile URLs in the exact format https://www.instagram.com/<handle>, one per line, with no additional text."
        ),
        "country": "PT",
    }
]

r = requests.post(
    f"{BASE_URL}/datasets/v3/trigger",
    headers=headers,
    params={"dataset_id": DATASET_ID, "include_errors": "true"},
    json=payload,
)

if r.status_code == 400:
    r = requests.post(
        f"{BASE_URL}/datasets/v3/trigger",
        headers=headers,
        params={"dataset_id": DATASET_ID, "include_errors": "true"},
        json={"input": payload},
    )

r.raise_for_status()
print("Trigger raw text:", trigger_resp.text)
print("Trigger response:", trigger_resp.status_code, trigger_resp.text)
snapshot_id = r.json().get("snapshot_id")

# Poll progress until ready
progress_url = f"{BASE_URL}/datasets/v3/progress/{snapshot_id}"

while True:
    p = requests.get(progress_url, headers=headers)
    print("Progress raw:", p.text[:300])
    p.raise_for_status()
    status = p.json().get("status")
    if status in {"done", "completed", "ready"}:
        break
    time.sleep(5)

# Download results as JSON and save to file
download_url = f"{BASE_URL}/datasets/v3/snapshot/{snapshot_id}"
resp = requests.get(download_url, headers=headers, params={"format": "json"})
resp.raise_for_status()

data = resp.json()
path = pathlib.Path("ai_google.json")
with path.open("w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)




## 3. Perplecity → Instagram search (Bright data Web Scrapers Library)

This code triggers a Perplexity-powered run via Bright Data’s Web Scrapers Library, polls progress and downloads the JSON results.  
Instead of a strict Google query, you provide a short natural-language prompt. Perplexity does the discovery and the dataset returns the findings you can post-process (e.g., extract Instagram profile URLs).

Prompt examples:
- Find Instagram profiles of NYC sourdough bakers. Prefer individual bakers (not brands/agencies). Return profile URLs only.
- Find Instagram creators in Europe who post about sustainable fashion. Return profile URLs only.
- Find Instagram creators who are indie makers or solopreneurs and frequently post reels. Return profile URLs only.
- Find Instagram profiles focused on AI tools and data engineering with authentic, human-made content. Return profile URLs only.
- Find Instagram food bloggers in Tel Aviv who share step-by-step baking recipes. Prefer individuals, not shops. Return profile URLs only.
- Find Instagram photographers in Berlin who shoot street/documentary style. Return profile URLs only.


In [None]:
import os, time, json, pathlib, requests

API_KEY = os.getenv("BRIGHTDATA_API_TOKEN")           
DATASET_ID = "gd_m7dhdot1vw9a7gc1n"                  
BASE_URL = "https://api.brightdata.com"
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

# Trigger the dataset run
prompt = (
    'Find Instagram profiles of NYC sourdough bakers. '
    'Return 15 profile URLs only (one per line). '
    'Prefer individual bakers (not brands or agencies). '
    'NYC includes Manhattan, Brooklyn, Queens.'
)

payload = [{"url": "https://www.perplexity.ai", "prompt": prompt, "index": 1}]

r = requests.post(
    f"{BASE_URL}/datasets/v3/trigger",
    headers=headers,
    params={"dataset_id": DATASET_ID, "include_errors": "true"},
    json={"input": payload},      
)

r.raise_for_status()
print("Trigger raw text:", trigger_resp.text)
print("Trigger response:", trigger_resp.status_code, trigger_resp.text)
snapshot_id = r.json().get("snapshot_id")

# Poll progress until ready
progress_url = f"{BASE_URL}/datasets/v3/progress/{snapshot_id}"
while True:
    p = requests.get(progress_url, headers=headers)
    print("Progress raw:", p.text[:300])
    p.raise_for_status()
    status = p.json().get("status")
    if status in {"done", "completed", "ready"}:
        break
    time.sleep(5)

# Download results as JSON and save to file
download_url = f"{BASE_URL}/datasets/v3/snapshot/{snapshot_id}"
resp = requests.get(download_url, headers=headers, params={"format": "json"})
resp.raise_for_status()

data = resp.json()
path = pathlib.Path("perplexity_bd.json")
with path.open("w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)


## 4. Direct call to perplexity (Openrouter)

This code calls a Perplexity online model through OpenRouter to discover Instagram profiles and prints the extracted profile URLs.

What it does:
- initializes an OpenRouter provider using the OPENROUTER_API_KEY environment variable
- selects a Perplexity model for web-enabled search
- sets a system prompt that enforces output format: one Instagram profile URL per line, prefer individuals over brands
- sends a user prompt: find NYC sourdough bakers on Instagram 
- runs the agent and captures plain-text output
- extracts instagram.com/{username} URLs with a regex and prints the count and a preview

Notes:
- set OPENROUTER_API_KEY in your environment before running
- adjust the model name or prompt to target different niches or locations
- the regex keeps only profile-like URLs and ignores posts/reels/explore links


In [None]:
import os, re
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openrouter import OpenRouterProvider

provider = OpenRouterProvider(api_key=os.getenv("OPENROUTER_API_KEY", ''))

model = OpenAIChatModel(
    "perplexity/sonar-pro-search",
    provider=provider,
)

SYSTEM = """
You are an Instagram influencer discovery assistant.

Output format rules (very important):
- Return ONLY Instagram profile URLs.
- One URL per line.
- No bullet points, no numbering, no headings, no explanations.
- Do NOT include @handles without URLs.
- Do NOT include any text before or after the URLs.
- Only include URLs for Instagram profiles confirmed by credible web sources.
- If no verified URL is found, return an empty response—do not guess or fabricate any profiles.

Valid output:
https://www.instagram.com/username1
https://www.instagram.com/username2

Invalid output examples (never do this):
1. @username - great creator
@username
Here are some creators: ...
"""

agent = Agent(
    model=model,
    system_prompt=SYSTEM,
)

PROMPT = (
    "Find Instagram creators based in New York City who consistently post sourdough bread baking content. "
    "Focus on individuals, not bakeries, restaurants, or commercial accounts. "
    "Prefer creators who share their own starter maintenance, dough experiments, baking techniques, and personal recipes. "
    "Exclude brands, shops, bakeries, and SEO spam. "
    "Only include accounts for which you have a verified direct URL to instagram.com from credible web sources. If you are not sure a handle exists, do not include it. "
    "Return ONLY Instagram profile URLs in the exact format 'https://www.instagram.com/<handle>' one per line, with no additional text."
)

result = await agent.run(PROMPT)
text = result.output

urls = []
for line in text.splitlines():
    m = re.search(r"(https?://(?:www\.)?instagram\.com/[A-Za-z0-9._]+)", line.strip())
    if m:
        urls.append(m.group(1))
print(len(urls), urls[:10])


## 5. Hashtag → Perplexity chain for NYC sourdough bakers

This code runs a two-step Perplexity flow via OpenRouter to find sourdough bakers on Instagram in New York City.

What it does:
- initializes an OpenRouter provider using `OPENROUTER_API_KEY`
- configures the `perplexity/sonar-pro-search` model through `pydantic_ai`
- uses a strict system prompt so the model returns only real `instagram.com/<handle>` profile URLs, one per line, with no extra text
- first call: asks Perplexity for a JSON array of Instagram hashtags used by NYC sourdough bakers (with rationale and popularity notes)
- parses the JSON and extracts the `hashtag` field from each item
- second call: uses those hashtags to find individual creators in NYC who regularly post sourdough baking content, excluding brands, bakeries, and SEO spam
- extracts and deduplicates profile URLs with a regex, then prints how many were found and the final list

Notes:
- set `OPENROUTER_API_KEY` in your environment before running
- you can tweak the prompts to target other niches (pizza, pastry, coffee) or locations


In [None]:
import os
import re
import json
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openrouter import OpenRouterProvider

provider = OpenRouterProvider(api_key=os.getenv("OPENROUTER_API_KEY", ""))

model = OpenAIChatModel(
    "perplexity/sonar-pro-search",
    provider=provider,
)


HASHTAG_SYSTEM = """
You are an Instagram hashtag research assistant.

Return ONLY a pure JSON array.
Each item must include:
- "hashtag"
- "short_rationale"
- "popularity_note"

No explanations, no text outside JSON.
"""

hashtag_agent = Agent(
    model=model,
    system_prompt=HASHTAG_SYSTEM
)

HASHTAGS_PROMPT = (
    "Give me 10 Instagram hashtags used by sourdough bread bakers in New York City "
    "in 2024-2025. Provide the results as a pure JSON array. "
    "Each item must have fields: \"hashtag\", \"short_rationale\", \"popularity_note\". "
    "Return only the JSON array — nothing else."
)

hashtags_result = await hashtag_agent.run(HASHTAGS_PROMPT)
hashtags_json = hashtags_result.output

hashtags = []
try:
    hashtags = json.loads(hashtags_json)
except json.JSONDecodeError:
    print("JSON parsing failed. Raw output:")
    print(hashtags_json)

hashtag_strings = [h["hashtag"] for h in hashtags if "hashtag" in h]

print("Extracted hashtags:", hashtag_strings)


INFLUENCER_SYSTEM = """
You are an Instagram influencer discovery assistant.

Output rules:
- Return ONLY Instagram profile URLs.
- One URL per line.
- No bullet points, no numbering.
- No explanations.
- No fake or unverified accounts.
- Do not guess or hallucinate handles.

Valid output example:
https://www.instagram.com/username
https://www.instagram.com/another.one
"""

influencer_agent = Agent(
    model=model,
    system_prompt=INFLUENCER_SYSTEM
)

INFLUENCER_PROMPT = (
    f"Using the following hashtags: {hashtag_strings}, "
    "Find Instagram creators based in New York City who consistently post sourdough bread baking content. "
    "Focus on individuals, not bakeries, restaurants, or commercial accounts. "
    "Prefer creators who share starter maintenance, dough experiments, baking techniques, and personal recipes. "
    "Exclude brands, shops, bakeries, and SEO spam. "
    "Return ONLY direct Instagram profile URLs in the format 'https://www.instagram.com/<handle>' "
    "one per line with no extra text. "
    "If unsure whether a handle exists, do not include it."
)

influencers_result = await influencer_agent.run(INFLUENCER_PROMPT)
text = influencers_result.output


urls = []
for line in text.splitlines():
    m = re.search(r"(https?://(?:www\.)?instagram\.com/[A-Za-z0-9._/]+)", line.strip())
    if m:
        urls.append(m.group(1))

urls = list(dict.fromkeys(urls))

print("Found influencers:", len(urls))
print(urls)
