# Instagram profile discovery via Google & Perplexity

This notebook demonstrates four ways to discover real Instagram profiles for a specific niche and location:

1) Google → Instagram search (Bright Data Google SERP dataset):  
   Triggers a SERP job with targeted `site:instagram.com ...` queries, waits for the snapshot, and downloads JSON results.

2) Google → Instagram search (Bright Data AI Mode Google dataset):  
   Uses a short natural-language prompt (instead of a strict keyword) to guide discovery, then fetches results as JSON.

3) Perplexity → Instagram search (Bright Data Web Scrapers Library):  
   Runs a Perplexity-powered search with a concise prompt and retrieves the JSON output for post-processing.

4) Direct call to Perplexity (OpenRouter):  
   Calls a Perplexity online model via OpenRouter and extracts only profile URLs from the plain-text response.


Prerequisites
- Environment variables:
  - `BRIGHTDATA_API_TOKEN` for Bright Data dataset calls
  - `OPENROUTER_API_KEY` for OpenRouter / Perplexity calls
- Python libs: `requests`, `pydantic-ai` (with OpenRouter provider)



In [1]:
from dotenv import load_dotenv

load_dotenv(override=True)


True

## 1. Google → Instagram search (Bright Data Google SERP dataset, targeted Google query)

This code sends a targeted Google query (e.g., `site:instagram.com "AI tools"`) to the Bright Data Google SERP dataset, waits until the snapshot is fully processed, and then downloads the results as JSON.  

`trigger_body` defines what exactly the Bright Data SERP dataset should search for.  
- `url` — the Google domain to query.  
- `keyword` — the full Google search string; here we limit results to Instagram profiles using `site:instagram.com` plus niche terms.  
- `language` — Google interface language.  
- `country` — the geographic region used for SERP results.  
- `start_page` / `end_page` — how many Google results pages to scrape.

This block tells Bright Data to run a targeted Google search and return up to 5 pages of results.

Examples of keywords:
- site:instagram.com "sustainable fashion" "Europe"
- site:instagram.com "indie maker" OR "solopreneur" "reels"
- site:instagram.com "AI tools" OR "data engineer" OR "#buildinpublic"
- site:instagram.com "nocode" "startup founder" "reels"


In [None]:
import os
import requests
import time
import json, pathlib

API_KEY = os.getenv("BRIGHTDATA_API_TOKEN")
DATASET_ID = "gd_mfz5x93lmsjjjylob"
BASE_URL = "https://api.brightdata.com"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
}


# Trigger the Google SERP dataset run
trigger_body = [
    {
        "url": "https://www.google.com/",
        "keyword": 'site:instagram.com "sourdough" "NYC baker"',
        "language": "en",
        "country": "US",
        "start_page": 1,
        "end_page": 5,
    }
]
# other exapple for keyword param:
# site:instagram.com "sourdough" "New York"
# site:instagram.com "bread baking" "NYC"
# site:instagram.com "artisan baker" "New York"
# site:instagram.com "home baker" "NYC" "sourdough"
# site:instagram.com "baker" "Brooklyn"
# site:instagram.com "sourdough bakery" "Manhattan"
# site:instagram.com "micro bakery" "New York"
# site:instagram.com "bagels" "NYC"

trigger_resp = requests.post(
    f"{BASE_URL}/datasets/v3/trigger",
    headers=headers,
    params={"dataset_id": DATASET_ID, "include_errors": "true"},
    json=trigger_body,
)
trigger_resp.raise_for_status()
print("Trigger raw text:", trigger_resp.text)
print("Trigger response:", trigger_resp.status_code, trigger_resp.text)
snapshot_id = trigger_resp.json().get("snapshot_id")

# Poll progress until ready
progress_url = f"{BASE_URL}/datasets/v3/progress/{snapshot_id}"

while True:
    r = requests.get(progress_url, headers=headers)
    print("Progress raw:", r.text[:300])

    r.raise_for_status()
    j = r.json()
    status = j.get("status")

    if status in {"done", "completed", "ready"}:
        print("Snapshot ready!")
        break

    time.sleep(5)

# Download results as JSON and save to file
download_url = f"{BASE_URL}/datasets/v3/snapshot/{snapshot_id}"

resp = requests.get(
    download_url,
    headers=headers,
    params={"format": "json"},
)

resp.raise_for_status()

data = resp.json()

path = pathlib.Path("serp_results_1.json")
with path.open("w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)



Trigger raw text: {"snapshot_id":"sd_mi04wc793e8mthspw"}
Trigger response: 200 {"snapshot_id":"sd_mi04wc793e8mthspw"}
sd_mi04wc793e8mthspw
Progress raw: {"status":"running","snapshot_id":"sd_mi04wc793e8mthspw","dataset_id":"gd_mfz5x93lmsjjjylob"}
Progress raw: {"status":"running","snapshot_id":"sd_mi04wc793e8mthspw","dataset_id":"gd_mfz5x93lmsjjjylob"}
Progress raw: {"status":"ready","snapshot_id":"sd_mi04wc793e8mthspw","dataset_id":"gd_mfz5x93lmsjjjylob","records":1,"errors":0,"collection_duration":9374}
Snapshot ready!


## 2. Google → Instagram search (Bright Data AI Mode Google dataset, human language prompt)

This code triggers an AI-Mode Google dataset run with a natural-language prompt, polls progress and downloads results as JSON. Instead of passing a strict keyword string, you write a short instruction (prompt) describing what to find and Bright Data’s AI-Mode performs the Google-style discovery for you.

Prompt examples:
- Find Instagram profiles of European sustainable-fashion creators. Use the Google query: site:instagram.com "sustainable fashion" "Europe". Return profile URLs only.
- Find Instagram creators who are indie makers or solopreneurs posting reels. Use: site:instagram.com "indie maker" OR "solopreneur" "reels". Return profile URLs only.
- Find Instagram profiles related to AI tools or data engineering. Use: site:instagram.com "AI tools" OR "data engineer" OR "#buildinpublic". Return profile URLs only.
- Find Instagram profiles of startup founders who work with nocode and post reels. Use: site:instagram.com "nocode" "startup founder" "reels". Return profile URLs only.

Tip: keep the prompt concise, include site:instagram.com and your niche terms and ask explicitly for profile URLs only.


In [32]:
import os, time, json, pathlib, requests

API_KEY = os.getenv("BRIGHTDATA_API_TOKEN")  
DATASET_ID = "gd_mcswdt6z2elth3zqr2"        # AI Mode Google dataset
BASE_URL = "https://api.brightdata.com"

headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

# Trigger the dataset run
payload = [
    {"url": "https://google.com/aimode", "prompt": "Find Instagram profiles of NYC sourdough bakers. Use the Google query: site:instagram.com 'sourdough' 'NYC baker'. Return profile URLs only.", "country": "US"}
]

r = requests.post(
    f"{BASE_URL}/datasets/v3/trigger",
    headers=headers,
    params={"dataset_id": DATASET_ID, "include_errors": "true"},
    json=payload,
)

if r.status_code == 400:
    r = requests.post(
        f"{BASE_URL}/datasets/v3/trigger",
        headers=headers,
        params={"dataset_id": DATASET_ID, "include_errors": "true"},
        json={"input": payload},
    )

r.raise_for_status()
print("Trigger raw text:", trigger_resp.text)
print("Trigger response:", trigger_resp.status_code, trigger_resp.text)
snapshot_id = r.json().get("snapshot_id")

# Poll progress until ready
progress_url = f"{BASE_URL}/datasets/v3/progress/{snapshot_id}"

while True:
    p = requests.get(progress_url, headers=headers)
    print("Progress raw:", p.text[:300])
    p.raise_for_status()
    status = p.json().get("status")
    if status in {"done", "completed", "ready"}:
        break
    time.sleep(5)

# Download results as JSON and save to file
download_url = f"{BASE_URL}/datasets/v3/snapshot/{snapshot_id}"
resp = requests.get(download_url, headers=headers, params={"format": "json"})
resp.raise_for_status()

data = resp.json()
path = pathlib.Path("ai_google.json")
with path.open("w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)




Trigger raw text: {"snapshot_id":"sd_mi04wc793e8mthspw"}
Trigger response: 200 {"snapshot_id":"sd_mi04wc793e8mthspw"}
Progress raw: {"status":"running","snapshot_id":"sd_mi070kmjuvkvt0bl6","dataset_id":"gd_mcswdt6z2elth3zqr2"}
Progress raw: {"status":"running","snapshot_id":"sd_mi070kmjuvkvt0bl6","dataset_id":"gd_mcswdt6z2elth3zqr2"}
Progress raw: {"status":"running","snapshot_id":"sd_mi070kmjuvkvt0bl6","dataset_id":"gd_mcswdt6z2elth3zqr2"}
Progress raw: {"status":"running","snapshot_id":"sd_mi070kmjuvkvt0bl6","dataset_id":"gd_mcswdt6z2elth3zqr2"}
Progress raw: {"status":"running","snapshot_id":"sd_mi070kmjuvkvt0bl6","dataset_id":"gd_mcswdt6z2elth3zqr2"}
Progress raw: {"status":"ready","snapshot_id":"sd_mi070kmjuvkvt0bl6","dataset_id":"gd_mcswdt6z2elth3zqr2","records":1,"errors":0,"collection_duration":24240}


## 3. Perplecity → Instagram search (Bright data Web Scrapers Library)

This code triggers a Perplexity-powered run via Bright Data’s Web Scrapers Library, polls progress and downloads the JSON results.  
Instead of a strict Google query, you provide a short natural-language prompt. Perplexity does the discovery and the dataset returns the findings you can post-process (e.g., extract Instagram profile URLs).

Prompt examples:
- Find Instagram profiles of NYC sourdough bakers. Prefer individual bakers (not brands/agencies). Return profile URLs only.
- Find Instagram creators in Europe who post about sustainable fashion. Return profile URLs only.
- Find Instagram creators who are indie makers or solopreneurs and frequently post reels. Return profile URLs only.
- Find Instagram profiles focused on AI tools and data engineering with authentic, human-made content. Return profile URLs only.
- Find Instagram food bloggers in Tel Aviv who share step-by-step baking recipes. Prefer individuals, not shops. Return profile URLs only.
- Find Instagram photographers in Berlin who shoot street/documentary style. Return profile URLs only.


In [33]:
import os, time, json, pathlib, requests

API_KEY = os.getenv("BRIGHTDATA_API_TOKEN")           
DATASET_ID = "gd_m7dhdot1vw9a7gc1n"                  
BASE_URL = "https://api.brightdata.com"
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

# Trigger the dataset run
prompt = (
    'Find Instagram profiles of NYC sourdough bakers. '
    'Return 15 profile URLs only (one per line). '
    'Prefer individual bakers (not brands or agencies). '
    'NYC includes Manhattan, Brooklyn, Queens.'
)

payload = [{"url": "https://www.perplexity.ai", "prompt": prompt, "index": 1}]

r = requests.post(
    f"{BASE_URL}/datasets/v3/trigger",
    headers=headers,
    params={"dataset_id": DATASET_ID, "include_errors": "true"},
    json={"input": payload},      
)

r.raise_for_status()
print("Trigger raw text:", trigger_resp.text)
print("Trigger response:", trigger_resp.status_code, trigger_resp.text)
snapshot_id = r.json().get("snapshot_id")

# Poll progress until ready
progress_url = f"{BASE_URL}/datasets/v3/progress/{snapshot_id}"
while True:
    p = requests.get(progress_url, headers=headers)
    print("Progress raw:", p.text[:300])
    p.raise_for_status()
    status = p.json().get("status")
    if status in {"done", "completed", "ready"}:
        break
    time.sleep(5)

# Download results as JSON and save to file
download_url = f"{BASE_URL}/datasets/v3/snapshot/{snapshot_id}"
resp = requests.get(download_url, headers=headers, params={"format": "json"})
resp.raise_for_status()

data = resp.json()
path = pathlib.Path("perplexity_bd.json")
with path.open("w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)


Trigger raw text: {"snapshot_id":"sd_mi04wc793e8mthspw"}
Trigger response: 200 {"snapshot_id":"sd_mi04wc793e8mthspw"}
Progress raw: {"status":"running","snapshot_id":"sd_mi071d55i0r631b3q","dataset_id":"gd_m7dhdot1vw9a7gc1n"}
Progress raw: {"status":"running","snapshot_id":"sd_mi071d55i0r631b3q","dataset_id":"gd_m7dhdot1vw9a7gc1n"}
Progress raw: {"status":"running","snapshot_id":"sd_mi071d55i0r631b3q","dataset_id":"gd_m7dhdot1vw9a7gc1n"}
Progress raw: {"status":"running","snapshot_id":"sd_mi071d55i0r631b3q","dataset_id":"gd_m7dhdot1vw9a7gc1n"}
Progress raw: {"status":"running","snapshot_id":"sd_mi071d55i0r631b3q","dataset_id":"gd_m7dhdot1vw9a7gc1n"}
Progress raw: {"status":"running","snapshot_id":"sd_mi071d55i0r631b3q","dataset_id":"gd_m7dhdot1vw9a7gc1n"}
Progress raw: {"status":"running","snapshot_id":"sd_mi071d55i0r631b3q","dataset_id":"gd_m7dhdot1vw9a7gc1n"}
Progress raw: {"status":"running","snapshot_id":"sd_mi071d55i0r631b3q","dataset_id":"gd_m7dhdot1vw9a7gc1n"}
Progress raw: {"st

## 4. Direct call to perplexity (Openrouter)

This code calls a Perplexity online model through OpenRouter to discover Instagram profiles and prints the extracted profile URLs.

What it does:
- initializes an OpenRouter provider using the OPENROUTER_API_KEY environment variable
- selects a Perplexity model for web-enabled search
- sets a system prompt that enforces output format: one Instagram profile URL per line, prefer individuals over brands
- sends a user prompt: find NYC sourdough bakers on Instagram 
- runs the agent and captures plain-text output
- extracts instagram.com/{username} URLs with a regex and prints the count and a preview

Notes:
- set OPENROUTER_API_KEY in your environment before running
- adjust the model name or prompt to target different niches or locations
- the regex keeps only profile-like URLs and ignores posts/reels/explore links


In [None]:
import os, re
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openrouter import OpenRouterProvider

provider = OpenRouterProvider(api_key=os.getenv("OPENROUTER_API_KEY", ''))

model = OpenAIChatModel(
    "perplexity/sonar-pro-search",
    provider=provider,
)

SYSTEM = (
    "You are an Instagram influencer discovery assistant.\n"
    "Return ONLY Instagram profile URLs (one per line). No extra text.\n"
    "Prefer individual creators; exclude brands/agencies/shops/SEO farms.\n"
    "NYC includes Manhattan, Brooklyn, Queens, Bronx, Staten Island.\n"
)

agent = Agent(
    model=model,
    system_prompt=SYSTEM,
)

PROMPT = (
    "Find Instagram profiles of NYC sourdough bakers. "
    "Return 15 profile URLs only (one per line). Prefer individual bakers (not brands/agencies). "
    "NYC includes Manhattan, Brooklyn, Queens, Bronx, Staten Island."
)

result = await agent.run(PROMPT)
text = result.output

urls = []
for line in text.splitlines():
    m = re.search(r"(https?://(?:www\.)?instagram\.com/[A-Za-z0-9._]+)", line.strip())
    if m:
        urls.append(m.group(1))
print(len(urls), urls[:10])


15 ['https://instagram.com/theclevercarrot', 'https://instagram.com/sarah_c_owens', 'https://instagram.com/brooklynsourdough', 'https://instagram.com/maurizio', 'https://instagram.com/riseandloaf_sourdoughco', 'https://instagram.com/joseybakerbread', 'https://instagram.com/october_farms', 'https://instagram.com/crustycravingsbyhannah', 'https://instagram.com/amybakesbread', 'https://instagram.com/littlewildbread']
