Skip to content

samber/the-great-gpt-firewall

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Great GPT Firewall 📛

This collection is a curated list of websites that employ the robots.txt file to restrict access to AI Agents, AI crawlers and GPTs.

It will be updated monthly.

We need a plan!

User agents & robots.txt

The robots.txt file allows website owners to control and limit the access of these user agents to certain areas of their website by specifying rules and directives.

# OpenAI’s web crawler: GPT3.5, GPT4, ChatGPT
# https://platform.openai.com/docs/gptbot
User-agent: GPTBot

# ChatGPT plugins
# https://platform.openai.com/docs/plugins/bot
User-agent: ChatGPT-User

# Google's web crawler: Bard, VertexAI, Gemini
# https://blog.google/technology/ai/an-update-on-web-publisher-controls/
User-agent: Google-Extended

# Claude
User-agent: anthropic-ai

# Common Crawl
# https://commoncrawl.org/ccbot
User-agent: CCBot

# Omglibot: webz.io
# https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/
User-agent: Omgilibot
User-agent: Omgili

# Facebook: LLaMA2
# https://developers.facebook.com/docs/sharing/bot/
User-agent: FacebookBot

# ByteDance: Duobao
User-agent: Bytespider

# Censorship area
Disallow: /

Disclaimer

Please note that this blocklist is intended for informational purposes only. Despite the provoking project name, it's fine to disallow web crawling and protect content ownership.

2024-04 update

Category: Press

  • Scanned: 66
  • ✅ Passing: 36 %
  • 🔐 Blocked: 64 %
  • ❓ Unknown: 0 %
Name Country Status
The Times 🇬🇧 🔐
BBC 🇬🇧 🔐
The Guardian 🇬🇧 🔐
The Economist 🇬🇧 🔐
Financial Times 🇬🇧 🔐
The Independent 🇬🇧
The Telegraph 🇬🇧 🔐
Daily Mail 🇬🇧 🔐
The Sun 🇬🇧 🔐
Daily Mirror 🇬🇧
Daily Express 🇬🇧
Washington Post 🇺🇸 🔐
USA Today 🇺🇸
Fox News 🇺🇸
ABC News 🇺🇸 🔐
NBC News 🇺🇸 🔐
CBS News 🇺🇸 🔐
Los Angeles Times 🇺🇸 🔐
Chicago Tribune 🇺🇸 🔐
New York Post 🇺🇸 🔐
New York Daily News 🇺🇸 🔐
The New Yorker 🇺🇸 🔐
Vice 🇺🇸 🔐
New York Times 🇺🇸 🔐
Wall Street Journal 🇺🇸 🔐
CNN 🇺🇸 🔐
El País 🇪🇸
Süddeutsche Zeitung 🇩🇪 🔐
Der Spiegel 🇩🇪 🔐
Corriere della Sera 🇮🇹 🔐
La Repubblica 🇮🇹 🔐
Le Monde 🇫🇷 🔐
Libération 🇫🇷 🔐
Le Figaro 🇫🇷 🔐
20 Minutes 🇫🇷 🔐
Ouest France 🇫🇷 🔐
Le Parisien 🇫🇷 🔐
L'Equipe 🇫🇷 🔐
Le Point 🇫🇷 🔐
Marianne 🇫🇷 🔐
Le Nouvel Observateur 🇫🇷 🔐
L'Express 🇫🇷 🔐
France 24 🇫🇷 🔐
BFMTV 🇫🇷 🔐
CNews 🇫🇷
Le Monde Diplomatique 🇫🇷
Mediapart 🇫🇷 🔐
Courrier International 🇫🇷 🔐
Brut 🇫🇷
IMDB 🌍
Allocine 🇫🇷
Fakt 🇵🇱
Super Express 🇵🇱
Gazeta Wyborcza 🇵🇱 🔐
Rzeczpospolita 🇵🇱
Dziennik Gazeta Prawna 🇵🇱
Polityka 🇵🇱
Newsweek Polska 🇵🇱
Gość Niedzielny 🇵🇱
Sieci 🇵🇱
Do Rzeczy 🇵🇱
Twój Styl 🇵🇱
Zwierciadło 🇵🇱
Wysokie Obcasy Extra 🇵🇱 🔐
Pani 🇵🇱
Elle 🇵🇱

Category: Video on demand

  • Scanned: 9
  • ✅ Passing: 56 %
  • 🔐 Blocked: 44 %
  • ❓ Unknown: 0 %
Name Country Status
Prime Video 🌍
Netflix 🌍
Disney+ 🌍 🔐
Hulu 🇺🇸 🔐
HBO Max 🇺🇸
Canal+ 🇫🇷 🔐
FranceTV 🇫🇷
TF1 🇫🇷 🔐
6Play 🇫🇷

Category: Music

  • Scanned: 6
  • ✅ Passing: 67 %
  • 🔐 Blocked: 33 %
  • ❓ Unknown: 0 %
Name Country Status
Soundcloud 🌍 🔐
Youtube 🌍
Apple Music 🌍
Spotify 🌍 🔐
Deezer 🇫🇷
LastFM 🇬🇧

Category: Podcast

  • Scanned: 8
  • ✅ Passing: 75 %
  • 🔐 Blocked: 25 %
  • ❓ Unknown: 0 %
Name Country Status
Google Podcasts 🌍
Apple Podcast 🌍
Spotify Podcaster 🌍 🔐
Buzzsprout 🌍
Podbean 🌍
Acast 🇬🇧
AudioMeans 🇫🇷
Radio France 🇫🇷 🔐

Category: X

  • Scanned: 6
  • ✅ Passing: 100 %
  • 🔐 Blocked: 0 %
  • ❓ Unknown: 0 %
Name Country Status
PornHub 🌍
YouPorn 🌍
Xnxx 🌍
Xvideos 🌍
Xhamster 🌍
OnlyFan 🌍

Category: Religion

  • Scanned: 5
  • ✅ Passing: 100 %
  • 🔐 Blocked: 0 %
  • ❓ Unknown: 0 %
Name Country Status
Bible 🇺🇸
Bible gateway 🇺🇸
Jehovah's Witnesses 🇺🇸
Vatican 🇻🇦
Islamweb 🌍

Category: Social media

  • Scanned: 13
  • ✅ Passing: 38 %
  • 🔐 Blocked: 54 %
  • ❓ Unknown: 8 %
Name Country Status
Facebook 🌍 🔐
Instagram 🌍 🔐
Reddit 🌍
Hacker News 🌍
Lobsters 🌍 🔐
Pinterest 🌍 🔐
TikTok 🌍
Twitter 🌍 🔐
LinkedIn 🌍
Quora 🌍 🔐
VK 🇷🇺
TripAdvisor 🌍
Yelp 🌍 🔐

Category: Artist

  • Scanned: 42
  • ✅ Passing: 71 %
  • 🔐 Blocked: 21 %
  • ❓ Unknown: 7 %
Name Country Status
Michael Jackson 🇺🇸
Madonna 🇺🇸
Taylor Swift 🇺🇸 🔐
Rihanna 🇺🇸
Bruno Mars 🇺🇸
Justin Bieber 🇺🇸 🔐
Beyoncé 🇺🇸
Katy Perry 🇺🇸 🔐
Lady Gaga 🇺🇸 🔐
Hardwell 🇺🇸
Dimitri Vegas & Like Mike 🇺🇸
Kanye West 🇺🇸
Black Eyed Peas 🇺🇸
Imagine Dragons 🇺🇸 🔐
Twenty One Pilots 🇺🇸
Maroon 5 🇺🇸 🔐
Selena Gomez 🇺🇸 🔐
Usher 🇺🇸 🔐
Stromae 🇧🇪
Aya Nakamura 🇫🇷
Soprano 🇫🇷
Johnny Hallyday 🇫🇷
Grand Corps Malade 🇫🇷
Zaho 🇫🇷
Jean Louis Aubert 🇫🇷
Camelia Jordana 🇫🇷
Indochine 🇫🇷
Tryo 🇫🇷
David Guetta 🇫🇷
Mc Solaar 🇫🇷
Zaz 🇫🇷
Christine and the Queens 🇫🇷
Boulevard des Airs 🇫🇷
Calogero 🇫🇷
Hoshi 🇫🇷
Avicii 🇸🇪
Adele 🇬🇧
Calvin Harris 🇬🇧
Ed Sheeran 🇬🇧
Arctic Monkeys 🇬🇧
Coldplay 🇬🇧
The Weeknd 🇨🇦 🔐

Category: Gov

  • Scanned: 3
  • ✅ Passing: 100 %
  • 🔐 Blocked: 0 %
  • ❓ Unknown: 0 %
Name Country Status
White House 🇺🇸
Elysée 🇫🇷
Europe 🇪🇺

Category: Science

  • Scanned: 28
  • ✅ Passing: 86 %
  • 🔐 Blocked: 14 %
  • ❓ Unknown: 0 %
Name Country Status
Google Scholar 🌍
Sci-Hub 🌍
PubPeer 🌍
Scopus 🇳🇱 🔐
Elsevier 🇳🇱 🔐
ScienceDirect 🇳🇱 🔐
MDPI 🇨🇭
Springer 🇩🇪
Wiley 🇺🇸
American Chemical Society 🇺🇸
PubMed 🇺🇸
Academia 🇺🇸
Science 🇺🇸 🔐
ArXiv 🇺🇸
American Physical Society 🇺🇸
Mendeley 🇬🇧
Nature 🇬🇧
Taylor & Francis 🇬🇧
Oxford University Press 🇬🇧
Cambridge University Press 🇬🇧
Royal Society of Chemistry 🇬🇧
ResearchGate 🇩🇪
BNF 🇫🇷
Cairn 🇫🇷
Persee 🇫🇷
Gallica 🇫🇷
HAL 🇫🇷
OpenEdition 🇫🇷

Category: Dev

  • Scanned: 3
  • ✅ Passing: 100 %
  • 🔐 Blocked: 0 %
  • ❓ Unknown: 0 %
Name Country Status
Github 🌍
Gitlab 🌍
Stack Overflow 🌍

Category: Other content

  • Scanned: 19
  • ✅ Passing: 84 %
  • 🔐 Blocked: 16 %
  • ❓ Unknown: 0 %
Name Country Status
Wikipedia 🌍
Medium 🌍 🔐
Substack 🌍
Common Crawl 🌍
Internet Archive 🌍
Wayback Machine 🌍
Notion 🌍
Weather 🇺🇸 🔐
AccuWeather 🇺🇸
Météo France 🇫🇷
Getty Images 🇺🇸
Shutterstock 🇺🇸 🔐
Adobe Stock 🇺🇸
Unsplash 🇨🇦
Pexels 🇩🇪
Pixabay 🇩🇪
Flickr 🇺🇸
500px 🇨🇦
Giphy 🇺🇸

Category: Other

  • Scanned: 1
  • ✅ Passing: 100 %
  • 🔐 Blocked: 0 %
  • ❓ Unknown: 0 %
Name Country Status
Indeed 🇺🇸

WTF list

A.k.a: do they understand their business model? 💸

Name Status
Adobe Stock
Getty Images
Pexels
Pixabay
500px

Shame list

A.k.a: this is public interest. 🖕

Name Status
Medium 🔐
Quora 🔐
Elsevier 🔐
ScienceDirect 🔐
Scopus 🔐
Science 🔐

🤝 Contributing

Looking for contributions:

  • Enrich website database
  • Chinese websites
  • New categories

Please open issues!

Don't hesitate ;)

Build

pip3 install -r requirements.txt
python3 scrape.py
# then copy the last version into readme

👤 Contributors

Contributors

💫 Show your support

Give a ⭐️ if this project helped you!

GitHub Sponsors

📝 License

Copyright © 2024 Samuel Berthe.

This project is MIT licensed.