Анализ данных и категоризация сайтов


В данном проекте я работал с файлом aggregated_visits.csv, содержащим данные о количестве запросов к сайтам. Для упрощения анализа я выделил 90% наиболее часто встречающихся сайтов, отсеяв менее популярные, чтобы сосредоточиться на значимых данных.

Параллельно был загружен файл обновленная_таблица_с_ключевыми_словами.xlsx, содержащий предварительно категоризированные сайты. Эта таблица была использована для оценки качества категоризации, выполненной с помощью ChatGPT, и данных из интернета. Примечательно, что автоматический код, созданный ChatGPT, смог классифицировать все сайты, оставив лишь 2.4% в категории "Others".

Для улучшения точности анализа мной был создан скрипт, который позволяет генерировать регулярные выражения для URL. В дальнейшем это станет основой для использования API и автоматической категоризации новых сайтов.

Основные этапы работы:

1.Загрузка и предобработка данных:
Сначала я загрузил файлы данных в pandas, отсортировал их по частоте запросов и выделил 90% наиболее популярных сайтов.

2.Категоризация сайтов:

1)Использовал таблицу ключевых слов для автоматической классификации сайтов.
2)Применил расширенные алгоритмы, включающие проверку на IP-адреса и поиск ключевых слов в URL.

3.Оценка точности:
Итоговая категоризация показала, что подавляющее большинство сайтов были успешно классифицированы, а доля категории "Others" значительно снизилась. Это подчеркивает эффективность предложенного подхода и открывает перспективы для дальнейшей доработки.

Следующим этапом станет интеграция API для динамической категоризации новых данных и уменьшение остаточной доли "Others" до менее 1%.

1. Загружаем библиотеки

In [1]:
import pandas as pd
import re  

2. Загружаем и предобрабатываем данные

Код загружает Excel-файл с ключевыми словами в DataFrame и отображает его первые строки.

In [2]:
output_file_path = 'обновленная_таблица_с_ключевыми_словами.xlsx'

# Загружаем файл Excel в DataFrame
top_keywords_2 = pd.read_excel(output_file_path)

top_keywords_2.head()

Unnamed: 0,Category,Keywords
0,Adult,"club, eu, onlyfans, porn, top, xxx, xyz"
1,Arts & Entertainment,"120years, 123movies4u, 123test, 16personalitie..."
2,Autos & Vehicles,"4x4utv, aerosus, arizonarvpartscenter, ase, au..."
3,Beauty & Fitness,"14daymanicure, 50-ml, absolutecosmetics, aesth..."
4,Books & Literature,"4m-net, abrasiverock, africanbookscollective, ..."


Этот код загружает данные о посещениях из CSV-файла, вычисляет порог для 90% всех запросов и отбирает сайты, которые составляют 90% от общего количества запросов.

In [3]:
file_path_visits = 'aggregated_visits.csv'  # Путь к файлу с агрегированными посещениями (CSV)
aggregated_visits_data = pd.read_csv(file_path_visits)  # Чтение CSV файла

# Шаг 1: Вычисляем 90% от общего количества запросов
sorted_visits_data = aggregated_visits_data.sort_values(by='count()', ascending=False)  # Сортировка по количеству запросов
total_visit_count = sorted_visits_data['count()'].sum()  # Общая сумма запросов
ninety_percent_threshold = total_visit_count * 0.9  # Порог для 90% запросов

# Шаг 2: Отбираем сайты, которые входят в 90% всех запросов
cumulative_visit_sum = sorted_visits_data['count()'].cumsum()  # Накопленная сумма запросов
top_90_percent_sites = sorted_visits_data[cumulative_visit_sum <= ninety_percent_threshold]  # Сайты, входящие в 90% запросов




3. Категоризация с помощью таблицы с ключевыми словами

Этот код реализует функцию категоризации сайтов на основе ключевых слов из таблицы df_top_keywords_2. Он проверяет, является ли сайт IP-адресом, и если это не так, пытается найти категорию, используя ключевые слова для каждого сайта. Если ключевые слова не совпадают, сайт попадает в категорию "Others". Затем функция применяется к каждому сайту в данных о топ-90% посещений, и для каждого сайта добавляется категория.

In [4]:
# Функция для категоризации сайтов на основе ключевых слов из df_top_keywords_2
def categorize_site(site, keyword_df):
    # Преобразуем сайт в строку и в нижний регистр для удобства поиска
    site = str(site).lower()
    
    # Проверка на IP-адрес (состоит только из цифр и точек)
    if all(c.isdigit() or c == '.' for c in site):
        return 'IP Address'
    
    # Пробуем найти категорию по ключевым словам из таблицы df_top_keywords_2
    for index, row in keyword_df.iterrows():
        keywords = row['Keywords'].split(', ')
        if any(keyword in site for keyword in keywords):
            return row['Category']
    
    # Если не подошли ключевые слова, возвращаем "Others"
    return 'Others'

# Применение функции к каждому сайту в df_top_90
top_90_percent_sites['Category'] = top_90_percent_sites['req_host'].apply(lambda x: categorize_site(x, top_keywords_2))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_90_percent_sites['Category'] = top_90_percent_sites['req_host'].apply(lambda x: categorize_site(x, top_keywords_2))


4. Категоризация с помощью данных чата gpt

Этот код представляет собой функцию для категоризации сайтов по различным признакам. В частности, она проверяет, содержат ли URL определенные ключевые слова, соответствующие различным категориям, таким как поисковые системы, социальные сети, электронная коммерция, финансы и другие. Он также включает проверку на IP-адреса и технические ресурсы. Затем эта функция применяется к каждому сайту из списка, чтобы добавить к нему новую категорию. Далее код подсчитывает количество сайтов в категории "Other" и вычисляет процент таких сайтов от общего числа.

In [5]:
def categorize_sites_v3(site):
    if isinstance(site, str):
        # Проверка на IP-адрес
        if re.match(r"^\d{1,3}(\.\d{1,3}){3}$", site):
            return "IP Address"
        
        # Приведение строки к нижнему регистру
        site = site.lower()
    if "google" in site or "bing" in site or "yahoo" in site or "ya.ru" in site or "yandex" in site or "rambler.ru" in site:
        return "Search Engine"
    elif "instagram" in site or "facebook" in site or "tiktok" in site or "twitter" in site or "linkedin" in site or "vk" in site or "snapchat" in site or "fbsbx" in site or "t.co" in site or "tinder" in site or "twimg" in site or "x.com" in site:
        return "Social Media"
    elif "ozon" in site or "aliexpress" in site or "amazon" in site or "ebay" in site or "etsy" in site or "wb" in site or "jmart" in site:
        return "E-commerce"
    elif "bot" in site or "automation" in site or "crawler" in site or "zennolab.com" in site or "multiloginapp.com" in site:
        return "Bots/Automation"
    elif "news" in site or "bbc" in site or "cnn" in site or "forbes" in site or "nytimes" in site or "reuters" in site:
        return "News/Media"
    elif "bank" in site or "payment" in site or "finance" in site or "paypal" in site or "kaspi" in site:
        return "Finance"
    elif "gov" in site or "edu" in site or "org" in site or "university" in site or "gosuslugi" in site:
        return "Government Services"
    elif "adult" in site or "xxx" in site or "porn" in site:
        return "Adult"
    elif "vpn" in site or "proxy" in site or "ip" in site or "whoer" in site or "ifconfig" in site:
        return "VPN/Proxy"
    elif "gaming" in site or "game" in site or "xbox" in site or "playstation" in site or "betano" in site or "bet365" in site:
        return "Gaming"
    elif "blog" in site or "wordpress" in site or "medium" in site:
        return "Blog"
    elif "shop" in site or "store" in site or "market" in site:
        return "Retail"
    elif "health" in site or "medical" in site or "clinic" in site or "hospital" in site:
        return "Health/Medical"
    elif "cdn" in site  or "yastatic" in site or "mds.yandex.net" in site or "cmd328" in site or "dzeninfra" in site or "avatars.dzeninfra" in site or "fastly" in site or "vercel" in site:
        return "Content Delivery Network"
    elif "ad" in site or "ads" in site or "advertising" in site or "googleadservices.com" in site or "doubleclick.net" in site or "tribalfusion" in site or "rubiconproject" in site or "bidswitch" in site or "criteo" in site or "doubleverify" in site or "2mdn" in site or "demdex" in site or "gplinks" in site or "agkn" in site or "openx" in site or "buzzoola" in site or "casalemedia" in site or "contextweb" in site or "sharethrough" in site or "pubmatic" in site or "mediavine" in site:
        return "Advertising"
    elif "cloud" in site or "azure" in site or "aws" in site:
        return "Cloud Services"
    elif "yandexmetrica" in site or "mc.yandex" in site or "ymetrica" in site or "dzen" in site:
        return "Analytics/Tracking"
    elif "mail" in site or "email" in site or "smtp" in site:
        return "Email Service"
    elif "youtube" in site or "ytimg" in site:
        return "Video Streaming"
    elif "gibdd" in site:
        return "Government Services"
    elif "api" in site or "client-api" in site:
        return "API Service"
    elif "sync" in site or "id5-sync" in site:
        return "Analytics/Tracking"
    elif site.replace('.', '').isdigit():  # Identifying IP addresses
        return "IP Address"
    elif "gstatic" in site or "fonts.googleapis.com" in site or "akamai" in site or "gvt1" in site or "jquery" in site:  # Categorizing technical resources
        return "Technical Resource"
    elif "tracking" in site or "analytics" in site or "beacons" in site  or "google-analytics.com" in site or "matomo.org" in site or "metrics" in site or "dmp" in site or "quantserve" in site or "scorecardresearch" in site:
        return "Analytics/Tracking"
    elif "scbeasy" in site:
        return "Finance"
    elif "acuityplatform" in site:
        return "Analytics/Tracking"
    elif "2lb" in site:
        return "Content Delivery Network"
    elif "gcp" in site:
        return "Cloud Services"
    elif "myxiyou" in site:
        return "other"
    elif "bidr" in site:
        return "Advertising"
    elif "shikkharalo" in site:
        return "Other"
    elif "nr-data" in site:
        return "Analytics/Tracking"
    elif "365lpodds" in site:
        return "Gaming"
    elif "ufouxbwn" in site:
        return "Advertising"
    elif "onetag-sys" in site:
        return "Advertising"
    elif "power5555" in site:
        return "Content Delivery Network"
    elif "taxismaned" in site:
        return "Other"
    elif "unpkg" in site:
        return "Content Delivery Network"
    elif "likest" in site:
        return "Social Media"
    elif "clarity" in site:
        return "Analytics/Tracking"
    elif "growplow" in site:
        return "Advertising"
    elif "sape" in site:
        return "SEO/Link Building"
    elif "simpli" in site:
        return "Advertising"
    elif "media.net" in site:
        return "Advertising"
    elif "scbeasy" in site:
        return "Finance"
    elif "acuityplatform" in site:
        return "Analytics/Tracking"
    elif "2lb" in site:
        return "Content Delivery Network"
    elif "gcp" in site or "beacons" in site:
        return "Cloud Services"
    elif "myxiyou" in site:
        return "E-commerce"
    elif "shikkharalo" in site:
        return "Other"
    elif "nr-data" in site:
        return "Analytics/Tracking"
    elif "365lpodds" in site:
        return "Gaming"
    elif "ufouxbwn" in site or "onetag-sys" in site:
        return "Advertising"
    elif "power5555" in site:
        return "Content Delivery Network"
    elif "taxismaned" in site:
        return "Advertisements"
    elif "unpkg" in site:
        return "Content Delivery Network"
    elif "likest" in site:
        return "Social Media"
    elif "clarity" in site:
        return "Analytics/Tracking"
    elif "growplow" in site:
        return "Event Management/Marketing"
    elif "sape" in site:
        return "Advertising"
    elif "simpli" in site:
        return "Advertising"
    elif "media.net" in site:
        return "Advertising"
    elif "tns-counter" in site:
        return "Analytics/Tracking"
    elif "scbdevice" in site:
        return "Finance"
    elif "a-mo" in site or "prebid" in site:
        return "Advertising"
    elif "digitaltarget" in site:
        return "Advertising"
    elif "publickeyservice" in site:
        return "Cloud Services"
    elif "example" in site:
        return "Other"
    elif "appier" in site:
        return "Advertising"
    elif "redditstatic" in site:
        return "Social Media"
    elif "mediago" in site:
        return "Advertising"
    elif "typekit" in site:
        return "Technical Resource"
    elif "recaptcha" in site:
        return "Security Service"
    elif "w55c" in site:
        return "Advertising"
    elif "jivosite" in site:
        return "Analytics/Tracking"
    elif "p.tvpixel" in site:
        return "Advertising"
    elif "mts.ru" in site:
        return "Telecommunications"
    elif "media.net" in site:
        return "Advertising"
    elif "betonmobile" in site:
        return "Gaming"
    elif "arkoselabs" in site:
        return "Security Service"
    elif "33across" in site:
        return "Advertising"
    elif "acint" in site:
        return "Analytics/Tracking"
    elif "bitrix" in site:
        return "Technical Resource"
    elif "asia" in site or "wosbet" in site or "66bb" in site:
        return "Gaming"
    elif "ezoic" in site:
        return "Advertising"
    elif "bidster" in site:
        return "Advertising"
    elif "bet" in site or "88" in site or "198198" in site or "nv9891" in site or "m8nice" in site or "umbet" in site or "dg222play" in site:
        return "Gaming"
    elif "createjs" in site:
        return "Technical Resource"
    elif "w9ccc" in site or "month88" in site:
        return "Betting/Gambling"
    elif "kimberlite" in site:
        return "Analytics/Tracking"
    elif "ibb.co" in site:
        return "Image Hosting"
    elif "m8bet" in site or "168" in site or "bms222" in site or "nova88" in site or "umbet" in site or "88bb" in site or "sugar28" in site or "enjoy68" in site:
        return "Gaming"
    elif "owneriq" in site:
        return "Advertising"
    elif "pystatic" in site:
        return "Content Delivery Network"
    elif "ttwstatic" in site:
        return "Content Delivery Network"
    elif "mozilla" in site:
        return "Technical Resource"
    elif "jivox.com" in site or "ult.kz" in site:
        return "Business and Industry"
    elif "textnow" in site:
        return "Telecommunications"
    elif "gumgum" in site:
        return "Advertising"
    elif "kargo" in site:
        return "Advertising"
    elif "fxdx" in site:
        return "Other"
    elif "unity3d" in site or "applovin" in site:
        return "Gaming"
    elif "m8bets" in site or "m8online" in site:
        return "Gaming"
    elif "whatsapp" in site:
        return "Social Media"
    elif "flash" in site or "flashtalking" in site:
        return "Advertising"
    elif "33across" in site or "lkqd" in site or "appocean" in site or "lynx" in site:
        return "Advertising"
    elif "kraken" in site:
        return "Cryptocurrency/Finance"
    elif "intercom" in site:
        return "Customer Support"
    elif "apple.com" in site:
        return "Technology Services"
    elif "zeotap" in site or "processfinger" in site or "omnitagjs" in site or "crwdcntrl" in site:
        return "Analytics/Tracking"
    elif "ivi.ru" in site:
        return "Media Streaming"
    elif "ups-x-sign" in site:
        return "Other"
    elif "blacktowhite" in site or "mommyandlove" in site:
        return "Adult"
    elif "static.net" in site or "uuidksinc" in site:
        return "Technical Resource"
    elif "lkqd" in site:
        return "Advertising"
    elif "taboola" in site or "primis" in site or "bidvol" in site or "acint" in site:
        return "Advertising"
    elif "comagic" in site or "comagic.ru" in site:
        return "Analytics/Tracking"
    elif "media.net" in site or "weborama" in site or "odr.mookie1" in site:
        return "Advertising"
    elif "time.apple" in site or "rezync" in site:
        return "Technology Services"
    elif "kirpich" in site or "dmitralex" in site:
        return "Retail"
    elif "fontawesome" in site  or "inner-active.mobi" in site:
        return "Computers and Internet"
    elif "deliveryhero" in site:
        return "Logistics"
    elif "poczta" in site:
        return "Email Service"
    elif "st.top100.ru" in site or "trustarc" in site or "truste.com" in site:
        return "Analytics/Tracking"
    elif "momkidzone" in site:
        return "Parenting/Family"
    elif "heypiggy" in site:
        return "Gaming"
    elif "reddit" in site or "redd.it" in site or "redditmedia" in site:
        return "Social Media"
    elif "whatsapp" in site or "t.me" in site:
        return "Social Media"
    elif "fontawesome" in site or "walmartimages" in site or "typekit" in site:
        return "Technical Resource"
    elif "unity3d" in site:
        return "Gaming"
    elif "gvt2" in site or "dotomi" in site or "undertone" in site or "utraff" in site or "360yield" in site:
        return "Advertising"
    elif "media.net" in site:
        return "Advertising"
    elif "linkpop" in site:
        return "Link Management"
    elif "fxdx" in site:
        return "Other"
    elif "disqus" in site:
        return "Computers and Internet"
    elif "mxptint" in site:
        return "Advertisements"
    elif "jivosite" in site:
        return "Customer Support"
    elif "cosmetic" in site:
        return "Beauty/Cosmetics"
    elif "comagic" in site:
        return "Analytics/Tracking"
    elif "1x1" in site or "pragmaticplay" in site:
        return "Gaming"
    elif "indexww" in site or "weborama" in site or "seedtag" in site:
        return "Advertising"
    elif "bateriasalinstante" in site:
        return "E-commerce"
    elif "pystatic" in site:
        return "Technical Resource"
    elif "tg.socdm" in site or "mc.acint" in site or "acint" in site:
        return "Analytics/Tracking"
    elif "my.rtmark.net" in site or "go-updater" in site:
        return "Advertising/Tracking"
    elif "schulmanlaw" in site:
        return "Legal Services"
    elif "thrtle" in site:
        return "Content Delivery Network"
    elif "pixabay" in site:
        return "Image Hosting"
    elif "rpc.ankr" in site:
        return "Technology Services"
    elif "bs.serving-sys" in site or "blismedia" in site or "gumgum" in site or "clarity.ms" in site:
        return "Advertising"
    elif "otlp.bugsnag" in site:
        return "Analytics/Tracking"
    elif "ulogin.ru" in site:
        return "Authentication Services"
    elif "feed.pghub" in site:
        return "Content Delivery Network"
    elif "c.go-mpulse.net" in site:
        return "Cloud Services"
    elif "example.com" in site:
        return "Other"
    elif "kinotv" in site:
        return "Media Streaming"
    elif "s.yimg" in site:
        return "Advertising"
    elif "troyka-tc.ru" in site or "riviera-tc.ru" in site:
        return "Retail"
    elif "tzegilo.com" in site or "schelkovskiy.ru" in site:
        return "Other"
    elif "onlyfans" in site:
        return "Adult"
    elif "rambler" in site:
        return "Media Streaming"
    elif "rostelecom" in site:
        return "Telecommunications"
    elif "appiersig" in site or "acint.net" in site:
        return "Analytics/Tracking"
    elif "krushmedia" in site:
        return "Advertising"
    elif "mixvisor" in site:
        return "Analytics/Tracking"
    elif "infolinks" in site:
        return "Advertising"
    elif "vto.pe" in site:
        return "Other"
    elif "blockstock" in site:
        return "Other"
    elif "static.aruodas.lt" in site or "storage" in site:
        return "Content Delivery Network"
    elif "unrulymedia" in site or "mod.calltouch" in site:
        return "Advertising"
    elif "s.pinimg.com" in site:
        return "Social Media"
    elif "apple.com" in site:
        return "Technology Services"
    elif "inmobi" in site or "clarity.ms" in site or "newrelic" in site or "rtactivate" in site:
        return "Advertising"
    elif "scbdevice" in site:
        return "Finance"
    elif "rabby.io" in site:
        return "Blockchain Services"
    elif "grow.me" in site:
        return "Advertising"
    elif "blockstock" in site or "bpi.rtactivate" in site:
        return "Other"
    elif "medievalfayre" in site:
        return "Other"
    elif "et-eus" in site:
        return "Other"
    elif "pedidosya.com.uy" in site:
        return "E-commerce"
    elif "session.scbdevice" in site:
        return "Finance"
    elif "derelay" in site:
        return "Blockchain Services"
    elif "idpix" in site:
        return "Advertising"
    elif "launchdarkly" in site:
        return "Cloud Services"
    elif "gumgum" in site or "taboola" in site or "ftstatic" in site:
        return "Advertising"
    elif "unity3d" in site or "dofus" in site:
        return "Gaming"
    elif "appier" in site or "acint" in site:
        return "Analytics/Tracking"
    elif "byteoversea" in site:
        return "Cloud Services"
    elif "mobile.de" in site:
        return "E-commerce"
    elif "pghub.io" in site:
        return "Content Delivery Network"
    elif "my.rtmark" in site or "paker112" in site or "mynumber" in site:
        return "Other"
    elif "tc-moskovsky.ru" in site:
        return "Retail"
    elif "thrtle.com" in site or "newrotator" in site or "blockstock" in site or "tzegilo.com" in site:
        return "Other"
    elif "roulette.blockx.fun" in site:
        return "Gaming"
    elif "const.uno" in site or "yvision" in site:
        return "Social Media"
    elif "schulmanlaw" in site:
        return "Legal Services"
    elif "rexcaptcha" in site or "pki.goog" in site:
        return "Security Services"
    elif "tasks.vto.pe" in site:
        return "Other"
    elif "clarity.ms" in site or "weborama" in site or "yieldmo" in site:
        return "Advertising"
    elif "1rpc.io" in site:
        return "Technical Resource"
    elif "megogo.net" in site:
        return "Media Streaming"
    elif "flerap.com" in site or "crcldu.com" in site:
        return "Other"
    elif "tawk.to" in site:
        return "Customer Support"
    elif "iqzone" in site or "bidderstack" in site:
        return "Advertising"
    elif "soloway.ru" in site or "tasks.vto.pe" in site or "bigslide.ru" in site:
        return "Other"
    elif "inmobi" in site or "germanshepherdproblems.com" in site or "pix.bumlam.com" in site:
        return "Other"
    elif "appier" in site:
        return "Analytics/Tracking"
    elif "olx.ua" in site:
        return "E-commerce"
    elif "embed.tawk.to" in site:
        return "Customer Support"
    elif "bumlam" in site or "gurusoluciones.com" in site:
        return "Other"
    elif "mynamaz.ru" in site or "mweb.ck.inmobi.com" in site:
        return "Other"
    elif "login.live.com" in site:
        return "Email Service"
    elif "newrelic" in site or "clarity.ms" in site or "appier" in site:
        return "Analytics/Tracking"
    elif "applovin" in site:
        return "Advertising"
    elif "soloway" in site or "paker112" in site:
        return "Other"
    elif "inmobi" in site or "global.ib-ibi" in site or "simpli" in site or "crcldu.com" in site:
        return "Advertising"
    elif "tasks.vto.pe" in site or "blockstock" in site:
        return "Other"
    elif "360yield" in site or "ftstatic" in site or "tremorhub.com" in site:
        return "Advertising"
    elif "nullroute.io" in site:
        return "Technical Resource"
    elif "login.newrelic.com" in site:
        return "Analytics/Tracking"
    elif "medievalfayre.com" in site:
        return "Other"
    elif "schelkovskiy.ru" in site or "yvision" in site:
        return "Social Media"
    elif "elascenic.io" in site or "tags.soloway.ru" in site:
        return "Other"
    elif "c.appier.net" in site:
        return "Analytics/Tracking"
    elif "bumlam.com" in site:
        return "Other"
    elif "temu.com" in site:
        return "E-commerce"
    elif "picturesdown" in site or "rexcaptcha" in site:
        return "Other"
    elif "clarity.ms" in site:
        return "Analytics/Tracking"
    elif "dodopizza" in site:
        return "E-commerce"
    elif "turn.com" in site:
        return "Advertising"
    elif "unity3d" in site or "clickagy" in site or "media.net" in site:
        return "Advertising"
    elif "connect.ok.ru" in site:
        return "Social Media"
    elif "logx.optimizely.com" in site:
        return "Analytics/Tracking"
    elif "hcaptcha.com" in site:
        return "Security Services"
    elif "sabon-sm.ru" in site or "gnezdo.ru" in site:
        return "Retail"
    elif "scenum.io" in site or "pix.bumlam" in site:
        return "Other"
    elif "newrotator" in site or "mynumber" in site:
        return "Other"
    elif "algoparalasalud.es" in site:
        return "Health/Medical"
    elif "germanshepherdproblems.com" in site:
        return "Other"
    elif "mookie1.com" in site:
        return "Advertising"
    elif "clarity.ms" in site or "ezodn.com" in site:
        return "Advertising"
    elif "rambler" in site or "sharethis" in site or "disqus" in site:
        return "Social Media"
    elif "newrotator" in site or "blockstock" in site:
        return "Other"
    elif "tns-counter.ru" in site:
        return "Analytics/Tracking"
    elif "bugsnag" in site:
        return "Analytics/Tracking"
    elif "bateriascolombia" in site:
        return "Retail"
    elif "autotouch" in site:
        return "Other"
    elif "flerap" in site or "medievalfayre" in site:
        return "Other"
    elif "yonmewon" in site or "paker112" in site:
        return "Other"
    elif "sessions" in site:
        return "Customer Support"
    elif "narrative" in site or "elastic.scenum" in site:
        return "Other"
    elif "soloway" in site:
        return "Retail"
    elif "px.arcspire" in site:
        return "Other"
    elif "bateriascolombia" in site:
        return "Retail"
    elif "us.ck-ie" in site:
        return "Advertising"
    elif "fontawesome" in site or "services.mozilla" in site:
        return "Technical Resource"
    elif "gurusoluciones" in site or "my.rtmark" in site:
        return "Other"
    elif "rexcaptcha" in site or "pki.rt.ru" in site:
        return "Security Services"
    elif "medievalfayre" in site:
        return "Other"
    elif "bumlam" in site or "zero.kz" in site:
        return "Analytics/Tracking"
    elif "pizzlessclimb" in site:
        return "Other"
    elif "clarity" in site or "newrotator" in site:
        return "Advertising"
    elif "commentfreely" in site:
        return "Social Media"
    elif "mgid.com" in site:
        return "Advertising"
    elif "textnow" in site or "matching" in site:
        return "Social Media"
    elif "px.arcspire" in site or "flashtalking" in site:
        return "Advertising"
    elif "magazin-laminata" in site or "medievalfayre" in site:
        return "Retail"
    elif "syndicatedsearch.goog" in site:
        return "Search Engine"
    elif "tasks.vto.pe" in site or "blockstock" in site:
        return "Other"
    elif "sonar.semantiqo" in site:
        return "Analytics/Tracking"
    elif "cs.minutemedia" in site:
        return "Advertising"
    elif "company.rt" in site or "z.ru" in site:
        return "Retail"
    elif "identory" in site:
        return "Other"
    elif "hlmiq" in site or "web-static.textnow" in site:
        return "Telecommunications"
    elif "paker112" in site:
        return "Other"
    elif "bigslide" in site:
        return "Other"
    elif "clarity.ms" in site or "ezoic.com" in site:
        return "Advertising"
    elif "luxbrand.bg" in site or "zoon.ru" in site:
        return "Retail"
    elif "medievalfayre" in site:
        return "Other"
    elif "rambler" in site or "sharethis" in site:
        return "Social Media"
    elif "autotouch" in site or "blockstock" in site:
        return "Other"
    elif "newrotator" in site or "elastic.scenum" in site:
        return "Other"
    elif "launchdarkly.com" in site:
        return "Cloud Services"
    elif "picturesdown" in site or "flashtalking" in site:
        return "Advertising"
    elif "mindbox.ru" in site or "intercom.io" in site:
        return "Customer Support"
    elif "narrative.io" in site:
        return "Other"
    elif "germanshepherdproblems" in site or "coinbase.com" in site:
        return "Other"
    elif "fs.ml.com" in site:
        return "Other"
    elif "celtra.com" in site:
        return "Advertising"
    elif "rambler.ru" in site or "shikkharalo" in site:
        return "Other"
    elif "flerap" in site or "pizzlessclimb" in site:
        return "Other"
    elif "all-brick.ru" in site:
        return "Retail"
    elif "widgets.fbshare.me" in site or "track.celtra.com" in site:
        return "Advertising"
    elif "clarity.ms" in site or "bestssp.com" in site:
        return "Advertising"
    elif "coinbase.com" in site:
        return "Finance"
    elif "t-mobile" in site:
        return "Telecommunications"
    elif "secure-gl.imrworldwide.com" in site or "rtbhouse.com" in site:
        return "Advertising"
    elif "medievalfayre" in site:
        return "Other"
    elif "gurusoluciones" in site:
        return "Customer Support"
    elif "blockstock" in site:
        return "Other"
    elif "tasks.vto.pe" in site:
        return "Other"
    elif "scenum.io" in site or "gemius.pl" in site:
        return "Analytics/Tracking"
    elif "chat.gurusoluciones" in site:
        return "Customer Support"
    elif "shikkharalo" in site or "power5555" in site:
        return "Other"
    elif "rtbwave" in site or "match.qtarget.tech" in site:
        return "Advertising"
    elif "flashtalking" in site:
        return "Advertising"
    elif "storygize.net" in site or "bestssp.com" in site:
        return "Advertising"
    elif "narrative.io" in site or "elastic.scenum.io" in site:
        return "Analytics/Tracking"
    elif "autotouch.net" in site or "example.com" in site:
        return "Other"
    elif "coinbase.com" in site or "fsp.ml.com" in site:
        return "Finance"
    elif "clarity.ms" in site or "histats.com" in site:
        return "Analytics/Tracking"
    elif "assets.a-mo.net" in site:
        return "Advertising"
    elif "abercrombie.com" in site:
        return "Retail"
    elif "brave.com" in site:
        return "Advertising"
    elif "fs.ml.com" in site:
        return "Finance"
    elif "linkpop.com" in site:
        return "Other"
    elif "traffmag.com" in site:
        return "Other"
    elif "kit.fontawesome" in site:
        return "Technical Resource"
    elif "plista.com" in site:
        return "Advertising"
    elif "star-randsrv" in site:
        return "Advertising"
    elif "user.krxd.net" in site:
        return "Analytics/Tracking"
    elif "impression.appsflyer" in site:
        return "Advertising"
    elif "ytbsearch" in site:
        return "Search Engine"
    elif "germanshepherdproblems" in site:
        return "Other"
    elif "paker112" in site or "blockstock" in site:
        return "Other"
    elif "chevyideas" in site:
        return "Other"
    elif "flashtalking.com" in site:
        return "Advertising"
    elif "events-dca.bidder" in site:
        return "Advertising"
    elif "medievalfayre" in site:
        return "Other"
    elif "clarity.ms" in site or "linkpop" in site:
        return "Analytics/Tracking"
    elif "power5555" in site:
        return "Other"
    elif "example.com" in site:
        return "Other"
    elif "w9ccc.com" in site:
        return "Other"
    elif "nszn.ru" in site:
        return "Other"
    elif "bigslide.ru" in site:
        return "Other"
    elif "tzegilo.com" in site:
        return "Other"
    elif "momkidzone" in site or "yonmewon" in site:
        return "Other"
    elif "blockstock.ru" in site or "medievalfayre" in site:
        return "Other"
    elif "medievalfayre.com" in site:
        return "Other"
    elif "xn--80aqeejfmlki0b7ds0a.com.ua" in site:
        return "Retail"
    elif "flarap.com" in site or "pizzlessclimb" in site:
        return "Other"
    elif "paker112.com" in site:
        return "Other"
    elif "gracesolar.it" in site:
        return "Energy"
    elif "mid.rkdms.com" in site or "umatch.krxd.net" in site:
        return "Analytics/Tracking"
    elif "cardspro.com" in site:
        return "Finance"
    elif "clarity.ms" in site or "mxptint" in site:
        return "Analytics/Tracking"
    elif "flerap.com" in site or "blockstock" in site:
        return "Other"
    elif "skelbiu.lt" in site:
        return "Retail"
    elif "autotouch.net" in site or "identory" in site:
        return "Other"
    elif "tzegilo.com" in site or "thrtle.com" in site:
        return "Other"
    elif "pmp.mxptint.net" in site:
        return "Advertising"
    elif "endway.su" in site:
        return "Other"
    elif "identory.com" in site:
        return "Other"
    elif "tasks.vto.pe" in site or "yonmewon" in site:
        return "Other"
    elif "medievalfayre" in site or "paker112" in site:
        return "Other"
    elif "cmr.bidderstack.com" in site:
        return "Advertising"
    elif "b-mode.ru" in site or "blockstock" in site:
        return "Other"
    elif "wt.rqtrk.eu" in site or "krxd" in site:
        return "Analytics/Tracking"
    elif "lib1.biz" in site:
        return "Other"
    elif "gvt2.com" in site:
        return "Content Delivery Network"
    elif "rmp.rakuten.com" in site:
        return "Advertising"
    elif "static.skelbiu.lt" in site:
        return "Retail"
    elif "mpsnare" in site:
        return "Analytics/Tracking"
    elif "clicktale" in site:
        return "Analytics/Tracking"
    elif "flerap" in site or "blockstock" in site:
        return "Other"
    elif "chevyideas.com" in site:
        return "Retail"
    elif "decibelinsight" in site:
        return "Analytics/Tracking"
    elif "clarity.ms" in site or "mxptint" in site:
        return "Analytics/Tracking"
    elif "medievalfayre" in site or "blockstock" in site:
        return "Other"
    elif "skelbiu.lt" in site:
        return "Retail"
    elif "tasks.vto.pe" in site:
        return "Other"
    elif "clubexpert.su" in site or "oreman" in site:
        return "Other"
    elif "b-mode.ru" in site or "momkidzone" in site:
        return "Other"
    elif "turn.fpjs.io" in site:
        return "Analytics/Tracking"
    elif "bshr.ezodn.com" in site:
        return "Advertising"
    elif "congress-rusmammo.ru" in site:
        return "Other"
    elif "forexinvestor.ru" in site:
        return "Finance"
    elif "stroymasters-24.ru" in site:
        return "Retail"
    elif "germanshepherdproblems.com" in site or "bigslide.ru" in site:
        return "Other"
    elif "fxdx" in site or "i.fxdx" in site:
        return "Other"
    elif "lib1.biz" in site:
        return "Other"
    elif "us01.z.antigena.com" in site:
        return "Security Services"
    elif "rutube" in site:
        return "Media Streaming"
    elif "edge.safedk.com" in site or "fpjs.io" in site:
        return "Analytics/Tracking"
    elif "forexinvestor.ru" in site:
        return "Finance"
    elif "didiglobal" in site:
        return "Cloud Services"
    elif "microsoft" in site or "office" in site or "live.com" in site:
        return "Microsoft Services"
    elif "morelogin" in site or "octobrowser" in site:
        return "Browser Automation"
    elif "emex" in site or "avito" in site or "poshmark" in site:
        return "Retail/E-commerce"
    elif "informburo" in site or "rbc" in site:
        return "News/Media"
    elif "discord" in site:
        return "Social Communication"
    elif "pixel" in site or "clickcease" in site or "onaudience" in site:
        return "Tracking/Analytics"
    elif "bodyffitnes" in site:
        return "Health/Fitness"
    elif "megogo" in site or "bzserial" in site:
        return "Media Streaming"
    elif "hotjar" in site or "rfihub" in site or "eyeota" in site or "clinch.co" in site or "kampyle" in site or "bumlam" in site:
        return "Tracking/Analytics"
    elif "ticket" in site or "queue-it" in site:
        return "Ticketing Services"
    elif "alibaba" in site:
        return "E-commerce"
    elif "kolesa" in site:
        return "Automotive"
    elif "unimed" in site or "amil" in site:
        return "Health Services"
    elif "vinted" in site:
        return "Retail/E-commerce"
    elif "blockchain" in site or "testnet" in site or "binance" in site or "dex" in site:
        return "Cryptocurrency/Blockchain"
    elif "programmatic" in site or "bttrack" in site or "ohmy.bid" in site:
        return "Advertising"
    elif "tvigle" in site:
        return "Media Streaming"
    elif "trendyol" in site or "sda.fyi" in site:
        return "Ticketing Services"
    elif "netspend" in site:
        return "Financial Services"
    elif "gravatar" in site or "perimeterx" in site:
        return "Security Services"
    elif "datawarehouse" in site or "quantummetric" in site or "online-metrix" in site:
        return "Data/Analytics"
    elif "bumble" in site:
        return "Social/Dating"
    elif "rbk" in site:
        return "News/Media"
    elif "clarium" in site:
        return "Security Services"
    elif "idealmedia" in site or "alfasense" in site:
        return "Advertising/Marketing"
    elif "pusher" in site:
        return "Telecommunication Services"
    elif "emiratesislamic" in site:
        return "Financial Services"
    elif "play241" in site or "teemooge" in site or "vr3d" in site:
        return "Gaming/Entertainment"
    elif "pinimg" in site:
        return "Image Hosting"
    elif "targetimg" in site or "hb-bidder" in site:
        return "Advertising"
    elif "sq.com.ua" in site:
        return "News/Media"
    elif "appsflyer" in site:
        return "App Analytics"
    elif "samplicio" in site or "swagbucks" in site:
        return "Quiz/Survey Platforms"
    elif "scene7" in site or "prdg" in site or "classistatic" in site:
        return "Media/Content Hosting"
    elif "blockstock" in site:
        return "Blockchain Services"
    elif "justpremium" in site or "t13" in site:
        return "Ad Networks"
    elif "tweetmeme" in site:
        return "Social Sharing"
    elif "narrative" in site or "openwebmp" in site:
        return "Tracking/Analytics"
    elif "primeopinion" in site or "fivesurveys" in site:
        return "Survey/Feedback"
    elif "spotify" in site or "fwmrm" in site:
        return "Streaming Media"
    elif "slutwives" in site:
        return "Adult Content"
    elif "innovid" in site or "hs-banner" in site:
        return "Ad/Tracking Services"
    elif "mynamaz" in site:
        return "Community/Religious"
    elif "bizzopinion" in site:
        return "Survey/Feedback"
    elif "olx" in site:
        return "Classifieds/Real Estate"
    elif "rudub" in site:
        return "Blockchain Services"
    elif "reestr-pki" in site:
        return "PKI Services"
    elif "anextour" in site:
        return "Travel/Booking"
    elif "canva" in site:
        return "Design/Creative Services"
    elif "satu" in site:
        return "Online Marketplace"
    elif "hd-rezka" in site:
        return "Video Streaming"
    elif "medievalfayre" in site:
        return "Historical Events/Fantasy"
    elif "richaudience" in site:
        return "Advertising Networks"
    elif "lordfilm" in site or "instreamvideo" in site:
        return "Video Streaming"
    elif "cancanawards" in site:
        return "Events/Awards"
    elif "smi2" in site or "agency2" in site:
        return "News/Media"
    elif "lords-films" in site:
        return "Video Streaming"
    elif "pactsocial" in site:
        return "Social Media Tools"
    elif "footballbookprod" in site:
        return "Sports Platforms"
    elif "stats.wp" in site:
        return "Analytics/Tracking"
    elif "kfc" in site:
        return "Food Services"
    elif "onesignal" in site:
        return "Push Notification Services"
    elif "playrohan" in site:
        return "Gaming Platforms"
    elif "germanshepherdproblems" in site:
        return "Pet Care"
    elif "big-razmer" in site:
        return "Adult Content"
    elif "fullstory" in site or "sentry.io" in site:
        return "Analytics Services"
    elif "soloway" in site:
        return "Customer Support Services"
    elif "everesttech" in site:
        return "Advertising Platforms"
    elif "textfree" in site:
        return "Messaging Services"
    elif "locateplus" in site:
        return "Location/Geospatial Services"
    elif "audrte" in site or "nextmillmedia" in site or "xlgmedia" in site:
        return "Advertising Platforms"
    elif "tictuk" in site:
        return "Social Media Tools"
    elif "visualwebsiteoptimizer" in site:
        return "Website Optimization"
    elif "fitnes-girl" in site:
        return "Fitness/Health"
    elif "aliyuncs" in site:
        return "Cloud Services"
    elif "wixstatic" in site:
        return "Static Hosting"
    elif "secured-entry" in site:
        return "Security Services"
    elif "klaviyo" in site:
        return "Email Marketing"
    elif "deepl" in site:
        return "Translation Services"
    elif "remerge" in site:
        return "Advertising Platforms"
    elif "nalog" in site:
        return "Government Services"
    elif "showjet" in site:
        return "Video Streaming"
    elif "ispot" in site:
        return "Ad Tracking"
    elif "afm01" in site:
        return "Affiliate Marketing"
    elif "ruserialy" in site or "kinescope" in site:
        return "Video Streaming"
    elif "ok.ru" in site:
        return "Social Media"
    elif "sfat-ryazan" in site:
        return "Government Services"
    elif "centrum24" in site:
        return "Banking Services"
    elif "magazin-postelnoe-belye" in site:
        return "E-commerce"
    elif "tikiwiki" in site:
        return "Content Management Systems (CMS)"
    elif "fototeni" in site:
        return "Photography Services"
    elif "watchfeed" in site:
        return "Video Streaming"
    elif "clubexpert" in site:
        return "Social Clubs"
    elif "mempool" in site:
        return "Cryptocurrency Services"
    elif "arbachakov" in site:
        return "Cultural/Art Platforms"
    elif "fczi" in site:
        return "Sports Clubs"
    elif "satsis" in site:
        return "Satellite/Technology"
    elif "cougsfortomorrow" in site:
        return "Support/Advocacy"
    elif "mtgglobals" in site:
        return "Global Marketing"
    elif "rostelekom" in site:
        return "Telecommunication"
    elif "roerich-belogorie" in site:
        return "Cultural Heritage"
    elif "lotus-dsp" in site:
        return "Advertising Platforms"
    elif "ssisurveys" in site:
        return "Survey Platforms"
    elif "split.io" in site:
        return "Software Development"
    elif "wsdata" in site:
        return "Data Platforms"
    elif "lordserial" in site:
        return "Video Streaming"
    elif "skyviewobservatory" in site:
        return "Tourism/Observation"
    elif "evidon" in site:
        return "Privacy/Tracking Management"
    elif "mvideo-catalog" in site:
        return "Retail/E-commerce"
    elif "gotcha-server" in site:
        return "Server Hosting"
    elif "alfasrv" in site:
        return "Advertising Services"
    elif any(keyword in site for keyword in ['spectrumsurveys', 'primesurveys', 'cintnetworks', 'siteintercept.qualtrics']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['hind.ee', 'aritzia', 'fa181818.com', 'kk383838.com']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['awsstatic', 'visualstudio', 'vungle.io', '2gis']):
        return "Cloud/Hosting"
    elif any(keyword in site for keyword in ['player.vimeo', 'film.ru', 'kinogo', 'moe.video', 'relap.io']):
        return "Media/Entertainment"
    elif any(keyword in site for keyword in ['2gis', 'maps']):
        return "Mapping Services"
    elif any(keyword in site for keyword in ['punchmedia', 'gmossp-sp', 'liftoff', 'shareaholic']):
        return "Advertising Platforms"
    elif any(keyword in site for keyword in ['toppress.kz', 'sinoptik.ua', 'baskino', 'gopro', 'static2.tinkoffjournal']):
        return "News/Media"
    elif any(keyword in site for keyword in ['luisaviaroma', 'auto.ru', 'eg33.net', 'microwaveturntableplate', 'camelia.lt']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['chartbeat', 'geojs', 'clean.gg', 'flux.jp', 'pxl.iqm']):
        return "Data/Tracking"
    elif any(keyword in site for keyword in ['spankbang', 'detectiveonline', 'tvigl', 'jke3.jetkino']):
        return "Entertainment"
    elif any(keyword in site for keyword in ['wp.com', 'on.kz', 'webfiles.pro', 's5o.ru']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['satsis.info', 'secure.gopro', 'storm.gde-luchshe']):
        return "Technology Services"
    elif any(keyword in site for keyword in ['dsp.mpartner.digital', 'r.prmin.net', 'ssp.otm-r.com']):
        return "Advertising"
    elif any(keyword in site for keyword in ['mediatoday.ru', 'doctorhouse.tv', 'mymusicclass.com.au', 'attentionxyz']):
        return "Media"
    elif any(keyword in site for keyword in ['counter6.optistats.ovh', 'usage.trackjs', 'b.videoamp.com', 'bounceexchange']):
        return "Data/Tracking"
    elif any(keyword in site for keyword in ['sheinsz.ltwebstatic', 'knygos.lt', 'prolock.kz']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['s.go-mpulse', 'imagedelivery']):
        return "Content Delivery"
    elif any(keyword in site for keyword in ['auth.imaq.kz', 'chat.chatra.io']):
        return "Authentication Services"
    elif any(keyword in site for keyword in ['forms-na1.hsforms', 'panorama.wixapps']):
        return "Website Tools"
    elif any(keyword in site for keyword in ['bes.media', 'images.rappi', 'img1.wsimg']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['surveyrouter', 'swfly744']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['tag.rutarget', 'rt.marphezis', 'pub-eu.p.otm-r']):
        return "Tracking/Advertising"
    elif any(keyword in site for keyword in ['geoedge.be', 'arcspire.io']):
        return "Technology Services"
    elif any(keyword in site for keyword in ['popt.in', 'impressions.onelink']):
        return "Marketing"
    elif any(keyword in site for keyword in ['pbstck.com', 'm.trafmag']):
        return "Affiliate Marketing"
    elif any(keyword in site for keyword in ['brevo.com', 'gravitec.net']):
        return "Email/Notification Services"
    elif any(keyword in site for keyword in ['bidder.skcrtxr', 'rpc.skcrtxr', 'id.rtb.mx', 'paleolifestyledoctor']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['assets.squarespace', 'static.terratraf', 'c.pub.network']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['pro-online-classes']):
        return "Education"
    elif any(keyword in site for keyword in ['cheapestprice-online-cialis']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['surfcars.net', 'opaloidridding']):
        return "Other"
    elif any(keyword in site for keyword in ['gidonline-hd', 'vop.sundaysky', 'appcloner']):
        return "Media"
    elif any(keyword in site for keyword in ['livejournal.net', 'picturesdown', 'assets.censor', 'zemanta']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['megafonpro', 'xfinity']):
        return "Telecommunication"
    elif any(keyword in site for keyword in ['elastic.scenum', 'smartlook', 'jelly.mdhv']):
        return "Technology Services"
    elif any(keyword in site for keyword in ['fourier.taobao', 'japandomesticmotor']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['cackle.me', 'surfcars.net', 'form.meriden']):
        return "Marketing"
    elif any(keyword in site for keyword in ['voicefive', 'detectme.net', 'censor.net']):
        return "Advertising Platforms"
    elif any(keyword in site for keyword in ['d.socdm', 'vidoomy', 'setupcmp']):
        return "Tracking/Advertising"
    elif any(keyword in site for keyword in ['slottica14', 'fit-future', 'septiko-drenage']):
        return "Retail"
    elif any(keyword in site for keyword in ['rosiyane', 'brokgaus-slovar', 'tratatur']):
        return "Media/News"
    elif any(keyword in site for keyword in ['calendly', 'braze', 'industrysf']):
        return "Technology Services"
    elif any(keyword in site for keyword in ['usbrowserspeed', 'w9ccc', 'opaloidridding']):
        return "Other"
    elif any(keyword in site for keyword in ['profit.kz', 'mattressmanaustin', 'remsam', 'pobedavmeste']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['tumblr', 'zdassets', 'bnbstatic', 'excite']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['serverbid', 'parsely', 'b-mode', 'sony']):
        return "Technology Services"
    elif any(keyword in site for keyword in ['usbrowserspeed', 'onelink', 'imperium']):
        return "Tracking/Advertising"
    elif any(keyword in site for keyword in ['celebshistory', 'momkidzone', 'rosiyane']):
        return "Media/News"
    elif any(keyword in site for keyword in ['spishy-dolgy', 'zagruzchikom', 'pix.pub']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['sojern', 'researchnow', 'kueezrtb']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['profitnessdoma', 'pelaezhermanos']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['cox.net', 'ccgateway']):
        return "Telecommunication Services"
    elif any(keyword in site for keyword in ['identory', 'crcldu', 'allstat-pp']):
        return "Other"
    elif any(keyword in site for keyword in ['github', 'bazaarvoice']):
        return "Technology Services"
    elif any(keyword in site for keyword in ['ctfassets', 'slatergordon', 'gammaplatform']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['sample-cube', 'surveygizmo']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['77vr', 'allstat-pp', 'mosauto-service']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['natisbraslet', 'hornoselectricos']):
        return "Telecommunication Services"
    elif any(keyword in site for keyword in ['paker112', 'arhidic', 'balkan-tour']):
        return "Other"
    elif any(keyword in site for keyword in ['neuruslab', 'erotic', 'goldod']):
        return "Media"
    elif any(keyword in site for keyword in ['egoid', 'schelkovskiy', 'kuchasnega', 'blob.core']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['forms.hsforms', 'swotwizard', 'cnt.fout']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['udachnaya-tekhnika', 'balkan-tour', 'aruodas']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['fxdx', 'paker112']):
        return "Telecommunication Services"
    elif any(keyword in site for keyword in ['neprostitutkikirov', 'maxhome', 'foodpanda']):
        return "Other"
    elif any(keyword in site for keyword in ['wal.co', 'joisuitesvenezia', 'rugraphics']):
        return "Advertising Platforms"
    elif any(keyword in site for keyword in ['hscollectedforms', 'srmdata', 'theflyingbirds']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['neprostitutkikirov', 'superdachnik', 'maxhome']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['digitalwebhome', 'chevyideas', 'the-best-lawyers']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['krebsonsecurity', 'taxismaned']):
        return "Media/Entertainment"
    elif any(keyword in site for keyword in ['js.hscollectedforms', 'sam-master']):
        return "Other"
    elif any(keyword in site for keyword in ['whois-actor', 'womenpreneurme']):
        return "Media"
    elif any(keyword in site for keyword in ['identory', 'uncn.jp', 'suprion']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['ranking.kz', 'promore', 'brand-tovar']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['ups-x-sign', 'so-gre8', 'bookkooq']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['a11ybar', 'affec.tv', 'rtb.com']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['medplusneft', 'kubgosarhiv', 'moy-m-portal']):
        return "Other"
    elif any(keyword in site for keyword in ['whois-actor', 'lib1.biz', 'mulleyrewayle']):
        return "Media"
    elif any(keyword in site for keyword in ['instruments.tds.bid', 'newrotatormarch23', 'thrtle']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['example', 'crcldu', 'dstatlove']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['foodpanda.my', 'sms-activate', 'dental-implants-site']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['a.a47b', 'skcrtxr', 'u-sjc03.e-planning']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['w9ccc', 'browserleaks']):
        return "Other"
    elif any(keyword in site for keyword in ['server.cpmstar', 'reprohelp', 'player.aniview']):
        return "Media/Entertainment"
    elif any(keyword in site for keyword in ['power5555', 'i8my1', 'i-sphere', 'iron-samson']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['deliv12', 'pochta', 'akkoraru', 'rappi']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['otm-r', 'uptolike', 'serving-sys', 'aarki', 'retargetly']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['quora', 'toluna', 'hit.ua']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['nszn', 'browserleaks', 'gso.amocrm']):
        return "Other"
    elif any(keyword in site for keyword in ['almin', 'rbtwo', 'macys']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['trkn', 'rtb.hhkld', 'audiencedata', '24log']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['globalsign', 'smk-alhikmah']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['soflopxl', 'falabella', 'squarespace']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['secure.globalsign', 'browserleaks', 'gso.amocrm']):
        return "Other"
    elif any(keyword in site for keyword in ['bigslide', 'salt-and-pepper', 'instantink']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['windonpetr', 'moy-m-portal', 'lordy-films', 'pearl-plaza']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['eskimi', 'activemetering', 'd.pub', 'whitesaas']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['felizcredito', 'opaloidridding', 'qb66', 'sam-master']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['browserleaks', 'chronos60', 'squareup']):
        return "Other"
    elif any(keyword in site for keyword in ['veristrecked', 'ml314', 'oilwellsublot']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['skwstat', 'themaredmedica', 'gmlinteractive']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['tvsquared', 'musicandtrumpets', 'hsforms']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['vak-sms', 'udsp', 'revidy', 'thebrighttag']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['nszn', 'browserleaks', 'squareup']):
        return "Other"
    elif any(keyword in site for keyword in ['kubtaxi', 'yonmewon', 'chrono60s']):
        return "Media"
    elif any(keyword in site for keyword in ['bateriacarro', 'mebel-malachite', 'find-packing-services']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['scenum.io', 'tourvisor', '24smi']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['unimrktresponse', 'technoratimedia', 'spot.im']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['atmequiz', '1rx', 'privacymanager']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['browserleaks', 'gso.amocrm', 'squareup']):
        return "Other"
    elif any(keyword in site for keyword in ['dd686868', 'ab333333', 'deloros-lo']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['w9ccc', 'eximg.jp', 'wikia-services', 'blentech']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['tapjoy', 'mistplay', 'sputnik', 'plausible']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['mingle2', 'glaz.tv', 'top-gear', 'turkportal']):
        return "Media/Entertainment"
    elif any(keyword in site for keyword in ['browserleaks', 'gso.amocrm', 'squareup']):
        return "Other"
    elif any(keyword in site for keyword in ['event.cityline', 'realestate.com', 'hetapus']):
        return "Other"
    elif any(keyword in site for keyword in ['qoopler', 'cian.ru', 'puma', 'boeschungspflege']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['nszn', 'comeonnigga', 'scbenc', 'backpack']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['medplusneft', 'meetup', 'termly', 'popin']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['trace.popin', 'arc.msn', 'appsha1.cointraffic', 'hetapus']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['browserleaks', 'gso.amocrm', 'squareup']):
        return "Other"
    elif any(keyword in site for keyword in ['homecenter', 'vtexassets', 'toyota-cars', 'lexus', 'mczbf']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['ctengine.io', 'wikia.nocookie', 'samtor', 'mbc-corp', 'malim.kz']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['prodegemr', 'marquiz-backend', 'disq', 'ugleconcepcion', 'mancing-emosi']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['serviceutility', 'a47b', 'atb.im-apps', 'disq', 'aggle']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['browserleaks', 'gso.amocrm', 'squareup']):
        return "Other"
    elif any(keyword in site for keyword in ['crocs', 'sumome', 'sewingauthority']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['uiscom', 'binotel', 'chatra']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['researchsurv', 'bp4pusat', 'kyiv1', 'mooterduarch']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['scroll', 'callibri', 'logly', 'tsssaver', 'p16-rc-captcha']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['browserleaks', 'gso.amocrm', 'squareup']):
        return "Other"
    elif any(keyword in site for keyword in ['sodimac', 'urbanoutfitters', 'consultant']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['gdelu', 'avalon', 'images.sodimac']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['opinion-edge', 'askmediagroup', 'panel.opinion']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['bouncex', 'e-planning', 'im-apps']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['browserleaks', 'gso.amocrm', 'squareup']):
        return "Other"
    elif any(keyword in site for keyword in ['di7', 'akb-mir.kz', 'moscow-mfc', 'evropeysky-tc']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['rdrom', 'presentprofi', 'suo.im', 'orientir-climb', 'inplayer']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['sasinator', 'linkpop', 'amplitude', 'aminery', 'ltwebstatic']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['birthbungo', 'chaturbate', '6sc.co', 'mrmserve']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['browserleaks', 'gso.amocrm', 'squareup']):
        return "Other"
    elif any(keyword in site for keyword in ['kashirskayaplaza', 'druzbahotel', 'trc-gorod', 'mypoints', 'check24', 'beeline']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['smotrim', 'alpstroy', 'vega-factory', 'chicoryapp', 'fout.jp', 'arhidic']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['browserleaks', 'rc-captcha', 'sport-express', 'coinbase', 'camera.clickz']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['imp.rfp', 'tssaver', 'forum-macan', 'ugelconcepcion', 'pleaseq']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['squareup', 'gso.amocrm', 'long-jump']):
        return "Other"
    elif any(keyword in site for keyword in ['ryotsuke', 'ua-rating', 'expedia', 'foursquare', 'dwin1']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['kubgosarhiv', 'neprostitutkiorenburg', 'sovietdigitalart', 'picuki', 'secureserver']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['wa.me', 'slack', 'i.hh', 'skype', 'qvdt3feo']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['cameraclickz', 'app.link', 'tzegilo', 'usemessages', 'waterfox']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['squareup', 'gso.amocrm', 'long-jump']):
        return "Other"
    elif any(keyword in site for keyword in ['ooobss', 'home.cou', 'qtravelw', 'vak345']):
        return "E-commerce"
    elif any(keyword in site for keyword in ['echonverforrinho', 'gde-luchshe', 'ttgoals', '66aa']):
        return "Content Hosting"
    elif any(keyword in site for keyword in ['uma.media', 'mudiview', 'cityline', 'theserials']):
        return "Survey/Research"
    elif any(keyword in site for keyword in ['gso.amocrm', 'squareup', 'long-jump', 'fs.ml', 'sectigo']):
        return "Advertising/Tracking"
    elif any(keyword in site for keyword in ['id.uma', '2qpzj17s', 'sk-e']):
        return "Other"
    else:
        return "Other"

# Добавляем новую категорию к сайтам
top_90_percent_sites['new_category'] = top_90_percent_sites['req_host'].apply(lambda x: categorize_sites_v3(str(x)))

# Подсчет количества сайтов в категории "Other"
other_count_v3 = top_90_percent_sites[top_90_percent_sites['new_category'] == 'Other'].shape[0]

# Общее количество сайтов
total_count_v3 = top_90_percent_sites.shape[0]

# Вычисляем процент сайтов в категории "Other"
percentage_other_v3 = (other_count_v3 / total_count_v3) * 100

other_count_v3, percentage_other_v3




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_90_percent_sites['new_category'] = top_90_percent_sites['req_host'].apply(lambda x: categorize_sites_v3(str(x)))


(106, 2.392236515459264)

5. Предобрабатываем для категоризации с помощью API и для сравнения с другими источниками

Этот код используется для подготовки данных к категоризации с помощью API. Он очищает и стандартизирует домены, преобразуя их в формат с поддоменом www. и генерирует регулярные выражения для точного поиска доменов. Эти шаги необходимы для точного сопоставления с категориями через API, позволяя эффективно классифицировать сайты по их доменам.

In [6]:
# Функция для очистки и стандартизации домена
def clean_host(host):
    if re.match(r'^\d+\.\d+\.\d+\.\d+$', host):  # если это IP
        return host, host
    # Преобразуем поддомены к виду www. и убираем всё перед точками
    base_host = re.sub(r'^([a-zA-Z0-9-]+\.)*([a-zA-Z0-9-]+\.[a-z]{2,})$', r'www.\2', host)
    return host, base_host

# Функция для создания регулярного выражения
def generate_regex(base_host):
    # Создание регулярного выражения для домена
    escaped_host = re.escape(base_host.replace('www.', ''))
    regex = f"^([a-zA-Z0-9-]+\\.)*(www\\.)?{escaped_host}$"
    return regex

# Приводим значения в столбце req_host к строковому типу
top_90_percent_sites = top_90_percent_sites.copy()  # Создаем копию DataFrame
top_90_percent_sites['req_host'] = top_90_percent_sites['req_host'].astype(str)

# Создаем новые столбцы с обработанными доменами и регулярными выражениями
top_90_percent_sites[['req_host', 'req_host_base']] = top_90_percent_sites['req_host'].apply(lambda x: clean_host(x)).apply(pd.Series)

# Применяем loc для безопасного изменения фрагмента
top_90_percent_sites.loc[:, 'req_host_combined'] = top_90_percent_sites['req_host_base']  # Копируем значения req_host_base в req_host_combined
top_90_percent_sites.loc[:, 'url_regex'] = top_90_percent_sites['req_host_base'].apply(generate_regex)


In [7]:
top_90 = top_90_percent_sites[['req_host_base','Category','new_category']]

In [8]:
top_90.rename(columns={'req_host_base': 'url', 'Category': 'cats base', 'new_category': 'cats gpt'}, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_90.rename(columns={'req_host_base': 'url', 'Category': 'cats base', 'new_category': 'cats gpt'}, inplace=True)


In [9]:
output_path = r'E:\\прокси работа\\категоризация\\топ 90 за месяц.xlsx'
top_90_percent_sites.to_excel(output_path, index=False)

In [10]:
dataset_for_api = top_90[['url']]

6. Сохраняем файлы

In [11]:
dataset_for_api.to_csv('new_url_task_2.csv', index=False)

In [12]:
top_90.to_csv('category_result_gpt.csv', index=False)

In [13]:
top_90_percent_sites.to_csv('top_90_month.csv', index=False)