#### [Оригинальная статья](https://arxiv.org/pdf/2106.01782.pdf)
    
### Начало

Посмотрим на готовые признаки и сделаем первую посылку.

1. [Описание данных](#Описание-данных)
2. [Описание признаков](#Описание-признаков)
3. [Наша первая модель](#Наша-первая-модель)
4. [Посылка](#Посылка)

### Первые шаги на пути в датасайенс

5. [Кросс-валидация](#Кросс-валидация)
6. [Что есть в json файлах?](#Что-есть-в-json-файлах?)
7. [Feature engineering](#Feature-engineering)

### Импорты

In [3]:
import os
import json
import pandas as pd
import datetime
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, ShuffleSplit, cross_val_score, GridSearchCV
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import make_scorer, roc_auc_score, accuracy_score

%matplotlib inline

In [2]:
SEED = 10801
sns.set_style(style="whitegrid")
plt.rcParams["figure.figsize"] = 12, 8
warnings.filterwarnings("ignore")

## <left>Описание данных

Файлы:

- `sample_submission.csv`: пример файла-посылки
- `train_raw_data.jsonl`, `test_raw_data.jsonl`: "сырые" данные
- `train_data.csv`, `test_data.csv`: признаки, созданные авторами
- `train_targets.csv`: результаты тренировочных игр

## <left>Описание признаков
    
Набор простых признаков, описывающих игроков и команды в целом

In [4]:
#PATH_TO_DATA = "/kaggle/input/copy-of-23-24-ml/"

df_train_features = pd.read_csv(os.path.join("../data/train_data.csv"),
                                    index_col="match_id_hash")
df_train_targets = pd.read_csv(os.path.join("../data/train_targets.csv"),
                                   index_col="match_id_hash")

In [4]:
df_train_features.shape

(31698, 245)

In [5]:
df_train_features.head()

Unnamed: 0_level_0,game_time,game_mode,lobby_type,objectives_len,chat_len,r1_hero_id,r1_kills,r1_deaths,r1_assists,r1_denies,...,d5_stuns,d5_creeps_stacked,d5_camps_stacked,d5_rune_pickups,d5_firstblood_claimed,d5_teamfight_participation,d5_towers_killed,d5_roshans_killed,d5_obs_placed,d5_sen_placed
match_id_hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
b9c57c450ce74a2af79c9ce96fac144d,658,4,0,3,10,15,7,2,0,7,...,0.0,0,0,0,0,0.0,0,0,0,0
6db558535151ea18ca70a6892197db41,21,23,0,0,0,101,0,0,0,0,...,0.0,0,0,0,0,0.0,0,0,0,0
19c39fe2af2b547e48708ca005c6ae74,160,22,7,0,0,57,0,0,0,1,...,0.0,0,0,0,0,0.0,0,0,0,0
c96d629dc0c39f0c616d1949938a6ba6,1016,22,0,1,0,119,0,3,3,5,...,8.264696,0,0,3,0,0.25,0,0,3,0
156c88bff4e9c4668b0f53df3d870f1b,582,22,7,2,2,12,3,1,2,9,...,15.762911,3,1,0,1,0.5,0,0,3,0


Имеем ~32 тысячи наблюдений, каждое из которых характеризуется уникальным `match_id_hash` (захэшированное id матча), и 245 признаков. `game_time` показывает момент времени, в который получены эти данные. То есть по сути это не длительность самого матча, а например, его середина, таким образом, в итоге мы сможем получить модель, которая будет предсказывать вероятность победы каждой из команд в течение матча (хорошо подходит для букмекеров).

Нас интересует поле `radiant_win` (так называется одна из команд, вторая - dire). Остальные колоки здесь по сути получены из "будущего" и есть только для тренировочных данных, поэтому на них можно просто посмотреть).

In [6]:
df_train_targets.head()

Unnamed: 0_level_0,game_time,radiant_win,duration,time_remaining,next_roshan_team
match_id_hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
b9c57c450ce74a2af79c9ce96fac144d,658,True,1154,496,
6db558535151ea18ca70a6892197db41,21,True,1503,1482,Radiant
19c39fe2af2b547e48708ca005c6ae74,160,False,2063,1903,
c96d629dc0c39f0c616d1949938a6ba6,1016,True,2147,1131,Radiant
156c88bff4e9c4668b0f53df3d870f1b,582,False,1927,1345,Dire


## <left>Catboost

Я решила зайти с козырей и начала сразу с Catboost. Функции для чтения матчей.

In [7]:
try:
    import ujson as json
except ModuleNotFoundError:
    import json
    print ("Подумайте об установке ujson, чтобы работать с JSON объектами быстрее")

try:
    from tqdm.notebook import tqdm
except ModuleNotFoundError:
    tqdm_notebook = lambda x: x
    print ("Подумайте об установке tqdm, чтобы следить за прогрессом")


def read_matches(matches_file, total_matches=31698, n_matches_to_read=None):
    """
    Аргуент
    -------
    matches_file: JSON файл с сырыми данными

    Результат
    ---------
    Возвращает записи о каждом матче
    """

    if n_matches_to_read is None:
        n_matches_to_read = total_matches

    c = 0
    with open(matches_file) as fin:
        for line in tqdm(fin, total=total_matches):
            if c >= n_matches_to_read:
                break
            else:
                c += 1
                yield json.loads(line)

Функция для добавления новый свойств в датасет. Подбирала разные признаки, но здесь оставила наиболее удачный вариант.

In [8]:
def add_new_features_cb(df_features, matches_file):
    """
    Аргуенты
    -------
    df_features: таблица с данными
    matches_file: JSON файл с сырыми данными

    Результат
    ---------
    Добавляет новые признаки в таблицу
    """

    for match in read_matches(matches_file):
        match_id_hash = match['match_id_hash']
    
        # Посчитаем количество разрушенных вышек обеими командами
        radiant_tower_kills = 0
        dire_tower_kills = 0
        for objective in match["objectives"]:
            if objective["type"] == "CHAT_MESSAGE_TOWER_KILL":
                if objective["team"] == 2:
                    radiant_tower_kills += 1
                if objective["team"] == 3:
                    dire_tower_kills += 1
    
        df_features.loc[match_id_hash, "radiant_tower_kills"] = radiant_tower_kills
        df_features.loc[match_id_hash, "dire_tower_kills"] = dire_tower_kills
        df_features.loc[match_id_hash, "diff_tower_kills"] = radiant_tower_kills - dire_tower_kills
        
        radiant_kills = 0
        radiant_deaths = 0
        radiant_assists = 0
        radiant_gold = 0
        radiant_xp = 0
        radiant_hero_levels = []

        dire_kills = 0
        dire_deaths = 0
        dire_assists = 0
        dire_gold = 0
        dire_xp = 0
        dire_hero_levels = []

        for player in match['players']:
            if player['player_slot'] < 128:
                radiant_kills += player['kills']
                radiant_deaths += player['deaths']
                radiant_assists += player['assists']
                radiant_gold += player['gold']
                radiant_xp += player['xp']
                radiant_hero_levels.append(player['level'])
            else:
                dire_kills += player['kills']
                dire_deaths += player['deaths']
                dire_assists += player['assists']
                dire_gold += player['gold']
                dire_xp += player['xp']
                dire_hero_levels.append(player['level'])

        radiant_avg_gold = radiant_gold / 5
        radiant_avg_xp = radiant_xp / 5
        radiant_avg_level = sum(radiant_hero_levels) / len(radiant_hero_levels)

        dire_avg_gold = dire_gold / 5
        dire_avg_xp = dire_xp / 5
        dire_avg_level = sum(dire_hero_levels) / len(dire_hero_levels)

        df_features.loc[match_id_hash, "radiant_kills"] = radiant_kills
        df_features.loc[match_id_hash, "radiant_deaths"] = radiant_deaths
        df_features.loc[match_id_hash, "radiant_assists"] = radiant_assists
        df_features.loc[match_id_hash, "radiant_avg_gold"] = radiant_avg_gold
        df_features.loc[match_id_hash, "radiant_avg_xp"] = radiant_avg_xp
        df_features.loc[match_id_hash, "radiant_avg_level"] = radiant_avg_level

        df_features.loc[match_id_hash, "dire_kills"] = dire_kills
        df_features.loc[match_id_hash, "dire_deaths"] = dire_deaths
        df_features.loc[match_id_hash, "dire_assists"] = dire_assists
        df_features.loc[match_id_hash, "dire_avg_gold"] = dire_avg_gold
        df_features.loc[match_id_hash, "dire_avg_xp"] = dire_avg_xp
        df_features.loc[match_id_hash, "dire_avg_level"] = dire_avg_level

        df_features.loc[match_id_hash, "diff_kills"] = radiant_kills - dire_kills
        df_features.loc[match_id_hash, "diff_deaths"] = radiant_deaths - dire_deaths
        df_features.loc[match_id_hash, "diff_assists"] = radiant_assists - dire_assists
        df_features.loc[match_id_hash, "diff_avg_gold"] = radiant_avg_gold - dire_avg_gold
        df_features.loc[match_id_hash, "diff_avg_xp"] = radiant_avg_xp - dire_avg_xp
        df_features.loc[match_id_hash, "diff_avg_level"] = radiant_avg_level - dire_avg_level
        
        df_features.loc[match_id_hash, "radiant_max_health"] = max([player['max_health'] for player in match['players'] if player['player_slot'] < 128])
        df_features.loc[match_id_hash, "dire_max_health"] = max([player['max_health'] for player in match['players'] if player['player_slot'] >= 128])
        df_features.loc[match_id_hash, "diff_max_health"] = df_features.loc[match_id_hash, "radiant_max_health"] - df_features.loc[match_id_hash, "dire_max_health"]
        
        df_features.loc[match_id_hash, "radiant_max_mana"] = max([player['max_mana'] for player in match['players'] if player['player_slot'] < 128])
        df_features.loc[match_id_hash, "dire_max_mana"] = max([player['max_mana'] for player in match['players'] if player['player_slot'] >= 128])
        df_features.loc[match_id_hash, "diff_max_mana"] = df_features.loc[match_id_hash, "radiant_max_mana"] - df_features.loc[match_id_hash, "dire_max_mana"]
        
        df_features.loc[match_id_hash, "radiant_camps_stacked"] = sum([player['camps_stacked'] for player in match['players'] if player['player_slot'] < 128])
        df_features.loc[match_id_hash, "dire_camps_stacked"] = sum([player['camps_stacked'] for player in match['players'] if player['player_slot'] >= 128])
        df_features.loc[match_id_hash, "diff_camps_stacked"] = df_features.loc[match_id_hash, "radiant_camps_stacked"] - df_features.loc[match_id_hash, "dire_camps_stacked"]
        
        df_features.loc[match_id_hash, "radiant_rune_pickups"] = sum([player['rune_pickups'] for player in match['players'] if player['player_slot'] < 128])
        df_features.loc[match_id_hash, "dire_rune_pickups"] = sum([player['rune_pickups'] for player in match['players'] if player['player_slot'] >= 128])
        df_features.loc[match_id_hash, "diff_rune_pickups"] = df_features.loc[match_id_hash, "radiant_rune_pickups"] - df_features.loc[match_id_hash, "dire_rune_pickups"]
        
        df_features.loc[match_id_hash, "radiant_roshans_killed"] = sum([player['roshans_killed'] for player in match['players'] if player['player_slot'] < 128])
        df_features.loc[match_id_hash, "dire_roshans_killed"] = sum([player['roshans_killed'] for player in match['players'] if player['player_slot'] >= 128])
        df_features.loc[match_id_hash, "diff_roshans_killed"] = df_features.loc[match_id_hash, "radiant_roshans_killed"] - df_features.loc[match_id_hash, "dire_roshans_killed"]
    
        # ... (/¯◡ ‿ ◡)/¯☆*:・ﾟ добавляем новые признаки ...

In [9]:
# Скопируем таблицу с признаками
df_train_features_extended_cb = df_train_features.copy()

# Добавим новые
add_new_features_cb(df_train_features_extended_cb,
                 os.path.join("../data/train_raw_data.jsonl"))

  0%|          | 0/31698 [00:00<?, ?it/s]

In [10]:
df_train_features_extended_cb.head()

Unnamed: 0_level_0,game_time,game_mode,lobby_type,objectives_len,chat_len,r1_hero_id,r1_kills,r1_deaths,r1_assists,r1_denies,...,diff_max_mana,radiant_camps_stacked,dire_camps_stacked,diff_camps_stacked,radiant_rune_pickups,dire_rune_pickups,diff_rune_pickups,radiant_roshans_killed,dire_roshans_killed,diff_roshans_killed
match_id_hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
b9c57c450ce74a2af79c9ce96fac144d,658,4,0,3,10,15,7,2,0,7,...,-120.0,1.0,2.0,-1.0,13.0,14.0,-1.0,0.0,0.0,0.0
6db558535151ea18ca70a6892197db41,21,23,0,0,0,101,0,0,0,0,...,60.0,0.0,0.0,0.0,3.0,1.0,2.0,0.0,0.0,0.0
19c39fe2af2b547e48708ca005c6ae74,160,22,7,0,0,57,0,0,0,1,...,-24.0,0.0,0.0,0.0,7.0,0.0,7.0,0.0,0.0,0.0
c96d629dc0c39f0c616d1949938a6ba6,1016,22,0,1,0,119,0,3,3,5,...,-24.0,0.0,0.0,0.0,32.0,12.0,20.0,0.0,0.0,0.0
156c88bff4e9c4668b0f53df3d870f1b,582,22,7,2,2,12,3,1,2,9,...,-24.0,5.0,2.0,3.0,9.0,12.0,-3.0,0.0,0.0,0.0


Определим и разделим переменные

In [13]:
X = df_train_features_extended_cb.values
y = df_train_targets["radiant_win"].values.astype("int8")

In [14]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      test_size=0.2,
                                                      random_state=SEED)

Зададим модель и проведем предсказание. Пробовала разные параметры, с этими сработало лучше всего.

In [18]:
cb_model = CatBoostClassifier(random_state=SEED, 
                            iterations=2000,
                           depth=8, learning_rate=0.01,
                            loss_function= 'Logloss',
                           l2_leaf_reg = 2,   
                            score_function = 'L2',
                            verbose = False  
                             )

cb_model.fit(X_train, y_train)

cb_model.save_model("cb_model.cbm")

y_pred = cb_model.predict_proba(X_valid)[:, 1]
valid_score = roc_auc_score(y_valid, y_pred)
print("ROC-AUC score на отложенной части:", valid_score)
valid_accuracy = accuracy_score(y_valid, y_pred > 0.5)
print("Accuracy score (p > 0.5) на отложенной части:", valid_accuracy)

ROC-AUC score на отложенной части: 0.8228216769089247
Accuracy score (p > 0.5) на отложенной части: 0.7391167192429022


In [20]:
%%time
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=SEED)
cb_model = CatBoostClassifier(random_state=SEED, verbose = 0)
cv_scores_cb = cross_val_score(cb_model, X, y, cv=cv, scoring="roc_auc")
cv_scores_cb

CPU times: total: 1min 34s
Wall time: 2min 47s


array([0.81492698, 0.81814127, 0.81557672, 0.81856899, 0.81918275])

In [21]:
print(f"Среднее значение ROC-AUC на кросс-валидации: {cv_scores_cb.mean()}")

Среднее значение ROC-AUC на кросс-валидации: 0.8172793413421788


Функция для чтения тестовых матчей.

In [22]:
def read_matches_test(matches_file, total_matches=7977 , n_matches_to_read=None):
    """
    Аргуент
    -------
    matches_file: JSON файл с сырыми данными

    Результат
    ---------
    Возвращает записи о каждом матче
    """

    if n_matches_to_read is None:
        n_matches_to_read = total_matches

    c = 0
    with open(matches_file) as fin:
        for line in tqdm(fin, total=total_matches):
            if c >= n_matches_to_read:
                break
            else:
                c += 1
                yield json.loads(line)

Функция для добавления новых признаков в тестовые данные

In [23]:
def add_new_features_cb_test(df_features, matches_file):
    """
    Аргуенты
    -------
    df_features: таблица с данными
    matches_file: JSON файл с сырыми данными

    Результат
    ---------
    Добавляет новые признаки в таблицу
    """

    for match in read_matches_test(matches_file):
        match_id_hash = match['match_id_hash']
    
        # Посчитаем количество разрушенных вышек обеими командами
        radiant_tower_kills = 0
        dire_tower_kills = 0
        for objective in match["objectives"]:
            if objective["type"] == "CHAT_MESSAGE_TOWER_KILL":
                if objective["team"] == 2:
                    radiant_tower_kills += 1
                if objective["team"] == 3:
                    dire_tower_kills += 1
    
        df_features.loc[match_id_hash, "radiant_tower_kills"] = radiant_tower_kills
        df_features.loc[match_id_hash, "dire_tower_kills"] = dire_tower_kills
        df_features.loc[match_id_hash, "diff_tower_kills"] = radiant_tower_kills - dire_tower_kills
        
        radiant_kills = 0
        radiant_deaths = 0
        radiant_assists = 0
        radiant_gold = 0
        radiant_xp = 0
        radiant_hero_levels = []

        dire_kills = 0
        dire_deaths = 0
        dire_assists = 0
        dire_gold = 0
        dire_xp = 0
        dire_hero_levels = []

        for player in match['players']:
            if player['player_slot'] < 128:
                radiant_kills += player['kills']
                radiant_deaths += player['deaths']
                radiant_assists += player['assists']
                radiant_gold += player['gold']
                radiant_xp += player['xp']
                radiant_hero_levels.append(player['level'])
            else:
                dire_kills += player['kills']
                dire_deaths += player['deaths']
                dire_assists += player['assists']
                dire_gold += player['gold']
                dire_xp += player['xp']
                dire_hero_levels.append(player['level'])

        radiant_avg_gold = radiant_gold / 5
        radiant_avg_xp = radiant_xp / 5
        radiant_avg_level = sum(radiant_hero_levels) / len(radiant_hero_levels)

        dire_avg_gold = dire_gold / 5
        dire_avg_xp = dire_xp / 5
        dire_avg_level = sum(dire_hero_levels) / len(dire_hero_levels)

        df_features.loc[match_id_hash, "radiant_kills"] = radiant_kills
        df_features.loc[match_id_hash, "radiant_deaths"] = radiant_deaths
        df_features.loc[match_id_hash, "radiant_assists"] = radiant_assists
        df_features.loc[match_id_hash, "radiant_avg_gold"] = radiant_avg_gold
        df_features.loc[match_id_hash, "radiant_avg_xp"] = radiant_avg_xp
        df_features.loc[match_id_hash, "radiant_avg_level"] = radiant_avg_level

        df_features.loc[match_id_hash, "dire_kills"] = dire_kills
        df_features.loc[match_id_hash, "dire_deaths"] = dire_deaths
        df_features.loc[match_id_hash, "dire_assists"] = dire_assists
        df_features.loc[match_id_hash, "dire_avg_gold"] = dire_avg_gold
        df_features.loc[match_id_hash, "dire_avg_xp"] = dire_avg_xp
        df_features.loc[match_id_hash, "dire_avg_level"] = dire_avg_level

        df_features.loc[match_id_hash, "diff_kills"] = radiant_kills - dire_kills
        df_features.loc[match_id_hash, "diff_deaths"] = radiant_deaths - dire_deaths
        df_features.loc[match_id_hash, "diff_assists"] = radiant_assists - dire_assists
        df_features.loc[match_id_hash, "diff_avg_gold"] = radiant_avg_gold - dire_avg_gold
        df_features.loc[match_id_hash, "diff_avg_xp"] = radiant_avg_xp - dire_avg_xp
        df_features.loc[match_id_hash, "diff_avg_level"] = radiant_avg_level - dire_avg_level

        df_features.loc[match_id_hash, "radiant_max_health"] = max([player['max_health'] for player in match['players'] if player['player_slot'] < 128])
        df_features.loc[match_id_hash, "dire_max_health"] = max([player['max_health'] for player in match['players'] if player['player_slot'] >= 128])
        df_features.loc[match_id_hash, "diff_max_health"] = df_features.loc[match_id_hash, "radiant_max_health"] - df_features.loc[match_id_hash, "dire_max_health"]
        
        df_features.loc[match_id_hash, "radiant_max_mana"] = max([player['max_mana'] for player in match['players'] if player['player_slot'] < 128])
        df_features.loc[match_id_hash, "dire_max_mana"] = max([player['max_mana'] for player in match['players'] if player['player_slot'] >= 128])
        df_features.loc[match_id_hash, "diff_max_mana"] = df_features.loc[match_id_hash, "radiant_max_mana"] - df_features.loc[match_id_hash, "dire_max_mana"]
        
        df_features.loc[match_id_hash, "radiant_camps_stacked"] = sum([player['camps_stacked'] for player in match['players'] if player['player_slot'] < 128])
        df_features.loc[match_id_hash, "dire_camps_stacked"] = sum([player['camps_stacked'] for player in match['players'] if player['player_slot'] >= 128])
        df_features.loc[match_id_hash, "diff_camps_stacked"] = df_features.loc[match_id_hash, "radiant_camps_stacked"] - df_features.loc[match_id_hash, "dire_camps_stacked"]
        
        df_features.loc[match_id_hash, "radiant_rune_pickups"] = sum([player['rune_pickups'] for player in match['players'] if player['player_slot'] < 128])
        df_features.loc[match_id_hash, "dire_rune_pickups"] = sum([player['rune_pickups'] for player in match['players'] if player['player_slot'] >= 128])
        df_features.loc[match_id_hash, "diff_rune_pickups"] = df_features.loc[match_id_hash, "radiant_rune_pickups"] - df_features.loc[match_id_hash, "dire_rune_pickups"]
        
        df_features.loc[match_id_hash, "radiant_roshans_killed"] = sum([player['roshans_killed'] for player in match['players'] if player['player_slot'] < 128])
        df_features.loc[match_id_hash, "dire_roshans_killed"] = sum([player['roshans_killed'] for player in match['players'] if player['player_slot'] >= 128])
        df_features.loc[match_id_hash, "diff_roshans_killed"] = df_features.loc[match_id_hash, "radiant_roshans_killed"] - df_features.loc[match_id_hash, "dire_roshans_killed"]

        # ... (/¯◡ ‿ ◡)/¯☆*:・ﾟ добавляем новые признаки ...

In [24]:
df_test_features = pd.read_csv(os.path.join("../data/test_data.csv"),
                                   index_col="match_id_hash")
df_test_features_extended_cb = df_test_features.copy()
add_new_features_cb_test(df_test_features_extended_cb, os.path.join("../data/test_raw_data.jsonl"))
df_test_features_extended_cb.head()

  0%|          | 0/7977 [00:00<?, ?it/s]

Unnamed: 0_level_0,game_time,game_mode,lobby_type,objectives_len,chat_len,r1_hero_id,r1_kills,r1_deaths,r1_assists,r1_denies,...,diff_max_mana,radiant_camps_stacked,dire_camps_stacked,diff_camps_stacked,radiant_rune_pickups,dire_rune_pickups,diff_rune_pickups,radiant_roshans_killed,dire_roshans_killed,diff_roshans_killed
match_id_hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a400b8f29dece5f4d266f49f1ae2e98a,155,22,7,1,11,11,0,0,0,0,...,-216.0,0.0,0.0,0.0,2.0,7.0,-5.0,0.0,0.0,0.0
46a0ddce8f7ed2a8d9bd5edcbb925682,576,22,7,1,4,14,1,0,3,1,...,384.0,0.0,2.0,-2.0,13.0,11.0,2.0,0.0,0.0,0.0
b1b35ff97723d9b7ade1c9c3cf48f770,453,22,7,1,3,42,0,1,1,0,...,310.0,0.0,1.0,-1.0,11.0,8.0,3.0,0.0,0.0,0.0
ab3cc6ccac661a1385e73a2e9f21313a,721,4,0,2,1,30,2,2,1,3,...,24.0,0.0,0.0,0.0,13.0,15.0,-2.0,0.0,0.0,0.0
54aaab1cb8cc5df3c253641618673266,752,22,7,1,0,8,2,0,2,8,...,262.001,0.0,2.0,-2.0,17.0,16.0,1.0,0.0,0.0,0.0


Предскажем результаты для тестовых данных

In [25]:
X_test = df_test_features_extended_cb.values
cb_model.load_model("cb_model.cbm")
y_test_pred = cb_model.predict_proba(X_test)[:, 1]

df_submission = pd.DataFrame({"radiant_win_prob": y_test_pred},
                                 index=df_test_features.index)

In [26]:
submission_filename = "catboost_submission_{}.csv".format(
    datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))
df_submission.to_csv(submission_filename)
print("Файл посылки сохранен, как: {}".format(submission_filename))

Файл посылки сохранен, как: submission_2024-04-04_11-13-38.csv


В итоге такой набор признаков и параметров позволил получить мой максимальный скор.
Кроме этого, я еще пробовала две модели, представленные ниже.

## XGboost

In [39]:
%%time
xgb_model = XGBClassifier(random_state=SEED, verbose=0)
xgb_model.fit(X_train, y_train)

xgb_model.save_model("xgb_model.json")

y_pred = xgb_model.predict_proba(X_valid)[:, 1]
valid_score = roc_auc_score(y_valid, y_pred)
print("ROC-AUC score на отложенной части:", valid_score)
valid_accuracy = accuracy_score(y_valid, y_pred > 0.5)
print("Accuracy score (p > 0.5) на отложенной части:", valid_accuracy)

ROC-AUC score на отложенной части: 0.8040106106387462
Accuracy score (p > 0.5) на отложенной части: 0.7089905362776026
CPU times: total: 6.97 s
Wall time: 1.52 s


In [47]:
%%time
from sklearn.metrics import make_scorer, roc_auc_score
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.1, 0.01, 0.001]
}

scoring = {'AUC': make_scorer(roc_auc_score)}

grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, scoring=scoring, refit='AUC', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best parameters found:", grid_search.best_params_)
print("Best AUC score found:", grid_search.best_score_)

best_model = grid_search.best_estimator_

Best parameters found: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}
Best AUC score found: 0.7273157924492414
CPU times: total: 9.47 s
Wall time: 5min 31s


In [48]:
y_pred = best_model.predict_proba(X_valid)[:, 1]
valid_score = roc_auc_score(y_valid, y_pred)
print("ROC-AUC score на отложенной части:", valid_score)
valid_accuracy = accuracy_score(y_valid, y_pred > 0.5)
print("Accuracy score (p > 0.5) на отложенной части:", valid_accuracy)

ROC-AUC score на отложенной части: 0.8155208739244031
Accuracy score (p > 0.5) на отложенной части: 0.7287066246056783


In [49]:
X_test = df_test_features_extended_cb.values
y_test_pred = best_model.predict_proba(X_test)[:, 1]

df_submission = pd.DataFrame({"radiant_win_prob": y_test_pred}, index=df_test_features.index)

In [50]:
submission_filename = "xgboost_submission_{}.csv".format(
    datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))
df_submission.to_csv(submission_filename)
print("Файл посылки сохранен, как: {}".format(submission_filename))

Файл посылки сохранен, как: submission_2024-04-03_13-37-44.csv


## LGBM

In [None]:
%%time
lgb_model = LGBMClassifier(random_state=SEED, verbose=0)

lgb_model.fit(X_train, y_train)
lgb_model.booster_.save_model("lgb_model.txt")

y_pred = lgb_model.predict_proba(X_valid)[:, 1]
valid_score = roc_auc_score(y_valid, y_pred)
print("ROC-AUC score на отложенной части:", valid_score)
valid_accuracy = accuracy_score(y_valid, y_pred > 0.5)
print("Accuracy score (p > 0.5) на отложенной части:", valid_accuracy)

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.1, 0.01, 0.001]
}

scoring = {'AUC': make_scorer(roc_auc_score)}

grid_search = GridSearchCV(estimator=lgb_model, param_grid=param_grid, cv=3, scoring=scoring, refit='AUC', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best parameters found:", grid_search.best_params_)
print("Best AUC score found:", grid_search.best_score_)

lgbm_best_model = grid_search.best_estimator_

y_pred = lgbm_best_model.predict_proba(X_valid)[:, 1]
valid_score = roc_auc_score(y_valid, y_pred)
print("ROC-AUC score на отложенной части с использованием лучшей модели:", valid_score)
valid_accuracy = accuracy_score(y_valid, y_pred > 0.5)
print("Accuracy score (p > 0.5) на отложенной части с использованием лучшей модели:", valid_accuracy)

In [None]:
X_test = df_test_features_extended_cb.values
y_test_pred = best_model.predict_proba(X_test)[:, 1]

df_submission = pd.DataFrame({"radiant_win_prob": y_test_pred}, index=df_test_features.index)

In [None]:
submission_filename = "lgbm_submission_{}.csv".format(
    datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))
df_submission.to_csv(submission_filename)
print("Файл посылки сохранен, как: {}".format(submission_filename))