<h1 align='center'> Применение модели машинного обучения <br>в прогнозировании исхода <br>шахматного матча </h1>
<p align='left'> <strong>Над проектом трудились:</strong>
<br> Парамонов Всеволод
<br> Сидоров Иван
<br> Чубов Артем

In [146]:
import requests
from bs4 import BeautifulSoup
import re
import string

from IPython.display import display, display_html, clear_output
import tqdm

import numpy as np
import pandas as pd
pd.options.display.max_columns=999
from datetime import datetime, timedelta

import chess.pgn
import zipfile

import seaborn as sns
import matplotlib.pyplot as plt

<h2> Скачивание данных игр с сайта lichess.org за март 2023</h2>

In [49]:
###########################################
### Преобразование PGN-файла в DataFrame ###
############################################

NUM_GAMES = 3*10**4

rows = []
game_counter = 0
with open("W:\lichess_db_standard_rated_2023-03.pgn\Игры.pgn") as pgn_file:
    for i in tqdm.tqdm(range(NUM_GAMES)):
        game = chess.pgn.read_game(pgn_file)
        game_counter += 1
        if game_counter % 50 == 0:  # извлекаем только каждую 50-ю игру
            row = {}
            row['headers'] = game.headers.__dict__
            rows.append(row)
games = pd.DataFrame(rows)

100%|███████████████████████████████████████████████████████████████████████████| 30000/30000 [01:44<00:00, 286.79it/s]


In [65]:
games.time_control.unique()

array(['180+0', '180+2', '60+0', '300+0', '600+0', '300+3', '600+5',
       '15+0', '1800+0', '120+1', '300+2', '30+0', '900+15', '120+0',
       '420+0', '300+1', '300+5', '1200+0', '180+5', '-', '900+3', '60+1',
       '300+4', '600+9', '600+2', '600+3', '600+20', '420+2'],
      dtype=object)

<h2> Подготовка датафрейма для последующего парсинга признаков о соперниках </h2>

In [68]:
def str_to_date(str):
    return datetime.strptime(str,"%H:%M:%S").time()

def date_to_day(date):
    return date.day

def time_to_sec(time):
    return time.hour * 3600 + time.minute * 60 + time.second

#############################################
### Преобразование даты в формат DateTime ###
#############################################

games['utc_date']=games['headers'].apply(lambda x: x.get("_others",{}).get("UTCDate","")).astype("datetime64[s]")
games['utc_date'] = games['utc_date'].apply(date_to_day)

games['utc_time'] = games['headers'].apply(lambda x: x.get("_others",{}).get("UTCTime",""))
games['utc_time'] = games['utc_time'].apply(str_to_date)
games['utc_time'] = games['utc_time'].apply(time_to_sec)


###################################################
### Достаем игроков, играющих за черных и белых ###
###################################################

games['white']=games['headers'].apply(lambda x: x.get("_tag_roster",{}).get("White","")).astype(str)
games['black']=games['headers'].apply(lambda x: x.get("_tag_roster",{}).get("Black","")).astype(str)


###############################################
### Достаем режим, в котором произошла игра ###
###############################################

games['event'] = games['headers'].apply(lambda x: x.get("_tag_roster",{}).get("Event","").replace('Rated ', '').replace(" game", "")).astype(str)


####################################################################################
### Достаем временной формат игры и разделяем его на основное и добавочное время ###
####################################################################################

games['time_control']=games['headers'].apply(lambda x: x.get("_others",{}).get("TimeControl","")).astype(str)

def splitting(x):
    try:
        time, add = x.split('+')
        return [time, add]
    except:
        return [0, 0]

games['time'] = [int(splitting(x)[0]) for x in games.time_control.values]
games['add'] = [int(splitting(x)[1]) for x in games.time_control.values]

#games.drop('time_control', inplace=True, axis=1)


#####################################################
### Достаем результат матча и преобразуем в float ###
#####################################################

games['white_score']=games['headers'].apply(lambda x: x.get("_tag_roster",{}).get("Result","").split("-")[0].replace("1/2","0.5"))
games_stats = games.drop(columns=["headers"]) ### Удлаяем исходную колонку headers


##########################################################
### Выбираем игры, которые были сыграны в режиме Blitz ###
##  и удаляем дубликаты                                 ## 
##########################################################

games_stats = games_stats[games_stats['event'] == "Blitz"]
games_stats = games_stats.drop_duplicates(subset=['white', 'black', 'utc_date', 'utc_time'], keep='first')


games_stats = games_stats.reset_index(drop=True)
games_stats.time_control.unique()

array(['180+0', '180+2', '300+0', '300+3', '300+2', '420+0', '300+1',
       '180+5', '300+4'], dtype=object)

<h2> Скрипт для парсинга дополнительных признаков соперников </h2>

In [69]:
from datetime import datetime, timedelta

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:101.0) Gecko/20100101 Firefox/101.0'
}


#####################################################
### Функция по парсингу дополнительных признаков: ###
#####################################################
####                                             ####
###  Статистика матча 1 (до рассматриваемого)     ###
###                                               ###
###  Статистика матча 2 (до рассматриваемого)     ###
###                                               ###
###  Статистика матча 2 (до рассматриваемого)     ###
####                                             ####
#####################################################


def additional_features (utc_date,utc_time, nickname):

    DESIRED_MONTH = 9
    DESIRED_YEAR = 2023

    desired_datetime_utc = datetime(
    DESIRED_YEAR,
    DESIRED_MONTH,
    utc_date,
    utc_time
    )
    

    # Преобразуем время в формат "HH:MM:SS"
    time_str = desired_datetime_utc.strftime("%H:%M:%S")

    # Получаем дату в формате "YYYY-MM-DD"
    date_str = desired_datetime_utc.strftime("%Y-%m-%d")

    # Комбинируем дату и время в одну строку
    formatted_datetime = date_str + "T" + time_str


    ### ПАРСИНГ ПОСЛЕДНИХ ИГР ###

    #############################
    ###   Пока not finished   ###
    #############################

    i = 1

    ### while
    url_3 = f'https://lichess.org/@/{nickname}/all?page={i}'

    response = requests.get(url_3, headers=headers)
    gett = BeautifulSoup(response.content, 'html.parser')

    games = gett.find_all('div', {'class': 'angle-content'})[0]
    games = games.find_all('div', {'class': 'search__result'})[0]
    games = games.find_all('article', {'class': 'game-row paginated'})[0]


    ### for
    gm = games.find_all('div', {'class': 'game-row__infos'})[0]
    vers = gm.find_all('div', {'class': 'versus'})[0]
    players = vers.find_all('a', {'class': 'user-link ulpt'})
    white_player, black_player = players

    color = ''

    if white_player == nickname:
        color = 'white'
    else:
        color = 'black'



<h2> Скрипт для парсинга признаков с сайта lichess.org </h2>

In [106]:
##############################################
### Функция по парсингу основных признаков ###
##############################################
####                                      ####
### Статистики по разным режимам           ###
###                                        ###
### Результаты за все игры (победы, ничьи  ###
### поражения)                             ###
###                                        ###
### Текущие серии побед и поражений        ###
###                                        ###
### Стандартное отклонение рейтинга        ###
####                                      #### 
##############################################

def get_stats(d):
    pl, match = d

    url = 'https://lichess.org/@/' + pl + '/all'


    ratings = {'Username': pl,
               'Mode': match,
               'UltraBullet': 0,
               'Bullet': 0,
               'Blitz': 0,
               'Rapid': 0,
               'Classical': 0,
               'Correspondence': 0,
               'Crazyhouse': 0,
               'Chess960': 0,
               'King of the Hill': 0,
               'Three-check': 0,
               'Antichess': 0,
               'Atomic': 0,
               'Horde': 0,
               'Racing Kings': 0,
               'Puzzles': 0,
               'Puzzle Storm': 0,
               'Puzzle Racer': 0,
               'Puzzle Streak': 0}
    
    
    games = {'games': 0,
           'rated': 0,
           'wins': 0,
           'losses': 0,
           'draws': 0,
           'bookmarks':0,
           'imported games': 0}
 

    streaks = {'win_streak': 0,
               'lose_streak': 0,
               'deviation': 0
               }    
    

    ################################################
    ### Парсинг данных с боковой страницы игрока ###
    ################################################

    responsse = requests.get(url, headers=headers)
    tre = BeautifulSoup(responsse.content, 'html.parser')
    statistic = tre.find_all('div', {'class': 'side sub-ratings'})

    
    ### Проверка на закрытость профиля игрока ###
    
    try:
        statistic = statistic[0].find_all('span')
    except:
        
        ### Если профиль игрока закрыт, то возвращаем DataFrame с NaN значениями ###
        
        ratings = {x: np.NaN for x in ratings}
        games = {x: np.NaN for x in games}
        streaks = {x: np.NaN for x in streaks}
        df1 = pd.DataFrame([list(ratings.values())], columns=list(ratings.keys()))
        df2 = pd.DataFrame([list(games.values())], columns=list(games.keys()))
        df3 = pd.DataFrame([list(streaks.values())], columns=list(streaks.keys()))

        df = pd.concat([df1, df2, df3], axis=1)
        return df
    
    ### Запись рейтингов и количества игр в словарь ###

    for i in statistic:
        if 'h3' in str(i):
            if i.find_all('strong')[0].text != '?':
                val = i.find_all('strong')[0].text.translate(str.maketrans('', '', string.punctuation))
                ratings[i.find_all('h3')[0].text] = int(val)



    #############################################
    ### Парсинг статистики игрока за все игры ###
    ##      (победы, поражения, ничьи)         ##      
    #############################################

    gms = tre.find_all('div', {'class': 'number-menu number-menu--tabs menu-box-pop'})
    gms = gms[0].find_all('a')
    
    for j in gms:
        res = j.text.translate(str.maketrans('', '', string.punctuation))
        res = re.split('(\d+)', res)[1:]
        points, rezult = res

        ### Решение проблемы со множественным числом ###

        if rezult in ['win', 'game', 'loss', 'draw', 'bookmark']:
            if rezult == 'loss':
                rezult += 'es'
            else: rezult += 's'        
        games[rezult] = int(points)


    ###############################################
    ### Парсинг текущих серий побед и поражений ###
    ###############################################

    match = match.lower()
    if match == 'ultra bullet':
        match = 'ultaBullet'

    url_2 = f'https://lichess.org/@/{pl}/perf/{match}'

    response = requests.get(url_2, headers=headers)
    dop = BeautifulSoup(response.content, 'html.parser')
    
    ### Проверка на то, есть ли у игрока текущие серии ###

    try:
        streak = dop.find_all('section', {'class': 'resultStreak split'})[0]
        streaks['win_streak'] = streak.find_all('strong')[1].text
        streaks['lose_streak'] = streak.find_all('strong')[2].text
    except:

        ### Иначе оставляем 0 ###

        pass

    ################################################
    ### Парсинг стандартного отклонения рейтинга ###
    ################################################
    
    try:
        dev = dop.find_all('section', {'class': 'glicko'})[0]
        a = dev.find_all('strong', {'title': 'Lower value means the rating is more stable. Above 110, the rating is considered provisional. To be included in the rankings, this value should be below 75 (standard chess) or 65 (variants).'})[0].text
        streaks['deviation'] = float(a)
    except:
        a = np.nan
        streaks['deviation'] = float(a)

#    if type(dev) == 'bs4.element.Tag':
#        a = dev.find_all('strong')[1].text
#    else:
#        a = np.nan
    
#    streaks['deviation'] = float(a)

    ############################################################
    ### Применение функции парсинга дополнительных признаков ###
    ############################################################

    ##### //////////////////////////////////////////////// #####
    #####----------------  В Процессе ---------------------#####
    ##### //////////////////////////////////////////////// #####


    ############################################
    ### Преобразование словарей в Data Frame ###
    ############################################

    df1 = pd.DataFrame([list(ratings.values())], columns=list(ratings.keys()))
    df2 = pd.DataFrame([list(games.values())], columns=list(games.keys()))
    df3 = pd.DataFrame([list(streaks.values())], columns=list(streaks.keys()))

    df = pd.concat([df1, df2, df3], axis=1)
    return df


In [108]:
get_stats(['SidorovIvan', 'Blitz'])

Unnamed: 0,Username,Mode,UltraBullet,Bullet,Blitz,Rapid,Classical,Correspondence,Crazyhouse,Chess960,King of the Hill,Three-check,Antichess,Atomic,Horde,Racing Kings,Puzzles,Puzzle Storm,Puzzle Racer,Puzzle Streak,games,rated,wins,losses,draws,bookmarks,imported games,win_streak,lose_streak,deviation
0,SidorovIvan,Blitz,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1569,0,0,10,6,0,4,1,1,0,0,0,0,500.0


<h2> Разделение датафрейма на игроков, играющих за черные и белые фигуры </h2>

In [72]:
games_stats_white = games_stats[['white', 'event']].copy()
games_stats_black = games_stats[['black', 'event']].copy()
space = "\xa0" * 30

df1_styler = games_stats_white.head(10).style.set_table_attributes("style='display:inline'").set_caption('games_stats_white')
df2_styler = games_stats_black.head(10).style.set_table_attributes("style='display:inline'").set_caption('games_stats_black')

display_html(df1_styler._repr_html_() + space + df2_styler._repr_html_(), raw=True)

Unnamed: 0,white,event
0,RRRai,Blitz
1,Anti_toxin,Blitz
2,JDaniel11,Blitz
3,changram,Blitz
4,meowmeow69,Blitz
5,RTS61,Blitz
6,kikyourass,Blitz
7,PXV,Blitz
8,Yeisonbuitragos,Blitz
9,awhitevan,Blitz

Unnamed: 0,black,event
0,Only1Royce,Blitz
1,rewindmike,Blitz
2,Rkar2EDS73,Blitz
3,ShellyCooper,Blitz
4,marxb8,Blitz
5,ShivD,Blitz
6,riccardo_schacchi,Blitz
7,normalplayerr,Blitz
8,MagzyOfNorway,Blitz
9,StriveForTheLight,Blitz


<h3> Применение скрипта к игрокам, играющим за белые фигуры </h3>

In [109]:
sample = pd.DataFrame(columns=['Username','Mode','UltraBullet', 'Bullet', 'Blitz', 'Rapid', 'Classical',
       'Correspondence', 'Crazyhouse', 'Chess960', 'Antichess', 'Horde',
       'Puzzles', 'Puzzle Storm', 'Puzzle Racer', 'Puzzle Streak', 'games',
       'rated', 'wins', 'losses', 'draws', 'bookmarks', 'win_streak',
       'lose_streak'])

whites = sample.copy()

#################################################################################################
### Применяем функцию до добавлению признаков к исходным данным по играм для игроков за белых ###
#################################################################################################

for i in tqdm.tqdm(games_stats_white.values):
    getting = get_stats(i)
    whites = pd.concat([whites, getting])

whites = whites.reset_index(drop=True)


100%|████████████████████████████████████████████████████████████████████████████████| 276/276 [02:10<00:00,  2.12it/s]


In [120]:
get_stats(['Anti_toxin', 'Blitz'])
whites

Unnamed: 0,Username,Mode,UltraBullet,Bullet,Blitz,Rapid,Classical,Correspondence,Crazyhouse,Chess960,Antichess,Horde,Puzzles,Puzzle Storm,Puzzle Racer,Puzzle Streak,games,rated,wins,losses,draws,bookmarks,win_streak,lose_streak,King of the Hill,Three-check,Atomic,Racing Kings,imported games,deviation,playing,imported game
0,RRRai,Blitz,0,0,0,0,0,0,0,0,0,0,0,0,0,0,154,144,60,87,7,0,0,0,0.0,0.0,0.0,0.0,23.0,,,
1,Anti_toxin,Blitz,0,0,1469,1796,0,0,0,0,0,0,2137,0,0,0,1402,1274,704,643,55,33,1,11,0.0,0.0,0.0,0.0,9.0,47.00,,
2,JDaniel11,Blitz,1390,2105,2158,2266,2040,0,0,0,0,0,2328,36,54,0,7624,7504,4461,2734,429,2,3,7,0.0,0.0,0.0,0.0,0.0,48.46,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,meowmeow69,Blitz,0,1065,1175,1171,1413,0,0,0,0,0,1632,0,0,0,586,553,269,288,29,0,11,1,0.0,0.0,0.0,0.0,0.0,50.27,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271,nadiagonaldovispo,Blitz,0,1698,1820,1797,1829,0,0,0,0,0,1883,0,0,0,10971,10837,5430,5232,309,8,0,0,0.0,0.0,0.0,0.0,0.0,,,
272,StockportGraphics,Blitz,0,1149,1134,1063,0,0,0,0,0,0,0,0,0,0,1390,1390,664,686,40,0,10,1,0.0,0.0,0.0,0.0,0.0,46.44,,
273,Simpey,Blitz,0,1683,1143,1677,0,0,0,0,0,0,1218,0,0,0,221,221,104,111,6,0,7,4,0.0,0.0,0.0,0.0,0.0,70.89,,
274,kishor007,Blitz,0,1600,1712,1271,0,0,0,0,0,0,0,0,0,0,22260,22246,10151,11293,816,0,14,0,0.0,0.0,0.0,0.0,0.0,45.02,,


<h3> Применение скрипта к игрокам, играющим за черные фигуры </h3>

In [119]:
blacks = sample.copy()

##################################################################################################
### Применяем функцию до добавлению признаков к исходным данным по играм для игроков за черных ###
##################################################################################################

for i in tqdm.tqdm(games_stats_black.values):
    getting = get_stats(i)
    blacks = pd.concat([blacks, getting])

blacks = blacks.reset_index(drop=True)
blacks

100%|████████████████████████████████████████████████████████████████████████████████| 276/276 [02:18<00:00,  1.99it/s]


Unnamed: 0,Username,Mode,UltraBullet,Bullet,Blitz,Rapid,Classical,Correspondence,Crazyhouse,Chess960,Antichess,Horde,Puzzles,Puzzle Storm,Puzzle Racer,Puzzle Streak,games,rated,wins,losses,draws,bookmarks,win_streak,lose_streak,King of the Hill,Three-check,Atomic,Racing Kings,imported games,deviation,playing,imported game
0,Only1Royce,Blitz,0,0,974,1030,0,0,0,0,0,0,1416,0,0,0,2676,2524,1197,1369,110,3,8,1,0.0,0.0,0.0,0.0,0.0,45.37,,
1,rewindmike,Blitz,0,0,1436,1790,0,0,0,0,0,0,1476,0,0,0,159,159,72,79,8,0,1,6,0.0,0.0,0.0,0.0,12.0,69.46,,
2,Rkar2EDS73,Blitz,0,0,1876,0,0,0,0,0,0,0,0,0,0,0,3625,3625,1683,1800,142,0,1,12,0.0,0.0,0.0,0.0,0.0,46.01,,
3,ShellyCooper,Blitz,0,1903,1920,0,0,0,0,0,0,0,0,0,0,0,51,51,30,21,0,0,5,5,0.0,0.0,0.0,0.0,0.0,106.37,,
4,marxb8,Blitz,0,817,978,1006,0,0,0,0,0,0,1346,0,0,0,1687,1680,768,896,23,0,13,1,0.0,0.0,0.0,0.0,19.0,45.13,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271,FabbeB,Blitz,0,1611,1776,0,0,0,0,0,0,0,2119,0,0,0,15886,15873,7574,7578,734,10,12,1,0.0,0.0,0.0,0.0,0.0,45.02,,
272,TimotheeChalamet,Blitz,0,1290,1135,1192,0,0,0,0,0,0,1503,0,0,0,1142,1104,532,564,46,0,0,0,0.0,0.0,0.0,0.0,0.0,,,
273,Zupi44,Blitz,0,1121,1140,0,0,0,0,0,0,0,0,0,0,0,3349,3349,1558,1565,226,0,11,3,0.0,0.0,0.0,0.0,0.0,51.76,,
274,dondude,Blitz,0,1243,1548,1911,0,0,0,1343,0,0,2290,0,0,0,1191,1189,594,555,42,0,0,0,0.0,0.0,0.0,0.0,0.0,,,


<h3> Объеединение получившихся датафреймов в один </h3>

In [121]:
whites_blacks = pd.merge(whites, blacks, left_index=True, right_index=True)
whites_blacks

Unnamed: 0,Username_x,Mode_x,UltraBullet_x,Bullet_x,Blitz_x,Rapid_x,Classical_x,Correspondence_x,Crazyhouse_x,Chess960_x,Antichess_x,Horde_x,Puzzles_x,Puzzle Storm_x,Puzzle Racer_x,Puzzle Streak_x,games_x,rated_x,wins_x,losses_x,draws_x,bookmarks_x,win_streak_x,lose_streak_x,King of the Hill_x,Three-check_x,Atomic_x,Racing Kings_x,imported games_x,deviation_x,playing_x,imported game_x,Username_y,Mode_y,UltraBullet_y,Bullet_y,Blitz_y,Rapid_y,Classical_y,Correspondence_y,Crazyhouse_y,Chess960_y,Antichess_y,Horde_y,Puzzles_y,Puzzle Storm_y,Puzzle Racer_y,Puzzle Streak_y,games_y,rated_y,wins_y,losses_y,draws_y,bookmarks_y,win_streak_y,lose_streak_y,King of the Hill_y,Three-check_y,Atomic_y,Racing Kings_y,imported games_y,deviation_y,playing_y,imported game_y
0,RRRai,Blitz,0,0,0,0,0,0,0,0,0,0,0,0,0,0,154,144,60,87,7,0,0,0,0.0,0.0,0.0,0.0,23.0,,,,Only1Royce,Blitz,0,0,974,1030,0,0,0,0,0,0,1416,0,0,0,2676,2524,1197,1369,110,3,8,1,0.0,0.0,0.0,0.0,0.0,45.37,,
1,Anti_toxin,Blitz,0,0,1469,1796,0,0,0,0,0,0,2137,0,0,0,1402,1274,704,643,55,33,1,11,0.0,0.0,0.0,0.0,9.0,47.00,,,rewindmike,Blitz,0,0,1436,1790,0,0,0,0,0,0,1476,0,0,0,159,159,72,79,8,0,1,6,0.0,0.0,0.0,0.0,12.0,69.46,,
2,JDaniel11,Blitz,1390,2105,2158,2266,2040,0,0,0,0,0,2328,36,54,0,7624,7504,4461,2734,429,2,3,7,0.0,0.0,0.0,0.0,0.0,48.46,,,Rkar2EDS73,Blitz,0,0,1876,0,0,0,0,0,0,0,0,0,0,0,3625,3625,1683,1800,142,0,1,12,0.0,0.0,0.0,0.0,0.0,46.01,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ShellyCooper,Blitz,0,1903,1920,0,0,0,0,0,0,0,0,0,0,0,51,51,30,21,0,0,5,5,0.0,0.0,0.0,0.0,0.0,106.37,,
4,meowmeow69,Blitz,0,1065,1175,1171,1413,0,0,0,0,0,1632,0,0,0,586,553,269,288,29,0,11,1,0.0,0.0,0.0,0.0,0.0,50.27,1.0,,marxb8,Blitz,0,817,978,1006,0,0,0,0,0,0,1346,0,0,0,1687,1680,768,896,23,0,13,1,0.0,0.0,0.0,0.0,19.0,45.13,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271,nadiagonaldovispo,Blitz,0,1698,1820,1797,1829,0,0,0,0,0,1883,0,0,0,10971,10837,5430,5232,309,8,0,0,0.0,0.0,0.0,0.0,0.0,,,,FabbeB,Blitz,0,1611,1776,0,0,0,0,0,0,0,2119,0,0,0,15886,15873,7574,7578,734,10,12,1,0.0,0.0,0.0,0.0,0.0,45.02,,
272,StockportGraphics,Blitz,0,1149,1134,1063,0,0,0,0,0,0,0,0,0,0,1390,1390,664,686,40,0,10,1,0.0,0.0,0.0,0.0,0.0,46.44,,,TimotheeChalamet,Blitz,0,1290,1135,1192,0,0,0,0,0,0,1503,0,0,0,1142,1104,532,564,46,0,0,0,0.0,0.0,0.0,0.0,0.0,,,
273,Simpey,Blitz,0,1683,1143,1677,0,0,0,0,0,0,1218,0,0,0,221,221,104,111,6,0,7,4,0.0,0.0,0.0,0.0,0.0,70.89,,,Zupi44,Blitz,0,1121,1140,0,0,0,0,0,0,0,0,0,0,0,3349,3349,1558,1565,226,0,11,3,0.0,0.0,0.0,0.0,0.0,51.76,,
274,kishor007,Blitz,0,1600,1712,1271,0,0,0,0,0,0,0,0,0,0,22260,22246,10151,11293,816,0,14,0,0.0,0.0,0.0,0.0,0.0,45.02,,,dondude,Blitz,0,1243,1548,1911,0,0,0,1343,0,0,2290,0,0,0,1191,1189,594,555,42,0,0,0,0.0,0.0,0.0,0.0,0.0,,,


<h3> Объединение получившегося датафрейма с исходными данными </h3>

In [122]:
data = pd.merge(games_stats, whites_blacks, left_index=True, right_index=True)
data = data[~(data.Username_x.isna()) & ~(data.Username_y.isna())].reset_index(drop=True)
data

Unnamed: 0,utc_date,utc_time,white,black,event,time_control,white_score,time,add,Username_x,Mode_x,UltraBullet_x,Bullet_x,Blitz_x,Rapid_x,Classical_x,Correspondence_x,Crazyhouse_x,Chess960_x,Antichess_x,Horde_x,Puzzles_x,Puzzle Storm_x,Puzzle Racer_x,Puzzle Streak_x,games_x,rated_x,wins_x,losses_x,draws_x,bookmarks_x,win_streak_x,lose_streak_x,King of the Hill_x,Three-check_x,Atomic_x,Racing Kings_x,imported games_x,deviation_x,playing_x,imported game_x,Username_y,Mode_y,UltraBullet_y,Bullet_y,Blitz_y,Rapid_y,Classical_y,Correspondence_y,Crazyhouse_y,Chess960_y,Antichess_y,Horde_y,Puzzles_y,Puzzle Storm_y,Puzzle Racer_y,Puzzle Streak_y,games_y,rated_y,wins_y,losses_y,draws_y,bookmarks_y,win_streak_y,lose_streak_y,King of the Hill_y,Three-check_y,Atomic_y,Racing Kings_y,imported games_y,deviation_y,playing_y,imported game_y
0,1,18,RRRai,Only1Royce,Blitz,180+0,0,180,0,RRRai,Blitz,0,0,0,0,0,0,0,0,0,0,0,0,0,0,154,144,60,87,7,0,0,0,0.0,0.0,0.0,0.0,23.0,,,,Only1Royce,Blitz,0,0,974,1030,0,0,0,0,0,0,1416,0,0,0,2676,2524,1197,1369,110,3,8,1,0.0,0.0,0.0,0.0,0.0,45.37,,
1,1,19,Anti_toxin,rewindmike,Blitz,180+2,1,180,2,Anti_toxin,Blitz,0,0,1469,1796,0,0,0,0,0,0,2137,0,0,0,1402,1274,704,643,55,33,1,11,0.0,0.0,0.0,0.0,9.0,47.00,,,rewindmike,Blitz,0,0,1436,1790,0,0,0,0,0,0,1476,0,0,0,159,159,72,79,8,0,1,6,0.0,0.0,0.0,0.0,12.0,69.46,,
2,1,2,JDaniel11,Rkar2EDS73,Blitz,180+2,1,180,2,JDaniel11,Blitz,1390,2105,2158,2266,2040,0,0,0,0,0,2328,36,54,0,7624,7504,4461,2734,429,2,3,7,0.0,0.0,0.0,0.0,0.0,48.46,,,Rkar2EDS73,Blitz,0,0,1876,0,0,0,0,0,0,0,0,0,0,0,3625,3625,1683,1800,142,0,1,12,0.0,0.0,0.0,0.0,0.0,46.01,,
3,1,12,meowmeow69,marxb8,Blitz,300+0,0,300,0,meowmeow69,Blitz,0,1065,1175,1171,1413,0,0,0,0,0,1632,0,0,0,586,553,269,288,29,0,11,1,0.0,0.0,0.0,0.0,0.0,50.27,1.0,,marxb8,Blitz,0,817,978,1006,0,0,0,0,0,0,1346,0,0,0,1687,1680,768,896,23,0,13,1,0.0,0.0,0.0,0.0,19.0,45.13,1.0,
4,1,7,RTS61,ShivD,Blitz,300+0,1,300,0,RTS61,Blitz,0,0,1741,0,0,0,0,0,0,0,1942,0,0,0,2638,2638,1294,1260,84,0,2,14,0.0,0.0,0.0,0.0,0.0,45.99,,,ShivD,Blitz,0,1337,1720,1595,1612,0,0,0,0,0,1608,0,0,0,8282,8095,4003,3938,341,0,10,2,0.0,0.0,0.0,0.0,0.0,45.00,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,1,1094,nadiagonaldovispo,FabbeB,Blitz,180+0,0,180,0,nadiagonaldovispo,Blitz,0,1698,1820,1797,1829,0,0,0,0,0,1883,0,0,0,10971,10837,5430,5232,309,8,0,0,0.0,0.0,0.0,0.0,0.0,,,,FabbeB,Blitz,0,1611,1776,0,0,0,0,0,0,0,2119,0,0,0,15886,15873,7574,7578,734,10,12,1,0.0,0.0,0.0,0.0,0.0,45.02,,
206,1,1108,StockportGraphics,TimotheeChalamet,Blitz,300+0,1,300,0,StockportGraphics,Blitz,0,1149,1134,1063,0,0,0,0,0,0,0,0,0,0,1390,1390,664,686,40,0,10,1,0.0,0.0,0.0,0.0,0.0,46.44,,,TimotheeChalamet,Blitz,0,1290,1135,1192,0,0,0,0,0,0,1503,0,0,0,1142,1104,532,564,46,0,0,0,0.0,0.0,0.0,0.0,0.0,,,
207,1,1091,Simpey,Zupi44,Blitz,180+0,1,180,0,Simpey,Blitz,0,1683,1143,1677,0,0,0,0,0,0,1218,0,0,0,221,221,104,111,6,0,7,4,0.0,0.0,0.0,0.0,0.0,70.89,,,Zupi44,Blitz,0,1121,1140,0,0,0,0,0,0,0,0,0,0,0,3349,3349,1558,1565,226,0,11,3,0.0,0.0,0.0,0.0,0.0,51.76,,
208,1,1084,kishor007,dondude,Blitz,300+0,0,300,0,kishor007,Blitz,0,1600,1712,1271,0,0,0,0,0,0,0,0,0,0,22260,22246,10151,11293,816,0,14,0,0.0,0.0,0.0,0.0,0.0,45.02,,,dondude,Blitz,0,1243,1548,1911,0,0,0,1343,0,0,2290,0,0,0,1191,1189,594,555,42,0,0,0,0.0,0.0,0.0,0.0,0.0,,,


<h3 style='color:red'> ! Считывание csv-файла ! </h3>
<p> Из-за того, что только у одного человека из нашей команды удалось разархивировать PGN-файл, мы приняли решение, что он на своем компьютере запустит ячейки выше и отправит получившийся файл в csv для удобства работы для всех членов команды </p>

In [59]:
# data = pd.read_csv('/Users/vsevolod/Downloads/Telegram Desktop/chess_org.csv', index_col=0)

<h3> Процесс исключения бесполезных признаков </h3>

In [131]:
###################################
### Удаление ненужных признаков ###
###################################

data.drop(['imported games_x', 'playing_x', 'imported game_x', 'imported games_y', 'playing_y', 'imported game_y'], axis=1, inplace=True)
data.drop(['white', 'black', 'Username_x', 'Username_y', 'Mode_x', 'Mode_y', 'event'], axis=1, inplace=True)
data.drop(columns = 'time_control', inplace = True)

In [133]:
data['deviation_y'].fillna(data['deviation_y'].mean(), inplace = True)
data['deviation_x'].fillna(data['deviation_x'].mean(), inplace = True)

<h3> Применение моделей машинного обучения </h3>

In [159]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE


##################################################################
### Деление на Train-Test и применение LabelEncoding для меток ### 
##################################################################

X = data.drop('white_score', axis=1)
Y = data['white_score']
Y = Y.apply(lambda x: 0 if x == 0.0 else 1 if x == 0.5 else 2)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)


# smt = SMOTE()
# X_train, y_train = smt.fit_resample(X_train, y_train)


#######################################
### Вывод наиболее важных признаков ###
#######################################

#fs = SelectKBest(f_classif, k="all")
#fs.fit(X, Y)

#rate = pd.DataFrame(fs.scores_, X.columns, columns=["score"])
#rate.sort_values("score", axis=0, ascending=False, inplace=True)
#rate.head(10)
# - работа с моделью пока в процессе!
