# Predicting Grandmasters
## Volume 3 Math 402 Final Project

### Damian and Whitney Anderson,
### Nathan Christiansen, Reagan Howell

#### Date 11/16/2021

# Introduction
Have you ever started playing a game of chess online and wondered if your opponent was one of the top players on lichess.org?

We have ... well, not really, but we were wondering how accurately could you predict your chess opponent based on the moves that they make.
In the past, chess opening moves were studied to find the move order that would give the

So taking the top ~30 players on lichess.org and downloading their classical and rapid format (>25 min and >10 min respectively) games.
With their games in what would it take to help us understand the patterns that these masters are making.

### Importing, Parsing and Cleaning the Data

All games were downloaded from lichess.org open database using links like this

https://lichess.org/api/games/user/Al_shima?rated=true&analysed=true&tags=true&clocks=true&evals=false&opening=false&perfType=rapid

The using regex to remove any of the unimportant information and stripping the .txt files to get the chess game Portable Game Notation (pgn) and the moves that were made.

In [3]:
import os
import regex as re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, LogisticRegression,LogisticRegressionCV,Perceptron
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.naive_bayes import MultinomialNB,GaussianNB

from sklearn.neural_network import MLPClassifier

In [3]:
path = './games'
def files(path = './games'):
    """
    Grab all the txt file names and find the player name from it
    Returns the filenames and the players
    """

    file = os.listdir(path)
    players = [file[i][8:-15] for i in range(len(file))]

    return file, players

In [4]:
def get_games(file, player):
    """
    Some of the game files has a chess clock time in the pgn file, so using regex we remove the time stamps.
    We are given the filename and the player name as inputs, then sorting through the file find the color, the variant and the moves played.

    """
    white_clock = re.compile('{ \[%\S* [0-9]*:[0-9]*:[0-9]*\] } [0-9]+\.\.\.')
    black_clock = re.compile('{ \[%\S* [0-9]+:[0-9]+:[0-9]+\] }')
    pre_df = []
    #open the file
    with open(file) as fin:
        lines = fin.readlines()

        for i in range(len(lines)):
            if player in lines[i]:
                color = lines[i][1:6] # get the color that the player is playing

                #Because each game is inconsistent with the rows that are included,
                #we need to check several different rows to find the variant being played.
                if color == "White":
                    if ('Variant' in lines[i+6]):
                        variant = lines[i+6][10:-3]

                    elif ('Variant' in lines[i+7]):
                        variant = lines[i+7][10:-3]

                    elif ('Variant' in lines[i+8]):
                        variant = lines[i+8][10:-3]

                    elif ('Variant' in lines[i + 9]):
                        variant = lines[i + 9][10:-3]

                    elif ('Variant' in lines[i + 10]):
                        variant = lines[i + 10][10:-3]

                    elif ('Variant' in lines[i + 11]):
                        variant = lines[i + 11][10:-3]

                    elif ('Variant' in lines[i + 12]):
                        variant = lines[i + 12][10:-3]

                    #each of these if blocks, tries to find the starting moves, and checks to makes sure
                    #that each game being added to the dataframe is of at least a minimum length
                    if (len(lines[i+14]) > 60) and ('1. ' in lines[i + 14]):
                        line = lines[i + 14]
                        new_line = re.sub(white_clock, str(), line)
                        newer_line = re.sub(black_clock, str(), new_line)
                        pre_df.append([player, color, variant, newer_line])

                    elif (len(lines[i+15]) > 60) and ('1. ' in lines[i + 15]):
                        line = lines[i + 15]
                        new_line = re.sub(white_clock, str(), line)
                        newer_line = re.sub(black_clock, str(), new_line)
                        pre_df.append([player, color, variant, newer_line])

                    elif (len(lines[i+16]) > 60) and ('1. ' in lines[i + 16]):
                        line = lines[i + 16]
                        new_line = re.sub(white_clock, str(), line)
                        newer_line = re.sub(black_clock, str(), new_line)
                        pre_df.append([player, color, variant, newer_line])

                #see comments for the white code
                elif color == "Black":
                    if ('Variant' in lines[i + 6]):
                        variant = lines[i + 6][10:-3]

                    elif ('Variant' in lines[i + 7]):
                        variant = lines[i + 7][10:-3]

                    elif ('Variant' in lines[i + 8]):
                        variant = lines[i + 8][10:-3]

                    elif ('Variant' in lines[i + 9]):
                        variant = lines[i + 9][10:-3]

                    elif ('Variant' in lines[i + 10]):
                        variant = lines[i + 10][10:-3]

                    elif ('Variant' in lines[i + 11]):
                        variant = lines[i + 11][10:-3]

                    elif ('Variant' in lines[i + 12]):
                        variant = lines[i + 12][10:-3]

                    if (len(lines[i+14]) > 60) and ('1. ' in lines[i + 14]):
                        line = lines[i + 14]
                        new_line = re.sub(white_clock, str(), line)
                        newer_line = re.sub(black_clock, str(), new_line)
                        pre_df.append([player, color, variant, newer_line])

                    elif (len(lines[i+15]) > 60) and ('1. ' in lines[i + 15]):
                        line = lines[i + 15]
                        new_line = re.sub(white_clock, str(), line)
                        newer_line = re.sub(black_clock, str(), new_line)
                        pre_df.append([player, color, variant, newer_line])

    return pre_df

In [5]:
def create_database():
    """
    Calls the files, get_games functions to creates a dataframe
    Returns a dataframe

    """
    file, players = files()
    df = []
    for i in range(len(players)):
        #create a list of all the files and their corresponding players and then input that list into the DataFrame
        df.extend(get_games(path + '/' + file[i], players[i]))
    df = pd.DataFrame(df, columns=['Name', 'Color', 'Variant', 'Moves'])
    df = df[df['Variant'] == 'Standard']
    return df

The main function utilizes the create_database() function to access the data files and build our DataFrame. We create a
DataFrame with columns for the players' name, color they were playing as, the variant, and the first 14 moves they make
in their game. We then import that Dataframe into chess_games.csv.

In [6]:
def main():
    """
    After creating the dataframe with the cleaned data, we need to change the one columns of  14 moves into
    14 columns of 1 moves each.
    Then we drop the na that slipped through
    Saves it to a csv file
    Returns None
    """
    df = create_database()
    moves = [f'{i}.' for i in range(1,16)]
    for move in moves:
        df[move] = np.nan
    #instead of having a list of 14 moves, we need columns for each of the individual 14 moves.
    for i in range(len(df)):
        move_order = df.iloc[i].Moves
        if i >830 and i < 835:
            pass
        else:
            # print(i)
            for j in range(len(moves)-1):
                first_ind = move_order.find(moves[j])
                second_ind = move_order.find(moves[j+1])
                if second_ind == -1:
                    break
                else:
                    df.loc[i, moves[j]] = move_order[first_ind:second_ind-1]


    df.drop(df.columns[-1],axis=1,inplace=True)
    df.dropna(inplace=True)
    for i in range(len(moves)-1):
        df.loc[:,moves[i]] = df.loc[:,moves[i]].str[3:]
    df = df.dropna()
    df.to_csv(r'chess_games.csv', index=True, header=True)
    return

main()

### Creating the X-data and the y-targets

Here we load chess_games.csv and break it into the data and the targets. We hope to predict the name of the Grandmaster
playing the game, so we choose the name column as our targets. We are using the first 14 moves the player makes in order
to predict who is playing, so our data is the 14 columns of moves from each game. We create a train/test split of .75
training and .25 testing.

In [7]:
def load_():
    df = pd.read_csv("chess_games.csv")
    targets = df.Name
    data = df.drop(columns=['Name', 'Color', 'Variant', 'Moves'])
    return data, targets


def sets_():
    data, targets = load_()
    data = pd.get_dummies(data,columns=data.columns)
    xtrain,xtest,ytrain,ytest = train_test_split(data,targets)
    # params = {'n_neighbors': [2,3,4],
    #         'weights' :['uniform','distance'],
    #         'leaf_size' : [30,40,50,60],
    #
    # }
    return xtrain,xtest,ytrain,ytest

xtrain,xtest,ytrain,ytest = sets_()

Now that we have split the data, we use several classifiers with some adjustments to hyper-parameters to figure out
which combination gives us the most accurate prediction of the Grandmaster playing any given game. In order to make
sure that no classifier got a better split than the others, we (begrudgingly) made the xtrain, xtest, ytrain, and ytest 
variables global and accessible to any function or method.


## KNeighborsClassifier
After doing a lot of Grid Searching by "hand", we found this that looking at the 4 nearest neighbors, 
using a distance metric and jst the brute force algorithm resulted in the fastest and highest scoring model type.

In [12]:
KNeighborsClassifier(n_neighbors=4,weights='distance',algorithm='brute').fit(xtrain,ytrain).score(xtest,ytest)

0.6574481458202388

## MultinomialNB

In [9]:
MultinomialNB().fit(xtrain,ytrain).score(xtest,ytest)


0.6228786926461345

## Random Forests

In [10]:
RandomForestClassifier().fit(xtrain,ytrain).score(xtest,ytest)


0.6209930861093652

## MLPClassifier

In [13]:
MLPClassifier(hidden_layer_sizes=(50,)).fit(xtrain,ytrain).score(xtest,ytest)

0.6323067253299811