# Predicting Grandmasters
## Volume 3 Math 402 Final Project

### Damian and Whitney Anderson,
### Nathan Christiansen, Reagan Howell

#### Date 11/16/2021

# Introduction
Have you ever started playing a game of chess online and wondered if your opponent was one of the top players on lichess.org?

We have ... well, not really, but we were wondering how accurately could you predict your chess opponent based on the moves that they make.
In the past, chess opening moves were studied to find the move order that would give the

So taking the top ~30 players on lichess.org and downloading their classical and rapid format (>25 min and >10 min respectively) games.
With their games in what would it take to help us understand the patterns that these masters are making.

### Importing, Parsing and Cleaning the Data

All games were downloaded from lichess.org open database using links like this

https://lichess.org/api/games/user/Al_shima?rated=true&analysed=true&tags=true&clocks=true&evals=false&opening=false&perfType=rapid

The using regex to remove any of the unimportant information and stripping the .txt files to get the chess game Portable Game Notation (pgn) and the moves that were made.

In [None]:
import os, regex as re, numpy as np, pandas as pd
from sklearn.linear_model import LinearRegression, LogisticRegression,LogisticRegressionCV,Perceptron
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.naive_bayes import MultinomialNB,GaussianNB

from sklearn.neural_network import MLPClassifier


The main function utilizes the create_database() function to access the data files and build our DataFrame. We create a
DataFrame with columns for the players' name, color they were playing as, the variant, and the first 14 moves they make
in their game. We then import that Dataframe into chess_games.csv.


### Creating the X-data and the y-targets

Here we load chess_games.csv and break it into the data and the targets. We hope to predict the name of the Grandmaster
playing the game, so we choose the name column as our targets. We are using the first 14 moves the player makes in order
to predict who is playing, so our data is the 14 columns of moves from each game. We create a train/test split of .75
training and .25 testing.

In [7]:
def load_():
    df = pd.read_csv("chess_games.csv")
    targets = df.Name
    data = df.drop(columns=['Name', 'Color', 'Variant', 'Moves'])
    return data, targets


def sets_():
    data, targets = load_()
    data = pd.get_dummies(data,columns=data.columns)
    xtrain,xtest,ytrain,ytest = train_test_split(data,targets)
    return xtrain,xtest,ytrain,ytest

xtrain,xtest,ytrain,ytest = sets_()

Now that we have split the data, we use several classifiers with some adjustments to hyper-parameters to figure out
which combination gives us the most accurate prediction of the Grandmaster playing any given game. In order to make
sure that no classifier got a better split than the others, we (begrudgingly) made the xtrain, xtest, ytrain, and ytest 
variables global and accessible to any function or method.


## KNeighborsClassifier
After doing a lot of Grid Searching by "hand", we found this that looking at the 4 nearest neighbors, 
using a distance metric and jst the brute force algorithm resulted in the fastest and highest scoring model type.

In [12]:
KNeighborsClassifier(n_neighbors=4,weights='distance',algorithm='brute').fit(xtrain,ytrain).score(xtest,ytest)

0.6574481458202388

## MultinomialNB

In [9]:
MultinomialNB().fit(xtrain,ytrain).score(xtest,ytest)


0.6228786926461345

## Random Forests

In [10]:
RandomForestClassifier().fit(xtrain,ytrain).score(xtest,ytest)


0.6209930861093652

## MLPClassifier

In [13]:
MLPClassifier(hidden_layer_sizes=(50,)).fit(xtrain,ytrain).score(xtest,ytest)



0.6323067253299811