# The problem

Based on this dataset with 20,000 chess matches collected from Lichess (games taken of users from the top ~100 teams on _lichess.org_ from 2013 to 2017) the objective is to create a model that predicts the result of a match, given the parameters of the game, before it starts (available at https://www.kaggle.com/datasnaek/chess).

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
%pylab inline

pd.set_option('display.max_columns', 200)

Importing the dataset:

In [None]:
chess_games = pd.read_csv('../input/chess/games.csv', delimiter=',')

In [None]:
chess_games.head()

# Features

### Duration of the match

In [None]:
games_delay_in_sec = (chess_games['last_move_at'] - chess_games['created_at']) / 1000
chess_games['duration_in_seconds'] = games_delay_in_sec.copy()

### One-Hot-Encoding of victory status

In [None]:
from category_encoders import OneHotEncoder

In [None]:
ohe_victory_status = OneHotEncoder(cols=['victory_status'], use_cat_names=True, drop_invariant=True)
chess_games = ohe_victory_status.fit_transform(chess_games)

### Time control: Minutes and Seconds

In [None]:
minutes = chess_games['increment_code'].str.split('+').map(lambda time_control: time_control[0], na_action=None).astype(int)
incr_seconds = chess_games['increment_code'].str.split('+').map(lambda time_control: time_control[1], na_action=None).astype(int)

chess_games['minutes'] = minutes.copy()
chess_games['incr_seconds'] = incr_seconds.copy()

In [None]:
chess_games = chess_games.drop(columns=['increment_code'], axis=1)

### Date of Creation and Last Move At (as dates)

In [None]:
chess_games['created_at'] = pd.to_datetime(chess_games['created_at'], unit='ms')
chess_games['last_move_at'] = pd.to_datetime(chess_games['last_move_at'], unit='ms')

### Rating difference

One other variable that seems to be predictive is the rating difference between the players.

In [None]:
chess_games['rating_difference'] = chess_games['white_rating'] - chess_games['black_rating']

In [None]:
chess_games['rating_difference'].mean()

### Castle

In [None]:
def get_white_moves(moves):
    return moves[::2]

def get_black_moves(moves):
    return moves[1::2]

def castled(moves):
    return ('O-O' in moves) | ('O-O-O' in moves)

all_moves = chess_games['moves'].str.split()
white_moves = all_moves.apply(get_white_moves)
black_moves = all_moves.apply(get_black_moves)
chess_games['white_castled'] = white_moves.apply(castled).astype(int)
chess_games['black_castled'] = black_moves.apply(castled).astype(int)

### Number of takes

In [None]:
def count_takes(moves):
    moves = pd.Series(moves)
    return moves.map(lambda mv: 1 if 'x' in mv else 0).sum()

chess_games['white_takes_count'] = white_moves.apply(count_takes)
chess_games['black_takes_count'] = black_moves.apply(count_takes)

In [None]:
chess_games.head()

# Clean Data

We calculate a threshold in hours for a match:

In [None]:
max_minutes = chess_games['minutes'].max()
max_incr_seconds = chess_games['incr_seconds'].max()
mean_moves = chess_games['turns'].mean()

duration_threshold_in_hours = (max_minutes + max_incr_seconds / 60 * mean_moves) / 60
duration_threshold_in_hours

So, we remove the matches which have until 3 turns (not possible to checkmate in 3 turns), and the ones with duration until 6 hours.

In [None]:
chess_games = chess_games[chess_games['turns'] > 3]
chess_games = chess_games[chess_games['duration_in_seconds'] < duration_threshold_in_hours * 3600]

In [None]:
chess_games.shape

We see we have also matches which have 0 seconds.

In [None]:
duration0 = chess_games[chess_games['duration_in_seconds'] == 0]
duration0['winner'].value_counts()

Let's see the distribution of the matches' duration:

In [None]:
pyplot.hist(x=chess_games['duration_in_seconds'], bins=100)
pyplot.xlabel('Duration (in seconds)')
pyplot.ylabel('Frequency')
pyplot.title('Matches\' Durations')
pyplot.show()

Let's filter the matches with duration_in_seconds > 0.

In [None]:
chess_games = chess_games[chess_games['duration_in_seconds'] > 0]

In [None]:
chess_games.shape

Reploting the durations, without the zeroes:

In [None]:
pyplot.hist(x=chess_games['duration_in_seconds'], bins=100)
pyplot.xlabel('Duration (in seconds)')
pyplot.ylabel('Frequency')
pyplot.title('Matches\' Durations')
pyplot.show()

The peak at ~10000 s = ~2.7 hours is a plausible match duration.

# Baseline

Let's take as **baseline case** the result as a function of the rating difference points (diff):

If
* \- (mean difference)  < diff < (mean difference)  => draw;
* else, diff is negative => black wins
* else => white wins

In [None]:
baseline = pd.DataFrame(index=chess_games.index)
baseline['rating_difference'] = chess_games['rating_difference']
baseline.shape

In [None]:
baseline['rating_difference'].mean()

In [None]:
def get_base_winner(rating_diff):
    average = baseline['rating_difference'].mean()
    if rating_diff < average and rating_diff > -average:
        return 'draw'
    elif rating_diff < 0:
        return 'black'
    else:
        return 'white'

baseline['winner'] = baseline['rating_difference'].apply(get_base_winner)
baseline['winner'].value_counts()

Let's use the **weighted avg precison** and **weighted avg recall** as our modeling metrics, so we balance the different quantities of examples for each result.

In [None]:
from sklearn.metrics import precision_recall_fscore_support, classification_report

In [None]:
print(classification_report(chess_games['winner'], baseline['winner'], digits=4))

p 0.5933, recall 0.5924 - Baseline

Looking how to capture the **weighted avg precison**, we use its definition: _weighted_avg_precision = weighted_avg(precision, support)_

In [None]:
results = precision_recall_fscore_support(chess_games['winner'], baseline['winner'])
np.average(results[0], weights=results[3])

# Constructing the model

We import the Random Forest Classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

Let's use the following variables to make the model:

(we include castled, takes quantity, rating difference)

In [None]:
variables = ['victory_status_outoftime', 'victory_status_resign', 'victory_status_mate', 'victory_status_draw', 
             'white_rating', 'black_rating', 'minutes', 'incr_seconds', 'rating_difference',
             'white_castled', 'black_castled', 'white_takes_count', 'black_takes_count']

As the dataset is small, let's divide 50% train / 50% validation.

In [None]:
X = chess_games[variables]
y = chess_games['winner']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.5)

In [None]:
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

We train the model.

In [None]:
model = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=0)
model.fit(X_train, y_train)

predicted = model.predict(X_valid)

results = precision_recall_fscore_support(y_valid, predicted)
support = results[3]
prec = np.average(results[0], weights=support)
recall = np.average(results[1], weights=support)
print("Precision: {}".format(prec))
print("Recall: {}".format(recall))
print()

We have a great increase of both the metrics using the variables **white_takes_count, black_takes_count.**

# Cross-validation

In order to have a better estimate of the metrics, let's use KFold to do cross-validation of the data.

In [None]:
from sklearn.model_selection import RepeatedKFold

In [None]:
avg_weighted_precisions = []
avg_weighted_recalls = []
kf = RepeatedKFold(n_splits=2, n_repeats=10, random_state=0)

X = chess_games[variables]
y = chess_games['winner']
i = 0

for lines_train, lines_valid in kf.split(chess_games):
    X_train, y_train = X.iloc[lines_train], y.iloc[lines_train]
    X_valid, y_valid = X.iloc[lines_valid], y.iloc[lines_valid]
    i = i + 1
    
    model = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=0)
    model.fit(X_train, y_train)
    
    predicted = model.predict(X_valid)

    results = precision_recall_fscore_support(y_valid, predicted)
    support = results[3]
    prec = np.average(results[0], weights=support)
    recall = np.average(results[1], weights=support)
    print("Iteration  #{}".format(i))
    print("=====================")
    print("Precision: {}".format(prec))
    print("Recall: {}".format(recall))
    print()
    
    avg_weighted_precisions.append(prec)
    avg_weighted_recalls.append(recall)

print("Average Precision: {}".format(np.mean(avg_weighted_precisions)))
print("Average Recall: {}".format(np.mean(avg_weighted_recalls)))

And we see that both the average of precision and the average of recall are near to 80%.

Let's plot the precision and recall distribution to detect possible outliers:

In [None]:
pylab.subplot(121)
pylab.xlabel('Precision')
pylab.ylabel('Value')
pylab.hist(avg_weighted_precisions, bins=20)

pylab.subplot(122)
pylab.xlabel('Recall')
pylab.hist(avg_weighted_recalls, bins=20)

The values for both the metrics lie in the interval [0.80, 0.82], which is not a big range of values. We haven't detected outliers.

Here are some other configurations tested in order to achieve this first model with metrics around 80%:

* p 0.5933, recall 0.5924 - Baseline

* p 0.6477, recall 0.6473 - Random Forest (n=100)

* p 0.6512, recall 0.6508 - Random Forest (n=200) with 'castled' variables

* p 0.8081, recall 0.8077 - Random Forest (n=200) with 'castled' and 'takes_count' variables

As we can see, we have a great improvement of the performance when we add 'takes_count' variables.