Of the final model, the metric like F1 and accuracy are pulled out for evaluation purposes.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import VectorAssembler

from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, CrossValidatorModel


import pandas as pd
import numpy as np

from fractions import Fraction as frac

In [2]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("fa21-ds5110-group10-rk") \
    .config("spark.driver.memory", "36g") \
    .getOrCreate()

In [3]:
spark.sparkContext

In [4]:
spark.sparkContext.cancelAllJobs()

First we add in our cached dataset from our prior feature engineering.

In [5]:
df = spark.read.parquet("../../../data/processed/chess_games_moves_model.parquet")
trainData = spark.read.parquet("../../../data/processed/training.parquet")
testData = spark.read.parquet("../../../data/processed/testing.parquet")

In [6]:
df.cache()
trainData.cache()

DataFrame[event: string, white_result: string, first_two: array<string>, ECO: string, EloDiff: int, Opening: string, game_complexity: int, opening_class: string]

In [7]:
df.show(2)

+-----+------------+----------------+---+-------+--------------------+---------------+--------------+
|event|white_result|       first_two|ECO|EloDiff|             Opening|game_complexity| opening_class|
+-----+------------+----------------+---+-------+--------------------+---------------+--------------+
|Blitz|         win|[c4 c5, Nc3 Nf6]|A34|    222|English Opening: ...|              6|Flank openings|
|Blitz|         win| [d4 b6, c4 Bb7]|A40|    309|  English Defense #2|              5|Flank openings|
+-----+------------+----------------+---+-------+--------------------+---------------+--------------+
only showing top 2 rows



In [8]:
model = CrossValidatorModel.load("ds5110/project/semester-project/data/modeling/GBT")

Next we review the resulting data points of interest.
We notice that ECO and the first two sets of moves are distinct of one another, and may influence the overall model's prediction.

In [9]:
model.bestModel

PipelineModel_dd791a802d6d

In [10]:
model.getEvaluator().metricName

Param(parent='BinaryClassificationEvaluator_811850a15b75', name='metricName', doc='metric name in evaluation (areaUnderROC|areaUnderPR)')

We will now begin to build a model, keying in on the opening move and the white_result columns.
Note that both of these are categorical values, so we will need to encode them using the StringIndexer for pyspark to do model evaluations.

In [11]:
# features
opening_vectorizor = StringIndexer(inputCol="ECO", outputCol="opening_ohe")
gametype_vectorizer = StringIndexer(inputCol="event", outputCol="event_vector")
class_vectorizer = StringIndexer(inputCol="opening_class", outputCol="opening_class_vector")
# target
result_vectorizor = StringIndexer(inputCol="white_result", outputCol="white_result_vector")

Next we perform One-Hot Encoding on our Opening type (or ECO) and do our comparision.  THis will create a new column that we will use for our random forest model.

In [12]:
move_encoder = OneHotEncoder(inputCols=["opening_ohe"],
                        outputCols=["ECO_Type"])

In [13]:
class_encoder = OneHotEncoder(inputCols=["opening_class_vector"],
                        outputCols=["Class_Type"])

Now that we have the OHE of our ECO, we can combine it with other features to build out our predictors for random forest.

In [14]:
features_assembler = VectorAssembler(inputCols=['ECO_Type', "Class_Type","EloDiff"], outputCol='features')

Now that our data is model-ready, we will do a split, fit, transform, and evaluation to determine the performance of our model.
Note that we have chosen the default tunings, but in the future we will likely apply a cross-validation technique in pyspark to select the correct hyperparameters.

In [15]:
gbt = GBTClassifier(maxDepth=8, maxIter=25, labelCol='white_result_vector', seed=1337, leafCol="leafId")

In [16]:
data_pipeline = Pipeline(stages=[opening_vectorizor,
                            gametype_vectorizer,
                            class_vectorizer,
                            result_vectorizor,
                            move_encoder,
                            class_encoder,
                            features_assembler])
ml_pipeline = Pipeline(stages=[gbt])

In [17]:
data_model = data_pipeline.fit(df)
data_model_train = data_model.transform(trainData)
data_model_test = data_model.transform(testData)

CrossValidation set up
stepSize is aka Learning Rate

In [18]:
result = model.transform(data_model_test)

In [19]:
result.show(2)

+-----+------------+----------------+---+-------+-------------+---------------+--------------+-----------+------------+--------------------+-------------------+---------------+-------------+--------------------+--------------------+--------------------+----------+--------------------+
|event|white_result|       first_two|ECO|EloDiff|      Opening|game_complexity| opening_class|opening_ohe|event_vector|opening_class_vector|white_result_vector|       ECO_Type|   Class_Type|            features|       rawPrediction|         probability|prediction|              leafId|
+-----+------------+----------------+---+-------+-------------+---------------+--------------+-----------+------------+--------------------+-------------------+---------------+-------------+--------------------+--------------------+--------------------+----------+--------------------+
|Blitz|        loss| [Na3 d5, c4 c6]|A00|   -392|Sodium Attack|              3|Flank openings|        0.0|         0.0|                 2.0|  

In [20]:
auc_evaluator = BinaryClassificationEvaluator(labelCol='white_result_vector')
model_auc = auc_evaluator.evaluate(result)

68.79% (minInstancesPerNode = 25, maxDepth = 8)

In [21]:
eval_multi = MulticlassClassificationEvaluator(labelCol='white_result_vector')

In [22]:
eval_multi.evaluate(result, {eval_multi.metricName: "accuracy"})

0.6308771911614864

In [23]:
eval_multi.evaluate(result, {eval_multi.metricName: "f1"})

0.6307975287351031

In [24]:
eval_multi.evaluate(result, {eval_multi.metricName: "recallByLabel"})

0.6449819685713402

In [25]:
eval_multi.evaluate(result, {eval_multi.metricName: "precisionByLabel"})

0.6301648693993191

In [26]:
print('AUC:', model_auc)
#print('AUC:', BinaryClassificationMetrics(gbpredictions['label','prediction'].rdd).areaUnderROC)
#print('PR:', BinaryClassificationMetrics(gbpredictions['label','prediction'].rdd).areaUnderPR)

AUC: 0.6879095070531936


In [27]:
metrics_rdd = MulticlassMetrics(result.select(['prediction','white_result_vector']).rdd.map(tuple))
confusionmatrix = pd.DataFrame(metrics_rdd.confusionMatrix().toArray())

label = 1.0
best_metric = pd.DataFrame([    ['Accuracy',  metrics_rdd.accuracy],
                                ['Precision', metrics_rdd.precision(label)],
                                ['Recall',    metrics_rdd.recall(label)],
                                ['F1 Score',  metrics_rdd.fMeasure(label)]],
                              columns=['Metric', 'Measure'])

In [28]:
confusionmatrix

Unnamed: 0,0,1
0,375226.0,206536.0
1,220215.0,354145.0


In [29]:
best_metric

Unnamed: 0,Metric,Measure
0,Accuracy,0.630877
1,Precision,0.631634
2,Recall,0.616591
3,F1 Score,0.624022


In [30]:
list(zip(model.avgMetrics, model.getEstimatorParamMaps()))

[(0.6882407011502037,
  {Param(parent='GBTClassifier_b3b65340241d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 5,
   Param(parent='GBTClassifier_b3b65340241d', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.'): 2}),
 (0.688292526238363,
  {Param(parent='GBTClassifier_b3b65340241d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 8,
   Param(parent='GBTClassifier_b3b65340241d', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.'): 2}),
 (0.68824

Here we see that our model's performance seems to work well, with roughtly 50% of games resulting in a match to white win.

Let's review our classifications and confusion matrix next to determine the overall performance.

In [31]:
#designer_matrix = rf_result.select(['prediction','white_result_vector'])
#metrics_rdd = MulticlassMetrics(designer_matrix.rdd.map(tuple))
#print(metrics_rdd.confusionMatrix().toArray())

Reviewing this model, we see that this model is not good.
Of the three classes, we found that the precision for the white loss performs okay, however the white win and tie both evaluate to 0 for precision.
This means that our model is overfitting and failed to correctly identify either ties or losses.

Further tuning will be required to better distribute our data (either adjusting the threshold, or tuning the tree).