# Data Preparation
## Stage 1

For this notebook, we load our source chess game data from lichess, applying a schema in the process, and filter the games down to only those following the rules of Classic and Blitz.
From there, we write our filtered dataset out to a parquet file for future notebook processing.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType, DoubleType
import pyspark.sql.functions as F

In [2]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("fa21-ds5110-group10") \
    .getOrCreate()

In [3]:
spark.sparkContext.cancelAllJobs()

In [5]:
chess_schema = StructType([StructField('event', StringType(), False), 
                           StructField('white', StringType(), False),
                           StructField('black', StringType(), False),
                           StructField('result', StringType(), False),
                           StructField('UTCDate', DateType(), False),
                           StructField('UTCTime', StringType(), False),
                           StructField('WhiteElo', IntegerType(), False),
                           StructField('BlackElo', IntegerType(), False),
                           StructField('WhiteRatingDiff', DoubleType(), False),
                           StructField('BlackRatingDiff', DoubleType(), False),
                           StructField('ECO', StringType(), False),
                           StructField('Opening', StringType(), False),
                           StructField('TimeControl', StringType(), False),
                           StructField('Termination', StringType(), False),
                           StructField('AN', StringType(), False)])


df = spark.read.csv(path="../../data/raw/chess_games.csv",
                    schema=chess_schema,
                    header=True,
                    ignoreLeadingWhiteSpace=True,
                    ignoreTrailingWhiteSpace=True,
                    dateFormat='yyyy.mm.dd')

print(f'Original Dataset Length: {df.count()}')

df = df.filter((F.col('event') == 'Classical') | (F.col('event') == 'Blitz'))
print(f'Filtered Dataset Length: {df.count()}')

df.write.mode("overwrite").parquet("../../data/processed/chess_games_blitz_classic.parquet")

Original Dataset Lengith 6256184
Filtered Dataset Lengith 3850385
