Basic imports ahead of time

In [1]:
import pandas as pd
import numpy as np
import urllib.request
import bz2
import re
import threading
import gc


Here we define a some functions to parse a pgn file for different features, using a modifiable array input so we can run these in parallel.


In [2]:
def parse_events(f, out):
    events = np.array(
        [
            x.split(b'"')[1].split(b" ")[1].decode("UTF-8")
            for x in re.findall(b"\[Event.*\]", f)
        ]
    )
    out.append(events)


def parse_results(f, out):
    results = np.array(
        [x.split(b'"')[1].decode("UTF-8") for x in re.findall(b"\[Result.*\]", f)]
    )
    out.append(results)


def parse_white_ELO(f, out):
    whiteELOs = np.array(
        [
            int(x.split(b'"')[1]) if x.split(b'"')[1] != b"?" else 0
            for x in re.findall(b"\[WhiteElo.*\]", f)
        ]
    )
    out.append(whiteELOs)


def parse_black_ELO(f, out):
    blackELOs = np.array(
        [
            int(x.split(b'"')[1]) if x.split(b'"')[1] != b"?" else 0
            for x in re.findall(b"\[BlackElo.*\]", f)
        ]
    )
    out.append(blackELOs)


def parse_move_nums(f, out):
    moveNums = np.array(
        [
            (
                0
                if len(re.findall(b"\d+\.", x[0])) == 0
                else int(re.findall(b"\d+\.", x[0])[-1][:-1])
            )
            for x in re.findall(b"(\]\\n\\n.*?(0-1|1-0|1/2-1/2))", f)
        ]
    )
    out.append(moveNums)


Here we'll grab the file directly from Lichess' database and decompress it into a byte string to use later. We also delete the very large decompressed file explicity, just in case python decides not to do garbage collection when we want it to.

In [3]:
with urllib.request.urlopen(
    "https://database.lichess.org/standard/lichess_db_standard_rated_2015-12.pgn.bz2"
) as f:
    decompressed = bz2.BZ2File(f, "r")
    bString = decompressed.read()
    del decompressed
    gc.collect()


Once we have the byte string, we will run all our parsing functions in parallel and send the outputs to some basic mutable lists. Once that is done, we change the categorical variables to be represented as integers for nicer storage and compression.

In [4]:
events = []
results = []
whiteELOs = []
blackELOs = []
moveNums = []

events_thread = threading.Thread(
    target=parse_events,
    args=(
        bString,
        events,
    ),
)
results_thread = threading.Thread(
    target=parse_results,
    args=(
        bString,
        results,
    ),
)
whiteELOs_thread = threading.Thread(
    target=parse_white_ELO,
    args=(
        bString,
        whiteELOs,
    ),
)
blackELOs_thread = threading.Thread(
    target=parse_black_ELO,
    args=(
        bString,
        blackELOs,
    ),
)
moveNums_thread = threading.Thread(
    target=parse_move_nums,
    args=(
        bString,
        moveNums,
    ),
)

events_thread.start()
results_thread.start()
whiteELOs_thread.start()
blackELOs_thread.start()
moveNums_thread.start()

events_thread.join()
results_thread.join()
whiteELOs_thread.join()
blackELOs_thread.join()
moveNums_thread.join()


events = events[0]
events[events == 'Blitz'] = 0
events[events == 'Bullet'] = 1
events[events == 'Classical'] = 2
events[events == 'Correspondence'] = 3

results = results[0]
results[results == '1-0'] = 0
results[results == '0-1'] = 1
results[results == '1/2-1/2'] = 2

whiteELOs = whiteELOs[0]
blackELOs = blackELOs[0]
moveNums = moveNums[0]


Once we have all our data in arrays, we stack them together and make a dataframe from this data. Data types were selected to be as small as possible to allow for maximum efficiency in compression. We also drop any games with 0 or 1 moves, as these won't assist our analysis later, and reset the index to remove superfluous data. Lastly, we pack it to .pickle file, which allows for very efficient direct representation of python objects.

In [15]:
all_data = np.vstack((events, results, whiteELOs, blackELOs, moveNums)).T
df = pd.DataFrame(
    all_data,
    columns=[
        "Game Type",
        "Result",
        "White ELO",
        "Black ELO",
        "Moves",
    ],
)
df = df.astype(
    {
        "Game Type": "uint8",
        "Result": "uint8",
        "White ELO": "uint16",
        "Black ELO": "uint16",
        "Moves": "uint8",
    },
)
# drop 0 or 1 move games ahead of time, these won't help in any analysis
df.drop(df.loc[df["Moves"] <= 1].index, inplace=True)
df.reset_index(drop=True, inplace=True)
df.to_pickle("./Chess_Data.pickle")


Let's check to make sure everything looks right, and see how much memory we use.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4058901 entries, 0 to 4058900
Data columns (total 5 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   Game Type  uint8 
 1   Result     uint8 
 2   White ELO  uint16
 3   Black ELO  uint16
 4   Moves      uint8 
dtypes: uint16(2), uint8(3)
memory usage: 27.1 MB


Over 4 million games compressed into just 27.1 mb?! That's a really nice compression ratio and means we can maybe handle even more data in a re-do of this project.

One last thing, let's make sure the values in our dataframe are actually what we expect to see from the actual PGN file. The first two games were abandoned before any real moves were made, so we'll look at the third game in the PGN and compare it to the first game in our dataframe.

In [17]:
print(bString[757:1528].decode("UTF-8"))
print("-" * 130)
print(df.loc[0])


[Event "Rated Classical game"]
[Site "https://lichess.org/HTPZ0iUA"]
[White "peymit"]
[Black "fajolo"]
[Result "0-1"]
[UTCDate "2015.11.30"]
[UTCTime "23:00:01"]
[WhiteElo "1286"]
[BlackElo "1580"]
[WhiteRatingDiff "-5"]
[BlackRatingDiff "+4"]
[ECO "C20"]
[Opening "King's Pawn Game: Leonardis Variation"]
[TimeControl "600+0"]
[Termination "Normal"]

1. e4 e5 2. d3 b6 3. Nf3 Nc6 4. a3 Bb7 5. c3 Nf6 6. Be2 Be7 7. Nbd2 h6 8. Nc4 d6 9. a4 a5 10. O-O O-O 11. Nh4 Nxe4 12. dxe4 Bxh4 13. Ne3 Bg5 14. Nf5 Bxc1 15. Rxc1 Ne7 16. f3 Nxf5 17. exf5 Qg5 18. Bd3 h5 19. Qd2 Qxd2 20. Rcd1 Qg5 21. g3 h4 22. g4 Qf4 23. f6 gxf6 24. c4 Bxf3 25. Be2 Qxg4+ 26. Kf2 Qf4 27. Bxf3 e4 28. Rg1+ Kh8 29. Rg4 e3+ 30. Ke2 Qxh2+ 31. Kxe3 Rae8+ 32. Kd3 Rg8 33. Rh1 Qxh1 34. Bxh1 Rxg4 35. Be4 h3 0-1
----------------------------------------------------------------------------------------------------------------------------------
Game Type       2
Result          1
White ELO    1286
Black ELO    1580
Moves          35
Name: 0