<a href="https://colab.research.google.com/github/robertrose85/WebMining/blob/main/Chess_Module1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Within the last few months, I have gotten back into chess using the [chess.com](https://www.chess.com) app on my phone. It's been quite the struggle getting to really understand the game, sure many people (including myself) understand the basic movements in chess, rook moves unlimited rank and file, bishop is similar but restricted to diagonals, the tricky L-shaped movement of a Knight. What I was not aware of is that there are certain advantages to color, openings, tactics, and general understandings about strong points on the board.

While the chess.com application certain provides it's fair share of in-depth analysis, I thought it would be fun to work through their API and analyze the data myself. This has become a fun new challenge where I can sharpen my python skills, really work on some Pandas related activity, and maybe even learn a thing or two about my chess game.

Since this is my first crack at this API I decided to keep it simple, what color do I seem to win more with? If you don't know, in chess, white moves first. From my understanding this can provide a slight advantage when determining the direction of the game, as a result, white could get the upperhand early.

So my first step was to figure out how to extract this data, thankfully, chess.com provides a public API (https://www.chess.com/news/view/published-data-api) that includes a wealth of information for any player you choose. Naturally, I chose myself. To make things a bit simpler, I chose to use the chessdotcom wrapper (https://chesscom.readthedocs.io/en/latest/). 

In [5]:
!pip install chess.com

Collecting chess.com
  Downloading https://files.pythonhosted.org/packages/c9/3d/f4d8ed3cec66329af23e2d3e536c1f5d31500895b257d812295aedc546c3/chess.com-1.3.1-py3-none-any.whl
Collecting certifi==2020.4.5.1
[?25l  Downloading https://files.pythonhosted.org/packages/57/2b/26e37a4b034800c960a00c4e1b3d9ca5d7014e983e6e729e33ea2f36426c/certifi-2020.4.5.1-py2.py3-none-any.whl (157kB)
[K     |████████████████████████████████| 163kB 4.2MB/s 
[?25hCollecting urllib3==1.25.9
[?25l  Downloading https://files.pythonhosted.org/packages/e1/e5/df302e8017440f111c11cc41a6b432838672f5a70aa29227bf58149dc72f/urllib3-1.25.9-py2.py3-none-any.whl (126kB)
[K     |████████████████████████████████| 133kB 6.3MB/s 
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[?25hInstalling collected packages: certifi, urllib3, chess.com
  Found existing installation: certifi 2020.12.5
    Uninstalling certifi-2020.12.5:
      Successfully uninstalled 

To help me sift through the API outputs, I need to use a couple of libraries:

*   Pandas is extremely helpful for normalizing, cleansing, and displaying data in a dataframe.
*   Chessdotcom is the wrapper for the chess.com API



In [6]:
import pandas as pd
import numpy as np
import chessdotcom as cdc
#import requests
#import json

In [7]:
#Test
#leaders = cdc.caller.get_leaderboards()
#print(leaders.json)

For player selection, I decided to focus on one player, me. After all, I want to see which side of the board might be my weakest. However, for reusability, I decided to create a list so that if I want to, I could always just swap out user names or even add multiple user names to analyze in the future.

In [8]:
players = ['NJsix']

My first goal was to determine whether or not the API responded. Success.

In [9]:
for player in players:
  data = cdc.get_player_profile(player)
  print(data.json)


{'player_id': 68973280, '@id': 'https://api.chess.com/pub/player/njsix', 'url': 'https://www.chess.com/member/NJsix', 'username': 'njsix', 'followers': 3, 'country': 'https://api.chess.com/pub/country/US', 'last_online': 1612715587, 'joined': 1575669678, 'status': 'premium', 'is_streamer': False}


Now I wanted to poke around and see what was available to me. When looking in the documentation my initial thought was, "Stat's should have what I need!". To test this hypothesis, I used our get_player_stats() endpoint. While helpful in determining some high level stats (rating, win/loss) it wasn't what I needed. However, it did remind me of the game type I play the most which is Rapid. Rapid is what chess.com calls chess games where each side is given 10 minutes of time to move their pieces. Run out of time first? You lose.

In [10]:
for player in players:
  stats = cdc.caller.get_player_stats(player)
  print(stats.json)

{'chess_daily': {'last': {'rating': 662, 'date': 1612631788, 'rd': 186}, 'best': {'rating': 1000, 'date': 1575669678, 'game': 'https://www.chess.com/game/daily/308439684'}, 'record': {'win': 1, 'loss': 4, 'draw': 0, 'time_per_move': 2165, 'timeout_percent': 0}}, 'chess_rapid': {'last': {'rating': 848, 'date': 1611366931, 'rd': 42}, 'best': {'rating': 927, 'date': 1607267463, 'game': 'https://www.chess.com/live/game/6317867810'}, 'record': {'win': 65, 'loss': 53, 'draw': 3}}, 'chess_blitz': {'last': {'rating': 731, 'date': 1609032345, 'rd': 103}, 'best': {'rating': 855, 'date': 1575671375, 'game': 'https://www.chess.com/live/game/4425290995'}, 'record': {'win': 151, 'loss': 143, 'draw': 14}}, 'fide': 0, 'tactics': {'highest': {'rating': 1208, 'date': 1609125817}, 'lowest': {'rating': 374, 'date': 1575856724}}, 'lessons': {}, 'puzzle_rush': {'best': {'total_attempts': 13, 'score': 11}}}


Since stats didn't give me the detail I needed, I figured the game data certainly would. To view my game data I needed a specific set of parameters: name, year, month. Because I wasn't as active in January, I decided to go back to December where I had 90+ games played for the month. A large enough data set that should begin to tell me what I need to hear. I added the .json bit at the end so I can read the output.

In [11]:
for player in players: 
  games = cdc.caller.get_player_games_by_month(player, '2020', '12').json

Awesome. So now I have my full JSON output of all of my games and game data from December of 2020. Looking at the structure, I get a lot of great data points. It's given me a few ideas for future experiments, data analysis or otherwise, but for now I wanted to stick to the plan. Scrolling through the structure I see what I need, there are dictionaries for each color for each game and it gives you the rating, the result, and the username. This is what I need to start my analysis.

In [12]:
games

{'games': [{'black': {'@id': 'https://api.chess.com/pub/player/letmedraws',
    'rating': 1006,
    'result': 'win',
    'username': 'Letmedraws'},
   'end_time': 1607267138,
   'fen': 'r1b1k3/ppp2pp1/4pn2/6q1/8/2N3n1/PP2K1P1/4R3 w q -',
   'pgn': '[Event "Live Chess"]\n[Site "Chess.com"]\n[Date "2020.12.06"]\n[Round "-"]\n[White "NJsix"]\n[Black "Letmedraws"]\n[Result "0-1"]\n[CurrentPosition "r1b1k3/ppp2pp1/4pn2/6q1/8/2N3n1/PP2K1P1/4R3 w q -"]\n[Timezone "UTC"]\n[ECO "B07"]\n[ECOUrl "https://www.chess.com/openings/Pirc-Defense-2.d4"]\n[UTCDate "2020.12.06"]\n[UTCTime "14:54:49"]\n[WhiteElo "824"]\n[BlackElo "1006"]\n[TimeControl "600"]\n[Termination "Letmedraws won by resignation"]\n[StartTime "14:54:49"]\n[EndDate "2020.12.06"]\n[EndTime "15:05:38"]\n[Link "https://www.chess.com/live/game/5900936890"]\n\n1. e4 {[%clk 0:09:56.5]} 1... d6 {[%clk 0:09:59.9]} 2. d4 {[%clk 0:09:53.9]} 2... d5 {[%clk 0:09:56.2]} 3. f3 {[%clk 0:09:38.9]} 3... dxe4 {[%clk 0:09:51.2]} 4. fxe4 {[%clk 0:09:35.

In [13]:
#df = pd.DataFrame(games['games'])
#df

I figured a good place to start is to make the data a bit more readable. Putting the outputs in a dataframe would certainly solve this. In my first pass, I was getting dataframes where we would have nested elements for particular columns, in this case I was primarily concerned with data by color. What I needed to do was normalize the JSON in order to get the nested JSON to have their own columns so I could continue to work with in.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html


In [18]:
df = pd.json_normalize(games['games'], max_level=2)
df

Unnamed: 0,url,pgn,time_control,end_time,rated,fen,time_class,rules,white.rating,white.result,white.@id,white.username,black.rating,black.result,black.@id,black.username
0,https://www.chess.com/live/game/5900936890,"[Event ""Live Chess""]\n[Site ""Chess.com""]\n[Dat...",600,1607267138,True,r1b1k3/ppp2pp1/4pn2/6q1/8/2N3n1/PP2K1P1/4R3 w q -,rapid,chess,824,resigned,https://api.chess.com/pub/player/njsix,NJsix,1006,win,https://api.chess.com/pub/player/letmedraws,Letmedraws
1,https://www.chess.com/live/game/5901009734,"[Event ""Live Chess""]\n[Site ""Chess.com""]\n[Dat...",600,1607267463,True,r3k2r/ppp2pp1/2q1bn2/4p1b1/4P1p1/2PP4/P1P4P/1K...,rapid,chess,776,resigned,https://api.chess.com/pub/player/taliabali,TaliaBali,927,win,https://api.chess.com/pub/player/njsix,NJsix
2,https://www.chess.com/live/game/5901075089,"[Event ""Live Chess""]\n[Site ""Chess.com""]\n[Dat...",600,1607268734,True,3rk2r/4Q1p1/3PR3/1N5p/2P2p2/8/P4PPP/R5K1 b - -,rapid,chess,971,win,https://api.chess.com/pub/player/saamsa535,saamsa535,848,checkmated,https://api.chess.com/pub/player/njsix,NJsix
3,https://www.chess.com/live/game/5901224399,"[Event ""Live Chess""]\n[Site ""Chess.com""]\n[Dat...",600,1607269977,True,r3r3/2pk2p1/1p5p/p3pQ2/2P1R2N/P7/6PP/5RK1 b - -,rapid,chess,842,win,https://api.chess.com/pub/player/asmf53,asmf53,775,timeout,https://api.chess.com/pub/player/njsix,NJsix
4,https://www.chess.com/live/game/5901334269,"[Event ""Live Chess""]\n[Site ""Chess.com""]\n[Dat...",600,1607270636,True,r1b2rk1/ppB2p1p/2n1p3/6p1/3P4/P1P1p3/2P1BR2/R2...,rapid,chess,717,checkmated,https://api.chess.com/pub/player/njsix,NJsix,783,win,https://api.chess.com/pub/player/hiptang,HipTang
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,https://www.chess.com/live/game/6098699705,"[Event ""Live Chess""]\n[Site ""Chess.com""]\n[Dat...",600,1609346164,True,r5k1/p1p1Q2p/bp4p1/5p2/5N2/2P2P2/PP4PP/4R1K1 w...,rapid,chess,760,timeout,https://api.chess.com/pub/player/ajk170,ajk170,759,win,https://api.chess.com/pub/player/njsix,NJsix
94,https://www.chess.com/live/game/6098847335,"[Event ""Live Chess""]\n[Site ""Chess.com""]\n[Dat...",600,1609347650,True,7k/pp5b/3p1p2/1B1PpP2/4P3/2K5/PP5p/7N b - -,rapid,chess,766,win,https://api.chess.com/pub/player/njsix,NJsix,723,timeout,https://api.chess.com/pub/player/deftunk,Deftunk
95,https://www.chess.com/live/game/6099080306,"[Event ""Live Chess""]\n[Site ""Chess.com""]\n[Dat...",600,1609349002,True,3r1rk1/pp3pQp/3n1n2/4pN2/8/8/P5PP/5R1K b - -,rapid,chess,814,win,https://api.chess.com/pub/player/kerfuffl3,Kerfuffl3,759,checkmated,https://api.chess.com/pub/player/njsix,NJsix
96,https://www.chess.com/live/game/6107363118,"[Event ""Live Chess""]\n[Site ""Chess.com""]\n[Dat...",600,1609430876,True,Q3k2r/3n1ppp/4b3/p3p3/1B1pP3/1B6/PPP2PPP/R3K2R...,rapid,chess,768,win,https://api.chess.com/pub/player/njsix,NJsix,790,resigned,https://api.chess.com/pub/player/shanlirichez,shanlirichez


At this point I can see everything I need parsed out the way I need it. But it's too much. So I went ahead and dropped some of the unnecessary columns.

In [19]:
df = df.drop(['pgn', 'time_control', 'end_time', 'rated','white.@id','black.@id','rules'], axis=1)
print(df.columns)

Index(['url', 'fen', 'time_class', 'white.rating', 'white.result',
       'white.username', 'black.rating', 'black.result', 'black.username'],
      dtype='object')


In [20]:
df

Unnamed: 0,url,fen,time_class,white.rating,white.result,white.username,black.rating,black.result,black.username
0,https://www.chess.com/live/game/5900936890,r1b1k3/ppp2pp1/4pn2/6q1/8/2N3n1/PP2K1P1/4R3 w q -,rapid,824,resigned,NJsix,1006,win,Letmedraws
1,https://www.chess.com/live/game/5901009734,r3k2r/ppp2pp1/2q1bn2/4p1b1/4P1p1/2PP4/P1P4P/1K...,rapid,776,resigned,TaliaBali,927,win,NJsix
2,https://www.chess.com/live/game/5901075089,3rk2r/4Q1p1/3PR3/1N5p/2P2p2/8/P4PPP/R5K1 b - -,rapid,971,win,saamsa535,848,checkmated,NJsix
3,https://www.chess.com/live/game/5901224399,r3r3/2pk2p1/1p5p/p3pQ2/2P1R2N/P7/6PP/5RK1 b - -,rapid,842,win,asmf53,775,timeout,NJsix
4,https://www.chess.com/live/game/5901334269,r1b2rk1/ppB2p1p/2n1p3/6p1/3P4/P1P1p3/2P1BR2/R2...,rapid,717,checkmated,NJsix,783,win,HipTang
...,...,...,...,...,...,...,...,...,...
93,https://www.chess.com/live/game/6098699705,r5k1/p1p1Q2p/bp4p1/5p2/5N2/2P2P2/PP4PP/4R1K1 w...,rapid,760,timeout,ajk170,759,win,NJsix
94,https://www.chess.com/live/game/6098847335,7k/pp5b/3p1p2/1B1PpP2/4P3/2K5/PP5p/7N b - -,rapid,766,win,NJsix,723,timeout,Deftunk
95,https://www.chess.com/live/game/6099080306,3r1rk1/pp3pQp/3n1n2/4pN2/8/8/P5PP/5R1K b - -,rapid,814,win,Kerfuffl3,759,checkmated,NJsix
96,https://www.chess.com/live/game/6107363118,Q3k2r/3n1ppp/4b3/p3p3/1B1pP3/1B6/PPP2PPP/R3K2R...,rapid,768,win,NJsix,790,resigned,shanlirichez


But now I want to focus on each color. So I'm going to create two variables I can work with, one for white data and one for black data. 

In [34]:
white = df.loc[df['white.username'] == player]
black = df.loc[df['black.username'] == player]

white
black

Unnamed: 0,url,fen,time_class,white.rating,white.result,white.username,black.rating,black.result,black.username
1,https://www.chess.com/live/game/5901009734,r3k2r/ppp2pp1/2q1bn2/4p1b1/4P1p1/2PP4/P1P4P/1K...,rapid,776,resigned,TaliaBali,927,win,NJsix
2,https://www.chess.com/live/game/5901075089,3rk2r/4Q1p1/3PR3/1N5p/2P2p2/8/P4PPP/R5K1 b - -,rapid,971,win,saamsa535,848,checkmated,NJsix
3,https://www.chess.com/live/game/5901224399,r3r3/2pk2p1/1p5p/p3pQ2/2P1R2N/P7/6PP/5RK1 b - -,rapid,842,win,asmf53,775,timeout,NJsix
5,https://www.chess.com/live/game/5901413247,2k1r3/2Q2ppp/8/1Nn5/8/2P5/PP3PPP/R5K1 b - -,rapid,749,win,Vvull,671,checkmated,NJsix
8,https://www.chess.com/live/game/5904469955,r1k1r3/ppp1Q1p1/n5q1/3P2B1/8/2P5/PP3PPP/RN2R1K...,rapid,766,win,Valcecc,717,resigned,NJsix
9,https://www.chess.com/live/game/5904674893,r3k2r/pbq2ppp/1p2pn2/8/2P1BP2/8/PP2QP1P/3RK2R ...,rapid,613,resigned,livrailton99,740,win,NJsix
10,https://www.chess.com/live/game/5905247069,8/5kN1/5P2/8/8/6P1/R7/6K1 b - -,rapid,878,win,Iscariot-J,721,resigned,NJsix
13,https://www.chess.com/live/game/5913504369,rnb1k2r/pp3ppp/4p3/2bP4/8/2N5/PPP2qPP/R2QKBNR ...,rapid,753,checkmated,Bcfeld216,740,win,NJsix
14,https://www.chess.com/live/game/5941515689,r6k/5p1p/p4N2/8/1PpP2rP/2P2K2/7P/R7 b - -,rapid,720,win,chessterlikescheetos,715,timeout,NJsix
16,https://www.chess.com/live/game/5941754849,r2qkbr1/p4Q2/1p1p3p/2p2P2/2B5/8/PPP3PP/R3K2R b...,rapid,740,win,viguss,717,checkmated,NJsix


Looking at the output, I noticed that the result is fairly simple, a 'win' is a win, but a loss can be any number of categories, there's also a tie (repetition, timevsinsufficient). I think this could be interesting, so I want to understand a little bit of what the categories are just in case I want to dive deeper. I also want a count of how many games I've played for each color. Because I want to use this later, I'll save the number to a variable. But it looks like I've played almost an equal amount of white and black games, just a few games apart.

In [38]:
print(white['white.result'].unique())
print(black['black.result'].unique())

whiteGamesCount = len(white.index)
blackGamesCount = len(black.index)
print(whiteGamesCount)
print(blackGamesCount)

['resigned' 'checkmated' 'win' 'timeout' 'repetition']
['win' 'checkmated' 'timeout' 'resigned' 'timevsinsufficient']
47
51


To do some top level analysis, I wanted to know one thing. Based on my starting color, what percentage of games do I win more of? Well, according to the below, it seems I tend to win a greater percentage of my white games than I do my black games.

In [42]:
whiteWins = 0
blackWins = 0

for w in white['white.result']:
  if w == 'win':
    whiteWins+=1

print(whiteWins/whiteGamesCount)


for w in black['black.result']:
  if w == 'win':
    blackWins+=1

print(blackWins/blackGamesCount)


0.5531914893617021
0.45098039215686275
