## Data source

* https://database.lichess.org/#evals 
* 342,059,879 chess positions evaluated with Stockfish.
* Often Depth 30-50
* Each position contains multiple evaluations (with different depth / n_multiPV) - can cross-check accuracy to remove noise
* Schema of a position
```
{
  "fen":          // the position FEN only contains pieces, active color, castling rights, and en passant square.
  "evals": [      // a list of evaluations, ordered by number of PVs.
      "knodes":   // number of kilo-nodes searched by the engine
      "depth":    // depth reached by the engine
      "pvs": [    // list of principal variations
        "cp":     // centipawn evaluation. Omitted if mate is certain.
        "mate":   // mate evaluation. Omitted if mate is not certain.
        "line":   // principal variation, in UCI_Chess960 format.
}
```


In [1]:
import json
import math
import chess
from generate_sft_data import *

data_path = "./data/lichess_db_eval_head100.jsonl"
max_candidates = 3
max_moves = 5
template_path = "./reasoning_template.txt"

valid_count = 0
skipped_count = 0

In [2]:
target = 0
with open(data_path, 'r') as f_in:
    for i, line in enumerate(f_in):
        if not line.strip(): continue
        
        stockfish_info = parse_lichess_entry(line, max_candidates, max_moves)
        if target == i: 
            print("=====Processed stockfish information=====")
            print(json.dumps(stockfish_info, indent=4))
        
        if stockfish_info:
            output_text = format_reasoning_trace(stockfish_info, template_str=open(template_path).read(), shuffle_candidates=False)
            
            sft_entry = {
                "instruction": "You are a chess Grandmaster. Analyze the current board state and determine the best move.",
                "input": f"FEN: {stockfish_info['fen']}",
                "output": output_text
            }
            if target == i: 
                print("=====Synthetic reasoning trace=====")
                print(output_text)
            
            valid_count += 1
            if target == i: 
                break
        else:
            skipped_count += 1

=====Processed stockfish information=====
{
    "fen": "7r/1p3k2/p1bPR3/5p2/2B2P1p/8/PP4P1/3K4 b - -",
    "turn_color": "Black",
    "candidates": [
        {
            "san": "Kg7",
            "uci": "f7g7",
            "win_rate": 43.68,
            "cp": -69,
            "mate": null,
            "pv": "... Kg7\nRe2\n... Rd8\nRd2\n... b5"
        },
        {
            "san": "Rd8",
            "uci": "h8d8",
            "win_rate": 35.43,
            "cp": -163,
            "mate": null,
            "pv": "... Rd8\nKe1\n... a5\na3\n... Bd7"
        },
        {
            "san": "Ra8",
            "uci": "h8a8",
            "win_rate": 30.09,
            "cp": -229,
            "mate": null,
            "pv": "... Ra8\nKe1\n... a5\nRh6+\n... Kg7"
        }
    ],
    "best_move": "Kg7",
    "best_win_rate": 43.68,
    "best_pv": "... Kg7\nRe2\n... Rd8\nRd2\n... b5"
}
=====Synthetic reasoning trace=====
<think>
The current position is: 7r/1p3k2/p1bPR3/5p2/2B2P1p/8/PP4P1/3K4 b

In [3]:
with open(data_path, 'r') as f_in, open(out_path, 'w') as f_out:
    for line in f_in:
        if not line.strip():
            continue

        stockfish_info = parse_lichess_entry(line)

        if stockfish_info:
            output_text = format_reasoning_trace(
                stockfish_info,
                template_str=open(template_path).read(),
            )

            sft_entry = {
                "instruction": (
                    "You are a chess Grandmaster. Analyze the current board state and determine the best move."
                ),
                "input": f"FEN: {stockfish_info['fen']}",
                "output": output_text,
            }

            f_out.write(json.dumps(sft_entry) + "\n")
            valid_count += 1
        else:
            skipped_count += 1

print("Done.")
print(f"Valid Samples Kept: {valid_count}")
print(f"Samples Filtered Out: {skipped_count}")


NameError: name 'out_path' is not defined

In [4]:
def read_output(target, file_name="data/lichess_eval_sft.jsonl"):
    with open(file_name, 'r') as f_in:
        for i, line in enumerate(f_in):
            data = json.loads(line)
            if i == target-1: 
                print(data["output"])
                break

## Some examples

1. checkmate by the opponent
2. a capture

In [5]:
read_output(4)

<think>
The current position is: 6k1/6p1/6N1/4K3/4N3/8/8/8 b - -.
The next player is Black.

Let's analyze Black's possible next moves:
First, consider Kf7.
If played here, a natural continuation might be:
... Kf7
Kf5
... Ke8
Ke6
... Kd8
This line allows White to force checkmate in 23.
In this variation, Black's win rate is about 0.0%.

Another possible choice is Kh7.
If played this way, the possible continuation is:
... Kh7
Kf5
... Kh6
Nd6
... Kh7
This line allows White to force checkmate in 12.
In this variation, Black's win rate is about 0.0%.

Based on the above analysis, the best move is Kf7, with a 0.0% win rate.
Choosing this move, the possible continuation is:
... Kf7
Kf5
... Ke8
Ke6
... Kd8
</think>
<answer>
\boxed{Kf7}
</answer>


In [6]:
read_output(7)

<think>
The current position is: 8/3B4/8/p4p1k/5P1p/Pb6/1P4P1/6K1 w - -.
The next player is White.

Let's analyze White's possible next moves:
First, consider Bxf5.
If played here, a natural continuation might be:
Bxf5
... a4
Kh2
... Bd5
Bd7
It captures the pawn on f5.
Overall, White comes out with a winning advantage (about +6.52 pawns).
In this variation, White's win rate is about 91.69%.

Another possible choice is g3.
If played this way, the possible continuation is:
g3
... hxg3
Bxf5
... Kh4
Bc8
Overall, White comes out with a slight edge (about +0.20 pawns).
In this variation, White's win rate is about 51.84%.

Another possible choice is Kh2.
If played this way, the possible continuation is:
Kh2
... Kg4
Bc8
... a4
Bd7
Overall, the position stays about even.
In this variation, White's win rate is about 50.0%.

Based on the above analysis, the best move is Bxf5, with a 91.69% win rate.
Choosing this move, the possible continuation is:
Bxf5
... a4
Kh2
... Bd5
Bd7
</think>
<answer>
\b

## Advanced reasoning trace ideas

1. get some candidate moves
    * top-k from Stockfish (current)
    * some good moves, some bad moves from Lichess database
    * top-k of some categories of moves
2. for each move, the analysis info can fome from:
    * stockfish info (current)
    * more descriptions given stockfish (capture, control etc) 
    * tablebases
    * endgame
    * commentary for positions (instructional games for beginners: how to reassess your chess)
    * critical position classifier
3. Besides candidate moves, we can incorporate bootstrap phase tasks into reasoning trace
    * leagal moves
    * board state translation

* https://github.com/hebbarashwin/lichess-puzzler/tree/master/tagger
* https://chessnook.wordpress.com/wp-content/uploads/2010/12/how-to-reassess-your-chess-jeremy-silman.pdf