To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [blog post](https://unsloth.ai/blog/r1-reasoning) for guidance on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
!pip install unsloth vllm
!pip install triton==3.1.0
!pip install -U pynvml

### Unsloth

Use `PatchFastRL` before all functions to patch GRPO and other RL algorithms!

In [2]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


Load up `Qwen 2.5 3B Instruct`, and set parameters

In [3]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 8192 # Can increase for longer reasoning traces
lora_rank = 256 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-0.5B-Instruct", #"meta-llama/Llama-3.2-1B-Instruct", #Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 1999,
)

INFO 02-26 11:02:21 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.2.15: Fast Qwen2 patching. Transformers: 4.49.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit with actual GPU utilization = 49.66%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 8192. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 6.6 GB. Also swap space = 5 GB.
INFO 02-26 11:02:34 config.py:549] This model supports multiple tasks: {'reward', 'generate', 'embed', 'score', 'classify'}. Defau

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

INFO 02-26 11:02:37 cuda.py:178] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-26 11:02:37 cuda.py:226] Using XFormers backend.
INFO 02-26 11:02:48 model_runner.py:1110] Starting to load model unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit...
INFO 02-26 11:02:48 loader.py:1089] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 02-26 11:02:48 weight_utils.py:254] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

INFO 02-26 11:02:50 weight_utils.py:270] Time spent downloading weights for unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit: 1.932830 seconds


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-26 11:02:52 model_runner.py:1115] Loading model weights took 0.5090 GB
INFO 02-26 11:02:52 logger.py:57] Using PunicaWrapperGPU.
INFO 02-26 11:03:00 worker.py:267] Memory profiling takes 7.92 seconds
INFO 02-26 11:03:00 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.50) = 7.32GiB
INFO 02-26 11:03:00 worker.py:267] model weights take 0.51GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.06GiB; the rest of the memory reserved for KV Cache is 5.71GiB.
INFO 02-26 11:03:01 executor_base.py:111] # cuda blocks: 31170, # CPU blocks: 27306
INFO 02-26 11:03:01 executor_base.py:116] Maximum concurrency for 8192 tokens per request: 60.88x
INFO 02-26 11:03:05 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs d

Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:35<00:00,  1.31s/it]

INFO 02-26 11:03:40 model_runner.py:1562] Graph capturing finished in 35 secs, took 0.39 GiB
INFO 02-26 11:03:40 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 48.65 seconds





tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.2.15 patched 24 layers with 24 QKV layers, 24 O layers and 24 MLP layers.


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [5]:
import re
from datasets import load_dataset, Dataset
import numpy as np
from typing import List, Dict, Tuple

In [6]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
HF_TOKEN = user_secrets.get_secret("HF_TOKEN")
!huggingface-cli login --token $HF_TOKEN

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
The token `sd` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `sd`


In [None]:
# --- System Prompt and XML Format ---
SYSTEM_PROMPT = """You are a master Connect Four strategist whose goal is to win while preventing your opponent from winning. The game is played on a 6x7 grid (columns a–g, rows 1–6 with 1 at the bottom) where pieces drop to the lowest available spot.

Board:
- Represented as a list of occupied cells in the format: <column><row>(<piece>), e.g., 'a1(O)'.
- For example: 'a1(O), a2(X), b1(O)' indicates that cell a1 has an O, a2 has an X, and b1 has an O.
- An empty board is shown as 'Empty Board'.
- Win by connecting 4 pieces in any direction (horizontal, vertical, or diagonal).

Strategy:
1. Identify taken positions, and empty positions.
2. Find and execute winning moves.
3. If There isn't a winning move, then block your opponent’s potential wins.
4. Control the center and set up future moves.

Respond in XML:
<reasoning>
Explain your thought process, focusing on your winning move, how you block your opponent, and your strategic plans.
</reasoning>
<move>
Specify the column letter (a–g) for your next move.
</move>
"""

def extract_xml_move(text: str) -> str:
    """
    Extracts the move (a single column letter a–g) from the XML format
    using an improved regex. This function is kept simple for reuse.
    """
    import re
    match = re.search(r'<move>\s*([a-g])\s*</move>', text)
    if match:
        return match.group(1)
    return ""

def convert_moves_to_coordinate_list(moves_list: List[str]) -> str:
    """
    Converts a list of moves to a coordinate list representation.
    Each move is formatted as <column><row>(<piece>).
    Returns "Empty" if no moves are present.
    """
    # Create an empty 6x7 grid (row 1 is at index 0)
    grid = [['.' for _ in range(7)] for _ in range(6)]
    
    for i, move in enumerate(moves_list):
        if not move:
            continue
        col = ord(move[0]) - ord('a')
        # Find the lowest available row in this column:
        for row in range(6):
            if grid[row][col] == '.':
                grid[row][col] = 'X' if i % 2 == 0 else 'O'
                break
    
    # Build coordinate list: Only include cells with a piece.
    coords = []
    for row in range(6):
        for col in range(7):
            if grid[row][col] != '.':
                # Convert row index to board row number (row 0 -> 1, etc.)
                coords.append(f"{chr(col + ord('a'))}{row+1}({grid[row][col]})")
    
    return ", ".join(coords) if coords else "Empty Board"

def parse_coordinate_list(board_str: str) -> List[List[str]]:
    """
    Converts a coordinate list representation (e.g., "a1(O), a2(X), b1(O)")
    into a 6x7 grid (list of lists) with row index 0 as the bottom.
    """
    grid = [['.' for _ in range(7)] for _ in range(6)]
    if not board_str.strip():
        return grid
    coords = board_str.split(",")
    for coord in coords:
        coord = coord.strip()
        # Expecting format: a1(O)
        if len(coord) < 4:
            continue
        col_letter = coord[0]
        try:
            row_number = int(coord[1])
        except ValueError:
            continue
        piece = coord[3]  # The piece inside the parentheses
        col = ord(col_letter) - ord('a')
        row = row_number - 1
        if 0 <= row < 6 and 0 <= col < 7:
            grid[row][col] = piece
    return grid

def get_available_positions(board_moves: List[str]) -> str:
    """Returns all available positions for each column in a clear format,
    reconstructing the board from a list of move strings."""
    # Initialize empty grid ('.' means empty)
    grid = [['.' for _ in range(7)] for _ in range(6)]
    
    # Fill in taken positions from the moves using the move index for parity
    for i, move in enumerate(board_moves):
        if len(move) >= 2:
            col = ord(move[0]) - ord('a')
            row = int(move[1]) - 1
            if 0 <= row < 6 and 0 <= col < 7:
                grid[row][col] = 'X' if i % 2 == 0 else 'O'
    
    # Find all available positions in each column
    available = []
    for col in range(7):
        col_letter = chr(ord('a') + col)
        positions = []
        for row in range(6):
            if grid[row][col] == '.':
                positions.append(f"{col_letter}{row + 1}")
        
        if positions:
            available.append(f"Column {col_letter}: {', '.join(positions)}")
        else:
            available.append(f"Column {col_letter}: Full")
    
    return "\n  ".join(available)

def check_win(board: List[List[str]], piece: str) -> bool:
    """Enhanced win checking with all directions."""
    directions = [(0,1), (1,0), (1,1), (1,-1)]  # horizontal, vertical, diagonals
    rows, cols = len(board), len(board[0])
    
    for r in range(rows):
        for c in range(cols):
            if board[r][c] != piece:
                continue
            for dr, dc in directions:
                count = 1
                for i in range(1, 4):
                    nr, nc = r + dr*i, c + dc*i
                    if not (0 <= nr < rows and 0 <= nc < cols):
                        break
                    if board[nr][nc] != piece:
                        break
                    count += 1
                if count >= 4:
                    return True
    return False

def create_enhanced_training_example(game_text: str, outcome: str) -> List[Dict]:
    """Enhanced training example with better game state representation."""
    examples = []
    turns = game_text.strip().split(' ')
    board_moves = []
    x_moves = []
    o_moves = []
    
    for turn_idx in range(0, len(turns), 3):
        # Process first player's move (X)
        if turn_idx + 1 < len(turns):
            move = turns[turn_idx + 1]
            current_board = convert_moves_to_coordinate_list(board_moves)
            next_positions = get_available_positions(board_moves)
            
            prompt = [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"""Game State:
- You are playing as: X
- Your previous moves: {', '.join(x_moves)}
- Opponent's moves: {', '.join(o_moves)}
- Current board state: {current_board}
- Next available position per column:
  {next_positions}

Make your move."""}
            ]
            examples.append({
                "prompt": prompt,
                "answer": move[0],
                "board_state": current_board,
                "player": 1
            })
            board_moves.append(move)
            x_moves.append(move)
        
        # Process second player's move (O)
        if turn_idx + 2 < len(turns):
            move = turns[turn_idx + 2]
            current_board = convert_moves_to_coordinate_list(board_moves)
            next_positions = get_available_positions(board_moves)
            
            prompt = [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"""Game State:
- You are playing as: O
- Your previous moves: {', '.join(o_moves)}
- Opponent's moves: {', '.join(x_moves)}
- Current board state: {current_board}
- Next available position per column:
  {next_positions}

Make your move."""}
            ]
            examples.append({
                "prompt": prompt,
                "answer": move[0],
                "board_state": current_board,
                "player": 2
            })
            board_moves.append(move)
            o_moves.append(move)
    
    return examples

# Dataset creation
def connect_four_dataset(split="train") -> Dataset:
    """Enhanced dataset creation with coordinate list board representation."""
    data = load_dataset("Lyte/ConnectFour-T10", split=split)
    all_examples = []
    
    for item in data:
        game_text = item['text']
        examples = create_enhanced_training_example(game_text, item['outcome']) #create_training_example(game_text, item['outcome'])
        all_examples.extend(examples)

    # Filter by length
    filtered_examples = [ex for ex in all_examples 
                        if sum(len(tokenizer.encode(turn["content"])) 
                        for turn in ex['prompt']) + len(tokenizer.encode(ex['answer'])) <= max_seq_length]

    return Dataset.from_list(filtered_examples)

#train_dataset = connect_four_dataset("train")
train_dataset = load_dataset("Lyte/ConnectFour-Training-Data_v3", split="train")
#train_dataset = load_dataset("Lyte/ConnectFour-Training-Data_v2", split="train")
#train_dataset = load_dataset('Lyte/ConnectFour-Training-Data', split="train") # this is the data uploaded after connect_four_dataset was created to save time #connect_four_dataset("train")

def strategic_winning_reward_func(prompts, completions: List[List[Dict]], answer: List[str], **kwargs) -> List[float]:
    """Enhanced strategic reward function with better game state understanding."""
    rewards = []
    
    for prompt, completion, ans in zip(prompts, completions, answer):
        try:
            predicted_move = extract_xml_move(completion[0]['content'])
            if not predicted_move or len(predicted_move) != 1 or not ('a' <= predicted_move[0] <= 'g'):
                rewards.append(-1.0)
                continue
            
            # Extract game state information
            content = prompt[1]['content']
            player = 'X' if 'You are playing as: X' in content else 'O'
            board_state = re.search(r'Current board state: (.*?)\n', content).group(1)
            current_board = parse_coordinate_list(board_state)
            
            # Extract next available position for the chosen column
            col = ord(predicted_move[0]) - ord('a')
            next_positions = re.search(r'Next available position per column:\n(.*?)(?=\n\nMake)', content, re.DOTALL).group(1)
            target_position = re.search(f'{predicted_move}: {predicted_move}(\d+)', next_positions)
            
            if not target_position:
                rewards.append(-1.0)  # Column is full or invalid
                continue
                
            row = int(target_position.group(1)) - 1
            
            # Apply move
            current_board[row][col] = player
            
            # Strategic evaluation
            reward = evaluate_strategic_position(current_board, row, col, player)
            
            rewards.append(reward)
            
        except Exception as e:
            rewards.append(-1.0)
            
    return rewards

def evaluate_strategic_position(board: List[List[str]], row: int, col: int, player: str) -> float:
    """Enhanced position evaluation with stage-specific strategic concepts."""
    reward = 0.0
    opponent = 'O' if player == 'X' else 'X'
    
    # Calculate total pieces and board fill percentage for dynamic game stage determination
    total_pieces = sum(1 for r in board for c in r if c != '.')
    board_fill_percentage = total_pieces / 42  # 42 is total cells in a 6x7 board
    
    # Determine game stage dynamically
    # Early: <30% filled, Mid: 30-70% filled, Late: >70% filled
    game_stage = "early" if board_fill_percentage < 0.3 else "mid" if board_fill_percentage < 0.7 else "late"
    
    # Immediate win or block is always highest priority regardless of game stage
    if check_win(board, player):
        return 10.0  # Win is always the best move
    
    # Block opponent's win
    board[row][col] = opponent  # Temporarily place opponent's piece
    if check_win(board, opponent):
        reward += 5.0  # High value for blocking
    board[row][col] = player  # Restore player's piece
    
    # Apply stage-specific evaluation
    if game_stage == "early":
        reward += early_game_eval(board, row, col, player)
    elif game_stage == "mid":
        reward += mid_game_eval(board, row, col, player)
    else:  # late game
        reward += late_game_eval(board, row, col, player)
    
    return reward

def early_game_eval(board: List[List[str]], row: int, col: int, player: str) -> float:
    """Early game strategy focuses on center control, development, and avoiding premature threats."""
    score = 0.0
    
    # Moderate center column bonus
    if col == 3:  # Column d (center)
        score += 0.05
    elif col in [2, 4]:  # Columns c and e (near center)
        score += 0.03
    
    # Encourage base development - pieces in bottom rows provide foundation
    if row <= 1:  # Bottom two rows
        score += 0.02
    
    # Encourage move diversity to avoid predictability
    # Count existing pieces in this column
    pieces_in_column = sum(1 for r in range(6) if board[r][col] != '.')
    if pieces_in_column >= 2:
        score -= 0.03  # Slight penalty for stacking too many pieces in one column early
    
    # Early connection potential without overcommitting
    score += evaluate_early_connections(board, row, col, player)
    
    # Avoid moves that help opponent create threats
    score += evaluate_defensive_positioning(board, row, col, player) * 0.5  # Lower weight in early game
    
    return score

def mid_game_eval(board: List[List[str]], row: int, col: int, player: str) -> float:
    """Mid game strategy focuses on creating threats, blocking opponent threats, and building structures."""
    score = 0.0
    opponent = 'O' if player == 'X' else 'X'
    
    # Stronger emphasis on creating own threats
    score += evaluate_win_paths(board, row, col, player) * 1.5
    
    # Moderate emphasis on blocking opponent's developing threats
    score += evaluate_opponent_threats(board, row, col, player, opponent) * 1.2
    
    # Trap creation becomes valuable in mid-game
    score += evaluate_trap_potential(board, row, col, player) * 1.0
    
    # Building connected structures for future advantage
    score += evaluate_connected_structures(board, row, col, player) * 0.8
    
    # Control of key positions (height advantage)
    if row >= 2 and sum(1 for r in range(row) if board[r][col] != '.') == row:
        score += 0.3  # Reward building on solid foundation
    
    return score

def late_game_eval(board: List[List[str]], row: int, col: int, player: str) -> float:
    """Late game strategy focuses on forcing wins, preventing opponent wins, and tactical play."""
    score = 0.0
    opponent = 'O' if player == 'X' else 'X'
    
    # Heavy emphasis on creating immediate threats
    score += evaluate_win_paths(board, row, col, player) * 2.5
    
    # Creating multiple simultaneous threats is crucial
    double_threats = evaluate_double_threats(board, row, col, player)
    if double_threats > 0:
        score += double_threats * 3.0
    
    # Blocking opponent's winning paths
    score += evaluate_opponent_threats(board, row, col, player, opponent) * 2.0
    
    # Forced move sequences
    score += evaluate_forcing_sequences(board, row, col, player) * 2.0
    
    # Penalize moves that create opportunities for opponent
    score -= evaluate_opponent_opportunities(board, row, col, player, opponent) * 1.5
    
    return score

def evaluate_early_connections(board: List[List[str]], row: int, col: int, player: str) -> float:
    """Evaluates potential for early piece connections without overcommitting."""
    score = 0.0
    directions = [(0,1), (1,0), (1,1), (1,-1)]
    
    for dr, dc in directions:
        # Look one step in each direction
        for direction in [-1, 1]:
            r, c = row + dr * direction, col + dc * direction
            if 0 <= r < 6 and 0 <= c < 7:
                if board[r][c] == player:
                    score += 0.02  # Small bonus for connecting pieces
                elif board[r][c] == '.':
                    score += 0.01  # Smaller bonus for potential future connection
    
    return score

def evaluate_defensive_positioning(board: List[List[str]], row: int, col: int, player: str) -> float:
    """Evaluates if a move avoids creating easy threats for opponent."""
    score = 0.0
    opponent = 'O' if player == 'X' else 'X'
    
    # Check if placing here would allow opponent to place above and create a threat
    if row < 5:  # Not top row
        # Temporarily place our piece
        board[row][col] = player
        
        # Check if opponent could place above
        board[row+1][col] = opponent
        
        # See if this creates a threat for opponent
        threat_value = evaluate_win_paths(board, row+1, col, opponent)
        if threat_value > 1.0:
            score -= 0.5  # Penalize moves that give opponent easy threats
        
        # Reset board
        board[row+1][col] = '.'
        board[row][col] = '.'
    
    return score

def evaluate_opponent_threats(board: List[List[str]], row: int, col: int, player: str, opponent: str) -> float:
    """Evaluates how well a move blocks opponent's developing threats."""
    score = 0.0
    
    # Save current state
    original = board[row][col]
    
    # For each of opponent's possible next moves
    for test_col in range(7):
        # Find where piece would land in this column
        test_row = 0
        while test_row < 6 and board[test_row][test_col] == '.':
            test_row += 1
        test_row -= 1
        
        if test_row >= 0:  # Valid move
            # Place opponent piece
            board[test_row][test_col] = opponent
            
            # Check for win or strong threat
            if check_win(board, opponent):
                score += 2.0  # Blocking an immediate threat
            else:
                # Check for developing threats
                threat_level = evaluate_win_paths(board, test_row, test_col, opponent)
                if threat_level > 1.0:
                    score += 0.5  # Blocking a developing threat
            
            # Restore board
            board[test_row][test_col] = '.'
    
    # Restore original state
    board[row][col] = original
    
    return score

def evaluate_connected_structures(board: List[List[str]], row: int, col: int, player: str) -> float:
    """Evaluates creation of connected piece structures (2 or 3 in a row with space to extend)."""
    score = 0.0
    directions = [(0,1), (1,0), (1,1), (1,-1)]
    
    for dr, dc in directions:
        # Check both directions
        for direction in [-1, 1]:
            connected = 0
            spaces = 0
            
            # Count connected pieces and spaces in this direction
            for i in range(1, 4):  # Look up to 3 steps away
                r = row + dr * direction * i
                c = col + dc * direction * i
                
                if 0 <= r < 6 and 0 <= c < 7:
                    if board[r][c] == player:
                        connected += 1
                    elif board[r][c] == '.':
                        spaces += 1
                        break
                    else:
                        break
            
            # Score based on structure
            if connected == 2 and spaces >= 1:
                score += 0.4  # Three in a row with space to extend
            elif connected == 1 and spaces >= 2:
                score += 0.2  # Two in a row with space to extend
    
    return score

def evaluate_double_threats(board: List[List[str]], row: int, col: int, player: str) -> float:
    """Evaluates if a move creates multiple winning threats simultaneously."""
    threats = 0
    
    # Save current position
    original = board[row][col]
    
    # Place player's piece
    board[row][col] = player
    
    # Find all winning moves for next turn
    for test_col in range(7):
        # Find landing row in this column
        test_row = 0
        while test_row < 6 and board[test_row][test_col] == '.':
            test_row += 1
        test_row -= 1
        
        if test_row >= 0:  # Valid move
            board[test_row][test_col] = player
            if check_win(board, player):
                threats += 1
            board[test_row][test_col] = '.'
    
    # Restore board
    board[row][col] = original
    
    return threats

def evaluate_forcing_sequences(board: List[List[str]], row: int, col: int, player: str) -> float:
    """Evaluates if a move forces opponent to play in a particular way, leading to advantage."""
    score = 0.0
    opponent = 'O' if player == 'X' else 'X'
    
    # Place our piece
    board[row][col] = player
    
    # Count forced responses (moves opponent must make to prevent loss)
    forced_responses = 0
    forced_col = -1
    
    for test_col in range(7):
        # Find landing row
        test_row = 0
        while test_row < 6 and board[test_row][test_col] == '.':
            test_row += 1
        test_row -= 1
        
        if test_row >= 0:  # Valid move
            # Place our piece in next move
            board[test_row][test_col] = player
            
            # If this creates a winning position
            if check_win(board, player):
                forced_responses += 1
                forced_col = test_col
            
            # Reset
            board[test_row][test_col] = '.'
    
    # If there's exactly one forced response
    if forced_responses == 1 and forced_col >= 0:
        # Check if this forced move gives us advantage
        force_row = 0
        while force_row < 6 and board[force_row][forced_col] == '.':
            force_row += 1
        force_row -= 1
        
        # Place opponent's forced move
        board[force_row][forced_col] = opponent
        
        # See if we can create a follow-up threat
        for follow_col in range(7):
            follow_row = 0
            while follow_row < 6 and board[follow_row][follow_col] == '.':
                follow_row += 1
            follow_row -= 1
            
            if follow_row >= 0:
                board[follow_row][follow_col] = player
                if check_win(board, player):
                    score += 2.0  # Significant bonus for forcing sequence
                board[follow_row][follow_col] = '.'
        
        # Reset forced move
        board[force_row][forced_col] = '.'
    
    # Restore original position
    board[row][col] = '.'
    
    return score

def evaluate_opponent_opportunities(board: List[List[str]], row: int, col: int, player: str, opponent: str) -> float:
    """Evaluates if a move creates opportunities for opponent."""
    opportunity_score = 0.0
    
    # Place our piece
    board[row][col] = player
    
    # If this move allows opponent to place above
    if row < 5:  # Not top row
        # Opponent places above
        board[row+1][col] = opponent
        
        # Check if this creates threats for opponent
        threat_value = evaluate_win_paths(board, row+1, col, opponent)
        if threat_value > 1.5:
            opportunity_score += 1.0
        
        # Reset
        board[row+1][col] = '.'
    
    # Restore original
    board[row][col] = '.'
    
    return opportunity_score

def validate_xml_format(response: str) -> float:
    """
    Checks if <reasoning> and <move> tags exist correctly.
    - Rewards 1.0 for perfect format.
    - Penalizes missing or extra occurrences.
    """
    expected_counts = {"<reasoning>": 1, "</reasoning>": 1, "<move>": 1, "</move>": 1}
    actual_counts = {tag: response.count(tag) for tag in expected_counts}

    reward = 1.0
    for tag, expected in expected_counts.items():
        actual = actual_counts.get(tag, 0)
        if actual != expected:
            reward -= 0.25 * abs(actual - expected)

    return reward

def length_reward_func(completions: List[List[Dict]], **kwargs) -> List[float]:
    """Rewards responses between 150 and 512 tokens."""
    rewards = []
    
    for completion in completions:
        response = completion[0]['content']
        num_tokens = len(tokenizer.encode(response))
        
        if 200 <= num_tokens <= 512:
            rewards.append(2.0)
        elif num_tokens < 200:
            penalty = (200 - num_tokens) / 200
            rewards.append(-penalty)
        else:  # num_tokens > 512
            penalty = ((num_tokens - 512) / 512) + 0.1
            rewards.append(-penalty)
    
    return rewards

'''def format_reward_func(completions: List[List[Dict]], **kwargs) -> List[float]:
    """Checks for <reasoning> and <move> tags."""
    rewards = []
    for completion in completions:
        response = completion[0]['content']
        if "<reasoning>" in response and "</reasoning>" in response and "<move>" in response and "</move>" in response:
            rewards.append(0.5)
        else:
            rewards.append(-0.5)
    return rewards'''

def strict_format_reward_func(completions, **kwargs) -> List[float]:
    """Strict format reward using regex."""
    rewards = []
    pattern = r"^<reasoning>\n(.*?)\n</reasoning>\n<move>\n(.*?)\n</move>$"
    for completion in completions:
        response = completion[0]["content"]
        match = re.match(pattern, response, re.DOTALL)
        if match:
            rewards.append(1.0)
        else:
            rewards.append(-0.5)
    return rewards

def strict_move_format_reward_func(completions: List[List[Dict]], **kwargs) -> List[float]:
    """
    Strictly enforces that the XML move is correctly formatted:
    - A single letter (a–g) enclosed within <move> and </move>
    - No extra non-whitespace content after the closing tag.
    
    Returns a reward of 2.0 for a perfect format; otherwise, a penalty.
    """
    rewards = []
    for completion in completions:
        response = completion[0]['content']
        move = extract_xml_move(response)
        # Check if a move was extracted and that there's no extra content after </move>
        closing_index = response.find("</move>")
        if move and closing_index != -1 and response[closing_index + len("</move>"):].strip() == "":
            rewards.append(2.0)
        else:
            rewards.append(-2.0)
    return rewards

def xml_count_reward_func(completions, **kwargs) -> List[float]:
    """Ensures the correct number of XML tags appear in the response."""
    rewards = []
    for completion in completions:
        response = completion[0]["content"]
        rewards.append(validate_xml_format(response))
    return rewards

README.md:   0%|          | 0.00/463 [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/13.4M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/13.4M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/13.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/690501 [00:00<?, ? examples/s]

In [9]:
train_dataset = train_dataset.shuffle(seed=1999)
train_dataset

Dataset({
    features: ['prompt', 'answer', 'board_state', 'player'],
    num_rows: 690501
})

In [10]:
print(train_dataset['prompt'][0][1]['content'])

Game State:
- You are playing as: O
- Your previous moves: f1, f2, a1, e2
- Opponent's moves: c1, c2, e1, b1, e3
- Current board state: a1(O), b1(X), c1(X), e1(X), f1(O), c2(X), e2(O), f2(O), e3(X)
- Next available position per column:
  Column a: a2, a3, a4, a5, a6
  Column b: b2, b3, b4, b5, b6
  Column c: c3, c4, c5, c6
  Column d: d1, d2, d3, d4, d5, d6
  Column e: e4, e5, e6
  Column f: f3, f4, f5, f6
  Column g: g1, g2, g3, g4, g5, g6

Make your move.


In [11]:
print(train_dataset['prompt'][0][1]['content'])

Game State:
- You are playing as: O
- Your previous moves: f1, f2, a1, e2
- Opponent's moves: c1, c2, e1, b1, e3
- Current board state: a1(O), b1(X), c1(X), e1(X), f1(O), c2(X), e2(O), f2(O), e3(X)
- Next available position per column:
  Column a: a2, a3, a4, a5, a6
  Column b: b2, b3, b4, b5, b6
  Column c: c3, c4, c5, c6
  Column d: d1, d2, d3, d4, d5, d6
  Column e: e4, e5, e6
  Column f: f3, f4, f5, f6
  Column g: g1, g2, g3, g4, g5, g6

Make your move.


In [12]:
print(train_dataset['prompt'][51][1]['content'])

Game State:
- You are playing as: X
- Your previous moves: e1, c2, g1
- Opponent's moves: c1, d1, b1
- Current board state: b1(O), c1(O), d1(O), e1(X), g1(X), c2(X)
- Next available position per column:
  Column a: a1, a2, a3, a4, a5, a6
  Column b: b2, b3, b4, b5, b6
  Column c: c3, c4, c5, c6
  Column d: d2, d3, d4, d5, d6
  Column e: e2, e3, e4, e5, e6
  Column f: f1, f2, f3, f4, f5, f6
  Column g: g2, g3, g4, g5, g6

Make your move.


In [13]:
print(train_dataset['prompt'][51][1]['content'])

Game State:
- You are playing as: X
- Your previous moves: e1, c2, g1
- Opponent's moves: c1, d1, b1
- Current board state: b1(O), c1(O), d1(O), e1(X), g1(X), c2(X)
- Next available position per column:
  Column a: a1, a2, a3, a4, a5, a6
  Column b: b2, b3, b4, b5, b6
  Column c: c3, c4, c5, c6
  Column d: d2, d3, d4, d5, d6
  Column e: e2, e3, e4, e5, e6
  Column f: f1, f2, f3, f4, f5, f6
  Column g: g2, g3, g4, g5, g6

Make your move.


In [14]:
print(train_dataset['prompt'][20][1]['content'])

Game State:
- You are playing as: X
- Your previous moves: e1, c1, b1, b2, f1, e2
- Opponent's moves: g1, c2, a1, b3, a2, e3
- Current board state: a1(O), b1(X), c1(X), e1(X), f1(X), g1(O), a2(O), b2(X), c2(O), e2(X), b3(O), e3(O)
- Next available position per column:
  Column a: a3, a4, a5, a6
  Column b: b4, b5, b6
  Column c: c3, c4, c5, c6
  Column d: d1, d2, d3, d4, d5, d6
  Column e: e4, e5, e6
  Column f: f2, f3, f4, f5, f6
  Column g: g2, g3, g4, g5, g6

Make your move.


In [15]:
#train_dataset.push_to_hub("Lyte/ConnectFour-Training-Data_v3", private=True)

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [17]:
model_path = "Lyte/QuadConnect2.5-0.5B-v0.0.9b"

In [18]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 8,
    gradient_accumulation_steps = 8, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 1024,
    max_completion_length = 768,
    #num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 50,
    max_grad_norm = 0.1,
    report_to = "tensorboard", # Can use Weights & Biases
    output_dir = model_path,
    logging_dir=model_path
)

In [None]:
# Update the trainer with new reward functions
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        #validity_reward_func,
        strategic_winning_reward_func,
        #format_reward_func,
        strict_format_reward_func,
        strict_move_format_reward_func,
        xml_count_reward_func,
        length_reward_func,
    ],
    args=training_args,
    train_dataset=train_dataset,
)
print("Training Started")
trainer.train()

Training Started


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 690,501 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 8
\        /    Total batch size = 64 | Total steps = 250
 "-____-"     Number of trainable parameters = 140,771,328


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / strategic_winning_reward_func_v2,rewards / strict_format_reward_func,rewards / strict_move_format_reward_func,rewards / xml_count_reward_func,rewards / length_reward_func
1,0.0,-4.114453,0.649027,45.015625,0.0,-1.0,-0.476562,-2.0,0.066406,-0.704297
2,0.0,-3.790782,1.019412,84.921875,0.0,-1.0,-0.453125,-2.0,0.128906,-0.466564
3,0.0,-3.845703,0.996904,73.25,0.0,-1.0,-0.40625,-2.0,0.160156,-0.599609
4,0.0,-3.646676,1.056979,113.8125,0.000101,-1.0,-0.429688,-2.0,0.160156,-0.377145
5,0.0,-3.871103,0.87577,79.15625,0.00012,-1.0,-0.476562,-2.0,0.089844,-0.484385
6,0.0002,-3.817954,1.028924,68.671875,0.004568,-1.0,-0.429688,-2.0,0.1875,-0.575767
7,0.0016,-3.409688,1.322432,106.25,0.040064,-1.0,-0.359375,-1.875,0.230469,-0.405781
8,0.0006,-3.176702,1.120627,174.828125,0.015392,-1.0,-0.335938,-2.0,0.378906,-0.21967
9,0.0005,-2.44604,1.496247,169.375,0.013425,-1.0,-0.242188,-1.9375,0.5625,0.171147
10,0.0008,-2.298379,1.18918,190.875,0.020567,-1.0,-0.195312,-2.0,0.570312,0.326621


TrainOutput(global_step=250, training_loss=0.0024297890571985904, metrics={'train_runtime': 40277.389, 'train_samples_per_second': 0.397, 'train_steps_per_second': 0.006, 'total_flos': 0.0, 'train_loss': 0.0024297890571985904})

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [21]:
moves = []
prompt = f"Game State:\n- You are playing as: X\n- Your previous moves: \n- Opponent's moves: b1\n- Current board state: b1(O)\n- Next available position per column:  \nColumn a: a1, a2, a3, a4, a5, a6  \nColumn b: b2, b3, b4, b5, b6  \nColumn c: c1, c2, c3, c4, c5, c6  \nColumn d: d1, d2, d3, d4, d5, d6  \nColumn e: e1, e2, e3, e4, e5, e6  \nColumn f: f1, f2, f3, f4, f5, f6  \nColumn g: g1, g2, g3, g4, g5, g6\n\nMake your move."
text = tokenizer.apply_chat_template(
    [{"role" : "system", "content" : SYSTEM_PROMPT}, {"role" : "user", "content" : prompt}],
    tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

print(output)

Processed prompts: 100%|██████████| 1/1 [00:05<00:00,  5.29s/it, est. speed input: 88.04 toks/s, output: 70.66 toks/s]

<reasoning>
I will make the move 'c1'. This move will be a winning move for the X player, as it will result in the X player's move at column 'c' being blocked.

I will explain my thinking process:

1. I have the opportunity to win the game by making a move that leaves an X winning position for my opponent. I will choose column 'c' because it's a central position that can be easily blocked.

2. To ensure that my opponent's winning move is not easily blocked, I will make the move that leaves an X in the center column.

3. I will not move to column 'e', as it is the only column that offers a winning move, but I will move to the next available position on the same column, which is column 'f'. This is because I will be making the move that leaves an X winning position, and the opponent will be unable to block the move.

4. I will block the opponent's move by moving to column 'c' or 'f'. In this case, I will move to column 'c' (which has a 50% chance of being a winning position).

5. Since I




And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [23]:
model.save_lora("grpo_saved_lora_v0.0.8b")

Now we load the LoRA and test:

In [24]:
trainer.repo_id = model_path
trainer.push_to_hub(model_path)

  0%|          | 0/4 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/563M [00:00<?, ?B/s]

(…)ut.tfevents.1740567935.239d89ce82c0.18.0:   0%|          | 0.00/209k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.82k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Lyte/QuadConnect2.5-0.5B-v0.0.9b/commit/5d83b2e8dd3d0ef64cbf3c5552cf3a96506a05a2', commit_message='Lyte/QuadConnect2.5-0.5B-v0.0.9b', commit_description='', oid='5d83b2e8dd3d0ef64cbf3c5552cf3a96506a05a2', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Lyte/QuadConnect2.5-0.5B-v0.0.9b', endpoint='https://huggingface.co', repo_type='model', repo_id='Lyte/QuadConnect2.5-0.5B-v0.0.9b'), pr_revision=None, pr_num=None)

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [26]:
# Merge to 16bit
if True: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if True: model.push_to_hub_merged(model_path, tokenizer, save_method = "merged_16bit", token = HF_TOKEN)

Unsloth: You have 2 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 553.5M


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 12.19 out of 31.35 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 24/24 [00:00<00:00, 63.66it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model.bin...
Done.


Unsloth: You are pushing to hub in Kaggle environment.
To save memory, we shall move Lyte/QuadConnect2.5-0.5B-v0.0.9b to /tmp/QuadConnect2.5-0.5B-v0.0.9b


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 12.16 out of 31.35 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 24/24 [00:00<00:00, 90.40it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving /tmp/QuadConnect2.5-0.5B-v0.0.9b/pytorch_model.bin...


README.md:   0%|          | 0.00/2.12k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/988M [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/Lyte/QuadConnect2.5-0.5B-v0.0.9b


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [29]:
# Save to multiple GGUF options - much faster if you want multiple!
if True:
    model.push_to_hub_gguf(
        model_path, # Change hf to your username!
        tokenizer,
        quantization_method = ["q8_0"],
        token = HF_TOKEN,
    )

Cloning into 'llama.cpp'...
Submodule 'kompute' (https://github.com/nomic-ai/kompute.git) registered for path 'ggml/src/ggml-kompute/kompute'
Cloning into '/kaggle/working/llama.cpp/ggml/src/ggml-kompute/kompute'...
Submodule path 'ggml/src/ggml-kompute/kompute': checked out '4565194ed7c32d1d2efa32ceab4d3c6cae006306'
make: Entering directory '/kaggle/working/llama.cpp'
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAV

100%|██████████| 24/24 [00:00<00:00, 93.03it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving Lyte/QuadConnect2.5-0.5B-v0.0.9b/pytorch_model.bin...
Done.


Unsloth: Converting qwen2 model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at Lyte/QuadConnect2.5-0.5B-v0.0.9b into q8_0 GGUF format.
The output location will be /kaggle/working/Lyte/QuadConnect2.5-0.5B-v0.0.9b/unsloth.Q8_0.gguf
This might take 3 minutes...
2025-02-26 22:20:59.085794: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-02-26 22:20:59.111416: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/531M [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/Lyte/QuadConnect2.5-0.5B-v0.0.9b


Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
