# Power Predictor

Standard is a format of the trading card game Magic: the Gathering in which players are only allowed to use cards from the past few years of set releases. Because of the limited card pool, new additions to the format via new set releases often have a dramatic effect, but it can be hard to predict what cards will be powerful when they release. 

This project seeks to rate Magic: the Gathering cards for power level in the current Standard format on a scale from 1-5, with 1 being completely unplayable and irrelevant, and 5 being format warping and broken. In order to do this effectively, I used two methods, and compared them to find the best approach. 

The first was to use OpenAI's API and make API calls to gpt-4-turbo. When given a card name, this module will fetch the relevant card data from the Scryfall API, including Oracle text, mana cost, and types. The data is then given to the API with a carefully engineered prompt, and it responds with a rating and rational for that rating.

The second approach was to download a pretrained LLM from huggingface and then fine-tune it on a list of standard-legal Magic: the Gathering cards each paired with a power rating.

#### Scryfall fetch code

Method to fetch card data from scryfall. It is stored in a dictionary and returned.

In [13]:
import requests

def get_card_info(card_name):
        """
        Fetch Magic: The Gathering card information from Scryfall API
        
        Args:
            card_name: Name of the MTG card to search for
            
        Returns:
            Dictionary with card information or None if card not found
        """
        # URL encode the card name for the API request
        url = f"https://api.scryfall.com/cards/named?exact={card_name}"
        
        try:
            response = requests.get(url).json()
            
            # Check if the request was successful
            if response:
                card_data = response
                
                # Extract the requested information
                card_info = {
                    'name': card_data.get('name'),
                    'mana_cost': card_data.get('mana_cost'),
                    'types': card_data.get('type_line'),
                    'oracle_text': card_data.get('oracle_text'),
                    'power': card_data.get('power'),
                    'toughness': card_data.get('toughness'),
                    'loyalty': card_data.get('loyalty')
                }
                
                return card_info
            else:
                print(f"Error: Could not find card named '{card_name}'")
                return None
                
        except Exception as e:
            print(f"An error occurred: {e}")
            return None

## API Call

This section contains the code and prompts for the API Call section of the project

### Imports
* requests: easy get requests from Scryfall API
* openai: API library
* Markdown, display: Library for displaying markdown output from the API



In [12]:
import requests
import openai
from IPython.display import display, Markdown

#### Get card info
This method retrieves relevant information about cards from Scryfall, which is a database containing Magic: the Gathering cards and the information about them. It stores the information in a map (card_info), which links the information to appropriate names and returns the map.

In [3]:
def get_card_info(card_name):
        """
        Fetch Magic: The Gathering card information from Scryfall API
        
        Args:
            card_name: Name of the MTG card to search for
            
        Returns:
            Dictionary with card information or None if card not found
        """
        # URL encode the card name for the API request
        url = f"https://api.scryfall.com/cards/named?exact={card_name}"
        
        try:
            response = requests.get(url).json()
            
            # Check if the request was successful
            if response:
                card_data = response
                
                # Extract the requested information
                card_info = {
                    'image_uris': card_data.get('image_uris'),
                    'name': card_data.get('name'),
                    'mana_cost': card_data.get('mana_cost'),
                    'types': card_data.get('type_line'),
                    'oracle_text': card_data.get('oracle_text'),
                    'power': card_data.get('power'),
                    'toughness': card_data.get('toughness'),
                    'loyalty': card_data.get('loyalty')
                }
                
                return card_info
            else:
                print(f"Error: Could not find card named '{card_name}'")
                return None
                
        except Exception as e:
            print(f"An error occurred: {e}")
            return None

#### System Prompt

This is the system prompt, which is given to the API before being prompted with a card. First, it outlines the task, which is to rate a card on a scale from 1-5 in the standard format, which is outlined and briefly described. Then each number on the scale is given a definition, and it's meaning is outlined. The model is told what the analysis of the score should include, and then there are 5 examples, 1 for each tier of power. This gives the model the ability to learn in-context, and provides a baseline for the model to compare cards it is asked to rate to. I attempted to choose a diverse set of cards to use as examples so that the model would have information about all types of cards.

In [4]:
system_prompt = """You are a Magic: The Gathering expert specializing in evaluating cards for Standard format play.
    The best standard decks in the foramt are highly value centered and full of removal and powerful threats.
    Analyze the provided card based on its mana cost, types, oracle text, and power/toughness if applicable.
    Rate the card on a scale of 1-5 where:
    
    1: Unplayable in every situation, and outclassed by other cards
    2: Of only average strength. Playable in niche archetypes or for specific sideboard uses
    3: Strong cards that can be played in a variety of decks, but mainly as a supporting card
    4: Exceptionally strong cards that inspire deck archetypes and provide lots of value on their own
    5: Broken and format-warping card than defines the metagame
    
    Your analysis should include:
    1. Power Rating (1-5)
    2. Strengths of the card
    3. Weaknesses or limitations
    
    Here is an example of a 1: 
    Card Name: "Air Marshal"
    Mana Cost: 1U
    Types: Creature - Human Soldier
    Oracle Text: 3: Target Soldier gains flying until end of turn.
    Power: 2
    Toughness: 1
    
    This card provides a mediocre body for its cost, and the cost of 3 mana to give a creature flying is far too high for the effect.
    It is outclassed by many other cards in the format and does not provide enough value to be worth playing.
    
    Here is an example of a 2:
    Card Name: "Abrade"
    Mana Cost: 1R
    Types: Instant
    Oracle text: Choose one - Abrade deals 3 damage to target creature; or destroy target artifact.
    Power: N/A
    Toughness: N/A
    This card provides decent utility, but at 2 mana it doesn't provide enough value to be worth playing in most decks.
    
    Here is an example of a 3:
    Card Name: Amalia Benavides Aguirre
    Mana Cost: WB
    Types: Legendary Creature - Vampire Scout
    Oracle text: Ward - Pay 3 life. Whenever you gain life, Amalia Benavides Aguirre explores. Then destroy all other creatures if its power is exactly 20. (To have this creature explore, reveal the top card of your library. Put that card into your hand if it is a land. Otherwise, put a +1/+1 counter on this creature, then put the card back or put it into your graveyard.)
    Power: 2
    Toughness: 2
    This card provides excellent value for life gain deck via growing power and toughness and lots of card selection. Becuase of this, it is a powerful addition to decks that gain life incrementally.
    
    Here is an example of a 4: 
    Card Name: "Overlord of the Hauntwoods"
    Mana Cost: 3GG
    Types: Enchantment Creature - Avatar Horror
    Oracle Text: Impending 4—1GreenGreen (If you cast this spell for its impending cost, it enters with four time counters and isn't a creature until the last is removed. At the beginning of your end step, remove a time counter from it.) Whenever this permanent enters or attacks, create a tapped colorless land token named Everywhere that is every basic land type.
    This card provides immense value for ramp decks and domain decks via its ability to create a land token that is every basic land type. Furthermore, it is flexible in that it can be played for full price as a creature, or early for its impending cost.
    This card is very powerful, and finds its way into many decks.
    
    Here is an example of a 5:
    Card Name: "Up the Beanstalk"
    Mana Cost: 1G
    Types: Enchantment
    Oracle text: When Up the Beanstalk enters the battlefield and whenever you cast a spell with mana value 5 or greater, draw a card.
    Power: N/A
    Toughness: N/A
    
    This card provides immediate value upon entering via drawing a card, and provides powerful repeated value throughout the game.
    Furthermore, because of synergies with other strong cards in the format, it is very easy to trigger the card draw effect.
    This card is the reason many of the best decks exist, and is a format-defining card.
    """
    

#### OpenAI API key

Insert OpenAI API key here to use the model.

In [None]:
API_KEY = "KEY"

#### API Call

This method calls the LLM API. It is passed the map of card information, and plugs it into the user prompt to elicit a response. 1000 max token is chosen because as this task can be quite complicated, sometimes extensive analysis is required. A temperature of 0.2 ensures that the model does not provide inaccurate ratings that were not the most likely option.

In [6]:
def analyze_card(card_info):
    try:
        client = openai.OpenAI(api_key=API_KEY)
        response = client.chat.completions.create(
            model="gpt-4-turbo",
            messages=[
                {
                    "role": "system",
                    "content": system_prompt
                },
                {
                    "role": "user",
                    "content": f"Please analyze this Magic: The Gathering card:\n\nName: {card_info['name']}\nMana Cost: {card_info['mana_cost']}\nTypes: {card_info['types']}\nOracle Text: {card_info['oracle_text']}\nPower/Toughness: {card_info['power']}/{card_info['toughness']}\nLoyalty: {card_info['loyalty']}\n\n"
                }
            ],
            temperature=0.1,
            max_tokens=1000
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {str(e)}"

#### Driver code

This cell calls each method defined above and prints the LLM's response, which is given in Markdown format

In [7]:

card_to_analyze1 = "Up the Beanstalk"
info1 = get_card_info(card_to_analyze1)
rating1 = analyze_card(info1)
print(card_to_analyze1)
display(Markdown(rating1))

card_to_analyze2 = "Abrade"
info2 = get_card_info(card_to_analyze2)
rating2 = analyze_card(info2)
print(card_to_analyze2)
display(Markdown(rating2))

card_to_analyze3 = "Abzan Monument"
info3 = get_card_info(card_to_analyze3)
rating3 = analyze_card(info3)
print(card_to_analyze3)
display(Markdown(rating3))

card_to_analyze4 = "Monastery Swiftspear"
info4 = get_card_info(card_to_analyze4)
rating4 = analyze_card(info4)
print(card_to_analyze4)
display(Markdown(rating4))


Up the Beanstalk


Power Rating: 5

Strengths of the card:
1. **Immediate Value**: "Up the Beanstalk" provides immediate value upon entering the battlefield by drawing a card. This helps mitigate the cost of playing the enchantment by replacing itself in your hand, ensuring that you do not lose card advantage.
2. **Recurring Value**: The ability to draw a card whenever you cast a spell with mana value 5 or greater can lead to significant card advantage over the course of a game. This is particularly strong in formats where higher-cost spells are prevalent and impactful.
3. **Deck Synergy**: This card fits well into decks that naturally want to play larger spells, such as ramp decks or control decks that stabilize and then play high-impact spells. It synergizes with the game plan of casting big, game-changing spells by rewarding you with additional card draw.
4. **Low Mana Cost**: At only 2 mana, this enchantment is very easy to incorporate into a variety of game plans without setting back your development. Its low cost also makes it a minimal risk investment in terms of tempo.

Weaknesses or limitations:
1. **Dependency on Deck Composition**: The card's effectiveness is highly dependent on having a sufficient number of spells with mana value 5 or greater in your deck. In decks without these higher-cost spells, its value diminishes significantly.
2. **No Immediate Impact on the Board**: While it provides card advantage, "Up the Beanstalk" does not affect the board state directly when played. This could be a disadvantage in very aggressive metagames where tempo and board presence are more critical than card advantage.

Overall, "Up the Beanstalk" is a powerful enchantment that can define the structure and strategy of the decks it's included in. Its ability to consistently generate card advantage in the right deck makes it a format-defining card, meriting a rating of 5. It encourages and rewards building around a specific type of deck, influencing the metagame significantly.

Abrade


Power Rating: 4

Strengths of the card:
1. **Flexibility**: Abrade's primary strength lies in its flexibility. The ability to choose between dealing 3 damage to a creature or destroying an artifact makes it highly versatile and valuable in a variety of game situations. This adaptability allows it to be effective in both the early and late game, handling threats or key utility artifacts.
2. **Cost Efficiency**: At a cost of only {1}{R}, Abrade is very mana-efficient. This low cost enables players to remain reactive and flexible with their mana, potentially casting other spells in the same turn.
3. **Instant Speed**: Being an instant significantly enhances Abrade's utility, allowing players to use it as a combat trick or in response to an opponent's actions during their turn. This can disrupt opponent strategies and provide surprise interactions that can swing the game.

Weaknesses or limitations:
1. **Damage Limitation**: The damage cap of 3 means that Abrade cannot deal with larger creatures that are common in many Standard formats. This limits its effectiveness against some of the format's bigger threats.
2. **No Player or Planeswalker Targeting**: Abrade is limited to targeting creatures and artifacts only. It cannot target players or planeswalkers, which restricts its utility compared to some other removal spells that offer broader targeting options.

Overall, Abrade's strengths in flexibility and cost efficiency make it an exceptionally strong card that can fit into multiple deck archetypes, especially in environments where artifacts are prevalent or where efficient creature removal is necessary. Its limitations are minor compared to the broad utility it offers, making it a staple in many Standard decks.

Abzan Monument


**Power Rating: 3**

**Strengths of the card:**
1. **Mana Efficiency and Ramp:** Abzan Monument has a low initial mana cost of {2}, making it an accessible early-game play. Its ability to fetch a basic Plains, Swamp, or Forest upon entering the battlefield aids in mana fixing and slight ramping, which is crucial in a three-color deck.
2. **Flexibility in Token Generation:** The second ability to create an X/X white Spirit creature token, where X is the greatest toughness among creatures you control, can be a significant late-game advantage. This ability allows for the creation of potentially large blockers or attackers, depending on the board state.
3. **Synergy with Creature-Based Strategies:** In decks that focus on creatures with high toughness, this artifact can consistently generate large tokens, making it a valuable asset in defensive or attrition strategies.

**Weaknesses or Limitations:**
1. **Sorcery Speed Restriction:** The activation of the token-generating ability only at sorcery speed limits its flexibility, particularly in reactive or instant-speed centered strategies. This restriction prevents it from being used as a surprise blocker or in response to removal, which could be a significant downside in some tactical situations.
2. **Dependency on Board State:** The value of the token generated depends heavily on the presence of creatures with high toughness. In scenarios where the board is wiped or the player is unable to establish a substantial creature presence, this artifact's utility diminishes greatly.
3. **One-Time Use of the Second Ability:** The requirement to sacrifice Abzan Monument to use its token-generating ability means it's a one-time effect. This can be a considerable limitation if the game extends and continuous value generation is necessary.

**Overall Evaluation:**
Abzan Monument is a strong card in decks that can utilize both its mana-fixing early game and its potential to create large creature tokens in the mid to late game. It fits well into Abzan-colored decks that focus on creature-based strategies, particularly those that can maintain a board state with creatures of high toughness. However, its limitations in flexibility and dependency on the board state prevent it from being a universally powerful card in all types of decks, thus earning it a rating of 3.

Monastery Swiftspear


Power Rating: 4

Strengths of the card:
1. **Low Mana Cost**: Monastery Swiftspear costs only a single red mana, making it extremely easy to deploy early in the game. This low cost also allows for multiple spells to be played in the same turn, maximizing its prowess ability.
2. **Haste**: The haste ability allows Monastery Swiftspear to attack immediately, providing immediate impact on the game by pressuring the opponent's life total from the very first turn it is played.
3. **Prowess**: This ability is particularly strong in decks that utilize a high number of noncreature spells (such as instants and sorceries). Each spell cast not only furthers the player's game plan but also temporarily boosts Monastery Swiftspear, potentially increasing the damage output significantly.
4. **Flexibility**: Its inclusion in a variety of aggressive and spell-heavy decks (like Burn, Red Deck Wins, or Izzet Spells) showcases its versatility and ability to fit into multiple strategies effectively.

Weaknesses or limitations:
1. **Fragility**: With a base toughness of 2, Monastery Swiftspear is vulnerable to many of the commonly played removal spells in the format, such as Shock or Stomp. This can sometimes lead to unfavorable trades or easy removal by the opponent.
2. **Dependence on Noncreature Spells**: To maximize the value of Monastery Swiftspear, a deck needs to be built with a sufficient number of noncreature spells. In decks not designed with this synergy in mind, its effectiveness can be significantly diminished.

Overall, Monastery Swiftspear is an exceptionally strong card in the right deck, capable of applying early pressure and scaling its threat level as the game progresses. Its ability to inspire aggressive red-based deck archetypes and its proven track record in competitive play across multiple formats solidify its high rating.

### Results

Overall, the model performs quite well with most cards. It gives a reasonable answer that agrees with my analysis and provides quality analysis of why the card is in the spot that it chose.

Problems: The model often gives cards a rating higher than what is correct. For example, as seen in the above examples, it rates Abrade a 4 and Abzan Monument a 3. I would rate them both as 2, being playable in specific decks. The model tends to give cards that have flexible uses hgiher scores. Abrade has two modes that can be chosen, and Abzan Monument has a mode that can be activated in the late game to provide a powerful creature. It references this flexibility in it's analysis as one of the reasons for it's rating. This could potentially be solved with more additions to the prompt.

In conclusion, this model performs adequately in most circumstances, especially on simpler cards. It tends to over-rate cards, giving them scores roughly 1 point higher than is accurate. However, the analysis given is very good. The model identifies the strengths and weaknesses of each card quite effectively. One problem with the model is the lack of awareness of other cards in the format. An important aspect in whether or not a card is good in Standard is the power of other cards that synergize with it. Since this approach is unable to know what is in standard at the current moment, it can't take this into account.

## Fine Tuned LLM

This section contains the code and output for the fine-tuned distilbert model

#### Imports
* Pandas, datasets: data tools
* predictor, transformers, peft, torch: LLM and training tools

In [14]:

import pandas as pd
import predictor
import torch
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)
from peft import LoraConfig, get_peft_model, TaskType
from sklearn.model_selection import train_test_split
from datasets import Dataset

#### Login to Huggingface

Provide a huggingface key to load distilbert

In [15]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#### Model Hyperparameters

Relevant decisions:
* Model: distilbert-base-uncased chosen for size and classification prowess
* Learning rate: low learning rate chosen to help avoid overfitting on the small dataset
* Batch size: small batch size to help avoid overfitting
* Low epoch number for the same reason as above


In [16]:
print("configuring model...")
MODEL_NAME = "distilbert/distilbert-base-uncased"
OUTPUT_DIR = "./mtg_rating_model"
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05
LEARNING_RATE = 1e-4
BATCH_SIZE = 4
EPOCHS = 5
MAX_LENGTH = 512

configuring model...


### Dataset

This dataset is 100 cards along with ratings chosen by me based on my experience with the game and also upon the MTG golfish meta page (https://www.mtggoldfish.com/metagame/standard/full#paper). The cards were semi-randomly chosen from all the cards available in the format.

In [18]:
cards = [
        "Adventuring Gear",
        "Basilisk Collar",
        "Bloodthorn Flail",
        "Carnelian Orb of Dragonkind",
        "Carrot Cake",
        "Cori-Steel Cutter",
        "Gilded Lotus",
        "Golden Argosy",
        "Monument to Endurance",
        "Perilous Snare",
        "Racers' Scoreboard",
        "Rope",
        "Runaway Boulder",
        "Abhorrent Oculus",
        "Ajani's Pridemate",
        "Ash, Party Crasher",
        "Ball Lightning",
        "Beza, the Bounding Spring",
        "Bloodghast",
        "Boulderborn Dragon",
        "Brightblade Stoat",
        "Defiler of Vigor",
        "Diregraf Ghoul",
        "Edgewall Pack",
        "Elvish Archdruid",
        "Essence Channeler",
        "Evolved Sleeper",
        "Fang Guardian",
        "Fangkeeper's Familiar",
        "Friendly Teddy",
        "Fynn, the Fangbearer",
        "Greedy Freebooter",
        "Halo-Charged Skaab",
        "Haughty Djinn",
        "Heartfire Hero",
        "Hinterland Sanctifier",
        "Ingenious Leonin",
        "Iridescent Vinelasher",
        "Jolly Gerbils",
        "Kiora, the Rising Tide",
        "Knight-Errant of Eos",
        "Kraul Whipcracker",
        "Llanowar Elves",
        "Manifold Mouse",
        "Mintstrosity",
        "Nurturing Pixie",
        "Overlord of the Hauntwoods",
        "Overlord of the Boilerbilges",
        "Pride of the Road",
        "Rankle and Torbran",
        "Savage Ventmaw",
        "Savannah Lions",
        "Screaming Nemesis",
        "Severance Priest",
        "Sire of Seven Deaths",
        "Skirmish Rhino",
        "Spiteful Hexmage",
        "Tangled Colony",
        "Up the Beanstalk",
        "Caretaker's Talent",
        "Colossification",
        "Disturbing Mirth",
        "Leyline of Resonance",
        "Lost in the Maze",
        "Nahiri's Resolve",
        "Nowhere to Run",
        "Phyrexian Arena",
        "Tribute to the World Tree",
        "Monstrous Rage",
        "Abrade",
        "Aetherize",
        "Bite Down",
        "Flame Lash",
        "Get Out",
        "Get Lost",
        "This Town Ain't Big Enough",
        "Negate",
        "On the Job",
        "Opt",
        "Rat Out",
        "Refute",
        "Ride's End",
        "Slick Sequence",
        "Steer Clear",
        "Torch the Tower",
        "Abuelo's Awakening",
        "Boltwave",
        "Captain's Call",
        "Deathmark",
        "Excavation Explosion",
        "Exorcise",
        "Feed the Swarm",
        "Jailbreak Scheme",
        "Lunar Insight",
        "Maelstrom Pulse",
        "Pyroclasm",
        "Rankle's Prank",
        "Zombify",
        "Slime Against Humanity",
        "Sunfall"
    ]

ratings = [1, 2, 1, 2, 3, 4, 2, 2, 4, 3, 1, 2, 1, 4, 2, 2, 2, 4, 3, 1, 2, 3, 2, 2, 3, 3, 2, 1, 3, 1, 3, 3, 2, 3, 4, 3, 1, 3, 1, 3, 4, 2, 4, 3, 2, 3, 3, 4, 1, 2, 2, 2, 3, 2, 2, 3, 2, 2, 4, 1, 3, 4, 1, 1, 3, 3, 3, 5, 2, 3, 2, 1, 3, 3, 5, 3, 1, 3, 2, 2, 3, 2, 2, 3, 3, 2, 3, 2, 2, 1, 2, 2, 1, 2, 3, 3, 2, 2, 4, 3]


##### Scryfall fetch

Below code fetches card data from the Scryfall API, concatenates it into a string, and adds it to the cards_with_data list. Final data is stored in the dataset_df dataframe for tokenization later.


In [19]:

cards_with_data = []
for card in cards:
    card_info = predictor.card_utils.get_card_info(card)
    with_data = f"Name: {card_info['name']}\nMana Cost: {card_info['mana_cost']}\nTypes: {card_info['types']}\nOracle Text: {card_info['oracle_text']}\nPower/Toughness: {card_info['power']}/{card_info['toughness']}\nLoyalty: {card_info['loyalty']}\n\n."
    cards_with_data.append(with_data)
dataset_df = pd.DataFrame({"card_text": cards_with_data, "rating": ratings})

#### Tokenization

Tokenizer function that, when passed the data and a tokenizer, will tokenize the inputs for the data and return it. Uses the distilbert tokenizer from huggingface

In [20]:

# Tokenize the data for sequence classification
def tokenize_function(examples, tokenizer):
    """Tokenize examples for sequence classification"""
    # Format the examples as input text
    texts = [
        f"Rate the Magic: The Gathering card '{card_text}' on a scale from 1 to 5 where 1 is irrelevant to the current standard format and 5 is format-warping."
        for card_text in examples["card_text"]
    ]
    
    # Tokenize with padding and truncation
    tokenized = tokenizer(
        texts, 
        padding="max_length",
        truncation=True,
        max_length=MAX_LENGTH,
        return_tensors="pt"
    )
    
    # Convert ratings to labels (subtract 1 to make labels 0-4 instead of 1-5)
    tokenized["labels"] = [label - 1 for label in examples["rating"]]
    
    return tokenized


### Driver Code

This is the main driver code that will run all of the functions defined above, and execute fine-tuning of the model

##### Split into training and test data

Using a 80/20 split, split the dataset into training and validation dataframes

In [21]:
train_df, val_df = train_test_split(dataset_df, test_size=0.2, random_state=42)

train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

##### Load model

Load the distilbert-base-uncased model from pretrained, load the tokenizer, and declare our model

In [22]:
print(f"Loading {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    
# Make sure the tokenizer has a pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    
# Load the model with num_labels=5 for the 1-5 rating scale
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=5,  # 5 classes (ratings 1-5)
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
    

Loading distilbert/distilbert-base-uncased...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##### Create peft config and apply to model

Peft is a set of fine-tuning techniques that is aimed to adapt a pre-trained model to a new task while only updating a small amount of parameters. Using peft and LoRA, I can easily train the model on my desktop GPU, and even a laptop CPU should be able to handle training the model using this method.


In [23]:
peft_config = LoraConfig(
    inference_mode=False,
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type=TaskType.SEQ_CLS,  # Sequence classification task
    target_modules=["q_lin", "k_lin", "v_lin", "o_lin"]  # Typical attention modules
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 1,036,805 || all params: 67,994,122 || trainable%: 1.5248


### Training compute power
As can be seen above, becuase peft is being pused, the number of parameters that are being trained is being slashed down to less than a 60th of what it would normally be. Because of this optimization, training is trivial to carry out for my desktop GPU, which is an rtx 3070. Even if peft was not being used, because the 3070 has 8 gb of vram, it would be able to load all nearly 68,000,000 parameters into memory quite easily. Distilbert is indeed a very small model, and because of this, training and running it is trivial.

##### Tokenize the dataset and process into train and validation

In [24]:
def tokenize_dataset(examples):
    return tokenize_function(examples, tokenizer)
    
# Process datasets with batched=True for efficiency
tokenized_train = train_dataset.map(
    tokenize_dataset,
    batched=True,
    remove_columns=train_dataset.column_names
)
    
tokenized_val = val_dataset.map(
    tokenize_dataset,
    batched=True,
    remove_columns=val_dataset.column_names
)

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

##### Define training arguments and trainer using variables declared above

In [25]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    learning_rate=LEARNING_RATE,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    push_to_hub=False,
    fp16=False,
)
   
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
)

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


##### Train and save model

In [26]:
print("Starting fine-tuning...")
trainer.train()
    
# Save the model
print(f"Saving model to {OUTPUT_DIR}...")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
        
print("Training complete!")

Starting fine-tuning...


Epoch,Training Loss,Validation Loss
1,1.4523,1.390625
2,1.4211,1.310938
3,1.4008,1.301563
4,1.3438,1.301563
5,1.2844,1.301563


Saving model to ./mtg_rating_model...
Training complete!


### Predict card rating

This function predicts the rating for a card when passed the card name, a model, and a tokenizer. It calls the scryfall API for the card data, tokenizes it, passes it to the model, and returns the output as a rating and probabilities.

In [27]:
def predict_card_rating(card_name, model, tokenizer, max_length=512):
    """
    Predict the rating for a specific Magic: The Gathering card
    
    Args:
        card_name: Name of the card to rate
        model: The fine-tuned model
        tokenizer: The tokenizer for the model
        max_length: Maximum sequence length
        
    Returns:
        rating: Predicted rating (1-5)
        confidence: Confidence scores for each class
    """
    # Get card information
    try:
        card_info = get_card_info(card_name)
        card_text = f"Name: {card_info['name']}\nMana Cost: {card_info['mana_cost']}\nTypes: {card_info['types']}\nOracle Text: {card_info['oracle_text']}\nPower/Toughness: {card_info['power']}/{card_info['toughness']}\nLoyalty: {card_info['loyalty']}\n\n."
    except Exception as e:
        print(f"Error fetching card info: {e}")
        return None, None
    
    # Format the prompt as it was during training
    prompt = f"Rate the Magic: The Gathering card '{card_text}' on a scale from 1 to 5 where 1 is irrelevant to the current standard format and 5 is format-warping."
    
    # Tokenize the input
    inputs = tokenizer(
        prompt, 
        padding="max_length", 
        truncation=True, 
        max_length=max_length, 
        return_tensors="pt"
    )
    
    # Move inputs to the same device as the model
    device = next(model.parameters()).device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get predicted class and confidence scores
    logits = outputs.logits.to(torch.float32)
    probabilities = torch.nn.functional.softmax(logits, dim=1)[0]
    predicted_class = torch.argmax(logits, dim=1).item()
    
    
    # Convert back to 1-5 scale (since model was trained on 0-4 labels)
    rating = predicted_class + 1
    
    # Convert probabilities to a regular list
    confidence_scores = probabilities.cpu().numpy().tolist()
    
    return rating, confidence_scores

### Run Fine-Tuned Model

Here I will pass some example cards to the fine-tuned model to see how it does. The code will call the function above and print the results 

In [28]:
test_cards = ["Abyssal Gorestalker", "Adventuring Gear", "Monstrous Rage", "Phyrexian Arena", "Abrade"]

for card in test_cards:
    rating, confidence = predict_card_rating(card, model, tokenizer)
    if rating is not None:
        # Format confidence scores as percentages
        conf_percentages = [f"{conf*100:.1f}%" for conf in confidence]   
                 
        print(f"Card: {card}")
        print(f"Predicted Rating: {rating}/5")
        print(f"Confidence: {conf_percentages}")
        print("-----------------------")

Card: Abyssal Gorestalker
Predicted Rating: 2/5
Confidence: ['13.6%', '41.9%', '28.7%', '11.1%', '4.7%']
-----------------------
Card: Adventuring Gear
Predicted Rating: 2/5
Confidence: ['13.8%', '41.5%', '28.8%', '11.2%', '4.7%']
-----------------------
Card: Monstrous Rage
Predicted Rating: 2/5
Confidence: ['13.5%', '42.0%', '28.7%', '11.2%', '4.6%']
-----------------------
Card: Phyrexian Arena
Predicted Rating: 2/5
Confidence: ['13.5%', '42.2%', '28.4%', '11.2%', '4.7%']
-----------------------
Card: Abrade
Predicted Rating: 2/5
Confidence: ['13.7%', '41.9%', '28.5%', '11.1%', '4.8%']
-----------------------


### Overview:

This method failed miserably, for a couple of reasons, both having to do with my data. As we can see, the model has defaulted to predicting 2 for almost every card, with a rare 3, and even though it is doing so, it is doing so without a lot of confidence in that answer. I believe that the reason for this is that the data is skewed towards the lower scores. This is the distribution of the data per score: 1: 17, 2: 36, 3: 34, 4: 11, 5: 2. We can see by looking at the above predictions that the probabilities given for each score 

Desipte many attempts at modifying hyperparameters such as epochs and learning rate to increase the amount of learning done, I could not bring the model to make different predictions for different cards. The best learning rate that I found was 2e-4. Any lower and the model simply stopped learning anything, and gave near 20 percent for each probability. The model seems to converge after 5 or so epochs each time, if the anti-learning that is being done in a lot of iterations can be called converging. Currently, I firmly believe that the miniscule dataset is preventing the model from learning trends that help it outside of the training data.

As a consequence of the very tiny amount of data coupled with the skew, the model is learning to be less surprised that a results is a 2 or a 3 instead of actually learning something about classifying the cards. As a method of evaluating Magic cards, this fine-tuning attempt was a failure. However, as a learning experience about fine-tuning and a lesson in careful and extensive data collection, it was a success!

# Conclusion

### General Results

In summary, in the current state of the project, the OpenAI API calling bot works far better than the fine-tuned distilbert model. It is able to rate cards somewhat accurately, even if it often scores them higher than they should be. This is much better than the Fine-tuned model, which ended up rating every card either a 2 or a 3 after fine-tuning. For more detailed analysis of model performace, see each approach's overview cells.

### Next steps
For the API call, there is not much to change except for more prompt engineering. In it, I would change the prompt to encourage more conservative scores, and only awarding 3s and 4s to higher powered cards. Something interesting to try would be to use RAG to allow the model to learn what is in standard before making it's ratings.

For the fine-tuned model, the best thing for me to do to improve performance would be to collect more data, and particularly to even out the distribution of cards/score so that the spread from 1-4 is even. Realistically, there are only ever 2 or 3 5s in the format, and they are self-apparent to anyone to has touched the game, so collecting the data would be impossible unless it was collected from cards outside the standard format, and it would not be necessary for the model to point out 5 powered cards.

### What I learned
The major lessons that I learned doing this project were that prompt engineering and in context learning are very powerful, and can be used to get admirable results without having to train your own model, and that when attempting to fine-tune a model, data collection is the most important part, and a lot of it is needed to effectively train a model. Additionally, I should have taken much more care in how I selected cards to go in the dataset to avoid getting the skew that I did.

### Societal Implications
In terms of general society, this model is fairly irrelevant, but in terms of Standard, I would like to run a thought experiment where everyone playing the game has access to my model and uses it (I'll also assume that the model gives good predictions). In this case, the model would predict for them the best cards coming out of a new set and they would go and use them. The problem that would occur in this context is that everyone, assuming that they want to win, would all either be using the same cards, or would be building their decks to answer these best, and now ubiquitous cards. This could lead to a very stale metagame very quickly, and a lot of the fun of deckbuilding and trying out new cards and strategies would be lost. This is one eventuality that I could forsee coming to pass if the scenario that I proposed was true. I believe that if this project was a true success, and the models were quite powerful, Magic the Gathering would suffer as a game because of it, and the Standard format would become a slave to the meta defined by the AI.