# Hello and welcome to my ML project. 
I was initially going to do an app (option A), but I switched gears and chose a data exploration project (option C) in the end. This is the notebook for the NLP model I did. My goal was to have the model guess the card's colour based on it's name and card type. First we get the data.

In [None]:
import numpy as np 
import pandas as pd
import torch


all_cards_df = pd.read_csv('/kaggle/input/all-mtg-cards/all_mtg_cards.csv')
selected_columns = all_cards_df[['name']]
print(selected_columns)

As you can see, there are many duplicate cards in this data set. My first step of cleaning is removing these redundant entries.

In [None]:
all_cards_no_dups = all_cards_df.drop_duplicates(subset="name")

print(all_cards_no_dups["name"].count())
all_cards_no_dups["color_identity"].value_counts()

The rows went from 76k to 26k. Because some of these card categories have so few numbers, I'm only going to have the model try to guess between monocolours, colorless, or multicolor. I'm going to create a new column 'overall_color' for this.

In [None]:
def str_to_list(cell):
    cell = ''.join(c for c in cell if c not in "'[]")  
    cell = cell.split(', ') 
    return cell

# Define an overall card color category, including Colorless and Multi
def color_to_category(x):
    try:
        size = len(x)
        if size == 0:
            return "Colorless"
        elif size == 1:
            return x[0]
        else:
            return "Multi"
    except:
        return "None"


all_cards_no_dups = all_cards_no_dups.copy()  # Create an explicit copy to avoid ambiguity

all_cards_no_dups.loc[:, "overall_color"] = all_cards_no_dups["color_identity"].apply(str)
all_cards_no_dups.loc[:, "overall_color"] = all_cards_no_dups["overall_color"].apply(lambda x: "[]" if x == "nan" else x)
all_cards_no_dups.loc[:, "overall_color"] = all_cards_no_dups["overall_color"].apply(eval)
all_cards_no_dups.loc[:, "overall_color"] = all_cards_no_dups["overall_color"].apply(color_to_category)

In [None]:
# Define a mapping of shorthand to full names
color_mapping = {
    'B': 'Black',
    'W': 'White',
    'U': 'Blue',
    'R': 'Red',
    'G': 'Green',
    'Multi': 'Multicolor'
}

# Replace the values in the 'overall_color' column using the mapping
all_cards_no_dups['overall_color'] = all_cards_no_dups['overall_color'].replace(color_mapping)

# Display the unique values in the updated column to verify
print(all_cards_no_dups['overall_color'].unique())

# Modeling

The color categories are looking nice. Now it's time to move on to setting up the model. I'm feeling hopeful as names and types should provide great insight into color. Angelic/holy language for white, zombies/undead for black, nature for green, goblins for red, equipment/artifact for colorless, etc.

In [None]:
card_color_subtypes = all_cards_no_dups.loc[:, ["name", "type", "overall_color"]]
card_color_subtypes = card_color_subtypes.explode("type")
card_color_subtypes.head()

In [None]:
all_cards_no_dups['input_text'] = all_cards_no_dups['name'] + " " + all_cards_no_dups['type']

Combining the type and name into one to make it more seamless. Regarding stopwords and capitalisation, I was assured by the sources that I looked up that preprocessing my data to remove those was largely unneccesary when using BERT, so I skipped that step of cleaning.

Splitting the data into test and training, also setting a random state to be able to track progress accurately.

In [None]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(
    all_cards_no_dups['input_text'], 
    all_cards_no_dups['overall_color'], 
    test_size=0.20, 
    random_state=41
)

The LabelEncoder is used to convert categorical labels into numeric format for both the training and validation datasets. The fit_transform method fits the encoder to the training labels and transforms them, while the transform method ensures the validation labels are encoded consistently using the same mapping.

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
train_labels = label_encoder.fit_transform(train_labels)
val_labels = label_encoder.transform(val_labels)

In [None]:
# Bert tokenizer here to do my work for me.
# It processes both training and validation texts with truncation, padding, and a maximum sequence length of 128 to ensure uniform input size

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_encodings = tokenizer(list(train_texts), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(list(val_texts), truncation=True, padding=True, max_length=128)

In [None]:
# preparing training and validation sets

class CardDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

train_dataset = CardDataset(train_encodings, train_labels)
val_dataset = CardDataset(val_encodings, val_labels)

In [None]:
from transformers import BertForSequenceClassification

# Loading the pre-trained BERT model with classification head
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_encoder.classes_))

In [None]:
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler

# Tried many batch sizes from 4 to 32
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

# Set up device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# Load pre-trained BERT model 
from transformers import BertForSequenceClassification

num_labels = len(set(train_labels)) 
model = BertForSequenceClassification.from_pretrained('bert-large-uncased', num_labels=num_labels)

model.to(device)

# Here I tried freezing some layers to reduce training time and reduce overfit which was a consistent problem

for param in model.base_model.parameters():
    param.requires_grad = False

for param in model.base_model.encoder.layer[-6:].parameters():
    param.requires_grad = True

for param in model.classifier.parameters():
    param.requires_grad = True

# Regularisation techniques 
# Dropout to, again, help with overfitting. Tried values from .2 to .4
model.config.hidden_dropout_prob = 0.3

# Weight decay: L2 regularisation to attempt to help with overfitting, rates from 1e-5 to 5e-6
optimizer = AdamW(model.parameters(), lr=1e-5, weight_decay=0.01)

# Define learning rate scheduler
num_training_steps = len(train_loader) * 10
lr_scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

# Early stopping to save time when the model dead ends.
best_val_loss = float('inf')
patience = 3  
epochs_no_improve = 0

# Training loop with early stopping and scheduler
for epoch in range(1):  
    model.train()  
    total_train_loss = 0

    for batch in train_loader:
        inputs = {key: val.to(device) for key, val in batch.items() if key != 'labels'}
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        outputs = model(**inputs, labels=labels)  
        loss = outputs.loss
        total_train_loss += loss.item()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()

    avg_train_loss = total_train_loss / len(train_loader)
    print(f"Epoch {epoch + 1}, Training Loss: {avg_train_loss:.4f}")

    # Validation loop
    model.eval()  
    total_val_loss = 0
    correct_predictions = 0

    with torch.no_grad():
        for batch in val_loader:
            inputs = {key: val.to(device) for key, val in batch.items() if key != 'labels'}
            labels = batch['labels'].to(device)

            outputs = model(**inputs, labels=labels)  
            total_val_loss += outputs.loss.item()

            predictions = torch.argmax(outputs.logits, dim=-1)
            correct_predictions += (predictions == labels).sum().item()

    avg_val_loss = total_val_loss / len(val_loader)
    val_accuracy = correct_predictions / len(val_dataset)

    print(f"Epoch {epoch + 1}, Validation Loss: {avg_val_loss:.4f}, Accuracy: {val_accuracy:.4f}")

    # Early stopping check
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        epochs_no_improve = 0
    else:
        epochs_no_improve += 1

    if epochs_no_improve >= patience:
        print("Early stopping triggered.")
        break


1:
Epoch 1, Training Loss: 1.1506
Epoch 1, Validation Loss: 1.0580, Accuracy: 0.6295
Epoch 2, Training Loss: 0.8626
Epoch 2, Validation Loss: 1.0819, Accuracy: 0.6367
Epoch 3, Training Loss: 0.6016
Epoch 3, Validation Loss: 1.1341, Accuracy: 0.6330

Going for better stats. Adding dropout to .2 and changing learning rate to 2e-5

2:
Epoch 1, Training Loss: 1.2940
Epoch 1, Validation Loss: 1.0733, Accuracy: 0.6245
Epoch 2, Training Loss: 0.9318
Epoch 2, Validation Loss: 1.0418, Accuracy: 0.6488
Epoch 3, Training Loss: 0.7069
Epoch 3, Validation Loss: 1.0732, Accuracy: 0.6374

Still overfitting. Adjusting learning rate to 1e-5 with weight decay .01, increasing batchsize from 16 to 20, increasing dropout to .3. Will consider trying layer freezing and dataset balancing next. Increasing epochs to 5

3:
Epoch 1, Training Loss: 0.4347
Epoch 1, Validation Loss: 1.1961, Accuracy: 0.6505
Epoch 2, Training Loss: 0.3103
Epoch 2, Validation Loss: 1.3321, Accuracy: 0.6388
Epoch 3, Training Loss: 0.2293
Epoch 3, Validation Loss: 1.4711, Accuracy: 0.6361

Stopped after 3rd epoch as the losses were too much to bear. Surprsingly first epoch is my best yet though lol. I'm increasing the test data size to .25 from .2. Dropout is going to .4, learning to 5-e6. I'm also trying layer freezing now. Data was already balanced just fine, so I'm not worried about that.

4:
Epoch 1, Training Loss: 1.7682
Epoch 1, Validation Loss: 1.5257, Accuracy: 0.4651
Epoch 2, Training Loss: 1.4664
Epoch 2, Validation Loss: 1.3411, Accuracy: 0.5303
Epoch 3, Training Loss: 1.3275
Epoch 3, Validation Loss: 1.2543, Accuracy: 0.5584
Epoch 4, Training Loss: 1.2560
Epoch 4, Validation Loss: 1.2066, Accuracy: 0.5787
Epoch 5, Training Loss: 1.2064
Epoch 5, Validation Loss: 1.1840, Accuracy: 0.5847

Interesting development. At least both validations are improving now. Increasing training rate 2e-5, reducing dropout to .3, freezing less layers (4 to 6)

latest attempt made it up to .63 but the numbers didnt save. trying some changes

Epoch 1, Training Loss: 1.4009
Epoch 1, Validation Loss: 1.1310, Accuracy: 0.6100
Epoch 2, Training Loss: 1.0546
Epoch 2, Validation Loss: 1.0551, Accuracy: 0.6302
Epoch 3, Training Loss: 0.8790
Epoch 3, Validation Loss: 1.0580, Accuracy: 0.6399
Epoch 4, Training Loss: 0.7424
Epoch 4, Validation Loss: 1.0737, Accuracy: 0.6432
Epoch 5, Training Loss: 0.6216
Epoch 5, Validation Loss: 1.1312, Accuracy: 0.6455
Epoch 6, Training Loss: 0.5326
Epoch 6, Validation Loss: 1.1673, Accuracy: 0.6466
Epoch 7, Training Loss: 0.4555
Epoch 7, Validation Loss: 1.2099, Accuracy: 0.6387
Epoch 8, Training Loss: 0.3984
Epoch 8, Validation Loss: 1.2552, Accuracy: 0.6404
Epoch 9, Training Loss: 0.3577
Epoch 9, Validation Loss: 1.2853, Accuracy: 0.6387

I stopped documenting after here because I couldn't reliably get above .64 in many consecutive tries.

# Visualisations

In [None]:
import matplotlib.pyplot as plt

# Data for losses
training_losses = [1.5665, 1.2734, 1.1782, 1.1140, 1.0686, 1.0222, 0.9875, 0.9567, 0.9262, 0.9183]
validation_losses = [1.2992, 1.2034, 1.1500, 1.1222, 1.1128, 1.1074, 1.0960, 1.0888, 1.0897, 1.0881]
epochs = list(range(1, 11))

# Plot for training and validation losses
plt.figure(figsize=(10, 5))
plt.plot(epochs, training_losses, label='Training Loss', marker='o')
plt.plot(epochs, validation_losses, label='Validation Loss', marker='o')
plt.title('Training and Validation Loss over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Data for accuracy
accuracies = [0.5432, 0.5769, 0.5912, 0.6049, 0.6145, 0.6195, 0.6236, 0.6253, 0.6278, 0.6278]

# Plot for validation accuracy
plt.figure(figsize=(10, 5))
plt.plot(epochs, accuracies, label='Validation Accuracy', marker='o', color='green')
plt.title('Validation Accuracy over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()


Here's a couple graphs showing the loss and accuracy against epochs. This isn't from my best result, but these ones made for a nice graph, so I used this set of training haha.

# Confusion Matrix

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Initialize lists to store true and predicted labels
y_true = []
y_pred = []

# Validation loop 
model.eval()
with torch.no_grad():
    for batch in val_loader:
        inputs = {key: val.to(device) for key, val in batch.items() if key != 'labels'}
        labels = batch['labels'].to(device)

        outputs = model(**inputs, labels=labels)
        predictions = torch.argmax(outputs.logits, dim=-1)

        # Append true and predicted labels
        y_true.extend(labels.cpu().numpy())
        y_pred.extend(predictions.cpu().numpy())

# Convert lists to numpy arrays 
y_true = np.array(y_true)
y_pred = np.array(y_pred)

# Define the category mapping
category_mapping = ['Black', 'Blue', 'Colorless', 'Green', 'Multicolor', 'Red', 'White']

# Compute the confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Plot the confusion matrix using Seaborn with category labels
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=category_mapping, yticklabels=category_mapping)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()


Here's a confusion matrix to track how the model did on each different area. It makes sense that multicolor was the most difficult to place correctly. It was consistently the worst of all the categories except for one stand out: Predicting white for blue slightly edged it out. This makes some amount of sense because in the game of MtG, white and blue share some thematic gameplay similarities.

In [None]:
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import f1_score

accuracy = accuracy_score(y_true, y_pred)
print(f'Accuracy: {accuracy:.4f}')

precision = precision_score(y_true, y_pred, average='weighted')
print(f'Precision: {precision:.4f}')

recall = recall_score(y_true, y_pred, average='weighted')
print(f'Recall: {recall:.4f}')

rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f'RMSE: {rmse:.4f}')

f1 = f1_score(y_true, y_pred, average='weighted')
print(f'F1 Score: {f1:.4f}')