In [1]:
from src.models import preprocess_df, decision_tree, neural_net_classifier
from src.mtg_json import load_atomic, prep_df

In [2]:
df = prep_df(load_atomic("ModernAtomic"), monocolor=True, creatures=True, legal_format='modern')
ml_df,  train_features, train_labels, test_features, test_labels = preprocess_df(df)


  df[feature] = df["subtypes"].apply(lambda x: 1 if typ in x else 0)
  df[feature] = df["subtypes"].apply(lambda x: 1 if typ in x else 0)


In [3]:
tree_score, tree_model = decision_tree(train_features, train_labels, test_features, test_labels)

0.5278491859468724

In [None]:
nn_score, nn_model = neural_net_classifier(train_features, train_labels, test_features, test_labels)

MAIN GOAL: Determine a creature's color identity based on: (number of features)
- CMC (1)
- Power (1)
- Toughness (1)
- Artifact / Enchantment Supertype (2)
- Type (boolean cols for each of the top 200 tribes) (200)
- Keywords (see keywords.json and list of evergreen keywords on https://mtg.fandom.com/wiki/Evergreen) (20-200)
- Name? (Would need a way to break this down (https://web.stanford.edu/group/pdplab/pdphandbook/handbookch8.html))


# Issue #1: Multi-faced cards from the Atomic dataset.
The more robust atmoic dataset contains split entries for DFC's, fuse cards, etc. How do we count these cards?
A. Remove them from the dataset.
    # By far the easiest approach.
B. Look at just the front.
    # Cleanest, will cause some outliers, namely on meld/fuse/transform cards
C. Add them as an additional row.
    # More accurate, but will likely be outliers
D. Add extra columns
    # Most accurate, but will mess with any ML algo if not weighted properly.

# Issue #2: Keywords Overlap with Creature Names (Death's Shadow, Flying Men)
This is mostly fine. For one thing, many of these cards match color identity with their mechanic (Flying men are blue, DS is black, etc.)
We could look into some kind of way to differentiate based on regex or pattern matching, but let's leave that for now.   

# Issue 3: The word Counter
So, despite being able to name things whatever they want, and repeated oracle changes to simplify and clarify wording, MTG still uses the word 'counter' to mean two different things: a keyword action meaning "Remove this spell or ability from the stack", and a board object placed on permanents i.e. +1/+1 counters, loyalty counters. Countering things (first interpretation) is a blue-coded mechanic, while counters (second interpretation) are a fairly universal mechanic, maybe leaning white and green but with no real identity. Again, we could look into differentiating these by string matching ("counter target" vs "+1/+1 counter"), but since counters don't really have an identity, we're changing "counter" in keyword abilities to "counter target". There are also way too many strings about the other kind of counter to simply process (counter vs counters, etc.). This does remove Baral, the bluest creature ever, from the counter keyword column, but whatever.

# Issue 4: Parsing Card Names
This might be an entirely different ML task. There are packages to determine semantic vectors of words. Use the names of each card (sum of all word vectors) as a feature set. A card like Death's Shadow would be easy peasy. Brushwagg, on the other hand, maybe not. Proper names like Olivia Voldaren, or worse, Drizz't Dourden, would be all but impossible.

# Issue 5: Improving Performance
Current accuracy for testing dataset sits at about 64%- not great for a binary classifier, but pretty good for a classifier with 6 classes (chance rate of 17%). A binary classifier on enemy-colors cards, (blue and green, probably the most disparate colors in terms of creatures) had an accuracy of 85%.

# Issue 6: PCA
PCA didn't improve the performance of the neural net, as expected (59.1% w/ PCA and 64% without). It also reduced performance on the decision tree (45%  with and 54% without). This makes sense, since we lose information with PCA. Could be useful for clustering algo's!