### This is a notebook utilizing fastai/pytorch to train a language model on historic MTG card data

In [1]:
from statistics import mean, median
import json
import pandas as pd
import numpy as np
!pip install fastai --upgrade
!pip install fastcore --upgrade
from fastai.text.all import *

Collecting fastai
[?25l  Downloading https://files.pythonhosted.org/packages/28/d9/23222f694d28a6bd798f1c0f3600efd31c623ba63115c11d8fd83c83216e/fastai-2.0.16-py3-none-any.whl (187kB)
[K     |█▊                              | 10kB 20.1MB/s eta 0:00:01[K     |███▌                            | 20kB 1.8MB/s eta 0:00:01[K     |█████▎                          | 30kB 2.3MB/s eta 0:00:01[K     |███████                         | 40kB 2.6MB/s eta 0:00:01[K     |████████▊                       | 51kB 2.0MB/s eta 0:00:01[K     |██████████▌                     | 61kB 2.3MB/s eta 0:00:01[K     |████████████▏                   | 71kB 2.6MB/s eta 0:00:01[K     |██████████████                  | 81kB 2.8MB/s eta 0:00:01[K     |███████████████▊                | 92kB 3.0MB/s eta 0:00:01[K     |█████████████████▍              | 102kB 2.9MB/s eta 0:00:01[K     |███████████████████▏            | 112kB 2.9MB/s eta 0:00:01[K     |█████████████████████           | 122kB 2.9MB/s eta 0:0

##Load & format data

In [2]:
df = pd.read_csv('cards.csv')
df.columns

  interactivity=interactivity, compiler=compiler, result=result)


Index(['index', 'id', 'artist', 'asciiName', 'availability', 'borderColor',
       'cardKingdomFoilId', 'cardKingdomId', 'colorIdentity', 'colorIndicator',
       'colors', 'convertedManaCost', 'duelDeck', 'edhrecRank',
       'faceConvertedManaCost', 'faceName', 'flavorName', 'flavorText',
       'frameEffects', 'frameVersion', 'hand', 'hasAlternativeDeckLimit',
       'isFullArt', 'isOnlineOnly', 'isOversized', 'isPromo', 'isReprint',
       'isReserved', 'isStarter', 'isStorySpotlight', 'isTextless',
       'isTimeshifted', 'keywords', 'layout', 'leadershipSkills', 'life',
       'loyalty', 'manaCost', 'mcmId', 'mcmMetaId', 'mtgArenaId',
       'mtgjsonV4Id', 'mtgoFoilId', 'mtgoId', 'multiverseId', 'name', 'number',
       'originalReleaseDate', 'originalText', 'originalType', 'otherFaceIds',
       'power', 'printings', 'promoTypes', 'purchaseUrls', 'rarity',
       'scryfallId', 'scryfallIllustrationId', 'scryfallOracleId', 'setCode',
       'side', 'subtypes', 'supertypes', 'tcgp

In [3]:
df = df[['name', 'colorIdentity', 'colors',
         'convertedManaCost', 'manaCost', 'type',
         'types', 'power', 'toughness',
         'rarity', 'text', 'flavorText', 'uuid']]
df.shape

(54920, 13)

In [4]:
df.rename(columns={'convertedManaCost': 'cmc',
           'manaCost': 'mana_cost',
           'colorIdentity': 'colorID',
           'type': 'main_type',
           'types': 'all_types',
           'flavorText': 'flavor_text'},
          inplace=True)

Create some new columns that may be useful later on

In [5]:
df['contains_W'] = df['colors'].str.contains('W', case=True, na=False, regex=False)
df['contains_U'] = df['colors'].str.contains('U', case=True, na=False, regex=False)
df['contains_B'] = df['colors'].str.contains('B', case=True, na=False, regex=False)
df['contains_R'] = df['colors'].str.contains('R', case=True, na=False, regex=False)
df['contains_G'] = df['colors'].str.contains('G', case=True, na=False, regex=False)
df['is_colorless'] = df['colors'].eq('')
df['is_multicolor'] = (df['colorID'].str.len() > 1) & (df['main_type'] != 'Land')

In [6]:
df['is_creature'] = df['all_types'].str.contains('creature|summon', case=False, na=False, regex=True)
df['is_instant'] = df['all_types'].str.contains('instant', case=False, na=False, regex=False)
df['is_enchantment'] = df['all_types'].str.contains('enchantment', case=False, na=False, regex=False)
df['is_sorcery'] = df['all_types'].str.contains('sorcery', case=False, na=False, regex=False)
df['is_artifact'] = df['all_types'].str.contains('artifact', case=False, na=False, regex=False)
df['is_planeswalker'] = df['all_types'].str.contains('planeswalker', case=False, na=False, regex=False)
df['is_land'] = df['all_types'].str.contains('land', case=False, na=False, regex=False)

Drop rows with outlier features

In [7]:
df = df.groupby('main_type').filter(lambda x: len(x) > 4)
df = df.groupby('all_types').filter(lambda x: len(x) > 10)

In [8]:
drop_power = df.loc[(df['is_creature']) & (~df['power'].isin([str(i) for i in range(20)]))]
drop_toughness = df.loc[(df['is_creature']) & (~df['toughness'].isin([str(i) for i in range(15)]))]
drop_cmc = df.loc[df['cmc'].isin([1000000.0])]
drop_colorID = df.loc[df['colorID'].isin(['GRUW', 'BGRW', 'BGUW', 'BRUW', 'BGRU'])]

all_drops = pd.concat([drop_power, drop_toughness, drop_cmc, drop_colorID])
df.drop(all_drops.index, inplace=True)
df.shape

(52088, 27)

Drop cards with no ```text``` attributes & replace all colorless cards as with ```colorID``` = ```C``` 

There is a caveat with colorless labels as they are slightly different from how other colors behave. Colorless mana is part of many cards but is not typically considered part of their identity. For example, a card with a cost of CU is a U card and not a CU card whereas a card that's 3C is a C card. This may add extra complexity to the multilabel effort, but for now continue on and see how often this is a problem in model evaluation

In [9]:
txt_df = df[df['text'].notnull()] 
txt_df.loc[((txt_df['is_colorless']) & (txt_df['is_land'] == False)), 'colorID'] = 'C'
txt_df = txt_df[txt_df['colorID'].notnull()] # Drop any random nans
print(f"Rows dropped: {df.shape[0] - txt_df.shape[0]}")

Rows dropped: 6668


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


##Tokenize & numericalize data

In [10]:
txts = L(txt_df['text'].to_list())
len(txts)

45420

In [11]:
tkn = Tokenizer(WordTokenizer())
tokens = txts.map(tkn)
tokens[0]

(#60) ['xxbos','xxmaj','if','you','would','draw','a','card',',','you'...]

In [12]:
num = Numericalize(min_freq=5)
num.setup(tkn(tokens))
# fcoll_repr(num.vocab, 200)

In [13]:
# Apply numericalizer to all text and load into dataloader
numericalized = tokens.map(num)
dl = LMDataLoader(numericalized, bs=128, shuffle=True)

x, y = first(dl)
x.shape, y.shape

(torch.Size([128, 72]), torch.Size([128, 72]))

```fastai``` handles the tokenization & numericalization in the ```TextDataLoaders``` object so will transition to that, but nice to see that the process works as intended here

##Train language model

In [14]:
dls = TextDataLoaders.from_df(txt_df, text_col='text', is_lm=True)

In [15]:
# Using a high drop_mult value here. In practice, it doesn't seem to matter 
# much but erring on the side of caution given small dataset
model = language_model_learner(dls,
                               AWD_LSTM,
                               drop_mult=0.8,
                               metrics=[accuracy, Perplexity()]).to_fp16()

# model.lr_find() #2e-2

In [16]:
model.fit_one_cycle(1, 2e-2)
model.unfreeze()
model.fit_one_cycle(7, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,1.845841,1.366548,0.673028,3.921791,01:03


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,1.370216,1.051953,0.74406,2.863237,01:03
1,1.079453,0.861372,0.789143,2.366406,01:04
2,0.930761,0.761685,0.813554,2.141882,01:03
3,0.830779,0.699894,0.829717,2.013539,01:03
4,0.751595,0.660373,0.840631,1.935515,01:04
5,0.706569,0.641698,0.846318,1.899705,01:03
6,0.68927,0.639018,0.847027,1.89462,01:04


Pretty solid. This should be good to build upon. Let's test some test card generation

In [17]:
test_txt = "Draw a card for"
n_words = 40   
n_sentences = 10
preds = [model.predict(test_txt, n_words) for _ in range(n_sentences)]
print("\n\n".join(preds))

Draw a card for each white creature you control . Flying , protection from black 
 At the beginning of your end step , if Whirling Dervish dealt damage to an opponent this turn , put a +1 /

Draw a card for each creature you control with a +1 / +1 counter on it . Pirate Ship ca n't attack unless defending player controls an Island . 
 { t } : Pirate Ship deals 1

Draw a card for each creature you control with power 4 or greater . Whenever you cast a Spirit or Arcane spell , you may untap Opportunistic Dragon . 
 For each t } : Gain

Draw a card for each attacking creature with a +1 / +1 counter on it . Destroy target creature or land . { t } : Draw a card , then discard a card . 
 { b } ,

Draw a card for each tapped creature target opponent controls . ( { t } : Add { g } . ) Whenever you cast an instant or sorcery spell , Guttersnipe deals 2 damage to each opponent .

Draw a card for each enchantment you control . 
 { g}{u } : You may put a land , creature , or land card from your han

They come out pretty nice! Save the model & the encoder for future use

In [18]:
model.save('MTG_language_model')
model.save_encoder('MTG_language_encoder')

In [19]:
df.to_csv('Language_card.csv', index=None)