In [None]:
# import modules
import pandas as pd
import numpy as np
import re
from bertopic import BERTopic
import random
import pickle
from sklearn.model_selection import train_test_split
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsRegressor as KNN_reg
from sklearn.tree import DecisionTreeRegressor as DT_reg
from sklearn.ensemble import RandomForestRegressor as RF_reg
from sklearn import metrics
import matplotlib.pyplot as plt
from plotnine import *
import json

In [None]:
# load in data
(pd.read_feather('C:/Georgetown University/Courses/Spring Semester 2022/Text As Data/text-data-spr22/data/mtg.feather')# <-- will need to change for your notebook location
 .head(2)  
)

In [None]:
# store full data
df = (pd.read_feather('C:/Georgetown University/Courses/Spring Semester 2022/Text As Data/text-data-spr22/data/mtg.feather')  
)

# check shape
df.shape

### Part 1: Unsupervised Exploration

Investigate the BERTopic documentation (linked), and train a model using their library to create a topic model of the flavor_text data in the dataset above.

- In a topic_model.py, load the data and train a bertopic model. You will save the model in that script as a new trained model object
- add a "topic-model" stage to your dvc.yaml that has mtg.feather and topic_model.py as dependencies, and your trained model as an output
- load the trained bertopic model into your notebook and display
    - the topic_visualization interactive plot see docs
    - Use the plot to come up with working "names" for each major topic, adjusting the number of topics as necessary to make things more useful.
    - Once you have names, create a Dynamic Topic Model by following their documentation. Use the release_date column as timestamps.
    - Describe what you see, and any possible issues with the topic models BERTopic has created. This is the hardest part... interpreting!

In [None]:
# load trained BERTopic model
topic_model = BERTopic.load("flav_text_model")

In [None]:
# access frequent topics
topic_model.get_topic_info()

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:

In [None]:
topic_model.get_topic(0)

In [None]:
# store topic frequency
freq_topics = topic_model.get_topic_info().iloc[1: , :] # remove row with outliers (where Topic = -1)

# view percentiles of Count/frequency
freq_topics.Count.quantile([0.25,0.5,0.75,0.99])

Will select topics whose Count is in the 99th percentile. 

#### Interactive plots

In [None]:
# visualize all topics 
topic_model.visualize_topics()

It's very hard to interpret 800+ topics, so I am going to select and visualize topics that have a frequency in the top percentile. Assumption: high frequency topics are representative of the main 'topic clusters'.

In [None]:
# view topics with freq in the top percentile
freq_topics.loc[freq_topics.Count > freq_topics.Count.quantile(0.99)] 

In [None]:
# view intertopic distance map
topic_model.visualize_topics(topics = [-1,0,1,2,3,4,5,6,7,8,9,10]) 

In order to name these topics, I will visualize them as bar charts that include the top 9 words in each topic. (I tried including the top 10 words but doing that only displays alternate written words which makes it difficult to interpret).

In [None]:
topic_model.visualize_barchart(topics = [0,1,2,3,4,5,6,7,8,9,10], n_words = 9) 

I'm not too familiar with these cards, but through Google searches of the top few words, I was able to come up with what I think are good topic names. I have added supporting links as well. 

- Topic 0 - Based on the top words (which show up in 'Phyrexia creature' cards in Google searches), this topic seems to capture the set 'New Phyrexia'.
- Topic 1 - Sword of Sinew and Steel (https://www.cardkingdom.com/mtg/modern-horizons/sword-of-sinew-and-steel)
- Topic 2 - Champions of Kamigawa (https://mtg.wtf/set/chk?page=7)
- Topic 3 - Beetleback Chief (https://gatherer.wizards.com/pages/card/Details.aspx?multiverseid=386305)
- Topic 4 - Noxious Dragon (https://gatherer.wizards.com/pages/card/details.aspx?multiverseid=391888)
- Topic 5 - Sarpadian Empires (https://mtg.fandom.com/wiki/Sarpadian_Empires)
- Topic 6 - Werewolf (https://mtg.fandom.com/wiki/Werewolf)
- Topic 7 - Vampire Lacerator (https://gatherer.wizards.com/pages/card/details.aspx?multiverseid=192225)
- Topic 8 - Squee (Squee was a **goblin cabin-hand** on the Skyship Weatherlight - https://mtg.fandom.com/wiki/Squee)
- Topic 9 - Necromancy (https://www.moxfield.com/decks/rlvIQMx1zUCT6smgX4GpOw)
- Topic 10 - Garruk Wildspeaker (https://gatherer.wizards.com/pages/card/details.aspx?multiverseid=140205)

In [None]:
# add topic name
freq_topics_11 = freq_topics.iloc[0:11, :]

freq_topics_11['Topic Name'] = ['New Phyrexia',
                                'Sword of Sinew and Steel',
                                'Champions of Kamigawa',
                               'Beetleback Chief',
                               'Noxious Dragon',
                               'Sarpadian Empires',
                               'Werewolf',
                               'Vampire Lacerator',
                               'Squee',
                               'Necromancy',
                               'Garruk Wildspeaker']

# view
freq_topics_11

In [None]:
topic_model.visualize_heatmap(topics = [0,1,2,3,4,5,6,7,8,9,10]) 

A heatmap shows the similarity between topics (based on the cosine similarity matrix between topic embeddings). Looking at the heatmap above, we can see that topic 9 (Necromancy) is similar to topic 4 (Noxious Dragon).

#### Once you have names, create a Dynamic Topic Model by following their documentation. Use the release_date column as timestamps.

In [None]:
df2 = df.dropna(how = 'any', subset = ['flavor_text'])

# check if dataframe has any missing values in the release_date column
df2.isnull().sum()

In [None]:
# store release_date column as list
timestamps = df2.release_date.to_list()

# check length
len(timestamps)

In [None]:
# store flavor_text data as list
flavor_text_list = df2.flavor_text.tolist()

# check length
len(flavor_text_list)

In [None]:
# fit model again 
topics, probs = topic_model.fit_transform(flavor_text_list)

# check length of topics
len(topics)

In [None]:
# generate the topic representations at each timestamp for each topic 
topics_over_time = topic_model.topics_over_time(flavor_text_list, topics, timestamps)

In [None]:
topic_model.visualize_topics_over_time(topics_over_time, topics = [0,1,2,3,4,5,6,7,8,9,10])

`Champions of Kamigawa` was released in October 2004 (which explains the spike around 2005). 

## Part 2 Supervised Classification

Using only the text and flavor_text data, predict the color identity of cards:

Follow the sklearn documentation covered in class on text data and Pipelines to create a classifier that predicts which of the colors a card is identified as. You will need to preprocess the target color_identity labels depending on the task:

- Source code for pipelines
    - in multiclass.py, again load data and train a Pipeline that preprocesses the data and trains a multiclass classifier (LinearSVC), and saves the model pickel output once trained. target labels with more than one color should be unlabeled!
    - in multilabel.py, do the same, but with a multilabel model (e.g. here). You should now use the original color_identity data as-is, with special attention to the multi-color cards.
- in dvc.yaml, add these as stages to take the data and scripts as input, with the trained/saved models as output.


- in your notebook:
    - **Describe: preprocessing steps (the tokenization done, the ngram_range, etc.), and why.**
    - **load both models and plot the confusion matrix for each model (see here for the multilabel-specific version)**
    - **Describe: what are the models succeeding at? Where are they struggling? How do you propose addressing these weaknesses next time?**

### Multiclass Classifier

In [None]:
# check missing values
df.isnull().sum()

`color_identity` and `text` don't have any missing values so only missing values from the `flavor_text` variable need to be removed.

In [None]:
# remove rows where target (color_identity) or predictors (flavor_text and text) have missing values
df2 = df.dropna(how = 'any',
                subset = ['flavor_text'])

# check
df2.isnull().sum()

#### For $x$, combine text and flavor text data

In [None]:
df2['combined_text'] = df['text'] + ' ' + df['flavor_text']

# view
df2.head(2)

#### For $y$, encode target variable (`color_identity`)

Target labels with more than one color should be unlabeled!

To "unlabel" data, I will replace the label with -1.<br>
Where there are no values, I will replace the label to null


In [None]:
# store color_identity values as a list
color_identity_values = list(df2.color_identity.values)

# create empty list to store results
color_identity_multiclass = []

# iterate through list, and unlabel target labels with more than one color
for i in color_identity_values:
    if len(i) == 1:
        color_identity_multiclass.append(i[0])
    elif len(i) < 1:
        color_identity_multiclass.append(0) # storing missing values as 0
    else:
        color_identity_multiclass.append(-1) # unlabeling target labels with more than one color

# check length
len(color_identity_multiclass)

In [None]:
# check target labels
set(color_identity_multiclass)

In [None]:
### encode target labels (I will do this manually instead of using LabelEncoder())

# store empty list to append to later
encoded_target_multiclass = []

for i in color_identity_multiclass:
    if i == 'W':
        encoded_target_multiclass.append(1)
    elif i == 'U':
        encoded_target_multiclass.append(2)
    elif i == 'R':
        encoded_target_multiclass.append(3)
    elif i == 'G':
        encoded_target_multiclass.append(4)
    elif i == 'B':
        encoded_target_multiclass.append(5)
    elif i == -1:
        encoded_target_multiclass.append(i)
    else:
        encoded_target_multiclass.append(i)
        
# check length
len(encoded_target_multiclass)

In [None]:
# check labels
set(encoded_target_multiclass)

In [None]:
# add encoded labels to dataframe as a new column
df2['multiclass'] = encoded_target_multiclass

# view
df2.head(2)

#### Split data into training and test sets

In [None]:
# store target and predictor
y = df2[['multiclass']]
X = df2[['combined_text']]

# split data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(X, y , test_size = .25, random_state = 123)

In [None]:
# check training and test data shapes
print(train_X.shape[0]/df2.shape[0])
print(test_X.shape[0]/df2.shape[0])

#### Training Data

In [None]:
# store training data as a list
training_X = train_X.combined_text.tolist()

# check length
len(training_X)

In [None]:
# check train_y length
len(train_y)

In [None]:
# store training target as numpy array
training_target = train_y.multiclass.values

# check length
len(training_target)

#### Test Data

In [None]:
# store test data as a list
test_x = test_X.combined_text.tolist()

# check length
len(test_x)

In [None]:
# check test_y length
len(test_y)

In [None]:
# store test target as numpy array
test_target = test_y.multiclass.values

# check length
len(test_target)

#### Preprocessing Steps:

Pre-processing text using CountVectorizer():
- removing English stop words in order to remove the 'low-level' information in the text and focus more on the important information.
- converting all words to lowercase - assumption is that the meaning and significance of a lowercase word is the same as when that word is in uppercase or capitalized. This will help remove noise.
- ngram_range set to 1,2 i.e. capturing both unigrams and bigrams since Magic Card texts often have names/terms that are bigrams e.g. Soul Warden and Beetleback Chief. 
- min_df set to 5 i.e. rare words that appear in less than 5 documents will be ignored.
- max_df set to 0.9 i.e. words that appear in more than 90% of the documents will be ignored since they are not adding much to a specific document.

Using TfidfTransformer():
- Term frequencies calculated to overcome the discrepancies with using occurence count for differently sized documents. 
- Downscaled weights for words that occur in many documents and therefore do not add a lot of information than those that occur in a smaller share of the corpus (tf-idf)


In [None]:
# load multiclass model
file_to_read = open("multiclass_classifier.pickle", "rb")
multiclass_classifier = pickle.load(file_to_read)
file_to_read.close()

# view
print(multiclass_classifier)

In [None]:
predicted = multiclass_classifier.predict(test_x)
np.mean(predicted == test_target)

We achieved 85% accuracy using Linear SVC.

In [None]:
# plot confusion matrix
multilabel_confusion_matrix(test_target, predicted, labels = [1,2,3,4,5])

This is how we can interpret the confusion matrix values: 6023 of the observations with the label 1 (i.e. color White) were predicted correctly by the model, whereas 1007 observations that did not have the label 1 were predicted correctly by the model. 208 records that did not have the label 1 were wrongy predicted as having the label 1, while 171 records that did have the label 1 were wrongly predicted as not having the label 1. 

#### F1 Score

In [None]:
# Opening JSON file
f = open('metrics.json')
 
# returns JSON object as
# a dictionary
data = json.load(f)
 
# print
data

In [None]:
# Closing file
f.close()

In [None]:
# store scores as a dataframe
metrics = pd.DataFrame(metrics.classification_report(test_target, predicted, output_dict = True))
print(metrics)

The macro-averaged F1-score is computed as a simple arithmetic mean of the per-class F1-scores.

When averaging the macro-F1, we gave equal weights to each class. We don’t have to do that: in weighted-average F1-score, we weight the F1-score of each class by the number of samples from that class.

### Multilabel Classifier

In [None]:
# check missing values
df.isnull().sum()

`color_identity` and `text` don't have any missing values so only missing values from the `flavor_text` variable need to be removed.

In [None]:
# remove rows where target (color_identity) or predictors (flavor_text and text) have missing values
df2 = df.dropna(how = 'any',
                subset = ['flavor_text'])

# check
df2.isnull().sum()

#### For $x$, combine text and flavor text data

In [None]:
df2['combined_text'] = df['text'] + ' ' + df['flavor_text']

# view
df2.head(2)

#### For $y$, use the (`color_identity`) column as is

Guidance obtained from: https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets

In [None]:
# store color_identity values as a list
color_identity_values = list(df2.color_identity.values)

# create label binary indicator array - target
color_identity_multilabels = MultiLabelBinarizer().fit_transform(color_identity_values)

In [None]:
# store target and predictor
y = color_identity_multilabels
X = df2[['combined_text']]

# split data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(X, y , test_size = .25, random_state = 123)

In [None]:
# check training and test data shapes
print(train_X.shape[0]/df2.shape[0])
print(test_X.shape[0]/df2.shape[0])

#### Training Data

In [None]:
# store training data as a list
training_X = train_X.combined_text.tolist()

# check length
len(training_X)

In [None]:
# check train_y length
len(train_y)

In [None]:
# store training target as numpy array
training_target = train_y

# check length
len(training_target)

#### Test Data

In [None]:
# store test data as a list
test_x = test_X.combined_text.tolist()

# check length
len(test_x)

In [None]:
# check test_y length
len(test_y)

In [None]:
# store test target as numpy array
test_target = test_y

# check length
len(test_target)

#### Preprocessing Steps:

Pre-processing text using CountVectorizer():
- removing English stop words in order to remove the 'low-level' information in the text and focus more on the important information.
- converting all words to lowercase - assumption is that the meaning and significance of a lowercase word is the same as when that word is in uppercase or capitalized. This will help remove noise.
- ngram_range set to 1,2 i.e. capturing both unigrams and bigrams since Magic Card texts often have names/terms that are bigrams e.g. Soul Warden and Beetleback Chief. 
- min_df set to 5 i.e. rare words that appear in less than 5 documents will be ignored.
- max_df set to 0.9 i.e. words that appear in more than 90% of the documents will be ignored since they are not adding much to a specific document.

Using TfidfTransformer():
- Term frequencies calculated to overcome the discrepancies with using occurence count for differently sized documents. 
- Downscaled weights for words that occur in many documents and therefore do not add a lot of information than those that occur in a smaller share of the corpus (tf-idf)


In [None]:
# load multilabel model
file_to_read = open("multilabel_classifier.pickle", "rb")
multilabel_classifier = pickle.load(file_to_read)
file_to_read.close()

# view
print(multilabel_classifier)

In [None]:
predicted = multilabel_classifier.predict(test_x)
np.mean(predicted == test_target)

We achieved 93% accuracy using OneVsRestClassifier.

In [None]:
# plot confusion matrix
multilabel_confusion_matrix(test_target, predicted)

## Part 3

#### Part 3: Regression?

Can we predict the EDHREC "rank" of the card using the data we have available?

- Like above, add a script and dvc stage to create and train your model
- in the notebook, aside from your descriptions, plot the predicted vs. actual rank, with a 45-deg line showing what "perfect prediction" should look like.
- This is a freeform part, so think about the big picture and keep track of your decisions:
    - what model did you choose? Why?
    - What data did you use from the original dataset? How did you proprocess it?
    - Can we see the importance of those features? e.g. logistic weights?
- How did you do? What would you like to try if you had more time?

For this part, I wanted to try using some categorical variables that I thought could be important predictors - namely the block i.e. sets with "shared mechanics", and the rarity of cards. 

I ran a grid search using K-nearest neighbors, random forest and a decision tree regressor, and found KNN() with 5-nearest neighbors to be the best model.

In [None]:
# remove rows where target or predictors have missing values
df2 = df.dropna(how = 'any',
                subset = ['block',
                         'rarity',
                         'edhrec_rank'])


`block`

In [None]:
# get dummies
block_dummies = pd.get_dummies(df2.block)
block_dummies.columns = [c.lower().replace(" ","_") for c in block_dummies.columns]

block_dummies = block_dummies.drop(['alara'],axis=1) # Baseline
block_dummies.head(5)

In [None]:
df2 = pd.concat([df2.drop(['block'],axis=1),block_dummies],axis=1)
df2.head()

`rarity`

In [None]:
# get dummies
rarity_dummies = pd.get_dummies(df2.rarity)
rarity_dummies.columns = [c.lower().replace(" ","_") for c in rarity_dummies.columns]

rarity_dummies = rarity_dummies.drop(['common'],axis=1) # Baseline
rarity_dummies.head(5)

In [None]:
df2 = pd.concat([df2.drop(['rarity'],axis=1),rarity_dummies],axis=1)
df2.head()

In [None]:
# store target and predictor
y = df2[['edhrec_rank']]
X = df2[['amonkhet', 'arena_league',
       'battle_for_zendikar', 'commander', 'conspiracy', 'core_set',
       'friday_night_magic', 'guilds_of_ravnica', 'ice_age', 'innistrad',
       'innistrad:_double_feature', 'invasion', 'ixalan', 'judge_gift_cards',
       'kaladesh', 'kamigawa', 'khans_of_tarkir', 'lorwyn',
       'magic_player_rewards', 'masques', 'mirage', 'mirrodin', 'odyssey',
       'onslaught', 'ravnica', 'return_to_ravnica', 'scars_of_mirrodin',
       'shadowmoor', 'shadows_over_innistrad', 'tempest', 'theros',
       'time_spiral', 'urza', 'zendikar', 'rare', 'uncommon']]

# split data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(X, y , test_size = .25, random_state = 123)

In [None]:
# load model
file_to_read = open("best_mod.pickle", "rb")
best_mod = pickle.load(file_to_read)
file_to_read.close()

# view
print(best_mod)

And Run

In [None]:
best_mod.fit(train_X,train_y)

In [None]:
predicted = best_mod.predict(test_X)
np.mean(predicted == test_y)

In [None]:
# store test_y
df_plot = test_y.copy()

# create empty list
predictions = []

# iterate
for i in predicted:
    predictions.append(i[0])
    
# store list as dataframe column
df_plot['predicted'] = predictions

In [None]:
# plot

(ggplot(data = df_plot,
        mapping = aes(x = 'edhrec_rank', y = 'predicted')) +
 geom_point(color = 'slategray', alpha = 0.7) + 
 geom_abline(intercept = 0, slope = 1, size = 2, color = 'maroon') +
 theme_minimal() +
 labs(title = 'Predicted vs Actual Rank\n',
     y = 'Predicted\n',
     x = '\nActual')
)

This isn't a good plot since the dots are scattered everywhere instead of being close to the line (i.e. predictions being close to the actual values.

KNN() with 5 nearest neighbors was identified as the best model when I did a grid search. However, when I loaded the model in the notebook, it did not have the number of neighbors specified and I was unsure how to add it or how to save the model in the .py script such that the number of neighbors also gets saved as a parameter of KNN. 

How did I do? Not too great. Definitely a lot of room for improvement. I would like to select more predictors if I have more time, as well as include k=5 in the KNN regressor (update: the default number of neighbors is 5 so even though I didn't specify k, the model ran with k=5). 

## Part 4

### For multiclass, report average and F1
Done above where the multiclass model was run.

### Run a new experiment that changes one parameter:

#### output of `dvc exp diff` copy and pasted from the command line, and formatted to a table:

|Path    |      Metric        |          exp-ddb8e   | workspace  |  Change |
|   -    |         -          |         -            |     -      |    -    |
|metrics.json | -1.f1-score            | 0.78642     | 0.77876    |  -0.0076563 
|metrics.json | -1.precision           | 0.81949     | 0.81064    |  -0.0088423 
|metrics.json | -1.recall              | 0.75591     | 0.74929    |  -0.0066225 
|metrics.json | 0.f1-score             | 0.8684      | 0.86119    |  -0.0072075  
|metrics.json | 0.precision            | 0.90123     | 0.89736    |  -0.0038784 
|metrics.json | 0.recall               | 0.83788     | 0.82783    |  -0.010043 
|metrics.json | 1.f1-score             | 0.83958     | 0.84162    |  0.0020424
|metrics.json | 1.precision            | 0.83292     | 0.82881    |  -0.004109
|metrics.json | 1.recall               | 0.84635     | 0.85484    |  0.008489
|metrics.json | 2.f1-score             | 0.87825     | 0.87506    |  -0.0031947
|metrics.json | 2.precision            | 0.85212     | 0.84865    |  -0.0034704
|metrics.json | 2.recall               | 0.90604     | 0.90316    |  -0.0028763
|metrics.json | 3.f1-score             | 0.8846      | 0.88317    |  -0.0014276
|metrics.json | 3.precision            | 0.86964     | 0.8725     |  0.002863
|metrics.json | 3.recall               | 0.90009     | 0.89411    |  -0.0059778
|metrics.json | 4.f1-score             | 0.85777     | 0.85627    |  -0.0014944
|metrics.json | 4.precision            | 0.85702     | 0.85404    |  -0.0029783
|metrics.json | 5.f1-score             | 0.85931     | 0.85253    |  -0.0067797
|metrics.json | 5.precision            | 0.85816     | 0.85445    |  -0.0037149
|metrics.json | 5.recall               | 0.86047     | 0.85063    |  -0.009839
|metrics.json | accuracy               | 0.85356     | 0.85018    |  -0.0033743
|metrics.json | macro avg.f1-score     | 0.85348     | 0.8498     |  -0.0036739
|metrics.json | macro avg.precision    | 0.8558      | 0.85235    |  -0.0034472
|metrics.json | macro avg.recall       | 0.85218     | 0.84834    |  -0.0038385
|metrics.json | weighted avg.f1-score  | 0.85305     | 0.84968    |  -0.0033749
|metrics.json | weighted avg.precision | 0.85347     | 0.85013    |  -0.0033366
|metrics.json | weighted avg.recall    | 0.85356     | 0.85018    |  -0.0033743

<br>


|Path     |    Param             |            exp-ddb8e  |  workspace  |  Change |
|-|-|-|-|-|
| params.yaml | preprocessing.ngrams.largest | 3 |           2  |          -1

Grabbing the weighted average scores from the output above:

| | Precision | Recall | F1-Score | 
| --- | --- | --- | --- |
| ngrams.largest = 2|0.85013|0.85018|0.84968|
| ngrams.largest = 3| 0.85347 | 0.85356 | 0.85305 |



There was only a very slight improvement in performance when the ngram range was changed from (1,2) to (1,3), based on the slightly higher scores. 