# Homework: Goals & Approaches

> The body grows stronger under stress. The mind does not.
> 
>  -- Magic the Gathering, _Fractured Sanity_

This homework deals with the goals you must define, along with the approaches you deem necessary to achieve those goals. 
Key to this will be a focus on your _workflows_: 

- are they reproducible? 
- are they maintainable? 
- are they well-justified and communicated? 

This is not a "machine-learning" course, but machine learning plays a large part in modern text analysis and NLP. 
Machine learning, in-turn, has a number of issues tracking and solving issues in a collaborative, asynchronous, distributed manner. 

It's not inherently _wrong_ to use pre-configured models and libraries! 
In fact, you will likely be unable to create a set of ML algorithms that "beat" something others have spent 100's of hours creating, optimizing, and validating. 
However, to answer the three questions above, we need a way to explicitly track our decisions to use others' work, and efficiently _swap out_ that work for new ideas and directions as the need arises. 

This homework is a "part 1" of sorts, where you will construct several inter-related pipelines in a way that will allow _much easier_ adjustment, experimentation, and measurement in "part 2"




## Setup

### Dependencies 
As before, ensure you have an up-to-date environment to isolate your work. 
Use the `environment.yml` file in the project root to create/update the `text-data-class` environment. 
> I expect any additional dependencies to be added here, which will show up on your pull-request. 

### Data
Once again, we have set things up to use DVC to import our data. 
If the data changes, things will automatically update! 
The data for this homework has been imported as `mtg.feather` under the `data/` directory at the top-level of this repository. 
In order to ensure your local copy of the repo has the actual data (instead of just the `mtg.feather.dvc` stub-file), you need to run `dvc pull`

In [None]:
!dvc pull

Then you may load the data into your notebooks and scripts e.g. using pandas+pyarrow:

In [None]:
import pandas as pd
(pd.read_feather('../../data/mtg.feather')# <-- will need to change for your notebook location
 .head()[['name','text', 'mana_cost', 'flavor_text','release_date', 'edhrec_rank']]  
)

But that's not all --- at the end of this homework, we will be able to run a `dvc repro` command and all of our main models and results will be made available for your _notebook_ to open and display. 

### Submission Structure
You will need to submit a pull-request on DagsHub with the following additions: 

- your subfolder, e.g. named with your user id, inside the `homework/hw2-goals-approaches/` folder
    - your "lab notebook", as an **`.ipynb` or `.md`** (e.g. jupytext), that will be exported to PDF for Canvas submission. **This communicates your _goals_**, along with the results that will be compared to them. 
    - your **`dvc.yaml`** file that will define  the inputs and outputs of your _approaches_. See [the DVC documentation](https://dvc.org/doc/user-guide/project-structure/pipelines-files) for information!
    - **source code** and **scripts** that define the preprocessing and prediction `Pipeline`'s you wish to create. You may then _print_ the content of those scripts at the end of your notebook e.g. as appendices using 
- any updates to `environment.yml` to add the dependencies you want to use for this homework


## Part 1: Unsupervised Exploration

Investigate the [BERTopic](https://maartengr.github.io/BERTopic/index.html) documentation (linked), and train a model using their library to create a topic model of the `flavor_text` data in the dataset above. 

- In a `topic_model.py`, load the data and train a bertopic model. You will `save` the model in that script as a new trained model object
- add a "topic-model" stage to your `dvc.yaml` that has `mtg.feather` and `topic_model.py` as dependencies, and your trained model as an output
- load the trained bertopic model into your notebook and display
    1. the `topic_visualization` interactive plot [see docs](https://maartengr.github.io/BERTopic/api/plotting/topics.html)
    2. Use the plot to come up with working "names" for each major topic, adjusting the _number_ of topics as necessary to make things more useful. 
    3. Once you have names, create a _Dynamic Topic Model_ by following [their documentation](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html). Use the `release_date` column as timestamps. 
    4. Describe what you see, and any possible issues with the topic models BERTopic has created. **This is the hardest part... interpreting!**

In [1]:
from bertopic import BERTopic

# load model and display 
topic_model = BERTopic.load("my_model")

# load topic model variables
from topic_model import mtg as text
from topic_model import topics as topics
from topic_model import probs as probs

Batches:  67%|██████▋   | 625/927 [07:16<02:43,  1.85it/s]

In [None]:
# load data
mtg = pd.read_feather('../../../data/mtg.feather')

In [None]:
topic_model.visualize_barchart()

In [None]:
topic_model.visualize_heatmap()

In [None]:
topic_model.visualize_topics([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19])

In [None]:
# hierarchical clustering


# what are some working names for each of the topics

In [None]:
release_date = mtg[['release_date', 'flavor_text']].dropna().drop(columns=['flavor_text'])

# convert release date to list
release_date = list(release_date['release_date'])

In [None]:
# create a dynamic topic model using release date columns as timestamps
topics_over_time = topic_model.topics_over_time(text, topics, release_date)

In [None]:
#Describe! All the topics increase over time that's not helpful
# normalize frequency

In [None]:
# topics_over_time.groupby('Timestamp')['Frequency'].transform()

## Part 2 Supervised Classification

Using only the `text` and `flavor_text` data, predict the color identity of cards: 

Follow the sklearn documentation covered in class on text data and Pipelines to create a classifier that predicts which of the colors a card is identified as. 
You will need to preprocess the target _`color_identity`_ labels depending on the task: 

- Source code for pipelines
    - in `multiclass.py`, again load data and train a Pipeline that preprocesses the data and trains a multiclass classifier (`LinearSVC`), and saves the model pickel output once trained. target labels with more than one color should be _unlabeled_! 
    - in `multilabel.py`, do the same, but with a multilabel model (e.g. [here](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_multilabel.html#sphx-glr-auto-examples-miscellaneous-plot-multilabel-py)). You should now use the original `color_identity` data as-is, with special attention to the multi-color cards. 
- in `dvc.yaml`, add these as stages to take the data and scripts as input, with the trained/saved models as output. 

- in your notebook: 
    - Describe:  preprocessing steps (the tokenization done, the ngram_range, etc.), and why. 
    - load both models and plot the _confusion matrix_ for each model ([see here for the multilabel-specific version](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.multilabel_confusion_matrix.html))
    - Describe: what are the models succeeding at? Where are they struggling? How do you propose addressing these weaknesses next time?




Describe the pre-processing steps:


In [None]:
# plot confusion matrixes
from sklearn.metrics import confusion_matrix
from sklearn.metrics import multilabel_confusion_matrix, ConfusionMatrixDisplay

In [None]:
from multiclass import X_test as X_mc, y_test as y_mc
from multilabel import X_test as X_ml, y_test as y_ml

In [None]:
import pickle
file = open("mtg_classifier_multilabel.pkl",'rb')
mc_classifier = pickle.load(file)
file.close()

file = open("mtg_classifier.pkl",'rb')
ml_classifier = pickle.load(file)
file.close()

In [None]:
# get predictions
y_mc_pred = mc_classifier.predict(X_mc)
y_ml_pred = ml_classifier.predict(X_ml)

In [None]:
# confusion matrixes
print("Multiclass confusion matrix:")
print(confusion_matrix(y_mc, y_mc_pred))

print("Multilabel confusion matrix:")
print(multilabel_confusion_matrix(y_ml, y_ml_pred))


In [None]:
# plot multiclass confusion matrix
ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_mc, y_mc_pred)).plot()

In [None]:
# multilabel confusion matrix plot
ConfusionMatrixDisplay(confusion_matrix=multilabel_confusion_matrix(y_ml, y_ml_pred)).plot()

In [None]:
# How are they doing? Let's look at scores!

# multilabel
from multilabel import f1, precision, recall
print(f1)
print(precision)
print(recall)

Really good at precision, eh at recall, not bad overall

In [None]:
# multiclass
from multiclass import f1, precision, recall
print(f1)
print(precision)
print(recall)

pretty good all around- balanced between precision and recall

## Part 3: Regression?

> Can we predict the EDHREC "rank" of the card using the data we have available? 

- Like above, add a script and dvc stage to create and train your model
- in the notebook, aside from your descriptions, plot the `predicted` vs. `actual` rank, with a 45-deg line showing what "perfect prediction" should look like. 
- This is a freeform part, so think about the big picture and keep track of your decisions: 
    - what model did you choose? Why? 
    - What data did you use from the original dataset? How did you proprocess it? 
    - Can we see the importance of those features? e.g. logistic weights? 
    
How did you do? What would you like to try if you had more time? 


In [None]:
# plot predicted vs actual
import matplotlib.pyplot as plt
file = open("regression.pkl",'rb')
clf = pickle.load(file)
file.close()


In [None]:
from regression import y_test, X_test 

#predict the test set
y_pred = clf.predict(X_test)

fig, ax = plt.subplots()
ax.scatter(y_test, y_pred, edgecolors=(0, 0, 0))
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "k--", lw=4)
ax.set_xlabel("Measured")
ax.set_ylabel("Predicted")
plt.show()

This is garbage- maybe try lasso

#### what regression model did I choose?
linear regression in part because it's easily interpretable

#### what data do we use and how did I process it? 

first selection based just off what sounds important: 
mtg = pd.read_feather('../../../data/mtg.feather')[['edhrec_rank', 'color_identity', 'converted_mana_cost', 'power', 'toughness', 'rarity', 'subtypes', \
'supertypes', 'types', 'text', 'flavor_text', 'life', 'block']]

Lots of NAs in power, toughness, and life so I got rid of those
I thought we could make categoires for supertypes and subtypes but there were so many that it didn't really amke sense to do so so I then got rid of those
Turned rarity into a bynch of dummies as well as blocks and types (33 possible) 
32 possible color combos so made them dummies too
- this is relying on a lot of dummies

Combined flavor text and text into one column, vectorized, made a tfidf dtm and appended that to the dataframe 

train test split of 80/20

In [None]:
# feature importance!
# get importance in descending order
import numpy as np
importance = np.sort(clf.coef_)

In [None]:
X_test.columns

In [None]:
feature_importance = pd.concat([pd.DataFrame(X_test.columns), pd.DataFrame(importance)], axis=1)
print(feature_importance)

In [None]:
# summarize feature importance
for i,v in enumerate(importance[1:10]):
	print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.xlabel('Features')
plt.show()