In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# DnD 5e Monsters
The first time I played DnD was when 5e came out, I have been hooked on role playing games since. I have also been hooked on data science, there is something very satisfying about finding a data set and discoverying **something** about what is there, whether its during EDA or writing predictive models, data is just _fun_.

## Getting Started
One of the first things I like to do when building a relationship to a new data set is to just call info, we can actually glean quite a bit of information just from this.

* we have missing data
* lots of columns and samples
* mix of categorical and quantitative

In [None]:
df = pd.read_csv('/kaggle/input/dnd-5e-monsters/dnd_monsters.csv')
df.info()

## Categorical
There are a few different parts we could start at, but for this data set I want to start examining some of the categorical features. First lets examine size, this describes how much space a creature occupies. The following is how many square feet each occupies.

* tiny - 2.5
* small - 5
* medium - 5
* large - 10
* huge - 15
* gargantuan - 20 (or larger)

From this it seems like most creatures are medium or large, this makes sense from a gameplay perspective as the need for incredibly massive or miniscule creatures is not as common. In a machine learning problem we probably be worried about a possible *class balance* problem if this feature was our output.

In [None]:
results = df.loc[:, 'size'].value_counts()
sns.barplot(results.values, results.index, orient='h', color='#84a9ac')

Next up lets examine the monster type, this describes its fundamental nature. This is typically useful if you have a spell that affects a type, for example you could have a spell that affects all *undead* creatures. I have a feeling this one will be quite large though, so lets check out how many unique values we have first.

In [None]:
results = df.loc[:, 'type'].value_counts()
len(results)

Yikes to much for a bargraph. Though that number seems very high to me, lets look at a random sample and see what the data looks like here.

In [None]:
results.sample(n=20)

Ah, it looks like some types are broken up into a large number of subgroups. This is where I would probably use a tool like openrefine to try and combine these, but lets try and do it on our own. Since all of the combo types have " (" in them we can just split on that and take whatever the first element returned is, this should not have any effect on types that do not have the token in them.

In [None]:
df['type'] = df['type'].apply(lambda x: x.split(' (')[0])
results = df.loc[:, 'type'].value_counts()
sns.barplot(results.values, results.index, orient='h', color='#84a9ac')

So most creatures are medium / large sized humanoids / beasts / fiends, this is anecdotal of course but I feel like this is an accurate representation of what I come across in many games. Poor oozes are feeling under represented here, this could be a good opportunity for an aspiring Dungeon Master to make a cool ooze themed dungeon. What do we have in the ooze family?

In [None]:
df.loc[df['type']=='ooze', :]

## Quantitative
Ok lets take a look at some numerical features and see if there is anything interesting. One of the first things I like to examine is a correlation matrix, in machine learning problems this is helpful since we can start to get a good idea if any of the input features will be helpful predictors or if any of the input features are collinear. In this case I just want to see if any of the features have a strong relationship that we could dive into. So what do we see?

* Obviously the categorical features are missing, perhaps we should encode some of them?
* I thought hp and ac would be more correlated, this indicates that the HP and AC are not growing at the same rate (high HP, low AC creatures possibly?)
* Dexterity is poorly correlated with everything
* The highest correlation is intelligence and charisma (self examination finds this to be true)

In [None]:
sns.heatmap(df.corr(), annot=True)

Ok lets dig into the HP/AC relationship, there is another feature that might be connected, 'legendary'. Right now this feature is mostly null, there is only a string with 'legendary' so lets fill in the remaining entries with 'normal' so we can add this to our analysis. It makes sense that there would be a relationship here, as creatures get stronger the HP and AC should as well. The interesting thing however is that there are no samples below the line, so it seems that there are very few creatures with large HP and low AC.

In [None]:
df['legendary'] = df['legendary'].fillna('Normal')
sns.scatterplot(df['hp'], df['ac'], hue=df['legendary'])

I thought it would be interesting to see the count of creatures with respect to alignment and type, the best way to display this is with a heatmap after pivoting the dataframe. This confirms some of the data that we have seen before, specifically that humanoids, beasts, fiends, and monstrositites are rather common. However the new insight that we have here is that creatures are typically evil, which makes sense as they are the antagonists of the players. Another interesting things this graph uncovers is that beasts are typically unaligned.

In [None]:
results = pd.pivot_table(df, values='cr', index='align', columns='type', aggfunc='count', fill_value=0)
sns.heatmap(results)

We've looked at HP and AC relationships, now lets see if either of these features have any interesting interactions with some of the categorical data. Box plots, violin plots, rain cloud plots are all pretty good for this type of analysis. It seems from this that that HP does not vary much with the type of the creature with the exception of dragons, from my domain experience of D&D I would also guess that there are likely few low level dragons. It seems that the size of the creature has much larger effect on the HP.

In [None]:
fig, ax = plt.subplots(1,2, figsize=(16,8))

sns.boxplot('hp', 'type', data=df, ax=ax[0], color='#84a9ac')
sns.boxplot('hp', 'size', data=df, ax=ax[1], color='#84a9ac')

ax[0].axvline(x=df.hp.mean(), ymin=0, ymax=1, linestyle=':', color='black')
ax[1].axvline(x=df.hp.mean(), ymin=0, ymax=1, linestyle=':', color='black')

ax[0].text(x=df.hp.mean()+5, y=14, s='HP Mean')
ax[1].text(x=df.hp.mean()+5, y=5, s='HP Mean')

ax[0].set_title('Type')
ax[1].set_title('Size')

# OK
That was fun! Some of that hit my confirmation bias pretty hard, but it was interesting to see how some of these features were distributed or interacted with each other. Some interesting takeaways.

* Size seems to have a bigger effect on HP than the type of creature
* Most creatures are EVIL (except beasts who are just doing their own thing)
* Some characteristics are intuitively connected (strength and constitution, intelligence and charisma)
* Most creatures are medium sized humanoids
* Poor oozes need some love