# Exploratory Data Analysis with Hearthstone Standard Cards

Let's see what kind of fun information we can gather in order to practice our basic data manipulation and evaluation skills.

Let's start by importing the necessary modules:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None) # show all DF columns

Great! Now let's read in the card data and take a look at the first few rows:

In [None]:
cards = pd.read_csv('../input/hearthstone-standard-cards/hearthstone_standard_cards.csv')
cards.head()

Each row has quite a bit of information about each card!

Taking a first look at the data, there seem to be lots of different data types, as well as some NaN values.

### Let's now inspect the data as a whole to find out a bit more:

In [None]:
cards.info()

In [None]:
cards.describe()

In [None]:
cards.nunique()

With the information above, we now have a fuller picture of how the data looks as a whole. Just a couple pieces of quick knowledge we can gather from this include:

- The current total number of Hearthstone standard cards is 810.

- Columns regarding the card's text and image are almost entirely unique.

### Now let's take a look at how some of the numeric data correlates:

In [None]:
# show correlation betwen columns

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(data=cards.corr(), ax=ax, cmap='twilight_shifted', annot=True)

# fix x ticks
ax.tick_params(top=True, bottom=False, labeltop=True, labelbottom=False)
plt.setp(ax.get_xticklabels(), rotation=-30, ha="right", rotation_mode="anchor")
plt.setp(ax.get_ylabel(), rotation=-90)

# set colorbar
cbar = ax.collections[0].colorbar
cbar.set_label('Correlation between columns in card data', rotation=-90, labelpad=20)

plt.show()
plt.close()

Some interesting correlations arise looking at the heatmap above.

- `manaCost` seems to be closely related to both `attack` and `health`.
- `durability` correlates most strongly with `attack`.
- `health` and `attack` are related.
- `armor` and `collectible` are completely white, as seaborn reads their input as bad values. I'm not sure myself why it reads this way. However, it does not seem to negatively impact our ability to analyze the data: `armor` seems to only have 6 total values, and we wouldn't expect the binary `collectible` column to correlate to any of the other columns anyway.

### Now let's explore some of the distributions in the card data:

In [None]:
# distribution of card classes

class_names = pd.read_csv('../input/hearthstone-standard-cards/metadata/classes.csv')
classes = cards.groupby('classId').count()
# add in multiclass cards that werent counted in classId
for row in cards['multiClassIds']:
    if len(row) > 2:
        row = row.strip('[]')
        row = row.replace(' ', '')
        row = row.split(',')
        classes.loc[int(row[1]), 'id'] += 1
classes = classes['id'].reset_index()
merged = pd.merge(class_names, classes.rename(columns={'id': 'count', 'classId': 'id'}))

palette = ['#034732ff', '#a68653', '#008148ff', '#4e71f2', '#ebc73b', '#c9e5f5', '#5c5558', '#27159e', '#812ab8', '#9e2a2bff', '#c7c7c7']
merged.plot.pie(y='count',\
                figsize=(7.5,7.5),\
                labels=merged['name'],\
                colors=palette,\
                autopct='%0.2f%%',\
                ylabel='',\
                legend=None,\
                title='Distribution of card classes')
plt.show()
plt.close()

In Hearthstone, there are 11 classes, each represented as a slice in the pie chart above. Neutral cards may be used by any class, and thus occupies the largest portion of the data.

- Demon Hunter has the most cards by more than a whole percentage point!
- Warlock has the least amount of cards, but trails behind 4 other classes by only 0.12%.

In [None]:
# distribution of card rarity

rarity_names = pd.read_csv('../input/hearthstone-standard-cards/metadata/rarities.csv')
rarities = cards.groupby('rarityId').count()
rarities = rarities['id'].reset_index()
merged = pd.merge(rarity_names, rarities.rename(columns={'id': 'count', 'rarityId': 'id'}))

palette = ['#c7c7c7', '#0033ff', '#8d32e3', '#f7a814']
merged.plot.pie(y='count',\
                figsize=(7.5,7.5),\
                labels=merged['name'],\
                colors=palette,\
                autopct='%0.2f%%',\
                ylabel='',\
                legend=None,\
                title='Distribution of card rarities')

plt.show()
plt.close()

A card's rarity in Hearthstone indicates its value. Cards may be deconstructed for "dust" relative to their rarity to use in the creation of other cards. Each deck (always consisting of 30 cards) may include up to 2 of any common, free, rare, or epic cards, and only 1 of any legendary card.

- Common is the minority majority of cards.
- There is almost an equal amount of epic and legendary cards.

In [None]:
# distribution of minon types

miniontype_names = pd.read_csv('../input/hearthstone-standard-cards/metadata/minionTypes.csv')
miniontypes = cards.groupby('minionTypeId').count()
miniontypes = miniontypes['id'].reset_index()
merged = pd.merge(miniontype_names, miniontypes.rename(columns={'id':'count', 'minionTypeId': 'id'}))

palette = ['#21c437', '#68109e', '#9fa7b5', '#6610f2', '#1d6930', '#8c703b', '#c7d7f0', '#d94214']
merged.plot.pie(y='count',\
                figsize=(7.5,7.5),\
                labels=merged['name'],\
                autopct='%0.2f%%',\
                ylabel='',\
                colors=palette,\
                legend=None,\
                title='Distribution of minion types')

plt.show()
plt.close()

A minion is a playable card in Hearthstone with a mana cost, attack, and health. Sometimes, a minion can have a specific type, like Dragon or Murloc, as illustrated in the pie chart above. There are many minions that do not have a type. I decided to leave these out of this pie graph, because they clog up the diagram and do not add very much useful information.

- Minions with type All consist of only 0.51% of all minons. That's because there's only one! Circus Amalgam.
- Totems have the least amount of representation by a wide margin --- only 2.03% of all minions with a type!
- Pirates and Quilboar are also have relatively low representation.
- Beasts and Demons tie for the most representation of minion types at 22.34% each.

In [None]:
# distribution of card health

health = cards.groupby('health').count()
health = health['id'].reset_index()
health = health.rename(columns={'id':'count'})
health = health.astype('int64')

fig, ax = plt.subplots(figsize=(10, 6))

sns.barplot(data=health, x='health', y='count', ax=ax, color='red')
ax.set_title('Distribution of card health')
plt.show()
plt.close()

In [None]:
# distribution of card attacks

attacks = cards.groupby('attack').count()
attacks = attacks['id'].reset_index()
attacks = attacks.rename(columns={'id':'count'})
attacks = attacks.astype('int64')

fig, ax = plt.subplots(figsize=(10, 6))

sns.barplot(data=attacks, x='attack', y='count', ax=ax, color='yellow')
ax.set_title('Distribution of card attacks')
plt.show()
plt.close()

In [None]:
# distribution of mana cost

mana = cards.groupby('manaCost').count()
mana = mana['id'].reset_index()
mana = mana.rename(columns={'id':'count'})

fig, ax = plt.subplots(figsize=(10, 6))

sns.barplot(data=mana, x='manaCost', y='count', ax=ax, color='blue')
ax.set_title('Distribution of card mana cost')
plt.show()
plt.close()

The three plots above share a common thread in that, as stated above, each minion has a mana cost, attack, and health. An experienced Hearthstone player might assume a priori that these three stats are directly related, which is loosely shown above.

- `health`, `attack`, and `manaCost` have most of their values from 1 to 5, tapering off as values increase.
- `health` has a few unusually high values, which is unique among the three graphs.

### Lastly, let's look at some bubble plot frequency distributions:

In [None]:
# let's compare health and attack
sns.set_style('darkgrid')
fig, ax = plt.subplots(figsize=(10, 6))

health_attack = cards.groupby(['health', 'attack'])['id'].count().reset_index()
health_attack = health_attack.astype('int64')

sns.scatterplot(data=health_attack, x='attack', y='health', size='id', sizes=(20, 4000), ax=ax, alpha=0.4, legend=False)
ax.set_title('Frequency distribution of cards for each attack/health combination')

plt.show()
plt.close()

As stated previously, one might assume that mana cost, attack, and health are directly related. In the bubble plot above, each bubble is a proportional representation of the average cost of mana for a card with those stats. The bigger the bubble, the higher the average mana cost for a card like that.

- There seems to be a linear relationship between attack and health.
- There also seems to be a linear relationship between mana cost and the stats of a card.

In [None]:
# mana cost distribution by class

sns.set_style('white')
fig, ax = plt.subplots(figsize=(10, 6))

class_mana_count = cards.groupby(['classId', 'manaCost'])['id'].count().reset_index()
merged = pd.merge(class_names, class_mana_count.rename(columns={'id':'count', 'classId':'id'}))

palette = ['#034732ff', '#a68653', '#008148ff', '#4e71f2', '#ebc73b', '#c9e5f5', '#5c5558', '#27159e', '#812ab8', '#9e2a2bff', '#c7c7c7']
sns.scatterplot(data=merged, x='name', y='manaCost', size='count', sizes=(20, 4000), alpha=0.6, legend=False, ax=ax, hue='name', palette=palette)
ax.set_yticks(range(11))
plt.xticks(rotation=45)
ax.set_xlabel('class')
ax.set_ylabel('mana cost')
ax.set_title('Frequency distribution of card mana cost per class')

plt.show()
plt.close()

It also is interesting to look at the frequency distribution of the mana cost for each class.

- Paladin appears to have the most even distribution of mana across classes.
- Neutral contains the most 10-cost cards far more 10-cost cards than any class.
- Rogue and Druid have the most 0-cost cards.
- Rogue has no cards that cost 8 or more.
- Many of the class cards occupy the 2-4-cost range.

## Conclusion

In this notebook we've taken a first look at the Hearthstone Standard Cards dataset. We got an overall picture of the data, looked at correlations, as well as a number of distributions across categories.

I hope you've found this notebook useful. Please feel free to reference it and any feedback is welcome!