# Beginner: Superhero EDA

**Objective of this Notebook**

1. Gain insight on both datasets
2. Practice preparing data
3. Test out the Seaborn module
4. Practice performing alternative methods to produce graphs.

# Interesting Findings

* Marvel Comics and DC Comics are the main contributors to the superhero dataset.
* It also appears that 70% of superheroes are males, while 27% are female.
* Regarding alignment: 65% of heroes are good, 29% are bad, and 4% are neutral.
* The hero with the most amount of powers is Spectre, he has 49 powers.
* 55% of the hero roster has the super strength ability, being strong doesn't seem that amazing anymore.
* 18% of the hero roster has the invulnerability ability, this seems really overpowered.
* Godzilla is considered a hero.

# Personal Takeaways

* I shouldn't have worked on a combined dataset, unless I wanted to find metrics that required both datasets. I lost some instances from the merge. 
* I should use a more consistent coding method/way to organize the visualizations. I kept creating/overwriting dataframes. If I made an error or wanted to back track I would have to recreate a dataframe.
* I should scale my quantitative data because it made my boxplots hard to read. If I did not limit the dimensions, the boxplot would have been a lot smaller.
* Learning how to weigh a variable would be valuable in working with future data sets. Some abilities are better than others.

# Data Preparation

**Importing Modules and Loading Data**

In [None]:

import math
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

powers = pd.read_csv('../input/superhero-set/super_hero_powers.csv')
heroes = pd.read_csv('../input/superhero-set/heroes_information.csv')

**Inspecting the Data**

At the very beginning, I noticed that the variable 'Skin color' was missing some values. So I decided to inspect each variable, after checking out the powers dataset.

In [None]:
heroes.head()

In [None]:
powers.head()

I dropped the 'Unnamed' column because it was an additional index.

In [None]:
heroes= heroes.drop('Unnamed: 0', axis=1)
heroes.info()

In [None]:
heroes['Alignment'].unique()

In [None]:
heroes['Publisher'].unique()


I renamed values in the 'Publisher' column to other, if they were not Marvel Comics nor DC Comics.
I assumed that there wouldn't be many heroes from these other publishers. Some of them are not main competitors in the comic book industry.
One example would be South Park. It is a cartoon show that I watch.


In [None]:
heroes.loc[(heroes['Publisher'] != 'Marvel Comics') & (heroes['Publisher'] != 'DC Comics'),'Publisher'] = 'Other'

Here I check if any heroes did not have names.

In [None]:
heroes['name'].unique()


In [None]:
heroes['name'].isna().value_counts()

In [None]:
heroes.loc[heroes['name'] == '-']

In [None]:
heroes['Gender'].unique()

In [None]:
heroes['Eye color'].unique()

In [None]:
heroes['Race'].unique()

In [None]:
heroes['Skin color'].unique()

In [None]:
heroes['Skin color'].isnull().value_counts()

I noticed that some heroes had a height of -99.0. Based on this information, I made a note that this was probably an missing value.
Due to this incident, I assumed that weight would probably have some -99.0 values as well.


In [None]:
heroes['Height'].value_counts()

In [None]:
heroes['Weight'].value_counts()

In [None]:
heroes['Hair color'].unique()

In [None]:
heroes.loc[heroes['Gender']=='-']

I decided to modify some cell values for most of the columns.

I made an error here by setting the -99.0 values to 'Unknown for Height and Weight.
This results in having to change the values later on, due to changing the columns' types from float.

In [None]:
heroes.loc[heroes['Gender'] == '-','Gender'] = 'Unknown'
heroes.loc[heroes['Eye color'] == '-','Eye color'] = 'Unknown'
heroes.loc[heroes['Hair color'] == '-','Hair color'] = 'Unknown'
heroes.loc[heroes['Hair color'] == 'Brownn','Hair color'] = 'Brown'
heroes.loc[heroes['Hair color'] == 'black','Hair color'] = 'Black'
heroes.loc[heroes['Skin color'] == '-','Skin color'] = 'Unknown'
heroes.loc[heroes['Alignment'] == '-','Alignment'] = 'Unknown'
heroes.loc[heroes['Race'] == '-','Race'] = 'Unknown'
heroes.loc[(heroes['Publisher'] == '-') | (heroes['Publisher'].isna() == True),'Publisher'] = 'Unknown'
heroes.loc[heroes['Height'] < 0,'Height'] = 'Unknown'
heroes.loc[heroes['Weight'] < 0,'Weight'] = 'Unknown'


I renamed the heroes dataset's 'name' column to 'hero_names' in order to merge it with the powers dataset.

In [None]:
heroes = heroes.rename(columns={'name': 'hero_names'})

In [None]:
heroes.loc[heroes['Gender'] == 'Unknown']

I combined both datasets here. 

I believe this may have been an error as well because I lost some rows of data.
I should have worked on both datasets independently.

In [None]:
combined = pd.merge(heroes,powers)

In [None]:
combined.head()

# Data Visualizations

In [None]:
sns.set_palette("pastel")

The first visualization is of the publishers and the amount of superheroes that they have created.
The percentage indicates how many of their superheroes are related to a specific gender based on the total count of superheroes in the dataset.

Marvel has contributed the most to the superhero dataset, by owning 51% of the total superheroes.
It also appears that 70% of superheroes are males, while 27% are female.


In [None]:
Publisher_df=combined[['Publisher','Gender']]
plt.figure(figsize=(8,4))
sns.countplot(x='Publisher',data=Publisher_df,hue='Gender')
plt.title('Publishers and the Amount of Superheroes', fontsize=14)

def roundup(x):
    return 50 + int(math.ceil(x / 100.0)) * 100 

total =float(len(Publisher_df))
ax = plt.gca()
y_max = combined['Publisher'].value_counts().max() 
ax.set_ylim([0, roundup(y_max)])

for patch in ax.patches:
    ax.text(patch.get_x() + patch.get_width()/2., patch.get_height(), '{:.0%}'.format(patch.get_height()/total), 
            fontsize=12, color='black', ha='center', va='bottom')
plt.show()

The second visualization is of the publishers and the alignment of their superheroes.
The percentage indicates how many of their superheroes are related to a specific alignment based on the total count of superheroes in the dataset.

Overall, all 3 publishers prioritize on creating 'good' aligned superheroes. 65% of the superheroes are 'good', 29% are 'bad', and around 4% are 'neutral'.
I was surprised that there aren't that many neutral superheroes. 4 % of the superheroes fall in the neutral category.


In [None]:
plt.figure(figsize=(8,4))
sns.countplot(x='Publisher',data=combined,hue='Alignment')
total =float(len(combined))
plt.title('Publishers and the Alignment of Their Superheroes', fontsize=14)


ax = plt.gca()
y_max = combined['Publisher'].value_counts().max() 
ax.set_ylim([0, roundup(y_max)])

for patch in ax.patches:
    ax.text(patch.get_x() + patch.get_width()/2., patch.get_height(), '{:.0%}'.format(patch.get_height()/total), 
            fontsize=12, color='black', ha='center', va='bottom')
plt.show()

The next visualization shows the most popular superhero races.
The percentage is also based on the total superhero count.

Exluding the unknown category, it seems that 30% of the superheroes are 'human'.
One thing that I am unable to figure out is how is a superhero classified as a human?
I see that 2% of superheroes consist of 'Human/Radiation,' is there a specific percentage of 'human' traits required to be 'human?'
Or should the 'Human/Radiation' variants also be classified as 'human?

Disclaimer: There were a lot of unknown values and I removed them from the visualizations, otherwise 'unknown' would always be the top variable.


In [None]:
df = combined.drop(combined[combined.Race == 'Unknown'].index) # Keep


Race_df = df['Race'].value_counts().sort_values(ascending=False).head(10)

label = Race_df.index
value = Race_df.values

plt.figure(figsize=(15,4))
sns.barplot(x=label,y=value)
plt.xlabel('Races')
plt.ylabel('Count')
plt.title("Top 10 Superhero Races")

ax = plt.gca()
y_max = combined['Publisher'].value_counts().max() 
ax.set_ylim([0, roundup(y_max)])

for patch in ax.patches:
    ax.text(patch.get_x() + patch.get_width()/2., patch.get_height(), '{:.0%}'.format(patch.get_height()/total), 
            fontsize=12, color='black', ha='center', va='bottom')

plt.show()

The next visualization shows the most popular eye color for superheroes.
Yet again,the percentage is also based on the total superhero count.

Exluding the unknown category, it seems that 32% of the superheroes have blue eyes. 
I wonder if the reasoning behind this pertains to the race of the superheroes or if it is just due to artist preference?



In [None]:
df = combined.drop(combined[combined['Eye color'] == 'Unknown'].index) 
Eyecolor_df = df['Eye color'].value_counts().sort_values(ascending=False).head(10)

label = Eyecolor_df.index
value = Eyecolor_df.values

plt.figure(figsize=(15,4))
sns.barplot(x=label,y=value)
plt.xlabel('Eye colors')
plt.ylabel('Count')
plt.title("Top 10 Superhero Eye Colors")

ax = plt.gca()
y_max = combined['Publisher'].value_counts().max() 
ax.set_ylim([0, roundup(y_max)])

for patch in ax.patches:
    ax.text(patch.get_x() + patch.get_width()/2., patch.get_height(), '{:.0%}'.format(patch.get_height()/total), 
            fontsize=12, color='black', ha='center', va='bottom')

plt.show()

In [None]:
df = combined.drop(combined.columns[10:],axis=1)
df['Height'].unique()

In [None]:
Height_df=df.drop(df[df['Height']=='Unknown'].index)

Height_M=Height_df.drop(Height_df[Height_df['Gender']!= 'Male'].index)
Height_F=Height_df.drop(Height_df[Height_df['Gender']!= 'Female'].index)
Height_U=Height_df.drop(Height_df[Height_df['Gender']!= 'Unknown'].index)

Here is a visualization that shows the height distributions of superheroes based on their gender.

For this visualization and the following one, there are a lot of outliers. There are some outliers past the 300 value, but I cut them off to help with the scaling of the boxplots.
I think I should have scaled the data as each boxplot has a different scale. 

In [None]:
fig=plt.figure(figsize=(14,8))
fig.add_subplot(1,3,1)
plt.ylim(100,300)
sns.boxplot(x='Gender',y='Height',data=Height_M, width =0.75, color='red')
fig.add_subplot(1,3,2)
plt.ylim(100,375)
plt.title('Superheroes and Height Distributions')
sns.boxplot(x='Gender',y='Height',data=Height_F, width =0.75)
fig.add_subplot(1,3,3)
plt.ylim(100,250)
sns.boxplot(x='Gender',y='Height',data=Height_U, width =0.75, color ='green')
plt.show()

In [None]:
Weight_df=df.drop(df[df['Weight']=='Unknown'].index)
Weight_df=Weight_df.dropna()

Weight_M=Weight_df.drop(Weight_df[Weight_df['Gender']!= 'Male'].index)
Weight_F=Weight_df.drop(Weight_df[Weight_df['Gender']!= 'Female'].index)
Weight_U=Weight_df.drop(Weight_df[Weight_df['Gender']!= 'Unknown'].index)

Here is a visualization that shows the Weight distributions of superheroes based on their gender.


In [None]:
fig=plt.figure(figsize=(14,8))
fig.add_subplot(1,3,1)

sns.boxplot(x='Gender',y='Weight',data=Weight_M, width =0.75,color='red')
fig.add_subplot(1,3,2)

plt.title('Superheroes and Weight Distributions')
sns.boxplot(x='Gender',y='Weight',data=Weight_F, width =0.75)
fig.add_subplot(1,3,3)

sns.boxplot(x='Gender',y='Weight',data=Weight_U, width =0.75, color='green')
plt.show()

In [None]:
hero_powers=combined*1
hero_powers.loc[:, '# of powers'] = hero_powers.iloc[:, 1:].sum(axis=1)
hero_powers


The next visualization shows which hero has the most amount of powers.
Spectre is in the first position with 49 powers, followed by Amazo.

In [None]:
df=hero_powers[['hero_names','# of powers']].sort_values('# of powers',ascending=False)
fig=plt.figure(figsize=(10,5))
plt.title('Superheroes With the Most Amount of Powers')
fig.add_subplot(1,1,1)
sns.barplot(x='hero_names',y='# of powers',data=df.head(15),palette="viridis")
plt.xticks(rotation=60)


plt.show()

In [None]:
df

In [None]:
hp=hero_powers[['hero_names','# of powers','Gender']].sort_values('# of powers',ascending=False)
hp_M=hp.drop(hp[hp['Gender'] != 'Male'].index)
hp_U=hp.drop(hp[hp['Gender'] != 'Unknown'].index)
hp_F=hp.drop(hp[hp['Gender'] != 'Female'].index)

The visualization below shows the hero with the most amount of powers based on gender.
I'm surprised that Godzilla made it onto this the list with the unknown gender.

In [None]:
fig=plt.figure(figsize=(20,5))
fig.add_subplot(1,3,1)
sns.barplot(x='hero_names',y='# of powers',data=hp_M.head(15),palette="plasma_r",hue='Gender')
plt.xticks(rotation=80)

fig.add_subplot(1,3,2)
plt.title("Superoheroes With the Most Amount of Powers Related to Genders")
sns.barplot(x='hero_names',y='# of powers',data=hp_F.head(15),palette="Blues",hue='Gender')
plt.xticks(rotation=80)

fig.add_subplot(1,3,3)
sns.barplot(x='hero_names',y='# of powers',data=hp_U.head(15),palette="viridis",hue='Gender')
plt.xticks(rotation=80)
plt.show()



In [None]:
hero_powers=hero_powers.drop(hero_powers.columns[0:10],axis=1)
hero_powers

In [None]:
hero_powers=hero_powers.drop('# of powers', axis =1)


In [None]:
hero_powers_count =pd.DataFrame()

for i in hero_powers.columns:
    hero_powers_count[i] = hero_powers[i].value_counts()

In [None]:
hero_powers_count
hero_powers_count=hero_powers_count.drop([0])
hero_powers_count=hero_powers_count.T
hero_powers_count=hero_powers_count.reset_index()
hero_powers_count.rename(columns={'index': 'Hero Power',1:'Count'}, inplace=True)

In [None]:
hero_powers_count=hero_powers_count.sort_values('Count',ascending=False)


Lastly, this is a bar chart that shows the most popular powers and the percentage of superheroes that have them.

In [None]:

plt.figure(figsize=(20,10))
plt.xticks(rotation=50)
sns.barplot(x='Hero Power',y='Count', data=hero_powers_count.head(15))

total =float(len(combined))
ax = plt.gca()
y_max = combined['Publisher'].value_counts().max() 
ax.set_ylim([0, roundup(y_max)])

for patch in ax.patches:
    ax.text(patch.get_x() + patch.get_width()/2., patch.get_height(), '{:.0%}'.format(patch.get_height()/total), 
            fontsize=12, color='black', ha='center', va='bottom')

plt.title('Most Popular Hero Powers and the Proportion of Heroes That Have Them')
plt.show()