# Applying Advanced Transformations (Core)

#### The Data

You will be working with a heavily modified version of the Superheroes dataset from Kaggle.

The dataset includes two csv's:

- superhero_info.csv:
    - Contains Name, Publisher, Demographic Info, and Body measurements.

- superhero_powers.csv:
    - Contains Hero name and list of powers

## The Task

Your task is two-fold:

**I. Clean the files and combine them into one final DataFrame.**

- This dataframe should have the following columns:
    - Hero (Just the name of the Hero)
    - Publisher
    - Gender
    - Eye color
    - Race
    - Hair color
    - Height (numeric)
    - Skin color
    - Alignment
    - Weight (numeric)
    - Plus, one-hot-encoded columns for every power that appears in the dataset. E.g.:
        - Agility
        - Flight
        - Superspeed
        - etc.

Hint: There is a space in "100 kg" or "52.5 cm"


**II. Use your combined DataFrame to answer the following questions.**

1. Compare the average weight of super powers who have Super Speed to those who do not.
2. What is the average height of heroes for each publisher?

In [220]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing json, os
import json, os

In [221]:
# Importing superhero information dataset
superhero_info = pd.read_csv(r'Data\superhero_info - superhero_info.csv')
# Displaying the first 5 rows of the dataset
superhero_info.head()

Unnamed: 0,Hero|Publisher,Gender,Race,Alignment,Hair color,Eye color,Skin color,Measurements
0,A-Bomb|Marvel Comics,Male,Human,good,No Hair,yellow,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}"
1,Abe Sapien|Dark Horse Comics,Male,Icthyo Sapien,good,No Hair,blue,blue,"{'Height': '191.0 cm', 'Weight': '65.0 kg'}"
2,Abin Sur|DC Comics,Male,Ungaran,good,No Hair,blue,red,"{'Height': '185.0 cm', 'Weight': '90.0 kg'}"
3,Abomination|Marvel Comics,Male,Human / Radiation,bad,No Hair,green,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}"
4,Absorbing Man|Marvel Comics,Male,Human,bad,No Hair,blue,Unknown,"{'Height': '193.0 cm', 'Weight': '122.0 kg'}"


In [222]:
# Importing superhero powers dataset
superhero_powers = pd.read_csv(r'Data\superhero_powers - superhero_powers.csv')
# Displaying the first 5 rows of the dataset
superhero_powers.head()

Unnamed: 0,hero_names,Powers
0,3-D Man,"Agility,Super Strength,Stamina,Super Speed"
1,A-Bomb,"Accelerated Healing,Durability,Longevity,Super..."
2,Abe Sapien,"Agility,Accelerated Healing,Cold Resistance,Du..."
3,Abin Sur,Lantern Power Ring
4,Abomination,"Accelerated Healing,Intelligence,Super Strengt..."


In [223]:
superhero_info[['Hero', 'Publisher']] = superhero_info['Hero|Publisher'].str.split('|', expand=True)
superhero_info.drop('Hero|Publisher', axis=1, inplace=True)
superhero_info.head()

Unnamed: 0,Gender,Race,Alignment,Hair color,Eye color,Skin color,Measurements,Hero,Publisher
0,Male,Human,good,No Hair,yellow,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}",A-Bomb,Marvel Comics
1,Male,Icthyo Sapien,good,No Hair,blue,blue,"{'Height': '191.0 cm', 'Weight': '65.0 kg'}",Abe Sapien,Dark Horse Comics
2,Male,Ungaran,good,No Hair,blue,red,"{'Height': '185.0 cm', 'Weight': '90.0 kg'}",Abin Sur,DC Comics
3,Male,Human / Radiation,bad,No Hair,green,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}",Abomination,Marvel Comics
4,Male,Human,bad,No Hair,blue,Unknown,"{'Height': '193.0 cm', 'Weight': '122.0 kg'}",Absorbing Man,Marvel Comics


In [224]:
superhero_powers.rename(columns={'hero_names': 'Hero'}, inplace=True)
superhero_powers.head()

Unnamed: 0,Hero,Powers
0,3-D Man,"Agility,Super Strength,Stamina,Super Speed"
1,A-Bomb,"Accelerated Healing,Durability,Longevity,Super..."
2,Abe Sapien,"Agility,Accelerated Healing,Cold Resistance,Du..."
3,Abin Sur,Lantern Power Ring
4,Abomination,"Accelerated Healing,Intelligence,Super Strengt..."


In [225]:
super_heros = pd.merge(superhero_info, superhero_powers, on='Hero', how='inner').head()
super_heros

Unnamed: 0,Gender,Race,Alignment,Hair color,Eye color,Skin color,Measurements,Hero,Publisher,Powers
0,Male,Human,good,No Hair,yellow,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}",A-Bomb,Marvel Comics,"Accelerated Healing,Durability,Longevity,Super..."
1,Male,Icthyo Sapien,good,No Hair,blue,blue,"{'Height': '191.0 cm', 'Weight': '65.0 kg'}",Abe Sapien,Dark Horse Comics,"Agility,Accelerated Healing,Cold Resistance,Du..."
2,Male,Ungaran,good,No Hair,blue,red,"{'Height': '185.0 cm', 'Weight': '90.0 kg'}",Abin Sur,DC Comics,Lantern Power Ring
3,Male,Human / Radiation,bad,No Hair,green,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}",Abomination,Marvel Comics,"Accelerated Healing,Intelligence,Super Strengt..."
4,Male,Human,bad,No Hair,blue,Unknown,"{'Height': '193.0 cm', 'Weight': '122.0 kg'}",Absorbing Man,Marvel Comics,"Cold Resistance,Durability,Energy Absorption,S..."


In [226]:
measurements = super_heros.loc[0, 'Measurements']
print(type(measurements))
measurements

<class 'str'>


"{'Height': '203.0 cm', 'Weight': '441.0 kg'}"

In [227]:
## Using .str.replace ot replace all single qoutes with double qoutes
super_heros['Measurements'] = super_heros['Measurements'].str.replace("'", '"')
## Applying json.loads to convert the string to a dictionary
super_heros['Measurements'] = super_heros['Measurements'].apply(json.loads)
super_heros['Measurements'].head()

0    {'Height': '203.0 cm', 'Weight': '441.0 kg'}
1     {'Height': '191.0 cm', 'Weight': '65.0 kg'}
2     {'Height': '185.0 cm', 'Weight': '90.0 kg'}
3    {'Height': '203.0 cm', 'Weight': '441.0 kg'}
4    {'Height': '193.0 cm', 'Weight': '122.0 kg'}
Name: Measurements, dtype: object

In [228]:
## Using .apply(pd.Series) to convert the dictionary to a Series
height_weight = super_heros['Measurements'].apply(pd.Series)
## Using pd.concat to concatenate the Series to the original DataFrame
super_heros = pd.concat((super_heros, height_weight), axis=1)
## Dropping the Measurements column
super_heros = super_heros.drop('Measurements', axis=1)
## Displaying the first 2 rows of the DataFrame
super_heros.head(2)

Unnamed: 0,Gender,Race,Alignment,Hair color,Eye color,Skin color,Hero,Publisher,Powers,Height,Weight
0,Male,Human,good,No Hair,yellow,Unknown,A-Bomb,Marvel Comics,"Accelerated Healing,Durability,Longevity,Super...",203.0 cm,441.0 kg
1,Male,Icthyo Sapien,good,No Hair,blue,blue,Abe Sapien,Dark Horse Comics,"Agility,Accelerated Healing,Cold Resistance,Du...",191.0 cm,65.0 kg


In [229]:
# Using .info to display the DataFrame's information
super_heros.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Gender      5 non-null      object
 1   Race        5 non-null      object
 2   Alignment   5 non-null      object
 3   Hair color  5 non-null      object
 4   Eye color   5 non-null      object
 5   Skin color  5 non-null      object
 6   Hero        5 non-null      object
 7   Publisher   5 non-null      object
 8   Powers      5 non-null      object
 9   Height      5 non-null      object
 10  Weight      5 non-null      object
dtypes: object(11)
memory usage: 652.0+ bytes


In [230]:
# Renaming Height and Weight columns
super_heros.rename(columns={'Height': 'Height (in cm)', 'Weight': 'Weight (in kg)'}, inplace=True)
# Displaying the first 2 rows of the DataFrame
super_heros.head(2)

Unnamed: 0,Gender,Race,Alignment,Hair color,Eye color,Skin color,Hero,Publisher,Powers,Height (in cm),Weight (in kg)
0,Male,Human,good,No Hair,yellow,Unknown,A-Bomb,Marvel Comics,"Accelerated Healing,Durability,Longevity,Super...",203.0 cm,441.0 kg
1,Male,Icthyo Sapien,good,No Hair,blue,blue,Abe Sapien,Dark Horse Comics,"Agility,Accelerated Healing,Cold Resistance,Du...",191.0 cm,65.0 kg


In [231]:
# Using .str.replace to remove the ' cm' and ' kg' from the Height and Weight columns
super_heros['Height (in cm)'] = super_heros['Height (in cm)'].str.replace(' cm', '')
super_heros['Weight (in kg)'] = super_heros['Weight (in kg)'].str.replace(' kg', '')
# Changing the data type of the Height and Weight columns to float
super_heros['Height (in cm)'] = super_heros['Height (in cm)'].astype(float)
super_heros['Weight (in kg)'] = super_heros['Weight (in kg)'].astype(float)
# Displaying the DataFrame's information
super_heros.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          5 non-null      object 
 1   Race            5 non-null      object 
 2   Alignment       5 non-null      object 
 3   Hair color      5 non-null      object 
 4   Eye color       5 non-null      object 
 5   Skin color      5 non-null      object 
 6   Hero            5 non-null      object 
 7   Publisher       5 non-null      object 
 8   Powers          5 non-null      object 
 9   Height (in cm)  5 non-null      float64
 10  Weight (in kg)  5 non-null      float64
dtypes: float64(2), object(9)
memory usage: 652.0+ bytes


In [232]:
# Converting Powers column to a Series
powers = super_heros['Powers'].apply(pd.Series)
# Displaying new Series
powers

Unnamed: 0,0
0,"Accelerated Healing,Durability,Longevity,Super..."
1,"Agility,Accelerated Healing,Cold Resistance,Du..."
2,Lantern Power Ring
3,"Accelerated Healing,Intelligence,Super Strengt..."
4,"Cold Resistance,Durability,Energy Absorption,S..."


In [233]:
# Printing data type of the 3rd of the Powers column
print(type(super_heros.loc[2, 'Powers']))
# Printing the first row of the Powers column
super_heros.loc[0, 'Powers']

<class 'str'>


'Accelerated Healing,Durability,Longevity,Super Strength,Stamina,Camouflage,Self-Sustenance'

In [234]:
# Using .str.split to split the string on the comma
split = ','
# Creating for loop that loops through the split list
for char in split:
    super_heros['Powers_split'] = super_heros['Powers'].str.split(char)
    super_heros['Powers_split'] = super_heros['Powers_split'].astype(str)


In [235]:
# Checking if the split was successful
super_heros.loc[2, 'Powers_split']

"['Lantern Power Ring']"

In [236]:
# Using .str.replace to replace all single qoutes with double qoutes
super_heros['Powers_split'] = super_heros['Powers_split'].str.replace("'", '"')

In [237]:
# Applying json.loads to convert the string to a dictionary
super_heros['Powers_split'] = super_heros['Powers_split'].apply(json.loads)
# Displaying the first 2 rows of the DataFrame
super_heros['Powers_split'].head(2)

0    [Accelerated Healing, Durability, Longevity, S...
1    [Agility, Accelerated Healing, Cold Resistance...
Name: Powers_split, dtype: object

In [238]:
# Checking if the conversion was successful
super_heros['Powers_split'].value_counts()

[Accelerated Healing, Durability, Longevity, Super Strength, Stamina, Camouflage, Self-Sustenance]                                                                                                                                  1
[Agility, Accelerated Healing, Cold Resistance, Durability, Underwater breathing, Marksmanship, Weapons Master, Longevity, Intelligence, Super Strength, Telepathy, Stamina, Immortality, Reflexes, Enhanced Sight, Sub-Mariner]    1
[Lantern Power Ring]                                                                                                                                                                                                                1
[Accelerated Healing, Intelligence, Super Strength, Stamina, Super Speed, Invulnerability, Animation, Super Breath]                                                                                                                 1
[Cold Resistance, Durability, Energy Absorption, Super Strength, Invulnerability

In [239]:
## Using .explode to explode the list of dictionaries
exploded = super_heros.explode('Powers_split')
## Displaying the first 2 rows of the DataFrame
exploded[['Powers', 'Powers_split']].head(2)

Unnamed: 0,Powers,Powers_split
0,"Accelerated Healing,Durability,Longevity,Super...",Accelerated Healing
0,"Accelerated Healing,Durability,Longevity,Super...",Durability


In [240]:
# Creating a list of unique values in the Powers_split column
cols_to_make = exploded['Powers_split'].dropna().unique()
# Displaying the list
cols_to_make

array(['Accelerated Healing', 'Durability', 'Longevity', 'Super Strength',
       'Stamina', 'Camouflage', 'Self-Sustenance', 'Agility',
       'Cold Resistance', 'Underwater breathing', 'Marksmanship',
       'Weapons Master', 'Intelligence', 'Telepathy', 'Immortality',
       'Reflexes', 'Enhanced Sight', 'Sub-Mariner', 'Lantern Power Ring',
       'Super Speed', 'Invulnerability', 'Animation', 'Super Breath',
       'Energy Absorption', 'Elemental Transmogrification',
       'Fire Resistance', 'Natural Armor', 'Molecular Manipulation',
       'Heat Resistance', 'Matter Absorption'], dtype=object)

In [241]:
# Creating for loop that loops through the list of unique values
for col in cols_to_make:
    # Creating a new column that is True if the value is in the list of unique values
    super_heros[col] = super_heros['Powers'].str.contains(col)
# Displaying the first 2 rows of the DataFrame
super_heros.head(2)

Unnamed: 0,Gender,Race,Alignment,Hair color,Eye color,Skin color,Hero,Publisher,Powers,Height (in cm),...,Invulnerability,Animation,Super Breath,Energy Absorption,Elemental Transmogrification,Fire Resistance,Natural Armor,Molecular Manipulation,Heat Resistance,Matter Absorption
0,Male,Human,good,No Hair,yellow,Unknown,A-Bomb,Marvel Comics,"Accelerated Healing,Durability,Longevity,Super...",203.0,...,False,False,False,False,False,False,False,False,False,False
1,Male,Icthyo Sapien,good,No Hair,blue,blue,Abe Sapien,Dark Horse Comics,"Agility,Accelerated Healing,Cold Resistance,Du...",191.0,...,False,False,False,False,False,False,False,False,False,False


In [242]:
## Dropping the Powers column
super_heros = super_heros.drop(columns=['Powers', 'Powers_split'])
## Saving the DataFrame as a csv file
super_heros.to_csv('super_heros_data.csv', index=False)

1. Compare the average weight of super powers who have Super Speed to those who do not.

In [243]:
filter = super_heros['Super Speed'] == True
filter_df = super_heros[filter]
print(f'Average weight of super heros with super speed: {filter_df["Weight (in kg)"].mean()} kg')

Average weight of super heros with super speed: 441.0 kg


2. What is the average height of heroes for each publisher?

In [244]:
# Getting the avgerage height of super heros by publisher
hero = super_heros.groupby('Publisher')['Height (in cm)'].mean()
# Displaying the results
print(f'Average height of super heros by publisher: {hero}')

Average height of super heros by publisher: Publisher
DC Comics            185.000000
Dark Horse Comics    191.000000
Marvel Comics        199.666667
Name: Height (in cm), dtype: float64
