# Data Science Wizards!

## Instructors: Shervin Manzuri, Levin Noronha, Morteza Alipour

# Introduction! 

This is going to be your first venture into **DATA SCIENCE**! Data science is the cool job of doing magic with piles of stuff most people don't understand!

A data scientist is basically a **wizard**! These wizards will get certain ingredients in the forms of **RAW Data** and they will make them into an **easily digestable data potion**! Then using **magic** they will predict the future! However before you become magicians you will first need to improve your **Cooking** skill!

## What potion we are making?

We want to look at **DC** and **Marvel** comic characters and do data science wizardry with them! So our ingredients will be their data.

What does a good cook usually do?

## 1. Fetching the Ingredients

As a data scientist wizard, you first need to gather your **necessary ingredients**. Wizards use herbs and water to make potions! But data scientist wizards use... wait for it... **DATA**. Just as a good potion needs quality herbs, good data scientists need to carefully choose their data. If they use bad data, they might create **poisonous potions**! We really don't want poison now do we? 

Now the question is...

### 1.1. How do we choose the data?

The first rule of thumb for wizards when picking quality herbs is they must be **clean**, **fresh** and without **pests**! 

Well you guessed it, it's exactly the same case for data scientist wizards. Because we *are* wizards.

Your first job as a data scientist wizard is to **pick and clean your ingredients**! LET'S GO!

#### Importing

The first step when writing python magic is importing your tools! We will do this for you. The tools we will import are two friendly neighborhood sidekicks named *pandas*, *numpy* and their projector device that is called *matplotlib*. We will call upon *pandas* and *numpy* to help us later.

In [2]:
import pandas as pd
import numpy as np
import matplotlib as plt

#### Fetching our Ingredients

Now we need python, our sidekick to fetch (read) the ingredients we want to clean! We need to tell it the address of where to find the ingredient! Both our ingredients are in our computer so we need to tell it the local address!


In [3]:
#Our First ingredient!
heroes_info = pd.read_csv("heroes_information.csv",index_col = 0)
#Our second ingredient!
superhero_powers = pd.read_csv("super_hero_powers.csv")

#### Now that we have our ingredients...

We need to see how much of each we have, our ingredients come in packages and they can be of different types, shapes, lengths, *smells* or even tastes.

To explore our data ingredients we can ask our friend *pandas* to **describe** them for us. We have to ask him separately for each ingredient as **pandas are lazy**!

#### Let's ask pandas to show us 5 pieces of our first ingredient, **superhero powers**:

In [4]:
superhero_powers.head()

Unnamed: 0,hero_names,Agility,Accelerated Healing,Lantern Power Ring,Dimensional Awareness,Cold Resistance,Durability,Stealth,Energy Absorption,Flight,...,Web Creation,Reality Warping,Odin Force,Symbiote Costume,Speed Force,Phoenix Force,Molecular Dissipation,Vision - Cryo,Omnipresent,Omniscient
0,3-D Man,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,A-Bomb,False,True,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,Abe Sapien,True,True,False,False,True,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,Abin Sur,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,Abomination,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


#### Let's ask pandas to show us 6 pieces of our second ingredient, **superhero info**:

In [5]:
heroes_info.head(6)

Unnamed: 0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0
4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0
5,Absorbing Man,Male,blue,Human,No Hair,193.0,Marvel Comics,-,bad,122.0


## 2. Cleaning our ingredients

Now that we have seen what our ingredients look like, we need to clean them up! Now when we want to clean up food ingredients we look for dirty parts, stale parts or parts that don't make sense (we don't want to put nails in our soup!).

The same goes for **data cleaning**. When we want to clean our data ingredients, we will look for:

    1. Data rows that don't make sense.
    2. Data rows that look wrong, but otherwise we can use them.
    3. Data we don't think we will need.
    
Now looking at the tables above, which ones do you think don't make sense? Which ones we need but need to change how they look? and which ones we won't need?

### 2.1 Don't look ahead! We need your expertise first!

## 2.1.1 You haven't been looking ahead now have you?






### 2.2 Now that you have some insight, let's see what needs to be cleaned!

We don't want problems of type 1 and 2 affect our analysis.

    1. Do you think a height of -99 or weight of -99 makes sense?
    2. We need to let python know if we don't have information about something. Otherwise it wouldn't know what to do! The way we do this is by turning cells with "-" into "NA" (not a number).
    
And we want to drop certain values!

    3. We want to compare Marvel to DC! So we need to drop all heroes that don't belong to them.

In [42]:
# Changing -99 height to NA
heroes_info.loc[heroes_info['Height'] < 0, 'Height'] = np.nan
heroes_info.loc[heroes_info['Weight'] < 0, 'Weight'] = np.nan

# Changing - to NA
superhero_powers.replace('-', np.nan); #Do you know why we put the semicolon here? remove it and see!
heroes_info.replace('-', np.nan);

# Dropping non Marvel and non DC
superhero_powers = superhero_powers.rename(columns ={'hero_names':'name'})
merged_data = pd.merge(heroes_info, superhero_powers, on='name')

marvel_and_dc_data = merged_data.drop(merged_data.loc[(merged_data['Publisher']!='Marvel Comics')|(merged_data['Publisher']!='DC Comics')].index)

display(merged_data.loc[(merged_data['Publisher']!='DC Comics')])


Unnamed: 0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight,...,Web Creation,Reality Warping,Odin Force,Symbiote Costume,Speed Force,Phoenix Force,Molecular Dissipation,Vision - Cryo,Omnipresent,Omniscient
0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0,...,False,False,False,False,False,False,False,False,False,False
1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0,...,False,False,False,False,False,False,False,False,False,False
3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0,...,False,False,False,False,False,False,False,False,False,False
4,Abraxas,Male,blue,Cosmic Entity,Black,,Marvel Comics,-,bad,,...,False,False,False,False,False,False,False,False,False,False
5,Absorbing Man,Male,blue,Human,No Hair,193.0,Marvel Comics,-,bad,122.0,...,False,False,False,False,False,False,False,False,False,False
6,Adam Monroe,Male,blue,-,Blond,,NBC - Heroes,-,good,,...,False,False,False,False,False,False,False,False,False,False
8,Agent Bob,Male,brown,Human,Brown,178.0,Marvel Comics,-,good,81.0,...,False,False,False,False,False,False,False,False,False,False
9,Agent Zero,Male,-,-,-,191.0,Marvel Comics,-,good,104.0,...,False,False,False,False,False,False,False,False,False,False
10,Air-Walker,Male,blue,-,White,188.0,Marvel Comics,-,bad,108.0,...,False,False,False,False,False,False,False,False,False,False
11,Ajax,Male,brown,Cyborg,Black,193.0,Marvel Comics,-,bad,90.0,...,False,False,False,False,False,False,False,False,False,False
