### Data analysis with example dataset - fun with pokemon

In [None]:
import os
import pandas as pd
from IPython.display import Image

Pokémon is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures. It now spans video games, trading card games, animated television shows and movies, comic books, and toys.

In [None]:
Image('http://cdn-static.denofgeek.com/sites/denofgeek/files/pokemon_4.jpg')

### 1. Read data into memory

Given we have data ready on the disk, we can read it into pandas DataFrame, that is stored in python memory. But are we in the same directory? Let's check to be sure...

In [None]:
# prints current working directory full path
os.getcwd()

We are not in the same directory, so we can either change working directory to the directory with dataset or read the file from that directory. Let's try that.

In [None]:
# reads csv file into 
poke = pd.read_csv('./Datasets/Pokemon.csv')
#poke=pd.read_csv('Pokemon.csv', sep=';')

In [None]:
# shows first  few rows of table
poke.head()

**Q: What is a DataFrame in pandas?**

It's a collection of Series (columns) with the same lenght that are made of numpy arrays.

In [None]:
type(poke)

In [None]:
type(poke['Name'])

In [None]:
poke['Name'];

In [None]:
len(poke)

In [None]:
len(poke) == len(poke['Name'])

### 2. Explore dataset

What's in the dataset and how does it look like? In data exploration, we try to answer these questions.

In [None]:
# general information about columns
poke.info()

In [None]:
# summary statistics of numeric columns
poke.describe()

In [None]:
# How many null values are there?
poke.isnull().sum()

In [None]:
# How many legendary and common pokemon are there?
poke['Legendary'].sum()

Q: Can you tell me how is the number calculated and if it is correct?

#### Slicing and filtering - Is there pikachu and what kind of pokemon is he? 

In [None]:
# boolean filtering
poke[poke['Name'] == 'Pikachu']

In [None]:
# subsetting with .loc and .iloc
poke[poke.loc[:,'Name'].isin(['Pikachu', 'Bulbasaur', 'Charmander', 'Squirtle'])]

In [None]:
# creating subset data frame
image_poke = poke[poke.loc[:,'Name'].isin(['Pikachu', 'Bulbasaur', 'Charmander', 'Squirtle'])]
image_poke

Let's make some simple plot.

In [None]:
# how to make plot with pandas and what arguments to pass?
image_poke.plot??
# two question marsk show full docstring that is present at documentation webpage for pandas.

In [None]:
%matplotlib inline 
# this draws it in a cell
# barplot that compares attack of chosen pokemon 
image_poke.plot.bar(x='Name', y='Attack', color=['green', 'red', 'blue', 'yellow'], title='Attack Comparison')

### 3. Clean data

Can we work with the dataset as it is or do we need to do some adjustments? Filling/removing null values, creating new columns with calculated values, deleting redundant columns, removing incomplete rows, creating relevant subsets of dataset, converting datatypes, renaming column names... These are all part of data cleaning step that is required before we can further analyze the data.

Renaming columns

In [None]:
poke.columns
poke = poke.rename(columns={'#':'Number'})

Subset of pokemon that are only common.

In [None]:
only_common = # fill in poke[poke[]]

Subset of pokemon that don't contain 'Mega' in their name.

In [None]:
# Finish subset of DataFrame using condition so that only pokemon that don't have Mega in their name are selected
no_mega = # finish the subset ['Name'].str.contains('Mega')

**Group by** operation to aggregate data.

In [None]:
# check if there is only 1 pokemon for every number
poke['#'].groupby(poke['#']).count().sort_values(ascending=False);
# we don't want to show so long output afterwards, so we can just add ';' behind the command not to show the output afterwards.

Removing mega wasn't enough. Let's consider all pokemon with same number as duplicates, drop them and keep only the first one.

In [None]:
# Finish the subset
# .drop_duplicates('Number', keep='first', inplace=False)

Now that we have desired and clean dataset, let's move to another step.

### 4. Clean Data Processing

In this step, we are ready to answer our questions with our dataset. In case we don't have any specific questions, we are doing just exploratory data analysis - looking what is inside the data.

So here are some questions:
- Which pokemon type is the most frequent?
- Which pokemon type is the strongest? (using total stats)
- Which pokemon type is the strongest on average?
- Which pokemon generation has the biggest total stats?
- How strong is Pikachu among pokemon of the same type?

### 5. Results visualization

- Create a histogram of all common pokemon's total stats
- Create a boxplot of total stats by type
- Create a boxplot of total stats by generation
- Show Pikachu's total stats among other pokemon of the same type