*Disclaimer*: This notebook was created as a part of **Data Analysis with pandas**, that I taught in Prague, September 2017.
My solutions are in notebook version 3, but try going through it by yourself at first :)

# Data analysis with example dataset - fun with Pokémon

In [None]:
import os
import numpy as np
import pandas as pd
from IPython.display import Image
from matplotlib import pyplot as plt

In [None]:
%matplotlib inline 
pd.options.mode.chained_assignment = None  # default='warn'

Pokémon is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures. It now spans video games, trading card games, animated television shows and movies, comic books, and toys.

In [None]:
Image('http://cdn-static.denofgeek.com/sites/denofgeek/files/pokemon_4.jpg')

### 1. Read data into memory

Given we have data ready on the disk, we can read it into pandas DataFrame, that is stored in python memory. But are we in the same directory? Let's check to be sure...

In [None]:
# prints current working directory full path
os.getcwd()

In [None]:
# reads csv file into 
poke = pd.read_csv('../input/Pokemon.csv')

In [None]:
# shows first  few rows of table
poke.head()

**Q: What is a DataFrame in pandas?**

It's a collection of Series (columns) with the same lenght that are made of numpy arrays.

In [None]:
# data frame
type(poke)

In [None]:
# series
type(poke['Name'])

In [None]:
# numpy array
type(poke['Name'].values)

In [None]:
# basic python/numpy data type - str, int, float...
type(poke['Name'].values[0])

In [None]:
poke['Name'].head()

In [None]:
len(poke)

In [None]:
len(poke) == len(poke['Name'])

### 2. Explore dataset

What's in the dataset and how does it look like? In data exploration, we try to answer these questions.

In [None]:
# general information about columns
poke.info()

In [None]:
# summary statistics of numeric columns
poke.describe()

In [None]:
# How many null values are there?
poke.isnull().sum()

In [None]:
# How many legendary and common pokemon are there?
poke['Legendary'].sum()

Q: Can you tell me how is the number calculated and if it is correct?

#### Slicing and filtering - Is there pikachu and what kind of pokemon is he? 

In [None]:
# boolean filtering
poke[poke['Name'] == 'Pikachu']

In [None]:
# subsetting with .loc and .iloc
poke[poke.loc[:,'Name'].isin(['Pikachu', 'Bulbasaur', 'Charmander', 'Squirtle'])]

In [None]:
# creating subset data frame
image_poke = poke[poke.loc[:,'Name'].isin(['Pikachu', 'Bulbasaur', 'Charmander', 'Squirtle'])]
image_poke

Let's make some simple plot.

In [None]:
# how to make plot with pandas and what arguments to pass?
image_poke.plot??
# two question marsk show full docstring that is present at documentation webpage for pandas.

In [None]:
# this draws it in a cell
# barplot that compares attack of chosen pokemon 
image_poke.plot.bar(x='Name', y='Attack', color=['green', 'red', 'blue', 'yellow'], title='Attack Comparison')

### 3. Clean data

Can we work with the dataset as it is or do we need to do some adjustments? Filling/removing null values, creating new columns with calculated values, deleting redundant columns, removing incomplete rows, creating relevant subsets of dataset, converting datatypes, renaming column names... These are all part of data cleaning step that is required before we can further analyze the data.

Renaming columns

In [None]:
poke = poke.rename(columns={'#':'Number'})
poke.columns

Subset of pokemon that are only common.

In [None]:
# Only common means non-legendary :)
only_common = poke[poke['Legendary']]
only_common

Subset of pokemon that don't contain 'Mega' in their name.

In [None]:
# Finish subset of DataFrame using condition so that only pokemon that don't have Mega in their name are selected (hint: ~)
no_mega = only_common['Name'].str.contains('Mega ')
no_mega.head()

**Group by** operation to aggregate data.

In [None]:
# check if there is only 1 pokemon for every number
poke['Number'].groupby(poke['Number']).count().sort_values(ascending=False)
# we don't want to show so long output afterwards, so we can just add ';' behind the command not to show the output afterwards.

Removing mega wasn't enough. Let's consider all pokemon with same number as duplicates, drop them and keep only the first one.

In [None]:
# Finish the subset
nodup_poke =  poke.drop_duplicates('Number', keep='first', inplace=False)
nodup_poke.head()

In [None]:
# Reindex the dataframe, so the first index is 0 and the last is n-1
nodup_poke.reset_index(inplace=True, drop=True)
nodup_poke.head()

Now that we have desired and clean dataset, let's move to another step.

### 4. Clean Data Processing

In this step, we are ready to answer our questions with our dataset. In case we don't have any specific questions, we are doing just exploratory data analysis - looking what is inside the data.

So here are some questions:
- Which pokemon type is the most frequent?
- Which pokemon type is the strongest and which the weakest? (according to total stats)
- What are the 5 strongest pokemon among the common pokemon?
- Which pokemon generation has the biggest average total stats?
- How strong is Pikachu among pokemon of the same type?

In [None]:
# How many unique pokemon are there in the remaining dataset?
len(nodup_poke['Number'].unique())

In [None]:
# Which pokemon type is the strongest and which the weakest on average? (according to total stats)
strongest_type_avg = nodup_poke[['Type 1','Total']].groupby(nodup_poke['Type 1']).mean().sort_values(by='Total', ascending=False)
strongest_type_avg

What happened in the previous line of code? Let's look:
1. Choose a data frame subset based on what values you want to see in result -  **nodup_poke[['Type 1','Total']]**
2. Choose a column you want the table to be grouped by. To group values means put all rows with the same value of selected column into one row. - **groupby(nodup_poke['Type 1'])**
3. Now that we have many values in the same row, we need to transform them into one value with aggregation function. Typical examples are *count, average, sum, max* - **mean()**
4. The last step is sorting the result based on values in descending order - **sort_values(by='Total', ascending=False)**

Try the same logic in the upcoming examples:

In [None]:
# Find the strongest pokemon in each group based on Type 1 and order them alphabetically from A-Z.
# Notice, this one has different logic - we do sorting first and only then group by and take the first observation for every group.
strongest_type_max = nodup_poke[['Total', 'Name', 'Type 1']].sort_values(
                                by='Total', ascending=False).groupby('Type 1', as_index=False).first()
strongest_type_max 

In [None]:
# Which pokemon type is the most frequent?
type_frequency = nodup_poke['Type 1'].# fill in the rest, sort from largest to smallest
type_frequency

In [None]:
# What are the 5 strongest pokemon among the common pokemon?
top5_poke = nodup_poke[['Name', 'Total']].sort_values(by='Total', ascending=False)[0:5]
top5_poke

In [None]:
# Which pokemon generation has the biggest average total stats?
generation_comparison = # generation and total columns. groupby column. aggregation function. sort function by value
generation_comparison

In [None]:
# What type is pikachu?
pikachu_type = nodup_poke['Type 1'][nodup_poke['Name'] == 'Pikachu'].values[0]
pikachu_type

In [None]:
# How strong is Pikachu among pokemon of the same type? Hint: debugging
pikachu_rank = nodup_poke[nodup_poke['Type 1'] == pikachu_type].sort_values();
pikachu_rank;

In [None]:
# reset index of table to start from 0 to n-1
pikachu_rank.reset_index(inplace=True)
pikachu_rank

### 5. Results visualization

- Create a histogram of all common pokemon's total stats
- Create a boxplot of total stats by type
- Create a boxplot of total stats by generation
- Create a barplot of total stats by generation
- Show Pikachu's total stats among other pokemon of the same type together with generation of pokemon

In [None]:
# Create a histogram of all common pokemon's total stats
# fill in missing .plot.hist(bins=15)

In [None]:
import matplotlib.cm as cm  # these would normally be at the beginning of the notebook

# colormaps https://matplotlib.org/examples/color/colormaps_reference.html
type_colors = cm.spring(np.linspace(0.05,0.95,len(type_frequency)))
type_frequency.plot.bar(color=type_colors)

In [None]:
# Create a boxplot of average total stats by type. Hint: use tables we created in previous step
nodup_poke[nodup_poke['Type 1'].isin(['Bug', 'Water', 'Grass', 'Poison'])
          ].boxplot(column='Total', by='Type 1')

In [None]:
# Create a boxplot of total stats by generation
# fill in missing .boxplot(column='Total', by='Generation')

In [None]:
# Create a barplot of total stats by generation
generation_comparison['Generation'] = generation_comparison['Generation'].astype('str')
gen_colors = cm.summer(np.linspace(0.05,0.95,len(generation_comparison)))
generation_comparison.plot.bar(x='Generation',color=gen_colors, title='Comparison of avg total stats between pokemon generations')

In [None]:
# Create a barplot of strongest pokemon in each Type 1. One color for all is fine.
# fill in all :) 

In [None]:
# Show Pikachu's total stats among other pokemon of the same type together with generation of pokemon
pikachu_rank['color'] = 'Grey'
pikachu_rank['size'] = 30
pikachu_rank['color'][pikachu_rank['Name']=='Pikachu']='Yellow'
pikachu_rank['size'][pikachu_rank['Name']=='Pikachu']=100
pika_plot = pikachu_rank.plot.scatter(x='Generation', y='Total', 
                                      c=pikachu_rank['color'], s=pikachu_rank['size'],
                          title='Pikachu vs other electric pokemon')

Save the plot as png file

In [None]:
fig = pika_plot.get_figure()
fig.savefig('pika_plot.png')

## Summary
This notebook introduces basic operations that are done in the process of Data Analysis.
If you liked the format, look also at my Kickstarter kernel, which has more advanced stuff and more exercises in it :)