# Pandas
This is a short tutorial on basic and useful functionality of the [pandas library](https://pandas.pydata.org/docs/reference/index.html).
## What
Scraping a list of pokemon from the internet and manipulating the data.
## Why 
To understand pandas basics including the use of:
*[pandas.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html?highlight=read_html#pandas.read_html)
*

### Importing Dependencies
We know we are using the Pandas library.  So we will import it as pd.  It can be imported as just pandas, but pd is shorter to type.

In [1]:
import pandas as pd

### Scraping our data
The [pandas.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html?highlight=read_html#pandas.read_html) method allows us to retrieve the html of a webpage and it will look through the page for html tables.  The documentation tells us that it will return an [array](https://www.w3schools.com/python/python_arrays.asp) of tables which we can choose from.  
All we need to do is pass the argument the url of the data we want.

In [2]:
url = 'https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_base_stats_(Generation_I)'
dfs = pd.read_html(url)

HTTPError: HTTP Error 403: Forbidden

### Okay...
That didn't go as expected.  But it did give us an error.  So, I'm just going to google something like [pandas read_html forbidden](https://www.google.com/search?q=pandas+read_html+forbidden&rlz=1C1SQJL_enUS861US861&oq=pandas+read_html+for&aqs=chrome.0.0j69i57j0i22i30l8.5423j0j1&sourceid=chrome&ie=UTF-8).  
Like most issues we aren't the first ones to experience this issue.  The [first link](https://stackoverflow.com/questions/43590153/http-error-403-forbidden-when-reading-html) looks exactly what we are experiencing.  So I'll just follow the upvoted answer.

In [3]:
# import requests, a well maintained and documented library for handling http requests in python
import requests

# create a dictionary called header.  This will be passed as an argument of our get method.
header = {
  'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36',
  'X-Requested-With': 'XMLHttpRequest'
}
# requests.get() will return a lot of info.  that can be accessed by attributes/properties.
# You can tell the difference between a method and attribute/property because a method has ()
# and an attribute/property doesn't.
r = requests.get(url, headers=header)

In [4]:
# just looking at r we see a response of 200.  This means our request was successful.
r

<Response [200]>

In [5]:
# looking at the text property we get the entire html text of the web page.  This will work!
r.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of Pokémon by base stats (Generation I) - Bulbapedia, the community-driven Pokémon encyclopedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Pokémon_by_base_stats_(Generation_I)","wgTitle":"List of Pokémon by base stats (Generation I)","wgCurRevisionId":3160901,"wgRevisionId":3160901,"wgArticleId":97802,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Lists of Pokémon","Lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaul

In [6]:
dfs = pd.read_html(r.text)
dfs

[       #  Unnamed: 1     Pokémon   HP  Attack  Defense  Speed  Special  Total  \
 0      1         NaN   Bulbasaur   45      49       49     45       65    253   
 1      2         NaN     Ivysaur   60      62       63     60       80    325   
 2      3         NaN    Venusaur   80      82       83     80      100    425   
 3      4         NaN  Charmander   39      52       43     65       50    249   
 4      5         NaN  Charmeleon   58      64       58     80       65    325   
 5      6         NaN   Charizard   78      84       78    100       85    425   
 6      7         NaN    Squirtle   44      48       65     43       50    250   
 7      8         NaN   Wartortle   59      63       80     58       65    325   
 8      9         NaN   Blastoise   79      83      100     78       85    425   
 9     10         NaN    Caterpie   45      30       35     45       20    175   
 10    11         NaN     Metapod   50      20       55     30       25    180   
 11    12       

### Back on Track
Like the documentation said, we got an array of dataframes back.  Scrolling through them, we can see that the first item in the array has the information we want.

In [7]:
# assign the first element of the returned array to a variable
# print out the head so we know what we are working with
pokemon = dfs[0]
pokemon.head(10)

Unnamed: 0,#,Unnamed: 1,Pokémon,HP,Attack,Defense,Speed,Special,Total,Average
0,1,,Bulbasaur,45,49,49,45,65,253,50.6
1,2,,Ivysaur,60,62,63,60,80,325,65.0
2,3,,Venusaur,80,82,83,80,100,425,85.0
3,4,,Charmander,39,52,43,65,50,249,49.8
4,5,,Charmeleon,58,64,58,80,65,325,65.0
5,6,,Charizard,78,84,78,100,85,425,85.0
6,7,,Squirtle,44,48,65,43,50,250,50.0
7,8,,Wartortle,59,63,80,58,65,325,65.0
8,9,,Blastoise,79,83,100,78,85,425,85.0
9,10,,Caterpie,45,30,35,45,20,175,35.0


In [8]:
# a useful property of dataframes is dtypes.  You can't add an integer to a string so its useful
# to know what we are working with before hand.
# NOTICE dtypes IS DIFFERENT than head()
pokemon.dtypes

#               int64
Unnamed: 1    float64
Pokémon        object
HP              int64
Attack          int64
Defense         int64
Speed           int64
Special         int64
Total           int64
Average       float64
dtype: object

### Removing Bad Data
Scraping from the web works pretty well but isn't always perfect.  Let's get rid of the unnamed column.  There are many ways you can do this, but I'm going to use [dataframe.drop()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html).  
Since our table of pokemon is a dataframe, we can use dataframe methods on it.  The documentation tells us there is a keyword argument called "inplace" that defaults to false if not specified.  This means that it will return a copy of the dataframe.  If it were true then it would mutate the original dataframe.  
How to use this depends on what you are doing with your data.  Sometimes it may be nice to keep your old dataframe and assign the result to a new dataframe.  I'm going to pretend like I may want to go back to my old dataframe at some point and create a new one.  I'm also going to remove some of the other columns so I can rework them.
If I were dropping only one column it would look like this:  
<code>pokemon.drop('Unnamed: 1')</code>  
But Because we are dropping multiple columns we need to pass an array of the column names:  
<code>pokemon.drop(['Unnamed: 1', 'Total', 'Average'])</code>

In [None]:
poke_clean = pokemon.drop(['Unnamed: 1', 'Total', 'Average'], axis=1)
poke_clean.head()

### Common DataFrame Functionality
We dropped the <code>Total</code> and <code>Average</code> columns above so that we could add them back ourselves!  
<i>Very Efficient.</i>  
Dataframe columns work a lot like SQL columns and we can create new columns easily.

In [None]:
# Add all of the stat columns together to get a total
poke_clean['Total'] = poke_clean['HP'] + poke_clean['Attack'] + poke_clean['Defense'] + poke_clean['Speed'] + poke_clean['Special']

# Divide tthe Total column by 5 to get the Average
poke_clean['Average'] = poke_clean['Total'] / 5

# Take a look to make sure its all good!
poke_clean.head(10)

Making new dataframes from other dataframes is simple.  
Just assign the columns you want to the new variable!  
  
If you are copying more than one column you need to pass an array of the column names

In [None]:
# renaming the pokemon column because the stupid e is annoying to work with
poke_clean = poke_clean.rename(columns={'Pokémon': 'Pokemon'})

# making a pokemon total dataframe
poke_total = poke_clean[['Pokemon', 'Total']]

# making a pokemon average dataframe
poke_avg = poke_clean[['Pokemon', 'Average']]

Just like SQL we can [join](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html) two dataframes together.  
There is also a [merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) method that works in a similar way.  Whichever way you want to do it is up to you, just follow the documentation.

In [None]:
poke_total_avg = poke_total.merge(poke_avg, left_on='Pokemon', right_on='Pokemon')
poke_total_avg.head(10)

### Using Custom Made Functions on Dataframe Columns
When you need to do something that is more custom you can write a function and then [apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) it to a column to get the result.  
The Pokemon table didn't come with type information.  So I'm going to create my own "type" and they will be either even or odd.  
The custom function will check the [modulo](https://realpython.com/python-modulo-operator/) of the Pokemon number and two.  If it is 0, then it is even, if not, its odd. 

In [None]:
def even_or_odd(row):
    if row['#'] % 2 == 0:
        return 'even'
    else:
        return 'odd'

poke_clean['type'] = poke_clean.apply(even_or_odd, axis=1)
poke_clean.head()

### Pandas Group By
[Grouping](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) by a column's values is very similar to situations you find with Excel's sumif.  
I'm going to create a groupby on the type that I just created and perform aggregate functions on the total column.

In [None]:
# Create a groupby object
poke_group = poke_clean.groupby('type')['Total']

# Apply sum aggregation
poke_sum = poke_group.sum()

#Apply count aggregation
poke_count = poke_group.count()

# Apply average aggregation
poke_avg = poke_group.mean()

poke_aggs = pd.DataFrame({'sum': poke_sum, 'count': poke_count, 'average': poke_avg})
poke_aggs

In [None]:
# type is now the index since it was found in each series.  It can be reset like so:
poke_aggs = poke_aggs.reset_index()
poke_aggs