# Case Study - Counting pandas

## How to count all the functions, methods and attributes that pandas has to offer?
There are probably multiple intelligent ways to do this but for this exercise we will start off by assuming the [API reference](http://pandas.pydata.org/pandas-docs/stable/api.html) in the pandas docs contain all the functionality of pandas. Full URL: http://pandas.pydata.org/pandas-docs/stable/api.html

Wow, thats an absurd amount of functionality for one library. Manually counting this might take some time. Lets use pandas to help us out.

In [1]:
import pandas as pd

### Finding pages with html tables

Many times it will not be obvious that a web page consists of html tables. For example, the Pandas api reference web page does not appear to have what you would normally define as a 'table'. However, all modern browsers have functionality to nicely display the contents of the current html page. In chrome you can right click **inspect** or **view page source**. If you click inspec, then the html for that object will be directly navigated to.

Once inspecting the html you can use search functions to find html tables which are always written with **`<table>`** elements.

Go ahead and inspect the api page and see if the underlying elements are indeed html tables.

### `read_html` to scrape tables

Pandas has a handy-dandy function **`read_html`** which reads all the html tables off of the given url. It returns a list of pandas dataframe objects - one for each table found. Let's use this now to grab every single table on that page.

In [2]:
# grab all html tables from api reference page
api_tables = pd.read_html('http://pandas.pydata.org/pandas-docs/stable/api.html')

In [3]:
# how many tables are there
len(api_tables)

157

In [4]:
#lets look at a few tables
api_tables[0]

Unnamed: 0,0,1
0,"read_pickle(path[, compression])",Load pickled pandas object (or any object) fro...


In [5]:
# take a look at another table
api_tables[44]

Unnamed: 0,0,1
0,Categorical.dtype,The CategoricalDtype for this instance
1,Categorical.categories,The categories of this categorical.
2,Categorical.ordered,Whether the categories have an ordered relatio...
3,Categorical.codes,The category codes of this categorical.


Looks like they are all two column tables with the attribute in the first column and the description in the right column. Every thing looks good. Lets try counting

In [6]:
count = 0
for table in api_tables:
    count += table.shape[0]
print("There are {} things pandas can do!".format(count))

There are 1331 things pandas can do!


## How much functionality does the pandas Series have?
As seen above, the pandas object is followed up by its method/attribute in object-oriented notation. If we want to count just the Series functionality we need to search each table's first column for the word `Series`. pandas again provides us with some nicely equipped with plenty of [string processing methods](http://pandas.pydata.org/pandas-docs/stable/text.html).

To use these string processing methods, define a pandas Series and use .str. and press tab to see all the available methods.

In [7]:
# Lets use the first column from the above table
s = api_tables[44][0]
s.head()

0         Categorical.dtype
1    Categorical.categories
2       Categorical.ordered
3         Categorical.codes
Name: 0, dtype: object

In [8]:
# use the str.contains method to see if each item does in fact contain the word 'Series' in it
s.str.contains('Series')

0    False
1    False
2    False
3    False
Name: 0, dtype: bool

In [9]:
# OK lets count the appearance of 
count_series = 0
for table in api_tables:
    count_series += table[0].str.contains('Series').sum()
print("There are {} things pandas Series can do!".format(count_series))

There are 332 things pandas Series can do!


# Exercises

## Problem 1
<span  style="color:green; font-size:16px"> Writing a new for loop every time we want to count a new word in our dataset is cumbersome. Can you write a function that accepts the parameter **word** and returns the count of this word if it appears as in the pandas API as a functions/methods/attributes. Count a few words with it like DataFrame or MultiIndex</span>

In [None]:
# your code here
def count_functionality(word):

## Problem 2
<span  style="color:green; font-size:16px">Define a new function by modifying the above function slightly to have it return a list of all the methods</span>

In [None]:
# your code here

## Problem 3
<span  style="color:green; font-size:16px">Select the first table from `api_tables` with more than 20 rows. Then assign the second column as a Series to variable `col`. Explore some **`str`** methods with this column.</span>

In [None]:
# your code here

## Problem 4
<span  style="color:green; font-size:16px">Lets get some 'live' data.</span>
1. Naviate to [real clear politics Trump vs Clinton](http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html) 
1. use pandas **`read_html`** to read in that full table at the bottom of the page and display it here in the notebook
1. use the header parameter to find the correct header instead of the default numbers
1. Inspect the info to make sure the clinton and trump data types are float64
1. add a column that calculates the difference of trump vs clinton
1. sort the dataframe by this newly created column
1. What conclusions (if any) can you make

# Solutions

In [11]:
import pandas as pd
api_tables = pd.read_html('http://pandas.pydata.org/pandas-docs/stable/api.html')

## Problem 1
<span  style="color:green; font-size:16px"> Writing a new for loop every time we want to count a new word in our dataset is cumbersome. Can you write a function that accepts the parameter **word** and returns the count of this word if it appears as in the pandas API as a functions/methods/attributes. Count a few words with it like DataFrame or MultiIndex</span>

In [12]:
def count_functionality(word):
    return sum([table[0].str.contains(word).sum() for table in api_tables])

In [13]:
count_functionality('Series'), count_functionality('DataFrame'), count_functionality('MultiIndex')

(332, 257, 23)

## Problem 2
<span  style="color:green; font-size:16px">Define a new function by modifying the above function slightly to have it return a list of all the methods</span>

In [14]:
def list_functionality(word):
    return_list = []
    for table in api_tables:
        s = table[0] # get first column
        cur_list = s[s.str.contains(word)].tolist() # get only items with word in it and convert to list
        return_list.extend(cur_list)
    return return_list

In [15]:
# these methods should look very familiar from the builtin python str methods
str_series = list_functionality('Series.str')
str_series

['Series.strides',
 'Series.str.capitalize()',
 'Series.str.cat([others,\xa0sep,\xa0na_rep,\xa0join])',
 'Series.str.center(width[,\xa0fillchar])',
 'Series.str.contains(pat[,\xa0case,\xa0flags,\xa0na,\xa0…])',
 'Series.str.count(pat[,\xa0flags])',
 'Series.str.decode(encoding[,\xa0errors])',
 'Series.str.encode(encoding[,\xa0errors])',
 'Series.str.endswith(pat[,\xa0na])',
 'Series.str.extract(pat[,\xa0flags,\xa0expand])',
 'Series.str.extractall(pat[,\xa0flags])',
 'Series.str.find(sub[,\xa0start,\xa0end])',
 'Series.str.findall(pat[,\xa0flags])',
 'Series.str.get(i)',
 'Series.str.index(sub[,\xa0start,\xa0end])',
 'Series.str.join(sep)',
 'Series.str.len()',
 'Series.str.ljust(width[,\xa0fillchar])',
 'Series.str.lower()',
 'Series.str.lstrip([to_strip])',
 'Series.str.match(pat[,\xa0case,\xa0flags,\xa0na,\xa0…])',
 'Series.str.normalize(form)',
 'Series.str.pad(width[,\xa0side,\xa0fillchar])',
 'Series.str.partition([pat,\xa0expand])',
 'Series.str.repeat(repeats)',
 'Series.str.re

## Problem 3
<span  style="color:green; font-size:16px">Select the first table from `api_tables` with more than 20 rows. Then assign the second column as a Series to variable `col`. Explore some **`str`** methods with this column.</span>

In [52]:
for table in api_tables:
    if len(table) > 20:
        break
table

Unnamed: 0,0,1
0,Series.values,Return Series as ndarray or ndarray-like depen...
1,Series.dtype,return the dtype object of the underlying data
2,Series.ftype,return if the data is sparse|dense
3,Series.shape,return a tuple of the shape of the underlying ...
4,Series.nbytes,return the number of bytes in the underlying data
5,Series.ndim,return the number of dimensions of the underly...
6,Series.size,return the number of elements in the underlyin...
7,Series.strides,return the strides of the underlying data
8,Series.itemsize,return the size of the dtype of the item of th...
9,Series.base,return the base object if the memory of the un...


In [53]:
col = table[1]
col.head()

0    Return Series as ndarray or ndarray-like depen...
1       return the dtype object of the underlying data
2                   return if the data is sparse|dense
3    return a tuple of the shape of the underlying ...
4    return the number of bytes in the underlying data
Name: 1, dtype: object

In [56]:
col.str.swapcase().head()

0    rETURN sERIES AS NDARRAY OR NDARRAY-LIKE DEPEN...
1       RETURN THE DTYPE OBJECT OF THE UNDERLYING DATA
2                   RETURN IF THE DATA IS SPARSE|DENSE
3    RETURN A TUPLE OF THE SHAPE OF THE UNDERLYING ...
4    RETURN THE NUMBER OF BYTES IN THE UNDERLYING DATA
Name: 1, dtype: object

In [58]:
col.str.split().head() # split each element by blank space

0    [Return, Series, as, ndarray, or, ndarray-like...
1    [return, the, dtype, object, of, the, underlyi...
2            [return, if, the, data, is, sparse|dense]
3    [return, a, tuple, of, the, shape, of, the, un...
4    [return, the, number, of, bytes, in, the, unde...
Name: 1, dtype: object

## Problem 4
<span  style="color:green; font-size:16px">Lets get some 'live' data.</span>
1. Naviate to [real clear politics](http://www.realclearpolitics.com) 
1. In the top left corner of the page, hover over the polls section and click on Clinton vs Trump
1. use pandas read_html to read in that full table at the bottom of the page and display it here in the notebook
1. use the header parameter to find the correct header instead of the default numbers
1. Inspect the info to make sure the clinton and trump data types are float64
1. add a column that calculates the difference of trump vs clinton
1. sort the dataframe by this newly created column
1. Do you see anything suspicious about the polls where Trump is leading?

In [20]:
rcp_tables = pd.read_html('http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html', header=0)

In [21]:
len(rcp_tables)

3

In [22]:
rcp_final = rcp_tables[2]

In [23]:
rcp_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261 entries, 0 to 260
Data columns (total 7 columns):
Poll           261 non-null object
Date           261 non-null object
Sample         261 non-null object
MoE            261 non-null object
Clinton (D)    261 non-null float64
Trump (R)      261 non-null float64
Spread         261 non-null object
dtypes: float64(2), object(5)
memory usage: 14.4+ KB


In [24]:
rcp_final['diff'] = rcp_final['Clinton (D)'] - rcp_final['Trump (R)']

In [25]:
rcp_final.head(20)

Unnamed: 0,Poll,Date,Sample,MoE,Clinton (D),Trump (R),Spread,diff
0,Final Results,--,--,--,48.2,46.1,Clinton +2.1,2.1
1,RCP Average,11/1 - 11/7,--,--,46.8,43.6,Clinton +3.2,3.2
2,BloombergBloomberg,11/4 - 11/6,799 LV,3.5,46.0,43.0,Clinton +3,3.0
3,IBD/TIPP TrackingIBD/TIPP Tracking,11/4 - 11/7,1107 LV,3.1,43.0,42.0,Clinton +1,1.0
4,Economist/YouGovEconomist,11/4 - 11/7,3669 LV,--,49.0,45.0,Clinton +4,4.0
5,LA Times/USC TrackingLA Times,11/1 - 11/7,2935 LV,4.5,44.0,47.0,Trump +3,-3.0
6,ABC/Wash Post TrackingABC/WP Tracking,11/3 - 11/6,2220 LV,2.5,49.0,46.0,Clinton +3,3.0
7,FOX NewsFOX News,11/3 - 11/6,1295 LV,2.5,48.0,44.0,Clinton +4,4.0
8,MonmouthMonmouth,11/3 - 11/6,748 LV,3.6,50.0,44.0,Clinton +6,6.0
9,NBC News/Wall St. JrnlNBC/WSJ,11/3 - 11/5,1282 LV,2.7,48.0,43.0,Clinton +5,5.0
