# Welcome to ....
![](https://dataaspirant.files.wordpress.com/2014/10/pandas.png)

### What is pandas?
pandas is possibly the best open source data exploration library available currently available. It gives the user tremendous power to easily explore, manipulate, query, aggregate, visualize, `<insert cool sounding data word>`, etc... tabular (row, column) data.

### Why pandas and not xyz?
It's 2016, and by now there are many dozens of other competitors that can essentially do many, if not more, than what the pandas library can do. However, there are many aspects of pandas that set it apart and continue to make it have one the fastest growing user bases.
1. It's a Python library, which makes it easy to read, easy to develop, and easily integrates with other popular Python libraries like matplotlib, numpy, and scikit-learn.
2. Like python its easy to use and quick to get up and running
3. It is nearly self-contained in that tremendous functionality is built in one package. This contrasts with R, where many packages are needed to obtain the functionality.
4. The community is amazing. Looking at stackoverflow for example there are [nearly 23,000](http://stackoverflow.com/questions/tagged/pandas) pandas questions. SAS, a multibillion dollar revenue analytics software maker has only 6k questions. This is one huge benefit of open source in general. If you need help, you are nearly guaranteed to find it very quickly. After a while most of your questions will be answered in the top 3 search results from google.
5. Lightning fast development. New features are added all the time thanks to the huge community. This contrasts with propriety softwarer which can never move as fast.
6. powerful, simple, amazing community!!!

### Why is it named after an east Asian bear?
pandas was built by a young kid named Wes McKinney beginning in 2008 at a hedge fund named AQR. Finance speak is to call tabular data 'panel data' which smashed together becomes pandas. If you are really interested in the history, you can hear it from the creator [himself](https://www.youtube.com/watch?v=kHdkFyGCxiY)

### Python already has data structures to handle data, why do we need another one?
Even though python itself is a high level language, its primary built-in data structures - lists and dicts - do not easily lend themself to tabular data in ways that humans can easily visualize them and do vectorized (no for loops) operations. Just summing up items in a list can be quite slow. Check this example below.

In [20]:
# import numpy and alias it to np as is convention
import numpy as np

In [21]:
# 1 million numbers
n = 1000000
my_list = list(range(n))

%%timeit is a cell magic command

In [22]:
%%timeit
sum1 = sum(my_list)

100 loops, best of 3: 11.5 ms per loop


In [23]:
array = np.arange(n)

In [24]:
%%timeit 
sum2 = np.sum(array)

1000 loops, best of 3: 775 µs per loop


### What just happened?
ipython notebooks come with handy dandy magic commands that give you some great extra functionality. The one I use the most is timeit which times the length of the operation. Precede it by % for a single line magic and %% for entire cell magic. Using the builtin sum with list took approximately 20 times longer than using numpy and this was just a simple sum of a list of numbers. This difference will increases with complexity of the operation performed on the data.

### Why is numpy so fast?
First off, numpy is one of the most popular (and one of the older) numerical computing packages for python. It provides an n-dimensional (ndarray) data structure that can be of any size with the ability to apply a large suite of mathematical functions on it. numpy ndarrays are homogenous (all elements have the same type) and its operations are executed in pre-compiled C code which makes for much faster execution times. A python list in contrast must be iterated through at run-time, can take any number of different types and so is not well suited to do large numerical computations. 

### Why not numpy?
Though numpy is fast and can handle most of our data needs, it still is relatively low-level.  For example, the ndarray is just a brick of numbers.  pandas adds a simple layer on top of numpy and leverages the speed of numpy.  The main building block in pandas, the ndframe is built directly upon numpy's building block (the ndarray).  But pandas allows much easier access to rows and columns, powerful statistical functionality, enhanced merging and grouping, `<insert many data exploration techniques>`. We will not delve into the specifics of numpy, but remembering that pandas building blocks consists of numpy building blocks as its base is useful. More info on numpy can be found [in the docs](http://docs.scipy.org/doc/numpy-1.10.1/index.html). We actually will be using some numpy functionality but it should be self explanatory when it comes up.

In [25]:
# Want to view all the magical abilites?
# view all the magic commands
%lsmagic

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %install_default_config  %install_ext  %install_profiles  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%latex  %%

### Pandas is simple: There's really only one type of data structure
Data representation is simple. It's just plopped in what most people would call a table. Rows, columns, thats it. 

![Boring Table](http://www.homeandlearn.co.uk/powerpoint/images/charts/8Table9.gif) 

You've seen this every day of your life. So no more explaning. 

Well, not exactly.  There are numerous formats for data (XML, json, raw bytes, etc...), but for our purposes today we will only be examining what everyone thinks of when the they think of data - a table.

pandas is built just for analyzing this tabular, rectangular, very deceptively normal concept of data. There are two primary objects that account for everything we will be covering today. 

**The Series and the DataFrame.**

A DataFrame is simply a sequence of Series and a Series is just a one column table with an index identifying the row.  The column is simply a numpy ndarray and it is this index that separates a pandas Series from a numpy ndarray. As we will come to see over and over again, this index is VERY important.

## Series: A manual build
We will start our first lines of pandas by building a Series, a single dimensional array with an index.

In [26]:
# import pandas and alias it to pd as is convention
import pandas as pd

In [27]:
# to construct a Series, simply pass it a python list
s = pd.Series([1,3,3,77])
s

0     1
1     3
2     3
3    77
dtype: int64

### What are these things that just got printed on my screen?
A 4 item series was created and stored to the variable s. But, several more things were printed to the screen as well. To the left of the created series is the index beginning at 0. This is the default index when none is given during the construction of the Series object. Also printed out is the data type (dtype) of the series `int64` which stands for a 64 bit integer

In [28]:
# As was done in the precourse, lets print out all the methods/attributes that are associated with this series object
print([method for method in dir(s) if method[0] != '_'])

['T', 'abs', 'add', 'add_prefix', 'add_suffix', 'align', 'all', 'any', 'append', 'apply', 'argmax', 'argmin', 'argsort', 'as_blocks', 'as_matrix', 'asfreq', 'asof', 'astype', 'at', 'at_time', 'autocorr', 'axes', 'base', 'between', 'between_time', 'bfill', 'blocks', 'bool', 'clip', 'clip_lower', 'clip_upper', 'combine', 'combine_first', 'compound', 'compress', 'consolidate', 'convert_objects', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'data', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'dropna', 'dtype', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'ewm', 'expanding', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'from_array', 'from_csv', 'ftype', 'ftypes', 'ge', 'get', 'get_dtype_counts', 'get_ftype_counts', 'get_value', 'get_values', 'groupby', 'gt', 'hasnans', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iget', 'iget_value', 'iloc', 'imag', 'index', 'interpolate', 'irow', 'is_c

### Help?
Those are an unbelievalbe amount of methods. Its not likely that you will have them handily remembered to use at any time in the future so be prepared to get help when using pandas, especially in the beginning. Luckily for us help is quite easy to get. Pandas has excellent [documentation](http://pandas.pydata.org/pandas-docs/stable/index.html) with plenty of examples that you should go through completely if you want to know just about every single feature the library has to offer. That link is always to the newest (stable) pandas version.

To get help inside the notebook use the help funciton to have the help printed to the screen or put a '?' at the end of the method to bring up a separate help window on the bottom or if you want to see even more put '??' at the end to view the source code. To view the source code of the value_counts method at its source navigate to /path/to/anaconda/lib/python3.5/site-packages/pandas/core/base.py

And one more excellent way to get help and my personal favorite - is to press shift + tab + tab once inside the parentheses of a method

In [29]:
s.value_counts??

In [30]:
s.value_counts()

3     2
77    1
1     1
dtype: int64

In [31]:
# return just the values to a 1 dimensional numpy array
s.values

array([ 1,  3,  3, 77])

In [32]:
# do a simple operation to all elements
s + 5

0     6
1     8
2     8
3    82
dtype: int64

### Perform simple vectorized operations
If you have only worked with python lists and dictionaries for holding data in python then the last statement should amaze you. Although its obvious what has happend this simple functionality does not come out of the box with a python list.

In [33]:
my_list = list(range(4))
my_list + 5

TypeError: can only concatenate list (not "int") to list

Remembering from the precourse assignment, lists implement the `__add__` special method which defines how lists concatenate to other lists. Since 5 is an int and not a list, an error is thrown. Python has no idea that you would like to add 5 to each of the elements. pandas natively understands that you would like to actually add 5 to every element. Concatenating to a Series is a bit more troublesome than concatentating to a python list (more on this later).

### What are vectorized operations
pandas/numpy is filled with vectorized operations going on all the time. A vectorized operation is one where a sequence of numbers is operated on without the explicit writing of for loops. They are handled outside of python in pre-compiled C/fortran code that has been optimized a long time ago. Vectorized operations allow you to execute many operations with ease that normally take a very long time to write through normal iterative methods in python.

<h1><del>LOOPS</del></h1>
Because of vectorization, we can say goodbye to loops. If you are writing loops in pandas, you are probably doing it wrong.

In [34]:
### A slew of other mathematical operations are able to be performed on a python series
## raise every element to a power and continue doing element by element math
s.pow(4) / 13 - 40

0   -3.992308e+01
1   -3.376923e+01
2   -3.376923e+01
3    2.704040e+06
dtype: float64

In [35]:
s.add(5) # add 5 using a method

0     6
1     8
2     8
3    82
dtype: int64

In [36]:
s.sum() #sum up all the items

84

In [37]:
# many methods have very valuable arguments that can be set to get a different result. 
# Here we sort from largest to smallest
s.sort_values(ascending=False)

3    77
2     3
1     3
0     1
dtype: int64

In [38]:
# more aggregations
s.std(), s.var(), s.min(), s.max(), s.mean(), s.median(), s.mode() 

(37.345236197762446, 1394.6666666666667, 1, 77, 21.0, 3.0, 0    3
 dtype: int64)

In [39]:
# execute operations that return the entire data set
# keep track of sum as you iterate down the column
s.cumsum()

0     1
1     4
2     7
3    84
dtype: int64

In [40]:
# keep track of product as you iterate down the column
s.cumprod()

0      1
1      3
2      9
3    693
dtype: int64

### Accessing individual elements of a Series
pandas gives you a ridiculous amount of ways to access the elements of your series. Many of the methods seen below will give you a different way of doing the same thing. This is a common theme throughout pandas. You will feel completely overwhelmed (I still do) by the number of methods available to you. Take a look at the pandas [API reference](http://pandas.pydata.org/pandas-docs/stable/api.html) to see all things possible.

In [41]:
# How to get the first element of a Series
s[0]

1

### Advanced function below:

In [42]:
# define a function to attempt an index lookup in a Series
def try_lookup(a_series, idx):
    '''
    This function returns a printed statement of either the value at the index or an error message.
    
    a_series : A pandas series you want to attempt a lookup with 
    
    idx : A numerical or string index that you would like to lookup in the passed series
    '''
    try:
        return a_series[idx]
    except Exception as e:
        return 'We got an error! {}'.format(type(e))

In [43]:
try_lookup(s, 0), try_lookup(s, 2), try_lookup(s, -1)

(1, 3, "We got an error! <class 'KeyError'>")

In [44]:
s[0], try_lookup(s, 0)

(1, 1)

In [45]:
try_lookup(s, -1)

"We got an error! <class 'KeyError'>"

In [46]:
# s[-1] # we get a KeyError!

### Is getting an item in a Series equivalent to getting an item in a python list?
**NO!** The previous example should clearly demonstrated this. So, how did the first example work? When using the [ ] (brackets) operator to lookup an element in a Series, the index is looked up and not the position of the Series. But since 0 is both the **position** and the **label** then this lookup worked.

In [47]:
# Create a series with strings as indices
s1 = pd.Series(data=[8,2,3], index=['a', 'b', 'c'])
s1

a    8
b    2
c    3
dtype: int64

In [48]:
# This should fail?!
s1[0], s1[1], s1[2]
# But it works???

(8, 2, 3)

### How did that work?
There are some esoteric rules when it comes to grabbing items using the [ ] operator.
1. When the index is all integers, lookup by label. Error if label is not in index
2. When index is all strings or a mix of strings and integers, search for label first and then fall back to position if label is not found

### Label vs Position for indexes
This is important terminology in the pandas indexing world  
**Label** : The actual value, a string or numeric at a particular index location

**Position** : The numerical positional order of the particular location in an index

When you read **label**, you should think of how keys are looked up in a python dictionary.

When you read **position**, you should think of how items are accessed in a python list.

In [49]:
# What happens when there is a mix of integers and strings
s2 = pd.Series(data=[8,2,3,5],index=['a', 'b', 'c', 0])
s2['a'], s2['c'], s2[-1], s2[0]

(8, 3, 5, 5)

In [50]:
s3 = pd.Series(data=[1,2,3,4],index=[-1, 'b', 'c', 0])
s3['c'], s3[-1], s3[0]

(3, 1, 4)

### Lookups with .ix[ ]
.ix works nearly the same way as the plain [ ] operator except that when the index is a mix of strings and integers it defaults to the lookup by position first and then location if the item is not in the index. .ix also uses the brackets notation

In [51]:
# begin with mixed integer/string indexed series
s3

-1    1
b     2
c     3
0     4
dtype: int64

In [52]:
# lookup differenes betwee [ ] and .ix
s3[-1], s3.ix[-1]

(1, 4)

In [53]:
s3[0], s3.ix[0]

(4, 1)

In [54]:
# For Series with only integers, ix will only lookup based on label and not fall back on position
s4 = pd.Series([5,10,2], index=[2,5,9])
s4

2     5
5    10
9     2
dtype: int64

In [55]:
# s4.ix[0] # KeyError
# s4.ix[-1] # KeyError
s4.ix[9]

2

### Lookups with slices in the [ ] and .ix[ ] operators
To add to the confusion you can use slicing to always get integer positional [ ] regardless if the index type.
Slices in .ix[ ] will be used positionally for mixed indexes and via label for all integer indexes

In [56]:
# start with mixed string/integer serries index
s3

-1    1
b     2
c     3
0     4
dtype: int64

In [57]:
s3[1:3] #get positional slice. Do not grab last element

b    2
c    3
dtype: int64

In [58]:
# you can pass slices that are out of range
s3[2:100]

c    3
0    4
dtype: int64

In [59]:
#now for ix
s3.ix[1:] # using only position

b    2
c    3
0    4
dtype: int64

In [60]:
s3.ix[-1:] # start from the last

0    4
dtype: int64

In [62]:
# Lets look at an all integer indexed series
s4

2     5
5    10
9     2
dtype: int64

In [63]:
s4[:1] # grabs up to the 1st element

2    5
dtype: int64

In [64]:
s4[5:] # gets from the 5th element onward

Series([], dtype: int64)

In [65]:
s4[:3] # gets up to the third element

2     5
5    10
9     2
dtype: int64

In [66]:
#now for ix - this is where the difference happens!
# the lookup is label-based when passing a slice to an all integer series index
s4.ix[:3] 

2    5
dtype: int64

In [67]:
# when using label based lookups the right label is included as opposed to positional sliced indexes
# where the right edge of the slice is not included
s4.ix[5:9] 

5    10
9     2
dtype: int64

### Passing lists of labels/positions to  [ ], .ix
Python lists can also be passed inside [ ] and .ix. This is useful to get non-adjacent values in a series. Here are some examples with the rules

In [68]:
# Again start with mixed label indexed series
s3

-1    1
b     2
c     3
0     4
dtype: int64

In [69]:
# lookup by label first then fall back to position
s3[[1, 0, -1, -2]] # -1 and 0 will use labels. 1 and -2 will use positional

b     2
-1    1
0     4
c     3
dtype: int64

In [70]:
# s3[[1, 0, -1, -2, 5]] # raise index error because 5 is not a label and then not a positional index

In [71]:
# use ix
s3.ix[[0]] # Lookup first by label. return NaN if label not in index 

0    4
dtype: int64

In [72]:
# s3.ix[[0, 1]] # IndexError. This is bizarre... possibly a bug?

In [73]:
# and for good measure - only integer index
s4

2     5
5    10
9     2
dtype: int64

In [74]:
# use only labels for [ ]
s4[[0, 1]] 

0   NaN
1   NaN
dtype: float64

In [75]:
s4[[2,9]]

2    5
9    2
dtype: int64

In [76]:
#now for ix
s4.ix[[0, 1]] # does the same thing as [ ]

0   NaN
1   NaN
dtype: float64

In [77]:
s4.ix[[2, 9]]

2    5
9    2
dtype: int64

<h1 style="text-align:center"> Summarizing [ ] and .ix[ ]</h1>
<table>
    <thead >
        <tr>
            <th style="text-align:center">Action when passed a ...</th>
            <th style="text-align:center">[ ]</th>
            <th style="text-align:center">.ix[ ]</th>
        </tr>
   </thead>
   <tbody>
       <tr>
            <td>single integer when index is all integers</td>
            <td>Lookup by label and raise KeyError when not in index. No positional lookup</td>
            <td>Lookup by label and raise KeyError when not in index. No positional lookup</td>
       </tr>
       <tr>
            <td>single integer when index is mixed strings and ints</td>
            <td>Use label for lookup then fall back to position if label not found</td>
            <td>Use position for lookup then fall back to label if position not found.</td>
       </tr>
       <tr>
            <td>slice (as in s[3:20:5]) for all integer series</td>
            <td>Lookup positions and do not raise an error if slice goes out of bounds</td>
            <td>Lookup by label and do not raise an error if slice goes out of bounds</td>
       </tr>
       <tr>
            <td>slice (as in s[3:20:5]) for a mixed index series</td>
            <td>Lookup positions and do not raise an error if slice goes out of bounds</td>
            <td>Lookup positions and do not raise an error if slice goes out of bounds</td>
       </tr>
       <tr>
            <td>list of integers when index is all integers</td>
            <td>Use label only. Return NaN if element in list not in index</td>
            <td>Use label only. Return NaN if element in list not in index</td>
       </tr>
       <tr>
            <td>list of integers when index is mixed strings and ints</td>
            <td>Lookup label, then position, before finally raising an IndexError</td>
            <td>Lookup by label then raise IndexError if not found.</td>
       </tr>
   </tbody>
<table>

# This is incredibly confusing!!! There must be a better way
These rules can get very confusing and lucky for us there is quite a bit of good news here. First, looking up values in a series is somewhat of a rarity as its more common to operate on a series as a whole.  There are more direct non-ambiguous methods that we will talk about next. Also since, .ix handles both label and positional locations, its performance is worst of the lookup mechanisms.

### .loc[ ] and .iloc[ ] for much less confusion
To simplify matters, pandas provides the .loc and .iloc lookup methods to simplify lookups. .loc lookups only use the **label** and error if the label is not in the index. .iloc lookups only use the **integer position** just like lists and numpy arrays. These methods are much prefered to .ix since they are non-ambiguous and faster.

In [78]:
s3

-1    1
b     2
c     3
0     4
dtype: int64

In [79]:
# .loc can take both integers and strings as long as they are a label in the index
s3.loc['b'], s3.loc[0]

(2, 4)

In [80]:
# can pass in a list to .loc
s3.loc[[-1, 5, 'c']]

-1    1.0
5     NaN
c     3.0
dtype: float64

In [81]:
# you can slice from element to another
s3.loc['b':]

b    2
c    3
0    4
dtype: int64

In [82]:
# loc doesnt take integers
# s3.loc[1:] TypeError

In [83]:
# lets look at iloc - only takes integer positions. works just like a list. finally simple!
s3.iloc[3], s3.iloc[0], s3.iloc[-3]

(4, 1, 2)

In [84]:
# slice just like lists
s3.iloc[-2:]

c    3
0    4
dtype: int64

In [85]:
s3.iloc[:-1]

-1    1
b     2
c     3
dtype: int64

In [86]:
#get every other element
s3.iloc[::2]

-1    1
c     3
dtype: int64

In [87]:
# can slice out of range just like lists
s3.iloc[5:500]

Series([], dtype: int64)

In [88]:
# can pass a list as well - will IndexError here if out of range
s3.iloc[[1,2,-1]]

b    2
c    3
0    4
dtype: int64

In [89]:
# all integer index
s4

2     5
5    10
9     2
dtype: int64

In [90]:
# since all index values are ints, integer lookups and slicing work for .loc
s4.loc[2:5]

2     5
5    10
dtype: int64

### .at[ ] and .iat for scalar lookups
.at and .iat are analogous to .loc and .iloc except they only lookup single scalar values - no lists or slices

.at accepts a single label location
.iat accepts a single integer location

In [91]:
s3.iat[3], s3.at['b']

(4, 2)

### A use case for .ix
.ix is not all bad... though it should be avoided when possible. Since it does have the ability to handle both positional and label locations, it works well with multiindexed series and dataframes - those with more than one index level. 

# Why all the fuss over this index
indexes in Series and DataFrames play a huge (and perhaps surprising) roll in pandas. Lets begin with one of these surprising examples

In [92]:
#no more strange mixed string/integer index series
# its typical to fill a series with a numpy array
s1 = pd.Series(np.arange(10))
s2 = pd.Series(np.arange(10), index = np.arange(1, 11))

In [93]:
s1

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [94]:
s2

1     0
2     1
3     2
4     3
5     4
6     5
7     6
8     7
9     8
10    9
dtype: int64

In [95]:
# These series only differ by 1 in their index and have the same values
# first lets just add the values (which are just the original numpy arrays)
s1.values + s2.values

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

### That made sense. Lets see what happens when we add the series together

In [96]:
# wtf?
s1 + s2

0      NaN
1      1.0
2      3.0
3      5.0
4      7.0
5      9.0
6     11.0
7     13.0
8     15.0
9     17.0
10     NaN
dtype: float64

### What happened?
A couple things look wrong. Not only are the values one off the original numpy array addition, there are to NaNs (not a number, also known as missing values). What happened here is the key to many pandas operations. The series automatically aligned on its index (not by its integer position).  Index 1 in `s1` got aligned with index 1 in `s2` which produced the operation 1 + 0. Only the indices 1 - 9 got paired up. Index 0 from s1 and index 10 from s2 didn't align but still returned a NaN result. This is actually similar to a sql outer join.

Lets look at one more example to drive the point home

In [97]:
# The only index label in common was 'd'. The rest are missing
s1 = pd.Series(np.arange(4), index=list('abcd'))
s2 = pd.Series(np.arange(4), index=list('defg'))
s1 + s2

a    NaN
b    NaN
c    NaN
d    3.0
e    NaN
f    NaN
g    NaN
dtype: float64

### More surprises with indexes
There is no enforcement of uniqueness on the index so all your elements in your series can have the same index value. Lets look at some operations on series when they have there are duplicated indexes.

In [98]:
s1 = pd.Series(np.arange(8), index=list('aaaabbbb'))
s2 = pd.Series(np.arange(5), index=list('aabbc'))

In [99]:
s1 + s2

a     0.0
a     1.0
a     1.0
a     2.0
a     2.0
a     3.0
a     3.0
a     4.0
b     6.0
b     7.0
b     7.0
b     8.0
b     8.0
b     9.0
b     9.0
b    10.0
c     NaN
dtype: float64

In [552]:
# A bit advancd but gives us a clearer look of how the aligning is happening
df = s1.to_frame(name='s1').join(s2.to_frame('s2'), how='outer')
df['sum'] = df['s1'] + df['s2']
df

Unnamed: 0,s1,s2,sum
a,0.0,0,0.0
a,0.0,1,1.0
a,1.0,0,1.0
a,1.0,1,2.0
a,2.0,0,2.0
a,2.0,1,3.0
a,3.0,0,3.0
a,3.0,1,4.0
b,4.0,2,6.0
b,4.0,3,7.0


### A Cartesian product aligns the values first and then the sum executes
A cartesian product is simply every element in the first series pairing up with every matching element in the second series. If there were 5 'a' indices in the first series and 7 'a' indices in the second series then there would be 35 total 'a' indices after summing both series (like we did above). This behavior is very different than a numpy array which simply aligns elements by their position first and then adds them. 

Also worth mentioning is that index aligning allows you to add series of different lengths while numpy arrays of different lengths cannot be added.

### More differences with numpy
pandas gives you more leniency by ignoring missing data when doing aggregations on your data.

In [100]:
# store the series from above in a variable and store its values in a numpy array
s3 = s1 + s2
array = s3.values

In [101]:
# As a default numpy returns nan when doing an aggregation on an array.
# Numpy does have special funcitons to get around this behavior
array.mean(), array.sum(), array.max(), np.nanmean(array)

(nan, nan, nan, 5.0)

In [102]:
# pandas on the other hand assumes you are aware that there are missing data and returns a non-missing value
# you can use the skipna parameter to match numpys default
s3.mean(), s3.sum(), s3.max(), s3.min(skipna=False)

(5.0, 80.0, 10.0, nan)

# Boolean Indexing
A more common method of extracting the values in a series is to choose them based on certain criteria. Boolean indexing is done by passing a boolean (only true/false values) array or series to the [ ] operator. If a series is passed then, every index in the outer series must have a true value associated to it in the inner series. If an array is passed, it must be the same length as the series.

As usual, lets do some boolean indexing by hand.

In [103]:
# Create a series that maps bools to each index label from s3
keep = pd.Series([True, False, True], index=list('abc'))
keep

a     True
b    False
c     True
dtype: bool

In [104]:
# This will keep only a and c labels
s3[keep]

a    0.0
a    1.0
a    1.0
a    2.0
a    2.0
a    3.0
a    3.0
a    4.0
c    NaN
dtype: float64

In [105]:
# Lets get a series based on some critera
criteria = s3 > 5
criteria

a    False
a    False
a    False
a    False
a    False
a    False
a    False
a    False
b     True
b     True
b     True
b     True
b     True
b     True
b     True
b     True
c    False
dtype: bool

In [106]:
# we can now pass this criteria to our original series to get only values above 5
s3[criteria]

b     6.0
b     7.0
b     7.0
b     8.0
b     8.0
b     9.0
b     9.0
b    10.0
dtype: float64

In [107]:
# you can do this all in one step without an intermediate variable
s3[s3 > 5]

b     6.0
b     7.0
b     7.0
b     8.0
b     8.0
b     9.0
b     9.0
b    10.0
dtype: float64

In [108]:
# Get just the index b
criteria = s3.index == 'b'
criteria

array([False, False, False, False, False, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False], dtype=bool)

In [109]:
# and now display the index b
s3[criteria]

b     6.0
b     7.0
b     7.0
b     8.0
b     8.0
b     9.0
b     9.0
b    10.0
dtype: float64

In [110]:
# Wait, wasn't that really repetitive?
# of course! We could have just done
s3.loc['b']

b     6.0
b     7.0
b     7.0
b     8.0
b     8.0
b     9.0
b     9.0
b    10.0
dtype: float64

In [111]:
%timeit s3.loc['b']

10000 loops, best of 3: 135 µs per loop


In [112]:
%timeit s3[s3.index == 'b']

1000 loops, best of 3: 479 µs per loop


In [113]:
# 3-4 times faster

### More complex boolean slicing
Any number of boolean conditions can be strung together to retrieve certain values just as they can in python. Instead of `and` use the `&` symbol and instead of `or` use the `|` symbol and wrap each condition with a parentheses

In [114]:
# lets start with a long series of numbers
s = pd.Series(np.arange(500))

In [115]:
#Lets get all the numbers that are divisible by both 2 and 13
# or divisible by 100
criteria = (s % 2 == 0) & (s % 13 == 0) | (s % 100 == 0)

In [116]:
# every number here must meet the above criteria
s[criteria]

0        0
26      26
52      52
78      78
100    100
104    104
130    130
156    156
182    182
200    200
208    208
234    234
260    260
286    286
300    300
312    312
338    338
364    364
390    390
400    400
416    416
442    442
468    468
494    494
dtype: int64

In [117]:
# we can also use aggregations in our boolean indexing
# generate 500 mean = 0, std = 1 normal random variables
# Multiply by 100 to get a larger spread and round to get whole numbers and add 300 to 
s = pd.Series(np.round(np.random.randn(500) * 100) + 300)

In [118]:
# Lets first get a quick look at the distribution of points
# use value_counts method to get the most common number
# since this is a normal distribution we would expect the numbers closest its mean(300) to show up the most
# Since there are likely hundreds of unique points,use head to get the first n rows. n defaults to 5 if left blank
s.value_counts().head(10)

305.0    7
293.0    6
314.0    5
239.0    5
218.0    5
345.0    5
341.0    5
322.0    4
292.0    4
269.0    4
dtype: int64

### Use boolean selection to test assumptions of normality
Lets pretend we didn't know this data was generated from a normal distribution but wanted to test the assumption that 2/3 of data lie within 1 standard deviation from the mean. We will need both the mean and the standard deviation to get started.

In [119]:
# get mean and std
mean = s.mean()
std = s.std()
mean, std

(299.822, 97.52559047550129)

In [120]:
# get a true value for every value less than 1 standard deviation away from normal
criteria = abs(s - mean) / std < 1

In [121]:
# select only the true values
s1 = s[criteria]

In [122]:
# To see if 2/3 of the data remain we need the length of the series. There are many ways to get this
len(s1), s1.size, s1.shape[0]

(342, 342, 342)

In [123]:
#finally, divide the length of the above series by the length of the original series
s1.size / s.size

0.684

### How to check the entire series for truth
Lets say we wanted to ensure that all observations were within 4 standard deviations (a very impropable outcome for any one event). We can use the .all or .any methods

In [124]:
# find the criteria in one step
s1 =  s[abs(s - mean) / std < 4]

# Are all of them less than 4 std?
s1.all()

True

In [125]:
# alternatively you could ask if any of them were greater than 4 std deviations
s1 =  s[abs(s - mean) / std > 4]
s1.any()

False

# Your Turn!
We have covered quite a bit of material and now its your turn to practice. Since we are still getting our feet wet, these problems will be more monotonous and drill the basics of what we just covered.

## Problem 1
<span  style="color:green; font-size:16px">Create a 3 element pandas series by hand with characters as the index</span>

In [658]:
# your code here

## Problem 2
<span  style="color:green; font-size:16px">Another way to create a series is to pass the pandas series constructor a dictionary. Make the keys as integers and the values as strings.</span>

In [659]:
# your code here

## Problem 3
<span  style="color:green; font-size:16px">Create a series by passing the constructor a np.random.rand array. Make the array 1000 elements long and store it to a variable s (we will continually refer to this variable the rest of the problems). Reprint out all the series methods using the `dir` command and explore the use of at least 5 methods not used in the lecture above. idxmax is a good one to get you started. Use the help function or ? command or by pressing shift + tab + tab after parentheses opening to get help. No stackoverflow here :) </span>

In [660]:
# your code here

## Problem 4
<span  style="color:green; font-size:16px"> Sort the series by descending value and save it to another variable s1</span>

In [None]:
# your code here

## Problem 5
<span  style="color:green; font-size:16px">Now re-sort the series s1 in place (meaning it will return None) by index ascending</span>

In [None]:
# your code here

## Problem 6
<span  style="color:green; font-size:16px">Use the series method all to and boolean logic to show that your series s equals s1</span>

In [None]:
# your code here

## Problem 7
<span  style="color:green; font-size:16px">Slicing! Using iloc, slice series s:<ol><li>Retrieve the first 6 elements</li><li>Retrieve every 18th element</li><li>Reverse the series</li><li>Write two ways to get every 4th element starting from the 993rd element to the 593rd</li><li>Chain your slicing multiple times by getting every other element, then every third element, then every 4th element, then every 5th element</li></ol>
</span>

In [None]:
# your code here (use keyboard shortcut ESC + B to make more cells below for each slice)

## Problem 8
<span  style="color:green; font-size:16px">What will happen when you create two series that have no indexes in common and you add them together? Test it out by making two small series and adding them together</span>

In [661]:
# your code here

## Problem 9
<span  style="color:green; font-size:16px"> Lookup the definition for interquartile range and smartly slice your way from series s until you are left with the IQR </span>

In [None]:
# your code here

## Problem 10
<span  style="color:green; font-size:16px">Find the interger position of the largest element in series s. Slice the series from the beginning to the largest element and find the cumulative product of the sliced series</span>

In [None]:
# your problem here

## Problem 11
<span  style="color:green; font-size:16px">A quick way to count the number of true values in a series is simply to sum it up as True = 1 and False = 0 in python. Find the number of values in series s less than 1.</span>

In [None]:
# your code here

## Problem 12
<span  style="color:green; font-size:16px">Use boolean indexing to get all numbers in series s between -1 and 1 except for those between -.3 and -.2</span>

In [None]:
# your code here

## Problem 13
<span  style="color:green; font-size:16px">Assign series s to s1 `s1 = s` and then change the value of the 0th element of s1 to 100. What has happened to s? We didn't talk about series assignment, but does it work as you would expect?</span>

In [676]:
#your code here

## Problem 14
<span  style="color:green; font-size:16px">If you are getting a loooooooong output, think about using the `head` method to trim it down..... Continuing from the above problem, assign s to s1 but this time copy it over (research this) and then assign the 100 to the 0th element of s1. What happened to s this time? </span>

In [None]:
# your code here

## Problem 15
<span  style="color:green; font-size:16px">Assign every 2nd element of s1 the mean of s. What happened to s1? Is it the same size. Now assign elements 100 to 900 the variance of s. How much has s1 changed?</span>

In [682]:
# your code here

## Problem 16
<span  style="color:green; font-size:16px">Using the del stament, can you delete every 5th element of a series?</span>

In [683]:
# your code here