# $$\textit{pandas}$$
$$\text{Schwartz}$$



## $$\textbf{They're not on the endangered species list any more}$$
# $$\textbf{China saved the Giant Panda}$$

As of September 2016 the International Union for the Conservation of Nature (IUCN) has taken the global icon off the endangered species list. The giant panda is now listed as "vulnerable" as opposed to "endangered". Thanks to Chinese conservation efforts the Giant Panda population has increased to approximately 2,000 individuals up from a low of about 1,000 in the 1970's when the species was at the most risk.  This population increase has been the result of bamboo forest restoration projects and the incorporation of improved captive breeding and husbandry methods -- initiatives driven by a better understanding of Giant Panda physiology and behaviour. Challenges remain, however. Climate change is predicted to wipe out more than one-third of the panda's bamboo habitat in the next 80 years, and reintroduction of captive-bred pandas into the wild remains challenging. 

<table align="center">
<tr><td>
<img src="stuff/pandas.png" width="600px" align="center"> 
</tr></td>
</table>

# Before we do pandas...
* Connect to a Postgres server 
    * in python 
        * using psycopg2
* Understand psycopg2's "cursors"
    * executes
    * commits
* Generate dynamic queries

# psycopg2
* A python SQL postgres server interface

In [1]:
# install homebrew: http://brew.sh

# brew cask install postgres -> double click -> applications
# brew cask install pgadmin3 -> double click -> applications, click plug
# brew tap homebrew/services
# brew services start postgresql

# https://github.com/zipfian/welcome/blob/master/notes/postgres_setup.md

# conda install psycopg2
import psycopg2

In [2]:
# Step 1: open a connection
conn = psycopg2.connect(dbname='my_postgresql_db',port=5432,password="",
                        user='schwarls37',host='localhost')

# host could be a remote database as well

<table align="center">
<tr>
<td><img src="stuff/whywouldyoudothat1.jpg" width="300px" align="center"></td> 
<td><img src="stuff/whywouldyoudothat2.jpg" width="415px" align="center"></td> 
</tr>
</table>
<table align="center">
<tr>
<td><img src="stuff/whywouldyoudothat3.jpg" width="400px" align="center"></td> 
<td><img src="stuff/whywouldyoudothat4.jpg" width="305px" align="center"></td> 
</tr>
</table>

## Allows us to combine data sources in one place
* Can use python to simultaneously pull data from other databases as well 
    * mysql-connector-python (MySql)
    * sqlite (SQLite)
    * pymongo (MongoDB)
    * sqlalchemy (all the things)
    * [psychopg2 (postgreSQL), obviously]

## Allows us to bring other python tools to bear
* DataFrames and associated functionality, Machine Learning tools, etc.

## Allows for easy dynamic query generation
* And hence, automation

In [3]:
# Step 2: create a cursor object
#cur.close()
cur = conn.cursor()

# The cursor interfaces and traverses the database 
# We don't have to worry about how it does it
# Queries are returned as (single iteration) generators 

In [6]:
# Step 3: execute some SQL queries
query = '''SELECT "Facility Name", "Available Residential Beds" 
           FROM Beds 
           WHERE "City" = 'Cuba' 
           LIMIT 10;'''

cur.execute(query)

In [7]:
# If you see this error 

print "InternalError: current transaction is aborted, commands ignored until end of transaction block"

# Then the cursor is chocking on a current command 
# and it needs to be aborted with: conn.rollback()

InternalError: current transaction is aborted, commands ignored until end of transaction block


In [8]:
for row in cur:
    print row

('Cuba Memorial Hospital Inc SNF                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ', 4)
('Cuba Memorial Hospital Inc SNF                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

In [9]:
for row in cur:
    print row
    
# See, I told you: cur saves results as a (one pass) generator    

In [None]:
# Other options for iterating through the generator

# cur.fetchone(), or cur.next()
# cur.fetchmany(n)
# cur.fetchall()

In [10]:
query = '''ALTER TABLE Beds 
           RENAME COLUMN "Available Residential Beds" 
           TO "Available_Residential_Beds"'''

cur.execute(query)

In [11]:
# Step 4: commit SQL actions 
# (to actually make the changes to the DB permanent)

conn.commit()
# conn.autocommit = True

# database level operations are also available

In [12]:
query = '''ALTER TABLE Beds 
           RENAME COLUMN "Available_Residential_Beds" 
           TO "Available Residential Beds"'''
cur.execute(query)
conn.commit()

In [None]:
my_name = "Scott"
unsafe_query = '''SELECT * FROM Users 
                  WHERE Name = ''' + my_name

# what if...
my_name = "Scott; DROP TABLE Users"

# This is called SQL Injection and it's obviously risky

In [None]:
# Instead 
my_name = "Scott; DROP TABLE Users"

cur.execute('''SELECT * FROM Users WHERE Name = %s''', my_name)

# will search for rows in Name *exactly* equal to 'Scott; DROP TABLE Users'

In [None]:
# Step 5: close the connection

cur.close() # optional, automatically close with conn.close()
conn.close()


# Pandas are cute cuddly animals
* They are also the Flying Circus' answer to Excel and R Data Frames
* They are built on top of NumPy NdArrays

# Objectives
* Proficiency with Pandas Series
    * Familiarity with Pandas Time Series
* Proficiency with Pandas DataFrames
    * Using the DataFrame Index
    * Creating and destroying columns
* Proficiency applying functions to rows and columns
    * DataFrame grouping and aggregation
    * DataFrames sorting
* Proficiency in linking DataFrames
    * Concatenating/Appening DataFrames
    * Merging/Joining DataFrames
* Familiarity with matplotlib and Pandas Exploratory Data Analysis (EDA) functionality 

# Pandas is very functional

<table align="center">
<tr>
<td><img src="stuff/panda7.jpg" width="300px" align="center"></td> 
<td><img src="stuff/panda2.jpg" width="180px" align="center"></td> 
<td><img src="stuff/panda6.jpg" width="475px" align="center"></td> 
<tr><td>Killer Panda</td><td>Red Handed Panda</td><td>Sexy Panda</td></tr>


</tr>
</table>
<table align="center">
<tr>

<td><img src="stuff/panda4.jpg" width="230px" align="center"></td> 
<td><img src="stuff/panda3.jpg" width="300px" align="center"></td> 
<td><img src="stuff/panda1.jpg" width="205x" align="center"></td> 
<td><img src="stuff/panda8.jpg" width="210px" align="center"></td> 
<tr><td>Sherrif Panda</td><td>Assisted Pushup Panda</td><td>Acrobat Panda (Advanced)</td><td>Acrobat Panda (Beginner)</td></tr>
</tr>
</table>

# (Standard Library) Lists
* concatenate

In [13]:
[1,2,3] + [4,5,6]

[1, 2, 3, 4, 5, 6]

# Numpy NdArrays
* operate elementwise

In [14]:
import numpy as np

In [15]:
np.array([1,2,3]) + np.arange(3) + np.linspace(10,12,3)

array([ 11.,  14.,  17.])

# Numpy NdArrays

* have types

In [16]:
ints = np.array(range(3))
chars = np.array(list('ABC'))
strings = np.array(['A','BC',"DEF"])

print ints.dtype, chars.dtype, strings.dtype

int64 |S1 |S3


# Speed

https://ipython.org/ipython-doc/3/interactive/magics.html

In [17]:
numpy_array = np.arange(0, 1000000)
python_list = range(1000000)

print "python list"
time = %timeit -r 1 -o sum(python_list)
print time.all_runs[0]/time.loops 

print "\n" + "numpy array"
time = %timeit -r 1 -o np.sum(numpy_array)
print time.all_runs[0]/time.loops

print "\n" + "numpy array -- standard library sum"
time = %timeit -r 1 -o sum(numpy_array)
print time.all_runs[0]/time.loops

python list
100 loops, best of 1: 10.8 ms per loop
0.0108169817924

numpy array
1000 loops, best of 1: 844 µs per loop
0.000843533992767

numpy array -- standard library sum
10 loops, best of 1: 103 ms per loop
0.103136014938


# Broadcasting

http://docs.scipy.org/doc/numpy-1.10.1/user/basics.broadcasting.html

In [18]:
a = np.array([[10], [-10]]) 
b = np.array([[1, 2], [-1, -2]]) 

print a.shape, b.shape 
print "\n"
print a + b

# elements will "duplicate, expand, and fill up" 
# to make the dimensions compatible for element-wise operations
# cool.

(2, 1) (2, 2)


[[ 11  12]
 [-11 -12]]


In [19]:
a = np.array([[10, 0, -10, 0],[-10, 0, -10, 0]]) 
b = np.array([[2,2],[-1,-1]]) 
print a.shape, b.shape 
print ""
print a + b

# it's not clear how it should fill up in this case... so it can't/doesn't

(2, 3) (2, 2)



ValueError: operands could not be broadcast together with shapes (2,3) (2,2) 

In [25]:
# dimension dimentia -- don't be caught holding the bag!

a = 10
a = np.array(10)
a = np.array([10])
a = np.array([[[10]]])
a = np.array([[10],[10]])
b = np.array([[1,2],[-1,-2]])

print a.shape, b.shape
#print "\n"
#print a + b

(2, 1) (2, 2)


# Pandas Series
* are (one dimensional) np.ndarray vectors **with an index**


In [None]:
import pandas as pd

In [None]:
series = pd.Series([5775,373,7,42,np.nan,33])
print series
print "\n"
print series.shape

In [None]:
world_series = pd.Series(["cubs","royals","giants","sox","giants","cards","giants","...",None])
world_series

# Pandas Date Series
* are fancy

In [None]:
bdays = pd.date_range(start='19821107', periods=34+1, freq=pd.DateOffset(years=1))
bdays

### After you learn Pandas you might care about using Date Series Types and the following could be useful:
* df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
* df['Date'] = df['Date'].apply(lambda x: pd.to_datetime(x))



# Pandas DataFrames
* are a set of Pandas Series **that share the same index** 
<br>
<br>

$$\huge \text{Pandas DataFrame} \supset \text{Pandas Series} \supset \text{NumPy Array}$$
<br>

In [None]:
mixedTypes_df = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","try"]),
                     'F' : 'foo' })

# mixedTypes_df.to_dict() 

print mixedTypes_df
print ""
print mixedTypes_df.shape
print ""
mixedTypes_df.dtypes

# Additional Numpy NdArray
* stuff that will likely be useful at some point

In [None]:
noise_ndarray = np.random.randn(35,6)

print noise_ndarray
print noise_ndarray.shape

#print noise_ndarray.flatten().shape
#print noise_ndarray.flatten() # copy

#print noise_ndarray.ravel().shape
#print noise_ndarray.ravel() # view

In [None]:
zeros_ndarray = np.zeros((3,4))        # Create a matrix of zeros with 3 rows and 4 columns. 
ones_ndarray = np.ones((10,20))        # Create a matrix of ones with 10 rows and 20 columns.
identity_ndarray = np.identity(10)     # Create an identity matrix with 50 rows and 50 columns. 
random_ndarray = np.random.rand(2, 2)  # Create a 2x2 array of random floats ranging from 0 to 1. 
range_ndarray = np.arange(0, 20, 0.5)  # Create a numpy array with arguments (start, end, step_size).
zeros_ndarray

In [None]:
# np bonus (+ pandas foreshadowing):
# applying functions by rows or columns

print noise_ndarray
print noise_ndarray.shape
print "\n" + "sum, axis=0"
print noise_ndarray.sum(axis=0)
print "\n" + "sum, axis=1"
print noise_ndarray.sum(axis=1)
print "\n" + "mean, axis=0"
print noise_ndarray.mean(axis=0)
print "\n" + "std, axis=0"
print noise_ndarray.std(axis=0)
print "\n" + "max, axis=0"
print noise_ndarray.max(axis=0)
print "\n" + "min, axis=0"
print noise_ndarray.min(axis=0)
print "\n" + "argmax, axis=0"
print noise_ndarray.argmax(axis=0)
print "\n" + "argmin, axis=0"
print noise_ndarray.argmin(axis=0)

# Manipulating Pandas Indexes

In [None]:
noise_df = pd.DataFrame(noise_ndarray, index=bdays, columns=list('ABC123'))
#noise_df
#noise_df.values # as_matrix()
#noise_df.index
#noise_df.index.values #.dtype

#noise_df.index.tolist()
#noise_df.columns

#noise_df.reset_index(inplace=True)
#noise_df.rename(columns={'index': 'mybdayz'}, inplace=True)

#noise_df.set_index("A", inplace=True)
#noise_df.reset_index().set_index("A")

#noise_df.rename(columns={'A': 'a'}, inplace=True)

#noise_df

In [None]:
noise_df.T

# Pandas Sorting

In [None]:
noise_df
#noise_df.sort_index(axis=1, ascending=False)
#noise_df.sort_index(axis=0, ascending=False)
#noise_df.sort_values(by='B')

# Accessing Data in Pandas
* is kind of special

In [None]:
print noise_ndarray[:4,:4]
#is this a matrix?

print noise_df.head()
#noise_df.columns
#noise_df[:4,:4]

# ANSWER THESE
#how to get rows?
#noise_df[1]
###bdays[7]
#how to get columns?
#noise_df['B']
#.values
#.tolist()

In [None]:
watchOut_df = pd.DataFrame(np.random.randn(3,3), index=list('ABC'), columns=range(3))
print watchOut_df
print ""
print watchOut_df.columns
print ""

#watchOut_df[:1]
#watchOut_df[[0,1]]
#watchOut_df["1"]

# The *.loc*

In [None]:
print noise_df[:3]
print ""
print bdays[:3]
print ""
print noise_df.columns, noise_df.columns.values, noise_df.columns.tolist()

#noise_df.loc[bdays[7],["B","C"]]
#noise_df.loc[["B","C"]] # hint: ,

#noise_df.loc[:,[1,2]] # hint: "
#noise_df.loc[:3,["B","C"]] # hint: bdays[7]

#noise_df.loc[bdays[:3],["B","C"]]
#noise_df.loc[bdays[1]:bdays[3],["B","C"]]

# The *.iloc*
* as opposed to the *.loc*

In [None]:
print noise_df.iloc[bdays[0]:bdays[3],["B","C"]]
#print noise_df.iloc[2:5,2:5]

# The *.ix*
* as opposed(?) to the *.loc* and the *.iloc*

In [None]:
print noise_df.ix[:4,1:3]
print ""
print noise_df.ix[:bdays[3],1:3]
print ""
print noise_df.ix[:4,['B','C']]
print ""
print noise_df.ix[:bdays[3],['B','C']]
print ""

# The *.at/.iat*?
* gets you a single scalar. fast.

# Boolean Indexing (i.e., row selecting)

In [None]:
# OH, BTW, LOOK! READING IN CSV'S... 
# Just like with RDBMS's you can basically read in any file flavor you want...

Schools_df = pd.read_csv("stuff/Schools.csv")
Players_df = pd.read_csv("stuff/SchoolsPlayers.csv")

# & | ~ == != VERSUS and or not equals

pd.DataFrame(Schools_df[Schools_df.schoolState.isin(["TX"])])
#reshape(3,13)
# schoolNick.unique()
# len()
# .reshape(13,3)
# pd.DataFrame()    
# [['schoolNick', 'schoolCity',' schoolName']]
# sort_values(by=['schoolNick','schoolCity'])

#Schools_df[(Schools_df.schoolState.isin(["TX"]) & Schools_df.schoolNick.str.contains("Tigers")) 
#           | ((Schools_df.schoolCity.astype(str) == "Austin") & 
#              (Schools_df["schoolName"].astype(str) != "University of Texas at Austin")) ] 

In [None]:
kp = (Schools_df.schoolState.isin(["TX"]) & Schools_df.schoolNick.str.contains("Tigers")) | \
    ((Schools_df.schoolCity.astype(str) == "Austin") & ~(Schools_df.schoolName.astype(str) == "University of Texas at Austin"))
print Schools_df.ix[kp, ["schoolName","schoolNick"]]
print ""
print Schools_df.loc[kp, "schoolName":"schoolNick"]
print ""
print Schools_df.iloc[kp ,1:2]

In [None]:
noise_ndarray = np.random.randn(35,6)
noise_df = pd.DataFrame(noise_ndarray, index=bdays, columns=list('ABC123'))
print noise_df[:3]
print ""
noise_df[noise_df > 0]
#noise_df[noise_df > 0] = 0
#noise_df

#.as_matrix()
#mat[~np.isnan(mat)]

# *Copy* versus *View* 
* and not accidentally editing another variables memory 

In [None]:
noise_ndarray = np.random.randn(35,6)
noise_df = pd.DataFrame(noise_ndarray, index=bdays, columns=list('ABC123'))
abs_noise_df = noise_df

print "noise_df[:3]" 
print noise_df[:3] 

abs_noise_df[abs_noise_df < 0] = -abs_noise_df[abs_noise_df < 0] 

print "\n"+"abs_noise_df[:3]"
print abs_noise_df[:3] 

print "\n"+"noise_df[:3]" 
print noise_df[:3] 
print "\n"+"noise_ndarray[:3]" 
print noise_ndarray[:3] 

# Do you have a problem with this?

In [None]:
noise_ndarray = np.random.randn(35,6)
noise_df = pd.DataFrame(noise_ndarray, index=bdays, columns=list('ABC123'))
abs_noise_df = noise_df.copy()

In [None]:
abs_noise_df[abs_noise_df < 0] = -abs_noise_df[abs_noise_df < 0] 

print "abs_noise_df[:3]"
print abs_noise_df[:3] 
print "\n"+"noise_df[:3]" 
print noise_df[:3] 
print "\n"+"noise_nparray[:3]" 
print noise_nparray[:3] 

# much better

# Adding Columns

In [None]:
mixedTypes_df = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","try"]),
                     'F' : 'foo' })

mixedTypes_augmented_df = mixedTypes_df.copy()

mixedTypes_augmented_df['F'] = mixedTypes_augmented_df['F'] + " fighter"
mixedTypes_augmented_df['G'] = 'hommies'

print "mixedTypes_df"
print mixedTypes_df
print "\n"+"mixedTypes_augmented_df"
print mixedTypes_augmented_df

# columns can be removed with the del keyword (demonstrated later)

# Missing Values

In [None]:
mixedTypes_df[mixedTypes_df.E=="test"]=np.nan
print mixedTypes_df

In [None]:
print mixedTypes_df
print ""
print mixedTypes_df.dropna(how='any') #df.dropna(subset=['a']) # there's probably an "inplace"...
print ""
print mixedTypes_df

In [None]:
mixedTypes_df.F = mixedTypes_df['F'].fillna(value="I pity the")

print mixedTypes_df#.isnull()
print "\n"
print pd.isnull(mixedTypes_df)
print "\n"
print pd.notnull(mixedTypes_df)


mixedTypes_df.loc[pd.isnull(mixedTypes_df.C),'D'] = 20
mixedTypes_df

# Applying functions to Data
## A.k.a., transforming data, doing stuff to data, etc.
<br>
$$\huge \text{NumPy Array} \subset \text{Pandas Series} \subset \text{Pandas DataFrame}$$

* http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics

In [None]:
for i in range(6):
    noise_df.iloc[i,i] = np.nan

print noise_df[:8]
print "\n" + "mean, axis=0"
print noise_df.mean(axis=0) 
print "\n" + "mean, axis=1"
print noise_df.mean(axis=1) 

In [None]:
print Schools_df.head()
Schools_df.schoolState.value_counts()

In [None]:
popular_names = Schools_df.schoolNick.value_counts()
pd.set_option('display.max_rows',len(popular_names))
print popular_names
pd.set_option('display.max_rows',10)

# Discretization

In [None]:
print noise_df.A[1:].quantile([0, .25, .5, .75, 1])
print ""
print pd.qcut(noise_df.A,[0, .25, .5, .75, 1])[:5]
print ""
print pd.cut(noise_df.A,3)[:5]

# Apply

In [None]:
print noise_df.apply(lambda x: x.max() - x.min()) #np.log(np.abs(x))

* [Guru God Level Extra Credit] Transform versus Apply -- what's the difference?

In [None]:
print noise_df[:3]
print ""
print noise_df.describe()
print ""
print mixedTypes_df
print ""
print mixedTypes_df.describe()

# Plotting
* for good, not evil

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import pylab
pylab.rcParams['figure.figsize']=(6,5)

In [None]:
import matplotlib
matplotlib.style.use('classic')
matplotlib.style.use('ggplot')

print plt.style.available 

In [None]:
noise_nparray = np.random.randn(35,6)
noise_df = pd.DataFrame(noise_nparray, index=bdays, columns=list('ABC123'))
for i in range(6):
    noise_df.iloc[i,i] = np.nan
    
random_walk_df = noise_df.apply(np.cumsum)

In [None]:
print random_walk_df
random_walk_df = random_walk_df.reset_index()
print ""
print random_walk_df.columns
print ""
print random_walk_df
del random_walk_df['index'] # or df.drop('index', inplace=True, axis=1)
print ""
print random_walk_df.columns

random_walk_df.plot()

In [None]:
plt.figure()
random_walk_df.hist()
plt.tight_layout()

In [None]:
random_walk_df.plot(kind='box')

In [None]:
pylab.rcParams['figure.figsize']=(17,5)
random_walk_df.plot(kind='bar')
pylab.rcParams['figure.figsize']=(7,5)

In [None]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(random_walk_df, alpha=0.9, figsize=(10, 10), diagonal='kde')
pylab.rcParams['figure.figsize']=(6,5)

* http://pandas.pydata.org/pandas-docs/stable/visualization.html
* http://matplotlib.org
* http://matplotlib.org/users/style_sheets.html
* https://stanford.edu/~mwaskom/software/seaborn/ 
* http://bokeh.pydata.org/en/latest/


# Elementwise Operations
* with broadcasting

In [None]:
pylab.rcParams['figure.figsize']=(7,5)
standarized_walk_df = (random_walk_df - random_walk_df.mean()) / random_walk_df.std()

plt.figure()
standarized_walk_df.plot()
pylab.rcParams['figure.figsize']=(6,5)

In [None]:
tmp = noise_df[["1","2","3"]] #.copy()? # tmp.iloc[1,1] = 1.0 # print noise_df.iloc[1,1]

print noise_df[:6]
print ""
print tmp[:6]

#tmp.columns = list('ABC')
noise_df[["A","B","C"]] + tmp

# Concatenating
* adding *rows*
* see also: df.append()
* http://pandas.pydata.org/pandas-docs/stable/merging.html

In [None]:
print noise_df[:3]
print ""
print noise_df[30:]
print ""
print pd.concat([noise_df[:3],noise_df[30:]])

In [None]:
pd.concat([noise_df[["1","2","3"]],noise_df[["A","B","C"]]])

In [None]:
A = noise_df[["1","2","3"]]
A = A.reset_index()
del A['index']
A.columns = list('ABC')

B = noise_df[["A","B","C"]]
B = B.reset_index()
del B['index']

C = pd.concat([A,B])
print C
print ""
print C.loc[1,:] # what's going on here with multiple results?
print ""
C = pd.concat([A,B],ignore_index=True)
print C
print ""
print C.loc[1,:] # did that fix it?

# Merging
* adding *columns*
* see also: df.join
* http://pandas.pydata.org/pandas-docs/stable/merging.html

In [None]:
schools_df = pd.read_csv('stuff/Schools.csv')
print schools_df[:3] 
print schools_df.shape
players_df = pd.read_csv('stuff/SchoolsPlayers.csv')
print ""
print players_df[:3]
print players_df.shape

pd.merge(schools_df, players_df, on='schoolID')

# SQL Style Joining
* Left, right, inner, outer...

In [None]:
left = pd.DataFrame({'key': ['foo', 'foo', 'bar'], 'lval': [1, 2, 3]})
right = pd.DataFrame({'key': ['foo', 'foo','post'], 'rval': ["A", "B", "C"]})

print "X"
print left
print "\n" + "Y"
print right
print "\n" + "X outer join Y"
print pd.merge(left, right, on='key', how='outer')
print "\n" + "X inner join Y"
print pd.merge(left, right, on='key', how='inner')
print "\n" + "X left join Y"
print pd.merge(left, right, on='key', how='left')
print "\n" + "X right join Y"
print pd.merge(left, right, on='key', how='right')

# Group By
* Aggregate, Apply
* http://pandas.pydata.org/pandas-docs/stable/groupby.html

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

print df
print ""
df.groupby('A').size()

In [None]:
print df.groupby(['A','B']).sum()
print ""
print df.groupby('A').sum()

# Remember, Sorting is just done as a sort -- *not* a Group By
* You just sort by mulptiple columns

In [None]:
print df.sort_values(by = ['A','C'])

# Multi-Indexing
* group by structuring

In [None]:
index = pd.MultiIndex.from_tuples(names=['first', 'second'],
            tuples = list(zip(['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
                              ['one', 'two', 'one', 'two','one', 'two', 'one', 'two'])))

df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C'])
df

# Stacking
* and unstacking

In [None]:
stacked = df.stack()
stacked

In [None]:
stacked.unstack() #.unstack() 

In [None]:
print stacked
stacked.unstack(0)

# Pandas supports all sorts of data types
So we noted all the RDBMS support
* http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
* http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html
* http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html


And we've seen CSV import capability 

Even .xlsx is supported
* http://pandas.pydata.org/pandas-docs/stable/io.html


So is Pickle

So is pretty much everything else

# And of course we can write out in all of these formats, as well

<br>





In [None]:
jes = pd.read_excel("stuff/Robert Distribution Environmental Data 2016.08.04.xlsx",
                    skiprows=1).iloc[:38,:]

jes.rename(columns = {'Evaluation Criteria':'Category', 'Unnamed: 1':'Evaluation Criteria'}, inplace = True)
jes['Route 2'] = pd.to_numeric(jes['Route 2'])
jes['Route 6'] = pd.to_numeric(jes['Route 6'])

jes

# Pivot Tables
* http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html

In [None]:
pd.pivot_table(jes, index = ['Category', 'Priority'],
               aggfunc = [len, max], values = ['Route 1', 'Route 7'])

In [None]:
m_pd = pd.merge(Schools_df, Players_df, on='schoolID')
pd.crosstab(m_pd.schoolState, m_pd.schoolNick)


<br>

# When will I _ever_ EVEN have to use Pandas? 


<br>

<br>

<br>

<br>

<br>

E.g., "When you need a team of animals to pull a sled across the expansive frozen tundra?"


<br>

# Actual Answer: all the time.
<br>

<br>

So you might as well get good at it sooner rather than later... so...

<br>



# Some other "Intro To Pandas" notebooks that I like a lot
* https://github.com/zipfian/DSI_Lectures/blob/master/pandas/sallamander/numpy_notes.ipynb
* https://github.com/zipfian/DSI_Lectures/blob/master/pandas/numpy_pandas.ipynb  
* http://pandas.pydata.org/pandas-docs/stable/10min.html

# The Official Documentation
* http://pandas.pydata.org/pandas-docs/stable/index.html

# An In-House Cheat Sheet
* https://github.com/zipfian/precourse/tree/master/Chapter_4_Pandas#functions-i-use-all-the-time