# Pandas

## What is Pandas?
A Python library providing data structures and data analysis tools.


## Benefits

* Efficient storage and processing of data.
* Includes many built in functions for data transformation, aggregations, and plotting.
* Great for exploratory work.

## Not so greats

* Does not scale terribly well to large datasets.

## Documentation:

* http://pandas.pydata.org/pandas-docs/stable/index.html

In [1]:
#By convention import pandas like:
import pandas as pd

#By convention import numpy like:
import numpy as np


#Make sure you have both lines when using matplotlib in Jupyter notebook
import matplotlib.pyplot as plt
%matplotlib inline


#For fake data.
from numpy.random import randn

# Pandas 
* They are built on top of NumPy NdArrays
* http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

## Objectives

* Create `Series` and `DataFrame`s from Python data types. 
* Create `DataFrame`s from on disk data.
* Index and Slice `pandas` objects.
* Aggregate data in `DataFrame`s.
* Join multiple `DataFrame`s.

# Pandas is built on Numpy
* Numpy is one of the fundamental packages for scientific computing in Python.


## Numpy Arrays
* Or NdArrays (n-dimensional array)
* They are like lists in Python however they allow faster computation
    1. They are stored as one contiguous block of memory, rather than being spread out across multiple locations like a list. 
    2. Each item in a numpy array is of the same data type (i.e. all integers, all floats, etc.), rather than a conglomerate of any number of data types (as a list is). We call this idea homogeneity, as opposed to the possible heterogeneity of Python lists.


Just how much faster are they? Let's take the numbers from 0 to 1 million, and sum those numbers, timing it with both a list and a numpy array.


In [2]:
numpy_array = np.arange(0, 1000000)
python_list = range(1000000)

print ("python list")
time = %timeit -r 1 -o sum(python_list)
print (time.all_runs[0]/time.loops )

print ("\n" + "numpy array")
time = %timeit -r 1 -o np.sum(numpy_array)
print (time.all_runs[0]/time.loops)

print ("\n" + "numpy array -- standard library sum")
time = %timeit -r 1 -o sum(numpy_array)
print (time.all_runs[0]/time.loops)

python list
19.4 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
0.019449680800005354

numpy array
683 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
0.0006832007929999691

numpy array -- standard library sum
103 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
0.10284348160000718


# Numpy NdArrays

* have types
* Each array is of one type

In [None]:
ints = np.array(range(3))
chars = np.array(list('ABC'))
strings = np.array(['A','BC',"DEF"])

print (ints.dtype, chars.dtype, strings.dtype)

# Creating and using NdArrays

In [None]:
my_lst_ndarray = np.array([1, 2, 3, 4, 5])
my_tuple_ndarray = np.array((1, 2, 3, 4, 5), np.int32) 

In [None]:
print(my_lst_ndarray.dtype)
print(my_tuple_ndarray.dtype)

In [None]:
print(my_lst_ndarray.shape)
print(my_tuple_ndarray.shape)

# Multi D

In [None]:
nd_arr = np.array([[1, 2, 3, 4, 5],[6, 7, 8, 9, 10],[11, 12, 13, 14, 15]])
nd_arr

# Access info in the array
* Individual data
* Slices of data

In [None]:
nd_arr[1,1]

In [None]:
nd_arr[0:2,0:2]

In [None]:
nd_arr.shape

In [None]:
nd_arr.sum()

In [None]:
nd_arr.sum(axis=1)

In [None]:
nd_arr.max()

# Broadcasting

In [None]:
a = np.array([[10], [-10]]) 
b = np.array([[1, 2], [-1, -2]]) 

print (a.shape, b.shape )
print ("\n")
print (a + b)

# elements will "duplicate, expand, and fill up" 
# to make the dimensions compatible for element-wise operations
# cool.

In [None]:
a = np.array([[10, 0, -10, 0],[-10, 0, -10, 0]]) 
b = np.array([[2,2],[-1,0]]) 
print (a.shape, b.shape )
print ("")
print (a + b)



In [None]:
# it's not clear how it should fill up in this case... so it can't/doesn't

----------------------------------------------------

# Pandas 


## Pandas Series
* are (one dimensional) np.ndarray vectors **with an index**
* They are built upon NdArrays

In [None]:
series = pd.Series([5775,373,7,42,np.nan,33])
print (series)
print ("\n")
print (series.shape)

In [None]:
world_series = pd.Series(["cubs","royals","giants","sox","giants","cards","giants","...",None])
world_series

## Pandas Series are very powerful when dealing with dates

In [None]:
#Datetime index
dt_index = pd.date_range('2015-1-1', 
                        '2015-11-1', 
                        freq='m')
dt_series = pd.Series(randn(10), 
                      index = dt_index)
dt_series

#Indexes.
Notice how each series has an index (in this case a relatively meaningless default index).

Pandas can make great use of informative indexes. Indexes work similarly to a dictionary key, allowing fast lookups of the data associated with the index.

Indexes can also be exploited for fast group-bys, merges, time-series operations and lots more.

When you're really in the zone with pandas, you'll be thinking a lot about indexes.

In [None]:
indexed_series = pd.Series(randn(5), 
                           index = ['California', 'Alabama', 
                                    'Indiana', 'Montana', 
                                    'Kentucky'])
alt_indexed_series = pd.Series(randn(5),
                               index = ['Washington', 'Alabama', 
                                        'Montana', 'Indiana', 
                                        'New York'])
print (indexed_series)
print ('\n')
print (alt_indexed_series)

In [None]:
#Pandas uses the index by default to align series for arithmetic!
indexed_series + alt_indexed_series

# Pandas DataFrames
* are a set of Pandas Series **that share the same index** 

Think Excel spreadsheets!!!

In [None]:
pd.DataFrame(
    [[1, 2, 3], [4, 5, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])

In [None]:
df = pd.DataFrame(randn(10, 5), index=dt_index, columns=[x for x in 'abcde'])
df

In [None]:
#A dataframes columns are series:
col = df.a
type(col)

In [None]:
#So are the rows.
row = df.loc['2015-01-31']
type(row)

In [None]:
#The columns all have the same index:
col.index   

In [None]:
#What's the index for the rows?
row.index

# Pandas DataFrame basics

In [None]:
#Slicing/ access of data

# When one row is returned it is a Series (note a dataframe) 
print (type(df.a))
df.a  # sometimes works

In [None]:
df['a'] #works always

In [None]:
#Return subset / multipul columns 
df[['a','b']]

# Advanced sleection
## .loc / .iloc / .xi

In [None]:
df.loc['2015-05-31':'2015-08-31', 'c':'e'] #Ranges by label.

In [None]:
df.iloc[2:-3,2:5] #Ranges by number.

In [None]:
df.ix[2:-3,2:5] #Tries to estimate your request

In [None]:
df.ix['2015-05-31':'2015-08-31', 'c':'e']

# DO NOT USE .ix 
It is here as you may see it in other code and should know what it is
  
  
--------------------------------------------------------------------------------------------     
        
      
      
# DataFrame Indexing

In [None]:
#Multi Index:
dt_index = pd.date_range('2015-1-1', 
                        '2017-7-1', 
                        freq='m')
df = pd.DataFrame(randn(30,5), index=dt_index)
df

In [None]:
# Adding new column
df['state'] = ['Alabama', 'Alaska' , 'Arizona'] * 10
df.head()

In [None]:
df = df.reset_index()
df = df.set_index(['state', 'index'])
df.head()

In [None]:
df.loc['Alabama'].head()

In [None]:
df.loc['2015-01-31'] #Doesn't work.

In [None]:
df.loc[('Alabama', '2015-01-31')] #Can do this.

# Load Data from file for use

In [None]:
df = pd.read_csv('data/winequality-red.csv', delimiter=';')

In [None]:
df.head()  #Display the first x rows (default is 5)

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.tail()

# Filtering (i.e., row selecting or boolean indexing)

In [None]:
trueFalse= df['chlorides'] <= 0.08 
trueFalse

In [None]:
df[trueFalse]

In [None]:
# To use a mask, we actually have to use it to index into the DataFrame (using square brackets). 
df[df['chlorides'] <= 0.08]

In [None]:
# Okay, this is cool. What if I wanted a slightly more complicated query...
df[(df['chlorides'] >= 0.04) & (df['chlorides'] < 0.08)]

In [None]:
df.groupby('quality') # Note that this returns back to us a groupby object. It doesn't actually 
                      # return to us anything useful until we perform some aggregation on it. 

In [None]:
# Note we can also group by multilple columns by passing them in in a list. It will group by 
# the first column passed in first, and then the second after that (i.e. it will group by 
# the second within the group by of the first). 
df2 = df.groupby(['pH', 'quality']).count()['chlorides']

df2

# Adding / Remove columns

In [None]:
# add a computed column

df['pct_free_sulf'] = df['free sulfur dioxide'] / df['total sulfur dioxide']

In [None]:
df

In [None]:
# Droping a row

In [None]:
df.drop('pct_free_sulf')

In [None]:
df.drop('pct_free_sulf', axis = 1)

In [None]:
df.columns

# Managing Missing Values
* http://pandas.pydata.org/pandas-docs/stable/missing_data.html

In [None]:
miss_val_df = pd.DataFrame(
    [[1, 2, 3], [4, np.nan, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])
miss_val_df

In [None]:
miss_val_df.fillna(0)

In [None]:
miss_val_df

In [None]:
# IF YOU WANT THE CHANGE TO HAPPEN INPLACE YOU MUST add that
miss_val_df.fillna(0,inplace=True)
miss_val_df

In [None]:
## DROP ROW

In [None]:
miss_val_df['b']['foo'] =np.nan

In [None]:
miss_val_df

In [None]:
miss_val_df.dropna()

# Merge 
* http://pandas.pydata.org/pandas-docs/stable/merging.html

We can join DataFrames in a similar way that we join tables to SQL.  In fact, left, right, outer, and inner joins work the same way here.

In [None]:
merge1 = pd.DataFrame(
    [[1, 2, 3], [4, 3, 6]], 
    columns=['a', 'b', 'c'])

merge2 = pd.DataFrame(
    [[1, 2, 3], [4, 3, 6]], 
    columns=['z', 'b', 'y'])

merge1

In [None]:
merged_df = merge1.merge(merge2, how='inner')

In [None]:
merged_df

# Concatenating
* adding *rows*
* see also: df.append()

In [None]:
df1 = pd.DataFrame(
    {'Col1': range(5), 'Col2': range(5), 'Col3': range(5)})
df2 = pd.DataFrame(
    {'Col1': range(5), 'Col2': range(5), 'Col4': range(5)},
    index=range(5, 10))

In [None]:
df1

In [None]:
df2

In [None]:
#Vertically
pd.concat([df1, df2], axis=0)

In [None]:
pd.concat([df1, df2], join='outer', axis=1)

# Graphing DataFrames

In [None]:
df = pd.read_csv('data/playgolf.csv', delimiter=',' )
df.head()

In [None]:
df.hist(['Temperature','Humidity'],bins=5)

In [None]:
df[['Temperature','Humidity']].plot(kind='box')

In [None]:
df.plot('Temperature', 'Humidity', kind='scatter')

In [None]:
groups=df.groupby('Outlook')
for name, group in groups:
    print( name)

In [None]:
fig, ax = plt.subplots()

ax.margins(0.05)
for name, group in groups:
    ax.plot(group.Temperature, group.Humidity, marker='o', linestyle='', ms=12, label=name)
ax.legend(numpoints=1, loc='lower right')

plt.show()

In [None]:
df['Outlook'].value_counts()