# Pandas

## What is Pandas?
A Python library providing data structures and data analysis tools.


## Benefits

* Efficient storage and processing of data.
* Includes many built in functions for data transformation, aggregations, and plotting.
* Great for exploratory work.

## Not so greats

* Does not scale terribly well to large datasets.

## Documentation:

* http://pandas.pydata.org/pandas-docs/stable/index.html

In [1]:
#By convention import pandas like:
import pandas as pd

#By convention import numpy like:
import numpy as np


#Make sure you have both lines when using matplotlib in Jupyter notebook
import matplotlib.pyplot as plt
%matplotlib inline


#For fake data.
from numpy.random import randn

# Pandas 
* They are built on top of NumPy NdArrays
* http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

## Objectives

* Create `Series` and `DataFrame`s from Python data types. 
* Create `DataFrame`s from on disk data.
* Index and Slice `pandas` objects.
* Aggregate data in `DataFrame`s.
* Join multiple `DataFrame`s.

# Pandas is built on Numpy
* Numpy is one of the fundamental packages for scientific computing in Python.


## Numpy Arrays
* Or NdArrays (n-dimensional array)
* They are like lists in Python however they allow faster computation
    1. They are stored as one contiguous block of memory, rather than being spread out across multiple locations like a list. 
    2. Each item in a numpy array is of the same data type (i.e. all integers, all floats, etc.), rather than a conglomerate of any number of data types (as a list is). We call this idea homogeneity, as opposed to the possible heterogeneity of Python lists.


Just how much faster are they? Let's take the numbers from 0 to 1 million, and sum those numbers, timing it with both a list and a numpy array.


In [3]:
numpy_array = np.arange(0, 1000000)
python_list = range(1000000)

print("python list")
time = %timeit -r 1 -o sum(python_list)
print (time.all_runs[0]/time.loops )

print("\n" + "numpy array")
time = %timeit -r 1 -o np.sum(numpy_array)
print (time.all_runs[0]/time.loops)

print("\n" + "numpy array -- standard library sum")
time = %timeit -r 1 -o sum(numpy_array)
print(time.all_runs[0]/time.loops)

python list
19.6 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)
0.019594595560000698

numpy array
705 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
0.0007054206039999826

numpy array -- standard library sum
85.2 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
0.08522267759999522


# Numpy NdArrays

* have types
* Each array is of one type

In [4]:
ints = np.array(range(3))
chars = np.array(list('ABC'))
strings = np.array(['A','BC',"DEF"])

print(ints.dtype, chars.dtype, strings.dtype)

int64 <U1 <U3


# Creating and using NdArrays

In [5]:
my_lst_ndarray = np.array([1, 2, 3, 4, 5])
my_tuple_ndarray = np.array((1, 2, 3, 4, 5), np.int32) 

In [6]:
print(my_lst_ndarray.dtype)
print(my_tuple_ndarray.dtype)

int64
int32


In [7]:
print(my_lst_ndarray.shape)
print(my_tuple_ndarray.shape)

(5,)
(5,)


# 2D arrays

In [8]:
nd_arr = np.array([[1, 2, 3, 4, 5],[6, 7, 8, 9, 10],[11, 12, 13, 14, 15]])
nd_arr

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

# Access info in the array
* Individual data
* Slices of data

In [9]:
nd_arr[1,1]

7

In [10]:
nd_arr[0:2,0:2]

array([[1, 2],
       [6, 7]])

In [11]:
nd_arr.shape

(3, 5)

In [12]:
nd_arr.sum()

120

In [13]:
nd_arr.sum(axis=1)

array([15, 40, 65])

In [14]:
nd_arr.max()

15

# Broadcasting

In [15]:
a = np.array([[10], [-10]]) 
b = np.array([[1, 2], [-1, -2]]) 

print(a.shape, b.shape )
print("\n")
print(a + b)

# elements will "duplicate, expand, and fill up" 
# to make the dimensions compatible for element-wise operations
# cool.

(2, 1) (2, 2)


[[ 11  12]
 [-11 -12]]


In [16]:
print(a)
print(a+4)
print(a*3)

[[ 10]
 [-10]]
[[14]
 [-6]]
[[ 30]
 [-30]]


In [17]:
a = np.array([[10, 0, -10, 0],[-10, 0, -10, 0]]) 
b = np.array([[2,2],[-1,0]]) 
print (a.shape, b.shape )
print ("")
print (a + b)



(2, 4) (2, 2)



ValueError: operands could not be broadcast together with shapes (2,4) (2,2) 

In [None]:
# it's not clear how it should fill up in this case... so it can't/doesn't

----------------------------------------------------

# Pandas 


## Pandas Series
* are (one dimensional) np.ndarray vectors **with an index**
* They are built upon NdArrays

In [18]:
series = pd.Series([5775,373,7,42,np.nan,33])
print(series)
print("\n")
print(series.shape)

0    5775.0
1     373.0
2       7.0
3      42.0
4       NaN
5      33.0
dtype: float64


(6,)


In [19]:
world_series = pd.Series(["cubs","royals","giants","sox","giants","cards","giants","...",None])
world_series

0      cubs
1    royals
2    giants
3       sox
4    giants
5     cards
6    giants
7       ...
8      None
dtype: object

## Pandas Series are very powerful when dealing with dates

In [20]:
#Datetime index
dt_index = pd.date_range('2015-1-1', 
                        '2015-11-1', 
                        freq='m')
dt_series = pd.Series(randn(10), 
                      index = dt_index)
dt_series

2015-01-31    0.813130
2015-02-28   -0.565614
2015-03-31    0.576677
2015-04-30    0.597328
2015-05-31    0.057497
2015-06-30   -0.004164
2015-07-31    1.134371
2015-08-31    0.374322
2015-09-30    1.593793
2015-10-31   -0.830639
Freq: M, dtype: float64

# Index
Notice how each series has an index (in this case a relatively meaningless default index).

Pandas can make great use of informative indexes. Indexes work similarly to a dictionary key, allowing fast lookups of the data associated with the index.

Indexes can also be exploited for fast group-bys, merges, time-series operations and lots more.

When you're really in the zone with pandas, you'll be thinking a lot about indexes.

In [21]:
indexed_series = pd.Series(randn(5), 
                           index = ['California', 'Alabama', 
                                    'Indiana', 'Montana', 
                                    'Kentucky'])
alt_indexed_series = pd.Series(randn(5),
                               index = ['Washington', 'Alabama', 
                                        'Montana', 'Indiana', 
                                        'New York'])
print(indexed_series)
print('\n')
print(alt_indexed_series)

California   -1.665557
Alabama       1.447049
Indiana       1.819369
Montana      -0.237382
Kentucky     -0.506190
dtype: float64


Washington   -0.702198
Alabama       1.163738
Montana       0.652911
Indiana       0.595966
New York     -0.467933
dtype: float64


In [22]:
#Pandas uses the index by default to align series for arithmetic!
indexed_series + alt_indexed_series

Alabama       2.610787
California         NaN
Indiana       2.415335
Kentucky           NaN
Montana       0.415529
New York           NaN
Washington         NaN
dtype: float64

# Pandas DataFrames
* are a set of Pandas Series **that share the same index** 


In [23]:
pd.DataFrame(
    [[1, 2, 3], [4, 5, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])

Unnamed: 0,a,b,c
foo,1,2,3
bar,4,5,6


In [24]:
df = pd.DataFrame(randn(10, 5), index=dt_index, columns=[x for x in 'abcde'])
df

Unnamed: 0,a,b,c,d,e
2015-01-31,0.001802,0.330959,-0.117159,-1.053053,0.839262
2015-02-28,1.261846,2.146015,0.943208,-1.230948,1.028456
2015-03-31,0.016157,0.540495,1.158699,1.340901,2.279944
2015-04-30,1.286424,0.055554,-0.330005,1.535541,0.745758
2015-05-31,-0.041403,0.005423,-0.466109,0.221599,0.54173
2015-06-30,0.195242,-3.000245,-0.34173,-1.030697,0.113351
2015-07-31,-0.141635,-0.181972,0.837091,1.372244,1.268435
2015-08-31,-0.730999,-1.454007,-0.632989,-0.654115,-0.91795
2015-09-30,0.596635,-0.391445,-0.589018,-1.032677,-1.176668
2015-10-31,-0.043072,-0.384681,-0.828345,1.082411,0.608415


In [None]:
# To select just one column, use brackets
df['a']

In [27]:
# To select one row, use .loc[]
df.loc['2015-10-31']

a   -0.043072
b   -0.384681
c   -0.828345
d    1.082411
e    0.608415
Name: 2015-10-31 00:00:00, dtype: float64

In [28]:
# A column of a dataframe is a series:
col = df['d']
type(col)

pandas.core.series.Series

In [29]:
# So is a row.
row = df.loc['2015-01-31']
type(row)

pandas.core.series.Series

In [30]:
#The columns all have the same index:
col.index   

DatetimeIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30',
               '2015-05-31', '2015-06-30', '2015-07-31', '2015-08-31',
               '2015-09-30', '2015-10-31'],
              dtype='datetime64[ns]', freq='M')

In [31]:
#What's the index for the rows?
row.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

# Pandas DataFrame basics

In [32]:
df['a']

2015-01-31    0.001802
2015-02-28    1.261846
2015-03-31    0.016157
2015-04-30    1.286424
2015-05-31   -0.041403
2015-06-30    0.195242
2015-07-31   -0.141635
2015-08-31   -0.730999
2015-09-30    0.596635
2015-10-31   -0.043072
Freq: M, Name: a, dtype: float64

In [33]:
# Selecting multiple columns 
df[['a','b']]

Unnamed: 0,a,b
2015-01-31,0.001802,0.330959
2015-02-28,1.261846,2.146015
2015-03-31,0.016157,0.540495
2015-04-30,1.286424,0.055554
2015-05-31,-0.041403,0.005423
2015-06-30,0.195242,-3.000245
2015-07-31,-0.141635,-0.181972
2015-08-31,-0.730999,-1.454007
2015-09-30,0.596635,-0.391445
2015-10-31,-0.043072,-0.384681


# Advanced sleection


## .loc 
select by row label (index), and column label

In [34]:
df.loc['2015-05-31':'2015-08-31', 'c':'e'] #Ranges by label.

Unnamed: 0,c,d,e
2015-05-31,-0.466109,0.221599,0.54173
2015-06-30,-0.34173,-1.030697,0.113351
2015-07-31,0.837091,1.372244,1.268435
2015-08-31,-0.632989,-0.654115,-0.91795


## .iloc
select by __positional__ index

In [35]:
df.iloc[2:-3,2:5] #Ranges by number.

Unnamed: 0,c,d,e
2015-03-31,1.158699,1.340901,2.279944
2015-04-30,-0.330005,1.535541,0.745758
2015-05-31,-0.466109,0.221599,0.54173
2015-06-30,-0.34173,-1.030697,0.113351
2015-07-31,0.837091,1.372244,1.268435


## .ix (deprecated)
select by either label or position index
(deprecated because it led to too much ambiguity)

In [36]:
df.ix[2:-3,2:5] # Figures out what you probably want

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,c,d,e
2015-03-31,1.158699,1.340901,2.279944
2015-04-30,-0.330005,1.535541,0.745758
2015-05-31,-0.466109,0.221599,0.54173
2015-06-30,-0.34173,-1.030697,0.113351
2015-07-31,0.837091,1.372244,1.268435


In [37]:
df.ix['2015-05-31':'2015-08-31', 'c':'e']

Unnamed: 0,c,d,e
2015-05-31,-0.466109,0.221599,0.54173
2015-06-30,-0.34173,-1.030697,0.113351
2015-07-31,0.837091,1.372244,1.268435
2015-08-31,-0.632989,-0.654115,-0.91795


# DO NOT USE .ix 
It is here so you can recognize it and scold others for using it.
  
  
--------------------------------------------------------------------------------------------     
        
      
      
# DataFrame Indexing

In [None]:
#Multi Index:
dt_index = pd.date_range('2015-1-1', 
                        '2017-7-1', 
                        freq='m')
df = pd.DataFrame(randn(30,5), index=dt_index)
df

In [None]:
# Adding new column
df['state'] = ['Alabama', 'Alaska' , 'Arizona'] * 10
df.head()

In [None]:
df = df.reset_index()
df = df.set_index(['state', 'index'])
df.head()

In [None]:
df.loc['Alabama'].head()

In [None]:
df.loc['2015-01-31'] #Doesn't work.

In [None]:
df.loc[('Alabama', '2015-01-31')] #Can do this.

# Load Data from file for use

In [None]:
df = pd.read_csv('data/winequality-red.csv', delimiter=';')

In [None]:
df.head()  #Display the first x rows (default is 5)

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.tail()

# Filtering (i.e., row selecting or boolean indexing)

In [None]:
trueFalse= df['chlorides'] <= 0.08 
trueFalse

In [None]:
df[trueFalse]

In [None]:
# To use a mask, we actually have to use it to index into the DataFrame (using square brackets). 
df[df['chlorides'] <= 0.08]

In [None]:
# Okay, this is cool. What if I wanted a slightly more complicated query...
df[(df['chlorides'] >= 0.04) & (df['chlorides'] < 0.08)]

In [None]:
df.groupby('quality') # Note that this returns back to us a groupby object. It doesn't actually 
                      # return to us anything useful until we perform some aggregation on it. 

In [None]:
# Note we can also group by multilple columns by passing them in in a list. It will group by 
# the first column passed in first, and then the second after that (i.e. it will group by 
# the second within the group by of the first). 
df2 = df.groupby(['pH', 'quality']).count()['chlorides']

df2

# Adding / Remove columns

In [None]:
# add a computed column

df['pct_free_sulf'] = df['free sulfur dioxide'] / df['total sulfur dioxide']

In [None]:
df

In [None]:
# Droping a row

In [None]:
df.drop('pct_free_sulf')

In [None]:
df.drop('pct_free_sulf', axis = 1)

In [None]:
df.columns

# Managing Missing Values
* http://pandas.pydata.org/pandas-docs/stable/missing_data.html

In [None]:
miss_val_df = pd.DataFrame(
    [[1, 2, 3], [4, np.nan, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])
miss_val_df

In [None]:
miss_val_df.fillna(0)

In [None]:
miss_val_df

In [None]:
# IF YOU WANT THE CHANGE TO HAPPEN INPLACE YOU MUST add that
miss_val_df.fillna(0,inplace=True)
miss_val_df

In [None]:
## DROP ROW

In [None]:
miss_val_df['b']['foo'] =np.nan

In [None]:
miss_val_df

In [None]:
miss_val_df.dropna()

# Merge 
* http://pandas.pydata.org/pandas-docs/stable/merging.html

We can join DataFrames in a similar way that we join tables to SQL.  In fact, left, right, outer, and inner joins work the same way here.

In [None]:
merge1 = pd.DataFrame(
    [[1, 2, 3], [4, 3, 6]], 
    columns=['a', 'b', 'c'])

merge2 = pd.DataFrame(
    [[1, 2, 3], [4, 3, 6]], 
    columns=['z', 'b', 'y'])

merge1

In [None]:
merged_df = merge1.merge(merge2, how='inner')

In [None]:
merged_df

# Concatenating
* adding *rows*
* see also: df.append()

In [None]:
df1 = pd.DataFrame(
    {'Col1': range(5), 'Col2': range(5), 'Col3': range(5)})
df2 = pd.DataFrame(
    {'Col1': range(5), 'Col2': range(5), 'Col4': range(5)},
    index=range(5, 10))

In [None]:
df1

In [None]:
df2

In [None]:
#Vertically
pd.concat([df1, df2], axis=0)

In [None]:
pd.concat([df1, df2], join='outer', axis=1)

# Graphing DataFrames

In [None]:
df = pd.read_csv('data/playgolf.csv', delimiter=',' )
df.head()

In [None]:
df.hist(['Temperature','Humidity'],bins=5)

In [None]:
df[['Temperature','Humidity']].plot(kind='box')

In [None]:
df.plot('Temperature', 'Humidity', kind='scatter')

In [None]:
groups=df.groupby('Outlook')
for name, group in groups:
    print( name)

In [None]:
fig, ax = plt.subplots()

ax.margins(0.05)
for name, group in groups:
    ax.plot(group.Temperature, group.Humidity, marker='o', linestyle='', ms=12, label=name)
ax.legend(numpoints=1, loc='lower right')

plt.show()

In [None]:
df['Outlook'].value_counts()