# pandas

pandas is the primary module of interest for this course, as it provides data processing and analysis functionality for data in many forms. In addition, pandas offers two additional data structures, the **Series** and the **DataFrame**, that leverage much of the most useful functionality from lists, dictionaries, and n-dimensional arrays.

Similar to our NumPy class, we will focus today on:

* Creating Series and DataFrame objects
* Getting familiar with Series and DataFrame methods
    - Indexing, slicing, and filtering
    - Mathematical operations
    - Sorting and ranking
    - Function application and mapping

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

## Creating Series and DataFrame objects

Similar to NumPy, most often, we will import our data into **Series** and **DataFrame** objects from another source; but for now, we will create them manually by casting from other appropriate data types.

### Series

**Series** objects are very similar to **ndarrays**...

* Indexing, slicing, filtering work in a similar way
* Easy and fast computation
* Concatenation

...with some additional features:

* An associated array of data labels, called an **index** object -- Index elements of a Series via positional indexing (as in an array) or via an index (similar to a dictionary key)
* Database-style merging

In [2]:
# Cast from another sequence object - list, tuple, or array
ser = Series([1,-2,3,-4])
ser

0    1
1   -2
2    3
3   -4
dtype: int64

In [3]:
# Basic indexing
ser[1]

-2

In [4]:
# Specify indices
ser = Series([1,-2,3,-4], index=['a','b','c','d'])
ser

a    1
b   -2
c    3
d   -4
dtype: int64

In [5]:
# Basic indexing - By index or position
print(ser.c, ser['c'], ser[2])

3 3 3


In [6]:
# Create from dictionary
D = {key:val for key,val in zip('abcd', [1,-2,3,-4])}
ser = Series(D)
ser

a    1
b   -2
c    3
d   -4
dtype: int64

In [7]:
# Drop values
ser.drop('c')

a    1
b   -2
d   -4
dtype: int64

In [8]:
# Print Series
ser

a    1
b   -2
c    3
d   -4
dtype: int64

In [9]:
# Deleting values
del ser['c']
ser

a    1
b   -2
d   -4
dtype: int64

Series objects have attributes that are similar dictionaries:

In [10]:
# Index object - similar to dictionary keys, returns an index array
ser.index

Index(['a', 'b', 'd'], dtype='object')

In [12]:
# Values - similar to dictionary values, returns an array
ser.values

array([ 1, -2, -4])

In [12]:
# Check for membership
print('e' in ser.index)
print(-2 in ser.values)

False
True


### DataFrames

**DataFrames**, in their most basic form, are 2-dimensional data structures (rows and columns), and are similar in structure to a dictionary of Series objects (i.e., each Series is accessible via a column key). DataFrames can also represent higher dimensional data by leveraging *hierarchical indexing*.

Aside from loading data from a file directly into a DataFrame (later), the primary methods for creating a DataFrame are:

* From a dictionary of equal-length sequences or (any-length) Series objects
* From a 2-d array or sequence of equal-length sequences

In [13]:
# Dictionary of equal-length sequences
df = DataFrame({'a': [1,2,3], 'b': (4,5,6), 'c': np.array([7,8,9])}, index=[1,2,3])
df

Unnamed: 0,a,b,c
1,1,4,7
2,2,5,8
3,3,6,9


In [14]:
# Dictionary of Series objects
df = DataFrame({'Original': ser.append(Series({'e':5})), 'Negated': -ser})
df

Unnamed: 0,Original,Negated
a,1,-1.0
b,-2,2.0
d,-4,4.0
e,5,


In [15]:
# 2-d array
df = DataFrame(np.random.rand(3,3), columns=['a','b','c'], index=[1,2,3])
df

Unnamed: 0,a,b,c
1,0.153027,0.875821,0.406075
2,0.36766,0.434089,0.387083
3,0.773476,0.837289,0.271032


DataFrames have similar attributes as Series objects:

In [16]:
# DataFrame index
df.index

Int64Index([1, 2, 3], dtype='int64')

In [17]:
# DataFrame columns
df.columns

Index(['a', 'b', 'c'], dtype='object')

In [18]:
# DataFrame values
df.values

array([[0.15302652, 0.87582115, 0.40607538],
       [0.36765969, 0.43408876, 0.38708282],
       [0.77347577, 0.83728897, 0.27103155]])

Indexing DataFrames columns follows the general syntax:

```
df[column(s)]
```

In addition, there are .loc and .iloc methods that allow you to access and update more specific information contained within the DataFrame. The .loc method accesses the [row(s), column(s)] data by index and column name. The .iloc method accesses the [row(s), column(s)] data by positional row and column indices.

In [19]:
# Basic column indexing
df['b'] # or df.b, only works if column name does not match a DataFrame method or have spaces
df.b

1    0.875821
2    0.434089
3    0.837289
Name: b, dtype: float64

In [20]:
# Multiple columns
df[['a','c']]

Unnamed: 0,a,c
1,0.153027,0.406075
2,0.36766,0.387083
3,0.773476,0.271032


In [21]:
# Columns and rows
print(df[['a','b']].loc[[1,2]])
# loc[row,col] :
# iloc[row,col] :
print(df.loc[[1,2],['a','b']])

          a         b
1  0.153027  0.875821
2  0.367660  0.434089
          a         b
1  0.153027  0.875821
2  0.367660  0.434089


In [22]:
# Updating values using .loc
df.loc[1,'a'] = np.random.rand()
df

Unnamed: 0,a,b,c
1,0.399283,0.875821,0.406075
2,0.36766,0.434089,0.387083
3,0.773476,0.837289,0.271032


In [23]:
# .iloc method
df.iloc[:2, 2:]

Unnamed: 0,c
1,0.406075
2,0.387083


In [24]:
# Creating new columns - New value(s) will be broadcast (if applicable)
df['d'] = np.random.rand(3)
df

Unnamed: 0,a,b,c,d
1,0.399283,0.875821,0.406075,0.909433
2,0.36766,0.434089,0.387083,0.209875
3,0.773476,0.837289,0.271032,0.172064


In [25]:
# Creating new columns - Series
df['e'] = Series(np.random.rand(3), index=[1,2,4])
df

Unnamed: 0,a,b,c,d,e
1,0.399283,0.875821,0.406075,0.909433,0.452501
2,0.36766,0.434089,0.387083,0.209875,0.415721
3,0.773476,0.837289,0.271032,0.172064,


In [26]:
# Dropping rows
df.drop(2, axis=0)

Unnamed: 0,a,b,c,d,e
1,0.399283,0.875821,0.406075,0.909433,0.452501
3,0.773476,0.837289,0.271032,0.172064,


In [27]:
# Dropping columns
df.drop('b', axis=1)

Unnamed: 0,a,c,d,e
1,0.399283,0.406075,0.909433,0.452501
2,0.36766,0.387083,0.209875,0.415721
3,0.773476,0.271032,0.172064,


In [28]:
# Display df
df

Unnamed: 0,a,b,c,d,e
1,0.399283,0.875821,0.406075,0.909433,0.452501
2,0.36766,0.434089,0.387083,0.209875,0.415721
3,0.773476,0.837289,0.271032,0.172064,


In [29]:
# Deleting columns
del df['e']
df

Unnamed: 0,a,b,c,d
1,0.399283,0.875821,0.406075,0.909433
2,0.36766,0.434089,0.387083,0.209875
3,0.773476,0.837289,0.271032,0.172064


### Filtering with Series and DataFrames

Similar to arrays, one of the primary techniques for filtering your data involves boolean arrays, which you can store as an array (to use multiple times) or apply inline (to use once). The key is that your boolean array must broadcast appropriately to the filtered object.

You can apply a filter to a Series, DataFrame column (which is also a Series), or a DataFrame, given the shape of the boolean array matches appropriately.

Comparisons for generating the boolean arrays can include:

* <, >, <=, >=, ==, !=
* in (membership) - also combine with not (not in)
* is (identity) - also combine with not (is not)
* .any(*axis*), .all(*axis*) where *axis* allows you to specify whether to aggregate across a row (1) or down a column (0)
* .isnull/.isna, .notnull/.notna methods (next)
* Vectorized string .is methods (later)
* Vectorized function applications (later)
* And more!

You can generate more complex comparisons using logical and, or, xor, or not operations:

* and, & (scalar) and np.logical_and (array)
* or, | (scalar) and np.logical_or (array)
* ^ (xor, scalar) and np.logical_xor (array)
* not (scalar), and np.logical_not (array)

You can also combine filtering with assignment, using the .loc method. Remember, the assigned object must broadcast appropriately to the filtered data structure. For example, a scalar will be broadcast to all values, a list or array dimensions must match appropriately, and a Series object must have the same indices (otherwise, you will get NaN).

In [30]:
# Column filter
df.loc[:, np.array([True,False,True,True])]

Unnamed: 0,a,c,d
1,0.399283,0.406075,0.909433
2,0.36766,0.387083,0.209875
3,0.773476,0.271032,0.172064


In [31]:
# Identify values greater than 0.5
mask = df > 0.5
mask

Unnamed: 0,a,b,c,d
1,False,True,False,True
2,False,False,False,False
3,True,True,False,False


In [32]:
# Apply mask to DataFrame
df[mask]
# df[df > 0.5]

Unnamed: 0,a,b,c,d
1,,0.875821,,0.909433
2,,,,
3,0.773476,0.837289,,


In [34]:
# Apply any or all method to mask - down each column (axis=0), across each row (axis=1)
amask = mask.any(axis=0)
amask

a     True
b     True
c    False
d     True
dtype: bool

In [35]:
# Apply amask to DataFrame - must match dimensions
df.loc[:, amask]

Unnamed: 0,a,b,d
1,0.399283,0.875821,0.909433
2,0.36766,0.434089,0.209875
3,0.773476,0.837289,0.172064


In [36]:
# Apply filter inline
df[df > 0.5]

Unnamed: 0,a,b,c,d
1,,0.875821,,0.909433
2,,,,
3,0.773476,0.837289,,


In [37]:
# Apply filter based on column values
df['color'] = 'blue', 'green', 'blue'
df[df['color'] == 'blue']

Unnamed: 0,a,b,c,d,color
1,0.399283,0.875821,0.406075,0.909433,blue
3,0.773476,0.837289,0.271032,0.172064,blue


In [38]:
# Apply filter to specific column
print(df['b'][df['color'] == 'green'])
print(df.loc[df['color'] == 'green', 'b'])

2    0.434089
Name: b, dtype: float64
2    0.434089
Name: b, dtype: float64


In [39]:
# Combine filtering with assignment
df.loc[df['color'] == 'green', 'b'] = 123
df

Unnamed: 0,a,b,c,d,color
1,0.399283,0.875821,0.406075,0.909433,blue
2,0.36766,123.0,0.387083,0.209875,green
3,0.773476,0.837289,0.271032,0.172064,blue


### Descriptive Statistics

Similar to arrays, there are many built-in functions and methods for computing descriptive statistics on Series and DataFrame objects:

* .count
* .describe
* .min, .max
* .argmin, .argmax
* .idxmin, .idxmax
* .quantile
* .sum
* .mean
* .median
* .mad (mean absolute deviation)
* .prod
* .var
* .std
* .skew
* .kurt
* .cumsum
* .cummin, .cummax
* .cumprod
* .diff
* .pct_change

Some of these functions are aggregation functions and some are non-aggregation functions. In addition, many functions have an *axis* input that allows you to compute the function down each column (*axis* = 0) or across each row (*axis* = 1).

In addition, there are correlation and covariance methods:

* ser.corr or df.corr
* ser.cov or df.cov
* df.corrwith

And summary and comparison functions for categorical data:

* .unique (unique values)
* .value_counts (frequency summary)
* .isin (membership)

In [40]:
# Create series object
ser = Series(np.random.randn(100))

In [41]:
# .describe method
ser.describe()

count    100.000000
mean      -0.041009
std        0.985156
min       -2.235848
25%       -0.789133
50%       -0.059248
75%        0.585013
max        3.006144
dtype: float64

In [45]:
# Create DataFrame object
df = DataFrame(np.random.randn(100,3))

In [46]:
# Compute summary statistic across row (axis=1) or column (axis=0)
df.sum(axis=1).head()

0    0.798997
1    3.531292
2    1.666645
3   -1.085787
4    2.381748
dtype: float64

In [47]:
# Correlation matrix
df.corr()

Unnamed: 0,0,1,2
0,1.0,0.12824,0.060966
1,0.12824,1.0,-0.087488
2,0.060966,-0.087488,1.0


In [48]:
# Create categorical series
cser = Series(['green'] * 20 + ['red'] * 25 + ['blue'] * 10)
cser.head()

0    green
1    green
2    green
3    green
4    green
dtype: object

In [49]:
# Unique values
cser.unique()

array(['green', 'red', 'blue'], dtype=object)

In [50]:
# Value counts
cser.value_counts()

red      25
green    20
blue     10
dtype: int64

In [51]:
# Membership
cser.isin(['green','red']).sum()

45

### Arithmetic Operations and Data Alignment

Similar to arrays, arithmetic operations (and comparisons) involving Series and DataFrame objects are flexible. The interpreter will attempt to broadcast scalars and sequences to the appropriate Series and DataFrame dimensions (if possible).

However, for arithmetic operations (and comparisons) strictly involving Series and/or DataFrame objects, the interpreter will match the indices of the objects and perform the operation between matching elements (i.e., those with the same index *and* column). This process is called *data alignment* and is an important aspect of working with pandas data structures. Operations involving entries (indices) that exist in one object and not in the other will result in NaN. To get around this issue, you can make sure beforehand that the indices match, or you perform the operation by casting to an array.

In [52]:
# Create Series objects
ser1 = Series([1,2,3])
ser2 = Series([1,2,3], index=[1,2,3])
DataFrame({'1': ser1, '2': ser2})

Unnamed: 0,1,2
0,1.0,
1,2.0,1.0
2,3.0,2.0
3,,3.0


In [53]:
# Arithmetic between Series and scalar
ser1 + 5

0    6
1    7
2    8
dtype: int64

In [63]:
# Arithmetic between Series and sequence
ser1 + np.array([10,20,30]) # works for list and tuples too

0    11
1    22
2    33
dtype: int64

In [54]:
# Arithmetic between two Series objects
ser1 + ser2

0    NaN
1    3.0
2    5.0
3    NaN
dtype: float64

In [55]:
# Arithmetic methods - .add, .sub, .div, .floordiv, .mul, .pow
ser1.add(ser2, fill_value=0)

0    1.0
1    3.0
2    5.0
3    3.0
dtype: float64

In [59]:
# Create two DataFrames
df1 = DataFrame(np.random.rand(3,3))
df1

Unnamed: 0,0,1,2
0,0.492351,0.017337,0.006704
1,0.446499,0.880864,0.066724
2,0.824962,0.843603,0.789564


In [60]:
df2 = DataFrame(np.random.rand(3,3), index=[1,2,3], columns=[1,2,3])
df2

Unnamed: 0,1,2,3
1,0.003113,0.598133,0.888918
2,0.225718,0.466116,0.173134
3,0.444441,0.222328,0.479399


In [61]:
# Perform operation between DataFrames
df1 + df2

Unnamed: 0,0,1,2,3
0,,,,
1,,0.883977,0.664856,
2,,1.069321,1.25568,
3,,,,


In [62]:
# Perform operation between DataFrame values
DataFrame(df1.values + df2.values)

Unnamed: 0,0,1,2
0,0.495464,0.615469,0.895622
1,0.672218,1.346981,0.239857
2,1.269403,1.065931,1.268963


### Function Application and Mapping

Similar to arrays, Series and DataFrames support fast computation and comparison operations. NumPy universal functions (e.g., np.abs, np.square, np.log, np.exp, etc.) work as expected (for both Series and DataFrame objects with numerical data).

However, there are many operations that we would like to apply that do not have built in functions or methods. Here is where our lambda functions come in handy, but we can use fully defined functions as well. There are three primary methods for applying custom functions to Series or DataFrame objects:

* Series .map method - Applies function to each element in Series
* DataFrame .apply method - Applies function column- or row-wise
* DataFrame .applymap method - Applies function to each element in DataFrame

In [63]:
# Apply NumPy universal function
df = DataFrame(np.random.rand(3,5))
print(df)
np.exp(df)

          0         1         2         3         4
0  0.551543  0.565579  0.924744  0.895497  0.473018
1  0.202348  0.672336  0.486154  0.880959  0.946229
2  0.741379  0.646607  0.351019  0.705163  0.505213


Unnamed: 0,0,1,2,3,4
0,1.73593,1.760466,2.521222,2.448554,1.60483
1,1.224274,1.958808,1.62605,2.413212,2.575978
2,2.098827,1.909053,1.420515,2.024177,1.657339


In [64]:
# Series map method - Numerical computation
ser = Series(np.random.rand(10))
print(ser)
ser.map(lambda x: x ** 2)

0    0.264053
1    0.501627
2    0.897975
3    0.722188
4    0.775064
5    0.209331
6    0.307279
7    0.121834
8    0.525011
9    0.599336
dtype: float64


0    0.069724
1    0.251630
2    0.806359
3    0.521555
4    0.600725
5    0.043820
6    0.094420
7    0.014843
8    0.275636
9    0.359203
dtype: float64

In [65]:
# Series .map method - String operation
Series({1: 'The Green Mile', 2: 'Black Panther', 3: 'The Bourne Identity'}).map(lambda s: len(s.split()))

1    3
2    2
3    3
dtype: int64

In [66]:
# DataFrame .apply method - down each column (axis=0), across each row (axis=1)
print(df)
print(df.apply(lambda arr: arr / arr.sum(), axis=1))

          0         1         2         3         4
0  0.551543  0.565579  0.924744  0.895497  0.473018
1  0.202348  0.672336  0.486154  0.880959  0.946229
2  0.741379  0.646607  0.351019  0.705163  0.505213
          0         1         2         3         4
0  0.161725  0.165840  0.271156  0.262580  0.138699
1  0.063471  0.210894  0.152494  0.276334  0.296807
2  0.251368  0.219235  0.119015  0.239088  0.171295


In [74]:
# DataFrame .applymap method
df.applymap(lambda x: x / df.sum().sum())

Unnamed: 0,0,1,2,3,4
0,0.024335,0.051568,0.103771,0.098875,0.086385
1,0.108367,0.077558,0.078503,0.033419,0.037186
2,0.008362,0.076176,0.070043,0.116236,0.029216


### Sorting and Ranking

Oftentimes, you will want to sort your Series or DataFrame objects according to some of the contained data. The primary method for sorting is:
```
ser.sort_values(ascending=True/False)
df.sort_values(by=column(s), ascending=True/False)
```

To re-sort your data structure into its original form, use the sort.index method.

Alternatively, you can determine the rank of each value using the .rank method.

In [67]:
# Sort series
ser.sort_values(ascending=True)

7    0.121834
5    0.209331
0    0.264053
6    0.307279
1    0.501627
8    0.525011
9    0.599336
3    0.722188
4    0.775064
2    0.897975
dtype: float64

In [68]:
# Sort DataFrame
df.sort_values(by=1, ascending=True)

Unnamed: 0,0,1,2,3,4
0,0.551543,0.565579,0.924744,0.895497,0.473018
2,0.741379,0.646607,0.351019,0.705163,0.505213
1,0.202348,0.672336,0.486154,0.880959,0.946229


In [69]:
# Rank method - by column (axis=0), by row (axis=1)
df.rank(axis=0, method='average', ascending=True) # method = 'average', 'min', 'max', 'first', 'dense'

Unnamed: 0,0,1,2,3,4
0,2.0,1.0,3.0,3.0,1.0
1,1.0,3.0,2.0,2.0,3.0
2,3.0,2.0,1.0,1.0,2.0


## Next Time: Data Import and Export