# Chapter 5: Getting started with pandas

In [1]:
# Setup
import pandas as pd
import numpy as np

## 5.1: Introduction to pandas Data Structures

### Series

The most basic unit of a pandas data frame is a Series, a one-dimensional data structure that's very similar to a NumPy array.  This is very similar to R's data frames that are built using lists.  pandas Series uses an array of data and an associated array of indexes used to identify individual elements in the Series

In [2]:
# Build a simple Series
simple_series = pd.Series([4, 7, -5, 3])
simple_series

0    4
1    7
2   -5
3    3
dtype: int64

In [5]:
# Grab the values and indexes separately
print(f'Values in the simple Series:\n{simple_series.values}\n')
print(f'Indexes in the simple SerieS:\n{simple_series.index}')

Values in the simple Series:
[ 4  7 -5  3]

Indexes in the simple SerieS:
RangeIndex(start=0, stop=4, step=1)


In [7]:
# Modify the indexes-- default is to use numeric indxes that start at 0
obj = pd.Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c'])
print(f'Values in the new Series:\n{obj.values}\n')
print(f'Index is the new Series:\n{obj.index}')

Values in the new Series:
[ 4  7 -5  3]

Index is the new Series:
Index(['d', 'b', 'a', 'c'], dtype='object')


In [8]:
# Index values by their index
obj['a']

-5

In [9]:
# Filtering is similar to bracket notation used in R
obj[obj > 0]

d    4
b    7
c    3
dtype: int64

A Series can be treated like a dictionary with data in a one-dimensional array linked to indexes as keys.  In fact, we can pass dictionaries as Series.

In [12]:
# Create a simple dictionary
state_data = {'ohio': 35000, 'texas': 71000, 'oregon': 16000, 'utah': 5000}

# Create a new Series object from the dictionary
state_series = pd.Series(state_data)
print(state_series)

ohio      35000
texas     71000
oregon    16000
utah       5000
dtype: int64


In [13]:
# Detect missing data
pd.isnull(state_series)

ohio      False
texas     False
oregon    False
utah      False
dtype: bool

Keeping in line with the tabular style of pandas data structures, we can bind names to both the data and the indexes in the Series.

In [14]:
state_series.name = 'population'
state_series.index.name = 'state'
print(state_series)

state
ohio      35000
texas     71000
oregon    16000
utah       5000
Name: population, dtype: int64


## DataFrame

We can build DataFrames as a dictionary, with each key the column "title" or index, and the values passed as an array.

In [15]:
# Build state data frame
state_data = {'state': ['ohio', 'ohio', 'ohio', 'nevada', 'nevada'], 
             'year': [2000, 2001, 2002, 2001, 2002], 
             'population': [1.5, 1.7, 3.6, 2.4, 2.9]}

state_df = pd.DataFrame(state_data)
print(state_df)

    state  year  population
0    ohio  2000         1.5
1    ohio  2001         1.7
2    ohio  2002         3.6
3  nevada  2001         2.4
4  nevada  2002         2.9


In [16]:
# If you pass in a column that isn't contained in the input data, they'll all be set to NaN
state_df2 = pd.DataFrame(state_data, columns = ['state', 'year', 'population', 'debt'])
print(state_df2)

    state  year  population debt
0    ohio  2000         1.5  NaN
1    ohio  2001         1.7  NaN
2    ohio  2002         3.6  NaN
3  nevada  2001         2.4  NaN
4  nevada  2002         2.9  NaN


In [17]:
# Set a default column value
state_df2['debt'] = 16.5
print(state_df2)

    state  year  population  debt
0    ohio  2000         1.5  16.5
1    ohio  2001         1.7  16.5
2    ohio  2002         3.6  16.5
3  nevada  2001         2.4  16.5
4  nevada  2002         2.9  16.5


In [18]:
# Create new column-- similar to R
state_df2['eastern'] = state_df2['state'] == 'ohio'
print(state_df2)

    state  year  population  debt  eastern
0    ohio  2000         1.5  16.5     True
1    ohio  2001         1.7  16.5     True
2    ohio  2002         3.6  16.5     True
3  nevada  2001         2.4  16.5    False
4  nevada  2002         2.9  16.5    False


In [19]:
# Delete a column 
del state_df2['eastern']
print(state_df2)

    state  year  population  debt
0    ohio  2000         1.5  16.5
1    ohio  2001         1.7  16.5
2    ohio  2002         3.6  16.5
3  nevada  2001         2.4  16.5
4  nevada  2002         2.9  16.5


It's important to note that similar to working with NumPy arrays, indexing a DataFrame returns a view on the object, not a copy; thus, when we use `del` to delete a specific column in the DataFrame, it changes the contents of the data structure in memory.  

### Index objects

Index objects in pandas are used to hold the axis labels and axis names.

In [24]:
# Get index object for state data frame
state_df2.index

RangeIndex(start=0, stop=5, step=1)

As we might expect, it would be problematic if we could accidentally modify the indexes, so they are immutable.

## 5.2: Essential functionality

In [25]:
# Create a new DataFrame
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [29]:
# Re-index-- sort the data
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

In [33]:
# Build another data frame using NumPy arrays
frame = pd.DataFrame(
    np.arange(9).reshape((3, 3)), 
    index = ['a', 'c', 'd'], 
    columns = ['ohio', 'texas', 'california'])

frame

Unnamed: 0,ohio,texas,california
a,0,1,2
c,3,4,5
d,6,7,8


In [34]:
# Re-index-- by default, function reindexes the rows
frame.reindex(['a', 'b', 'c', 'd'])

Unnamed: 0,ohio,texas,california
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [35]:
# Re-index the columns-- that is, select to get a specific order
states = ['texas', 'ohio', 'california']
frame.reindex(columns=states)

Unnamed: 0,texas,ohio,california
a,1,0,2
c,4,3,5
d,7,6,8


In [36]:
# Create a new series
obj = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [40]:
# Drop based on index-- note, this will return a new object
new_obj = obj.drop('c')
new_obj

a    0
b    1
d    3
e    4
dtype: int64

In [41]:
# Pass an array of indexes to drop
obj.drop(['a', 'd'])

b    1
c    2
e    4
dtype: int64

In [42]:
# Create a DataFrame
data = pd.DataFrame(np.arange(16).reshape((4, 4)), 
                   index=['ohio', 'colorado', 'utah', 'new york'], 
                   columns=['one', 'two', 'three', 'four'])

data

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7
utah,8,9,10,11
new york,12,13,14,15


In [43]:
# Drop colorado and ohio
data.drop(['colorado', 'ohio'])

Unnamed: 0,one,two,three,four
utah,8,9,10,11
new york,12,13,14,15


In [44]:
# Note the original DataFrame remains unchanged
data

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7
utah,8,9,10,11
new york,12,13,14,15


In [45]:
# You can drop index values from either axis (default, axis == 0, rows)
data.drop(['colorado', 'ohio'], axis=0)

Unnamed: 0,one,two,three,four
utah,8,9,10,11
new york,12,13,14,15


In [46]:
data.drop(['one'], axis=1)

Unnamed: 0,two,three,four
ohio,1,2,3
colorado,5,6,7
utah,9,10,11
new york,13,14,15


In [47]:
data.drop(['two', 'four'], axis=1)

Unnamed: 0,one,three
ohio,0,2
colorado,4,6
utah,8,10
new york,12,14


It's important to note that slicing with labels in pandas works differently than default Python behavior.  Normally, the endpoint in an array slice is not included, but in pandas slicing with indexes the endpoint is included.

In [50]:
# Create a simple object
obj = pd.DataFrame(np.arange(4), index=['a', 'b', 'c', 'd'])
obj['b':'d']

Unnamed: 0,0
b,1
c,2
d,3


In [58]:
# Subsetting with bracket notation is similar to using R
state_data = pd.DataFrame(np.arange(16).reshape((4, 4)), 
                         index=['ohio', 'colorado', 'utah', 'new york'],
                         columns=['one', 'two', 'three', 'four'])
state_data

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7
utah,8,9,10,11
new york,12,13,14,15


In [59]:
# Boolean DataFrame
state_data > 5

Unnamed: 0,one,two,three,four
ohio,False,False,False,False
colorado,False,False,True,True
utah,True,True,True,True
new york,True,True,True,True


In [60]:
# Instead of subsetting, set a default value
state_data[state_data < 5] = 0
state_data

Unnamed: 0,one,two,three,four
ohio,0,0,0,0
colorado,0,5,6,7
utah,8,9,10,11
new york,12,13,14,15


Note-- in the textbook, the `ix()` method is used for indexing, which is now deprecated.  Instead, we can use `loc()`.  This takes one to two positional arguments set to the row and column names, respectively.  We can use the same general syntax though-- if we want multiple rows or columns, we can pass them as an array, or use slicing notation.

In [67]:
state_data.loc['colorado', ['one', 'two']]

one    0
two    5
Name: colorado, dtype: int64

### Arithmetic and data alignment

pandas performs arithmetic on different DataFrames by matching indexes (this is another reason why they are immutable).  If there are indexes between the two DataFrames that don't match, the union of the index pairs is returned (i.e., what both share in common).

In [69]:
# Create two Series
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

# Adding these two together-- the only matching indexes are a, c, and e
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [71]:
# If we want to fill by a specific value, like 0, where the locations don't overlap, use the add() method
s1.add(s2, fill_value=0)

a    5.2
c    1.1
d    3.4
e    0.0
f    4.0
g    3.1
dtype: float64

### Function application and mapping

In [78]:
# NumPy functions that operate element-wise will work with pandas DataFrames
data_frame = pd.DataFrame(np.random.randn(4, 3),
                       columns=['b', 'd', 'e'],
                       index=['utah', 'ohio', 'texas', 'oregon'])

data_frame.index.name = 'state'
data_frame

Unnamed: 0_level_0,b,d,e
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
utah,-0.074772,-0.219961,-1.078329
ohio,-1.429948,-1.703525,0.291113
texas,1.196553,-0.895521,-0.874538
oregon,1.485745,-0.236223,-0.417666


In [79]:
# Get absolute value
np.abs(data_frame)

Unnamed: 0_level_0,b,d,e
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
utah,0.074772,0.219961,1.078329
ohio,1.429948,1.703525,0.291113
texas,1.196553,0.895521,0.874538
oregon,1.485745,0.236223,0.417666


In [82]:
# Some functions are already DataFrame methods-- e.g., sum and mean
# Get the mean for each index
print(f'Mean measurement for each state:\n{data_frame.mean(axis=1)}\n')

# Get mean of each column (across rows)
print(f'Mean measurement for each variable:\n{data_frame.mean(axis=0)}\n')

Mean measurement for each state:
state
utah     -0.457688
ohio     -0.947453
texas    -0.191169
oregon    0.277285
dtype: float64

Mean measurement for each variable:
b    0.294395
d   -0.763808
e   -0.519855
dtype: float64



In [83]:
# We can use the apply and applymap() methods to apply custom functions to axes and elements, respectively
# Create a lambda function to re-format numbers as floats with 2 decimal places
# Apply is used on the data frame; map is used on the individual Series
format_float = lambda x: '%.2f' % x
data_frame.applymap(format_float)

Unnamed: 0_level_0,b,d,e
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
utah,-0.07,-0.22,-1.08
ohio,-1.43,-1.7,0.29
texas,1.2,-0.9,-0.87
oregon,1.49,-0.24,-0.42


### Sorting and ranking

In [85]:
# Create new object
obj = pd.Series(range(4), index = ['d', 'a', 'd', 'c'])
obj

d    0
a    1
d    2
c    3
dtype: int64

In [86]:
# Sort index names
obj.sort_index()

a    1
c    3
d    0
d    2
dtype: int64

In [87]:
# The default is ascending order-- you can also reverse
obj.sort_index(ascending=False)

d    0
d    2
c    3
a    1
dtype: int64

In [92]:
# Note-- in the textbook (outdated), the values are sorted using the order() method
# That method is now deprecated
# Use the sort_values() method instead
print(f'Values sorted in ascending order:\n{obj.sort_values(ascending=True)}\n')
print(f'Values sorted in descending order:\n{obj.sort_values(ascending=False)}')

Values sorted in ascending order:
d    0
a    1
d    2
c    3
dtype: int64

Values sorted in descending order:
c    3
d    2
a    1
d    0
dtype: int64


In [95]:
# Similar to R, we can sort values in one or more columns
data_frame.sort_values(by = ['b', 'e'])

Unnamed: 0_level_0,b,d,e
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ohio,-1.429948,-1.703525,0.291113
utah,-0.074772,-0.219961,-1.078329
texas,1.196553,-0.895521,-0.874538
oregon,1.485745,-0.236223,-0.417666


In [96]:
# Rank methods-- similar to sort except they use a rule to break ties
obj = pd.Serie
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [98]:
# Default rule is to use the mean to break ties for rank
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [99]:
# Use first occurrence as the rule
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

### Axis indexes with duplicate values

Interestingly, unlike R, index values in pandas are not required to be uniue.  We can use the `is_unique` attribute (boolean) on the index object that's part of the Series to determine whether the values are unique or not.  

In [102]:
obj.index.is_unique

True

In [112]:
# Create new Series with duplicate index names
obj = pd.Series([7, -5, 7, 4, 2, 0, 4], index=['a', 'a', 'b', 'c', 'd', 'e', 'f'])
obj

a    7
a   -5
b    7
c    4
d    2
e    0
f    4
dtype: int64

In [113]:
obj.index.is_unique

False

In [114]:
# Behavior for indexing is different with duplicates-- returns a Series instead of a scalar
obj['a']

a    7
a   -5
dtype: int64

## 5.3: Summarizing and computing descriptive statistics