# Minimal Modern Pandas

The Pandas library gives a rich set of tools for representing and working with real-world data. 

Pandas is a truly valuable, but truly huge, API. The [essential basic functionality](https://pandas.pydata.org/pandas-docs/version/1.5/user_guide/basics.html#attributes-and-underlying-data) page of the Pandas docs would be 51 pages printed out.  The core `DataFrame` object has 231 members in Pandas 1.5.2, many of them very complex methods. Therefore this doc attempts to provde a friendly and opinionated introduction to the Pandas API, pointing out just enough to enable you to get the job done and avoid common pitfalls.

For a deeper dive, look at Jake Vanderplas's book [TODO link].  Some alternative points of view include the Kaggle tutorial [TODO link], Matt Harrison's [tutorials](https://github.com/ponder-org/professional-pandas/tree/main) and Tom Augspurger's [blog](https://tomaugspurger.net/posts/modern-1-intro/).

Version note: Pandas 2 was released recently, spring 2023. As it is a major overhaul including a new backend (Apache Arrow replaces NumPy) and I haven't yet migrated any existing code to this release, I'm sticking with 1.5.2 for this doc for now. 

In [32]:
# It's conventional to import pandas under the shorter name pd.
import pandas as pd

In [33]:
print(f"There are {len(pd.DataFrame.__dict__)} members of the DataFrame class.")

There are 231 members of the DataFrame class.


## Series

Pandas dataframes are built from `Series`. A Series is *almost* a dictionary, and can be built from one, and used like one. It's better to think of a Series, however, as a little data table, just as Jupyter notebooks display them by default.  Here are two little Series, one with string keys, one with integer keys.

In [34]:
d = {'Heads': 60, 'Tails': 40}
s = pd.Series(d, name = '100 coin flips')
s

Heads    60
Tails    40
Name: 100 coin flips, dtype: int64

In [35]:
r = pd.Series({1: 10, 2: 9, 3: 11, 4: 8, 5: 12, 6:10}, name='60 die rolls')
r

1    10
2     9
3    11
4     8
5    12
6    10
Name: 60 die rolls, dtype: int64

Just like dicts, you can efficiently access individual values using indexing by keys.  Unlike dicts, you can also use the keys as attributes, if their names are valid Python identifiers.  

In [36]:
s['Heads']

60

In [37]:
s.Heads

60

You can iterate over the keys and values of a Series just like a dictionary, using the `.items()` method.

In [38]:
for key, val in s.items():
    print(key, val)

Heads 60
Tails 40


We start to see the advantage of Series over dicts when we realize that we can operate on them as a whole. Arithmetic operations work just as you would expect on the whole table. The `sum()` method computes the sum of all the values, for example.  The `name` property of the Series will be very handy when we start collecting them into DataFrames. 

In [39]:
rate = s / s.sum()
rate.name = 'Rate of heads'
rate

Heads    0.6
Tails    0.4
Name: Rate of heads, dtype: float64

Even more compellingly, we can conveniently select out subsets of the Series, and those subsets are themselves Series with the same advantages.

For instance, we can select a range of keys.  The modern syntax for selecting by position uses `.loc[]` indexing. Note that `.loc` takes square brackets, not round ones - think of it as a property that returns an indexable object.  Also note that those arguments are *labels*, not indices.

In [40]:
s.loc['Heads':'Tails']

Heads    60
Tails    40
Name: 100 coin flips, dtype: int64

In [41]:
r.loc[1:3]

1    10
2     9
3    11
Name: 60 die rolls, dtype: int64

The `r.loc[1:3]` selection is itself a Series that we can operate on with all the whole-Series operations. 

In [42]:
r.loc[1:3].sum()

30

Another very important subsetting operation, which confusingly is also done on the `.loc` indexer, is indexing with a boolean Series.  Let's make a Series with the same index as `r` whose values are bools.

In [43]:
evens = pd.Series({1: False, 2: True, 3: False, 4: True, 5: False, 6: True})
evens

1    False
2     True
3    False
4     True
5    False
6     True
dtype: bool

Now when we index `r` at this position we get a Series of the counts for even rolls.

In [44]:
r.loc[evens]

2     9
4     8
6    10
Name: 60 die rolls, dtype: int64

Naturally we have boolean logic operations on boolean series. 

In [45]:
~evens

1     True
2    False
3     True
4    False
5     True
6    False
dtype: bool

In [62]:
print(f"Rate of odds: {r.loc[~evens].sum() / r.sum()}")

Rate of odds: 0.55


It's also reasonable to think of a Series as two parallel arrays, and you can construct them that way also. Indeed, this is more like the way they are actually stored, but internally Pandas prepares a [dict of label to location](https://pandas.pydata.org/pandas-docs/version/1.5/development/internals.html) to enable constant-time lookups (not scaling with the length of the Series).

In [47]:
pd.Series(index=['Heads', 'Tails'], data=[60, 40])

Heads    60
Tails    40
dtype: int64

The `keys` of a Series are actually an `Index`, but for now think of it as an array. More on Index objects later.

In [48]:
s.keys()

Index(['Heads', 'Tails'], dtype='object')

The `values` of the Series are another array, parallel to the keys. 

In [49]:
s.values

array([60, 40])

Continuing the theme, you can treat the data in the Series like an array using `.iloc` indexing. Just as before, this subset is a fully-paid-up Series.

In [50]:
r.iloc[:2]

1    10
2     9
Name: 60 die rolls, dtype: int64

## Major differences between Series and dicts

While it generally works to think of a `Series` as a better `dict`, there are some significant differences.  Here are a few that you're likely to come across.

### Series have dtypes 

While Pandas dicts can have types that are any value, a Series has a consistent `dtype` such as `float64` or `int64`. The dtype may be deduced from the values or set explicitly. When the dtype is NumPy-compatible, so is the Series: the values of a Series can be viewed, without copying, as a one-dimensional NumPy array.

In [51]:
s.array

<PandasArray>
[60, 40]
Length: 2, dtype: int64

In [52]:
s_rate = pd.Series({'Heads': .60, 'Tails': .40})
s_rate

Heads    0.6
Tails    0.4
dtype: float64

A Series with dtype `object` can hold any value, which makes it even more similar to a Python dict.

In [53]:
pd.Series({'Heads': [0, 0, 1, 1], 'Tails': None})

Heads    [0, 0, 1, 1]
Tails            None
dtype: object

### Series can have repeated keys
While Python dicts can only have one value for a given key, a Series is allowed to have multiple values (as the parallel-arrays view of a Series suggests). This is sometimes an annoyance, sometimes essential functionality, depending on what data you are representing.

In [54]:
s2 = pd.Series(index=['Heads', 'Tails', 'Heads'], data=[60,40,50])
s2

Heads    60
Tails    40
Heads    50
dtype: int64

Indexing a Series by a non-repeated key gives you a single value, but indexing the same Series by a repeated key gives you back a smaller Series instead - watch out! This foreshadows other kinds of advanced indexing we'll discuss later.

In [55]:
s2['Heads']

Heads    60
Heads    50
dtype: int64

In [56]:
s2['Tails']

40

### Lookups are slower

If you're working with big data, bear in mind that lookups in a Series are substantially slower than in a dictionary, by close to two orders of magnitude in my experiments.

In [57]:
%%timeit 
d['Heads']

15 ns ± 0.177 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)


In [58]:
%%timeit 
s['Heads']

1.11 µs ± 24.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


## DataFrames

A Pandas `DataFrame` is best thought of as a dictionary of Series, often of disparate types, that have one thing in common: they share an `Index`.  We can construct one using the parallel-arrays point of view, by giving a dictionary of arrays and one more array for the index. 

In [59]:
df = pd.DataFrame(index=['Heads', 'Tails'], data={'counts': [60, 40], 'rate': [.6, .4]})
df

Unnamed: 0,counts,rate
Heads,60,0.6
Tails,40,0.4


Observe that the shared Index becomes the row labels, while the dictionary keys become the column labels, of the resulting dataframe.

You can also create a DataFrame using a dict of Series, as long as they have matching indexes.  (They don't have to be literally the same object, just have a consistent set of values.)

In [60]:
pd.DataFrame(data={'counts': s, 'rate': s_rate})

Unnamed: 0,counts,rate
Heads,60,0.6
Tails,40,0.4


Actually the indices don't even have to be completely consistent: any missing values will be filled in with NaN, the exceptional floating-point value that means "Not a Number". 

In [61]:
s_rate_heads = pd.Series({'Heads': .6})
pd.DataFrame(data={'counts': s, 'rate': s_rate_heads})

Unnamed: 0,counts,rate
Heads,60,0.6
Tails,40,
