# Minimal Modern Pandas

The Pandas library gives a rich set of tools for representing and working with real-world data. 

Pandas is a truly valuable, but truly huge, API. The [essential basic functionality](https://pandas.pydata.org/pandas-docs/version/1.5/user_guide/basics.html#attributes-and-underlying-data) page of the Pandas docs would be 51 pages printed out.  The core `DataFrame` object has 231 members in Pandas 1.5.2, many of them very complex methods. Therefore this doc attempts to provde a friendly and opinionated introduction to the Pandas API, pointing out just enough to enable you to get the job done and avoid common pitfalls.

For some deeper-dive Pandas introductions, look at Jake Vanderplas's book [TODO link], the Kaggle tutorial [TODO link], or Matt Harrison's [tutorials](https://github.com/ponder-org/professional-pandas/tree/main).

Version note: Pandas 2 was released recently, spring 2023. As it is a major overhaul including a new backend (Apache Arrow replaces NumPy) and I haven't yet migrated any existing code to this release, I'm sticking with 1.5.2 for this doc.

In [1]:
# It's conventional to import pandas under the shorter name pd.
import pandas as pd

In [2]:
print(f"There are {len(pd.DataFrame.__dict__)} members of the DataFrame class.")

There are 231 members of the DataFrame class.


## Series

Pandas dataframes are built from `Series`. A Series is *almost* a dictionary, and can be built from one, and used like one. It's better to think of a Series, however, as a little data table, just as Jupyter notebooks display them by default.


In [3]:
d = {'Heads': 60, 'Tails': 40}
s = pd.Series(d, name = '100 coin flips')
s

Heads    60
Tails    40
Name: 100 coin flips, dtype: int64

Arithmetic operations work just as you would expect on the whole table. The `sum()` method computes the sum of all the values.

In [23]:
rate = s / s.sum()
rate.name = 'Rate of heads'
rate

Heads    0.6
Tails    0.4
Name: Rate of heads, dtype: float64

You can access individual values using dictionary-style indexing by keys, or by using the keys as attributes. Internally Pandas prepares a [dict of label to location](https://pandas.pydata.org/pandas-docs/version/1.5/development/internals.html) to enable constant-time lookups (not scaling with the length of the array). 

In [5]:
s['Heads']

60

In [6]:
s.Heads

60

It's also reasonable to think of a Series as two parallel arrays, and you can construct them that way also. 

In [7]:
pd.Series(index=['Heads', 'Tails'], data=[60, 40])

Heads    60
Tails    40
dtype: int64

The `keys` of a Series are actually an `Index`, but for now think of it as an array. More on Index objects later.

In [8]:
s.keys()

Index(['Heads', 'Tails'], dtype='object')

The `values` of the Series are another array, parallel to the keys. 

In [9]:
s.values

array([60, 40])

You can iterate over the keys and values of a Series just like a dictionary.

In [10]:
for key, val in s.items():
    print(key, val)

Heads 60
Tails 40


A Pandas Series can have a `name`, which is very important when we start collecting multiple Series objects into Dataframes. 

In [11]:
s.name = '100 coin flips'
s

Heads    60
Tails    40
Name: 100 coin flips, dtype: int64

### Series have dtypes 

While Pandas dicts can have types that are any value, a Series has a consistent `dtype` such as `float64` or `int64`. The dtype may be deduced from the values or set explicitly. When the dtype is NumPy-compatible, so is the Series: the values of a Series can be viewed, without copying, as a one-dimensional array.

In [12]:
s.array

<PandasArray>
[60, 40]
Length: 2, dtype: int64

In [13]:
s_rate = pd.Series({'Heads': .60, 'Tails': .40})
s_rate

Heads    0.6
Tails    0.4
dtype: float64

A Series with dtype `object` can hold any value, which makes it even more similar to a Python dict.

In [14]:
pd.Series({'Heads': [0, 0, 1, 1], 'Tails': None})

Heads    [0, 0, 1, 1]
Tails            None
dtype: object

### Series can have repeated keys
While Python dicts can only have one value for a given key, a Series is allowed to have multiple values (as the parallel-arrays view of a Series suggests). This is sometimes an annoyance, sometimes essential functionality, depending on what data you are representing.

In [15]:
s2 = pd.Series(index=['Heads', 'Tails', 'Heads'], data=[60,40,50])
s2

Heads    60
Tails    40
Heads    50
dtype: int64

Indexing a Series by a non-repeated key gives you a single value, but indexing the same Series by a repeated key gives you back a smaller Series instead - watch out! This foreshadows other kinds of advanced indexing we'll discuss later.

In [16]:
s2['Heads']

Heads    60
Heads    50
dtype: int64

In [17]:
s2['Tails']

40

### Lookups are slower

If you're working with big data, bear in mind that lookups in a Series are substantially slower than in a dictionary, by close to two orders of magnitude in my experiments.

In [18]:
%%timeit 
d['Heads']

15.9 ns ± 0.232 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)


In [19]:
%%timeit 
s['Heads']

1.06 µs ± 8.36 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


## DataFrames

A Pandas `DataFrame` is best thought of as a dictionary of Series, often of disparate types, that have one thing in common: they share an `Index`.  We can construct one using the parallel-arrays point of view, by giving a dictionary of arrays and one more array for the index. 

In [20]:
df = pd.DataFrame(index=['Heads', 'Tails'], data={'counts': [60, 40], 'rate': [.6, .4]})
df

Unnamed: 0,counts,rate
Heads,60,0.6
Tails,40,0.4


Observe that the shared Index becomes the row labels, while the dictionary keys become the column labels, of the resulting dataframe.

You can also create a DataFrame using a dict of Series, as long as they have matching indexes.  (They don't have to be literally the same object, just have a consistent set of values.)

In [21]:
pd.DataFrame(data={'counts': s, 'rate': s_rate})

Unnamed: 0,counts,rate
Heads,60,0.6
Tails,40,0.4


Actually the indices don't even have to be completely consistent: any missing values will be filled in with NaN, the exceptional floating-point value that means "Not a Number". 

In [22]:
s_rate_heads = pd.Series({'Heads': .6})
pd.DataFrame(data={'counts': s, 'rate': s_rate_heads})

Unnamed: 0,counts,rate
Heads,60,0.6
Tails,40,
