# More than 10 Minutes to Pandas

## Pandas First Steps Part 3

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Using the Index

In [None]:
df = pd.DataFrame()
df['Name'] = ['John', 'Xavier', 'Alana', 'Ben']
df['Office'] = ['103A', '300B', '201A', '202B']
df

In [None]:
df.index

In [None]:
df1 = df.set_index('Name')
df1

In [None]:
df2 = df1.sort_index()
df2

In [None]:
df2['Floor'] = [2,2, 1,4]
df2['Exit'] = ['North', 'South', 'North', 'West']
df2

In [None]:
df2.loc['Alana']

In [None]:
df2.loc['John':'Xavier']

In [None]:
df2.loc['John':'Xavier', 'Office']

In [None]:
df3 = df2.reset_index()
df3

In [None]:
df4 = df3.set_index(['Floor', 'Name'])
df4

In [None]:
df4.loc[2]

In [None]:
df4.loc[(2, 'Ben')]

In [None]:
df4.loc[2, 'Exit']

## Reshaping

See the sections on [Hierarchical Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-hierarchical) and [Reshaping](http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-stacking).

### Stack

In [None]:
tuples = list(zip(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
                  ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']))

tuples

In [None]:
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [None]:
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

In [None]:
df2 = df[:4]

In [None]:
df2

The stack() method “compresses” a level in the DataFrame’s columns.

In [None]:
stacked = df2.stack()

In [None]:
stacked

In [None]:
stacked.index

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the **last level**:

In [None]:
stacked.unstack(2) # arg -> which index level to convert to columns

In [None]:
stacked.unstack(1)

In [None]:
stacked.unstack(0)

### Pivot Tables

See the section on [Pivot Tables](http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-pivot).

In [None]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                                    'B' : ['alpha', 'beta', 'gamma'] * 4,
                                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                                    'D' : np.random.randn(12),
                                    'E' : np.random.randn(12)})

In [None]:
df

We can produce pivot tables from this data very easily:

In [None]:
pivoted = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
pivoted

In [None]:
pivoted.index

In [None]:
pivoted.columns

## Time Series

Pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the [Time Series section](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries)

In [None]:
rng = pd.date_range('1/1/2012', periods=100, freq='S')
rng

Frequency cheatsheet: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

In [None]:
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts

In [None]:
by_10_s = ts.resample('10S').sum()
by_10_s

### Rolling Aggregations

In [None]:
by_10_s.rolling(3).sum()

In [None]:
by_10_s.rolling('30s', min_periods=2).sum()

In [None]:
by_10_s.rolling('30s', min_periods=3).mean()

More complex window types
* https://docs.scipy.org/doc/scipy/reference/signal.windows.html#module-scipy.signal.windows

In [None]:
by_10_s.rolling(3, win_type='gaussian').sum(std=3) 

<img src='images/expand.jpeg'>
<super>credit: Tony Yiu https://tonester524.medium.com/</super>

In [None]:
by_10_s.expanding().mean()

### Time zone representation

In [None]:
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')

In [None]:
ts = pd.Series(np.random.randn(len(rng)), rng)

In [None]:
ts

In [None]:
ts_utc = ts.tz_localize('UTC')

In [None]:
ts_utc


Convert to another time zone

In [None]:
ts_utc.tz_convert('US/Eastern')


Converting between time span representations

In [None]:
rng = pd.date_range('1/1/2012', periods=5, freq='M')

In [None]:
ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [None]:
ts

In [None]:
ts.index

In [None]:
ps = ts.to_period()

ps

In [None]:
ps.index

In [None]:
ps.to_timestamp()

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:

In [None]:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')

In [None]:
ts = pd.Series(np.random.randn(len(prng)), prng)

In [None]:
ts

In [None]:
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

In [None]:
ts.head()

## Categoricals

Since version 0.15, pandas can include categorical data in a DataFrame. For full docs, see the [categorical introduction](http://pandas.pydata.org/pandas-docs/stable/categorical.html#categorical) and the [API documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#api-categorical).

In [None]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

Convert the raw grades to a categorical data type.

In [None]:
df["grade"] = df["raw_grade"].astype("category")

In [None]:
df["grade"]

Rename the categories to more meaningful names (assigning to Series.cat.categories is inplace!)

In [None]:
df.grade.cat.categories

In [None]:
df.grade.cat.codes

In [None]:
df["grade"].cat.categories = ["very good", "good", "very bad"]

Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series per default).

In [None]:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

In [None]:
df["grade"]

Sorting is per order in the categories, not lexical order.

In [None]:
df.sort_values(by="grade")

Grouping by a categorical column shows also empty categories.

In [None]:
df.groupby("grade").size()

### Gotchas
If you are trying an operation and you see an exception like:

In [None]:
if pd.Series([False, True, False]):
    print("I was true")

See [Comparisons](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-compare) for an explanation and what to do.

See [Gotchas](http://pandas.pydata.org/pandas-docs/stable/gotchas.html#gotchas) as well.

## More Internals and Perf Info

https://tomaugspurger.github.io/modern-4-performance.html