In [1]:
import numpy as np
print("numpy version: {}".format(np.__version__))
import pandas as pd 
print("pandas version: {}".format(pd.__version__))
import matplotlib
import matplotlib.pyplot as plt
print("matplotlib version: {}".format(matplotlib.__version__))
import scipy as sp
print("scipy version: {}".format(sp.__version__))
import sklearn as sl
print("scikit-learn: {}".format(sl.__version__))
import seaborn as sns
print("seaborn: {}".format(sns.__version__))
import statsmodels as sm
print("statsmodels: {}".format(sm.__version__))

numpy version: 1.17.4
pandas version: 0.25.3
matplotlib version: 3.1.2
scipy version: 1.3.3
scikit-learn: 0.21.3
seaborn: 0.9.0
statsmodels: 0.10.2


## Categorical Data

### Background and Motivation

In [2]:
values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)

In [3]:
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [4]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [5]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

For repeated values. In data warehousing, a best practice is to use socalled *dimension tables* containing the distinct values and storing the primary observations as integer keys referencing the dimension table

In [6]:
values = pd.Series([0, 1, 0, 0] * 2)

In [7]:
dim = pd.Series(['apple', 'orange'])

In [8]:
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [9]:
dim

0     apple
1    orange
dtype: object

We can use the ```take``` method to restore the original Series of strings:

In [10]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

The representation as integers is called the *categorical* or *dictionary-encoded* representation. The array of distinct values can be called the *categories*, *dictionary*, or *levels* of the data. In this book we will use the terms *categorical* and *categories*. The integer values that reference the categories are called the *category codes* or simply *codes*.

The categorical representation can yield sighificant performance imporvements whe you are doing analytics. Can perform transformations on the categories while leaving the codes unmodified.

- Renaming categories
- Appending a new category without changing the order or position of the existing categories

In [11]:
### Categorical Type in pandas

```pandas``` has a special ```Categorical``` type for holding data that uses the integer-based categorical representation or *encoding*.

In [12]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2

In [13]:
N = len(fruits)

In [14]:
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                   columns=['basket_id', 'fruit', 'count', 'weight'])

In [15]:
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,4,1.304924
1,1,orange,3,3.256697
2,2,apple,13,3.697772
3,3,apple,5,0.102151
4,4,apple,8,3.527459
5,5,orange,4,0.789483
6,6,apple,5,0.342332
7,7,apple,6,2.140036


Here ```df['fruit']``` is an array of Python string objects. We can convert it to **categorical** by calling

In [16]:
fruit_cat = df['fruit'].astype('category')

In [17]:
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [18]:
type(fruit_cat.values)

pandas.core.arrays.categorical.Categorical

In [19]:
fruit_cat.values.categories

Index(['apple', 'orange'], dtype='object')

In [20]:
fruit_cat.values.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

Convert a DataFrame column to categorical by assigning the converted result:

In [21]:
df['fruit'] = df['fruit'].astype('category')

In [22]:
df.fruit

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [23]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])

In [24]:
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

In [25]:
categories = ['foo', 'bar', 'baz']

In [26]:
codes = [0, 1, 2, 0, 0, 1]

In [27]:
my_cats_2 = pd.Categorical.from_codes(codes, categories)

In [28]:
my_cats_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

In [29]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)

In [30]:
ordered_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

In [31]:
my_cats_2.as_ordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

categorical data need not be strings. A categorical array can consist of any immutable value types

### Computations with Categoricals

In [32]:
np.random.seed(12345)

In [33]:
draws = np.random.randn(1000)

In [34]:
draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

In [35]:
bins = pd.qcut(draws, 4)

In [36]:
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

In [37]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

In [38]:
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

The labeled bins categorical does not contain information about the bin edges in the
data, so we can use ```groupby``` to extract some summary statistics:

In [39]:
bins = pd.Series(bins, name='quartile')

In [40]:
results = (pd.Series(draws)
          .groupby(bins)
          .agg(['count', 'min', 'max'])
          .reset_index())

In [41]:
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


In [44]:
pd.Series(draws).groupby(bins).agg(['count', 'min', 'max'])

Unnamed: 0_level_0,count,min,max
quartile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q1,250,-2.949343,-0.685484
Q2,250,-0.683066,-0.010115
Q3,250,-0.010032,0.628894
Q4,250,0.634238,3.927528


In [45]:
results['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

#### Better performance with categoricals

In [46]:
N = 10_000_000

In [47]:
draws = pd.Series(np.random.randn(N))

In [48]:
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

In [49]:
categories = labels.astype('category')

In [50]:
labels.memory_usage()

80000128

In [51]:
categories.memory_usage()

10000320

In [52]:
%time _ = labels.astype('category')

CPU times: user 373 ms, sys: 67.7 ms, total: 440 ms
Wall time: 438 ms


### Categorical Methods

Series containing categorical data have several special methods similar to the ```Series.str``` 
specialized string methods. This also provides convenient access to the categories and 
codes. Consider the Series:

In [53]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)

In [54]:
cat_s = s.astype('category')

In [55]:
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

The special attribute ```cat``` provides access to categorical methods:

In [56]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [57]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

To change categories use method ```set_categories```

In [58]:
actual_categories = ['a', 'b', 'c', 'd', 'e']

In [59]:
cat_s2 = cat_s.cat.set_categories(actual_categories)

In [60]:
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

In [61]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [62]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

In [63]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]

In [64]:
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [65]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

#### Creating dummy variables for modeling

When you’re using statistics or machine learning tools, you’ll often transform categorical data into *dummy variables*, also known as **one-hot encoding**. This involves creating a DataFrame with a column for each distinct category; these columns contain 1s
for occurrences of a given category and 0 otherwise.

In [66]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')

In [68]:
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [67]:
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1


## Advanced GroupBy Use

### Group Transforms and "Unwrapped" GroupBys

There is another built-in method called ```transform``` , which is similar to ```apply``` but imposes more constraints on the kind of function you can use:

- It can produce a scalar value to be broadcast to the shape of the group
- It can produce an object of the same shape as the input group
- It must not mutate its input

In [69]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                    'value': np.arange(12.)})

In [70]:
df

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


In [71]:
g = df.groupby('key').value

In [72]:
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

Suppose instead we wanted to produce a Series of the same shape as ```df['value']``` but
with values replaced by the average grouped by 'key'. We can pass the function
```lambda x: x.mean()``` to ```transform``` :

In [73]:
g.transform(lambda x: x.mean())

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [74]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

Like ```apply```, ```transform``` works with functions that return ```Series```, but the result must be the same size as the input. For example, we can multiply each group by 2 using a ```lambda``` function:

In [75]:
g.transform(lambda x: x * 2)

0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64

In [76]:
g.transform(lambda x: x.rank(ascending=False))

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

In [77]:
def normalize(x):
    return (x - x.mean()) / x.std()

In [78]:
g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [79]:
g.apply(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

Built-in aggregate functions like ```'mean'``` or ```'sum'``` are often much faster than a general ```apply``` function. These also have a “fast past” when used with ```transform```. This allows us to perform a so-called *unwrapped* group operation:

In [80]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [81]:
normalized = (df['value'] - g.transform('mean')) / g.transform('std')

In [82]:
normalized

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

### Grouped Time Resampling

For time series, the ```resample``` method is semantically a group operation based on a time intervalization.

In [84]:
N = 15

In [85]:
times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)

In [87]:
df = pd.DataFrame({'time': times,
                   'value': np.arange(N)})

In [88]:
df

Unnamed: 0,time,value
0,2017-05-20 00:00:00,0
1,2017-05-20 00:01:00,1
2,2017-05-20 00:02:00,2
3,2017-05-20 00:03:00,3
4,2017-05-20 00:04:00,4
5,2017-05-20 00:05:00,5
6,2017-05-20 00:06:00,6
7,2017-05-20 00:07:00,7
8,2017-05-20 00:08:00,8
9,2017-05-20 00:09:00,9


Here, we can index by ```'time'``` and then resample:

In [89]:
df.set_index('time').resample('5min').count()

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
2017-05-20 00:00:00,5
2017-05-20 00:05:00,5
2017-05-20 00:10:00,5


In [92]:
df.set_index('time').resample('30s').ffill()

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
2017-05-20 00:00:00,0
2017-05-20 00:00:30,0
2017-05-20 00:01:00,1
2017-05-20 00:01:30,1
2017-05-20 00:02:00,2
2017-05-20 00:02:30,2
2017-05-20 00:03:00,3
2017-05-20 00:03:30,3
2017-05-20 00:04:00,4
2017-05-20 00:04:30,4


Suppose that a ```DataFrame``` contains multiple time series, marked by an additional
group key column:

In [93]:
df2 = pd.DataFrame({'time': times.repeat(3),
                    'key': np.tile(['a', 'b', 'c'], N),
                    'value': np.arange(N * 3.)})

In [94]:
df2[:7]

Unnamed: 0,time,key,value
0,2017-05-20 00:00:00,a,0.0
1,2017-05-20 00:00:00,b,1.0
2,2017-05-20 00:00:00,c,2.0
3,2017-05-20 00:01:00,a,3.0
4,2017-05-20 00:01:00,b,4.0
5,2017-05-20 00:01:00,c,5.0
6,2017-05-20 00:02:00,a,6.0


## Techniques for Method Chaining

When applying a sequence of transformations to a dataset, you may find yourself cre‐
ating numerous temporary variables that are never used in your analysis. Consider
this example, for instance:

```python
df = load_data()
df2 = df[df['col2'] < 0]
df2['col1_demeaned'] = df2['col1'] - df2['col1'].mean()
result = df2.groupby('key').col1_demeaned.std()
```

While we’re not using any real data here, this example highlights some new methods.
First, the ```DataFrame.assign``` method is a functional alternative to column assignments of the form ```df[k] = v```. Rather than modifying the object in-place, it returns a
new DataFrame with the indicated modifications. So these statements are equivalent:

```python
# Usual non-functional way
df2 = df.copy()
df2['k'] = v
# Functional assign way
df2 = df.assign(k=v)
```

Assigning in-place may execute faster than using ```assign```, but assign enables easier
method chaining:

```python
result = (df2.assign(col1_demeaned=df2.col1 - df2.col2.mean())
.groupby('key')
.col1_demeaned.std())
```

One thing to keep in mind when doing method chaining is that you may need to
refer to temporary objects. In the preceding example, we cannot refer to the result of
load_data until it has been assigned to the temporary variable df . To help with this,
assign and many other pandas functions accept function-like arguments, also known
as *callables*.

To show callables in action, consider a fragment of the example from before:
```python
df = load_data()
df2 = df[df['col2'] < 0]
```
This can be rewritten as:
```python
df = (load_data()
[lambda x: x['col2'] < 0])
```
Here, the result of load_data is not assigned to a variable, so the function passed into
[] is then bound to the object at that stage of the method chain.

We can continue, then, and write the entire sequence as a single chained expression:
```python
result = (load_data()
[lambda x: x.col2 < 0]
.assign(col1_demeaned=lambda x: x.col1 - x.col1.mean())
.groupby('key')
.col1_demeaned.std())
```
Whether you prefer to write code in this style is a matter of taste, and splitting up the
expression into multiple steps may make your code more readable

### The pipe Method

Consider a sequence of function calls:
```python
a = f(df, arg1=v1)
b = g(a, v2, arg3=v3)
c = h(b, arg4=v4)
```
When using functions that accept and return Series or DataFrame objects, you can
rewrite this using calls to ```pipe```:
```python
result = (df.pipe(f, arg1=v1)
            .pipe(g, v2, arg3=v3)
            .pipe(h, arg4=v4))
```
The statement ```f(df)``` and ```df.pipe(f)``` are equivalent, but pipe makes chained invocation easier.
A potentially useful pattern for ```pipe``` is to generalize sequences of operations into
reusable functions. As an example, let’s consider substracting group means from a
column:
```python
g = df.groupby(['key1', 'key2'])
df['col1'] = df['col1'] - g.transform('mean')
```
Suppose that you wanted to be able to demean more than one column and easily
change the group keys. Additionally, you might want to perform this transformation
in a method chain. Here is an example implementation:
```python
def group_demean(df, by, cols):
    result = df.copy()
    g = df.groupby(by)
    for c in cols:
        result[c] = df[c] - g[c].transform('mean')
    return result
```
Then it is possible to write:
```python
result = (df[df.col1 < 0]
          .pipe(group_demean, ['key1', 'key2'], ['col1']))
```