# Categorical Data

Tools for achieve better performance and memory use in pandas operations.

Using categorical data in statistics anf Machine Learning applications.

In [4]:
import numpy as np
import pandas as pd
values = pd.Series(["apple", "orange", "orange", "apple", "apple"] * 2)
values

0     apple
1    orange
2    orange
3     apple
4     apple
5     apple
6    orange
7    orange
8     apple
9     apple
dtype: object

In [5]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [6]:
values.value_counts()

apple     6
orange    4
dtype: int64

The more efficient way of storing this data is by using "Dimension Table" containing distinct values and storing the primary observations as integer keys referencing the dimension table:

In [7]:
values = pd.Series([0,1,0,0] * 2)

In [8]:
dim = pd.Series(["apple", "orange"]) #categories

In [9]:
values    #Codes

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [10]:
# we can use "take" method to restore the original Series of strings:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

This representation as integers is called the categorical or dictionary-encoded representation.
The array of distict values can be called the categories, dictionary or levels of data.

The terms used are categorical and categories.
The integer values that reference the categories are called category codes or "CODES


The Categorical representation can yield significant performance improvements when doing analytics.
We can do transformations on categories while leaving codes unmodified.
Transformations that can be made at relatively low cost are :

- Renaming categories
- Appending a new category without changing the order or position of the exisiting categories.

In [11]:
fruits = ["apple", "orange", "apple", "apple"] * 2
N = len(fruits)

In [12]:
df = pd.DataFrame({"fruit" : fruits,
                    "basket_id" : np.arange(N),
                    "count" : np.random.randint(3,15,size = N),
                    "weight" : np.random.uniform(0,4,size = N)},
                    columns = ["basket_id","fruit","count","weight"])

In [13]:
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,7,1.103953
1,1,orange,12,3.704648
2,2,apple,14,0.025675
3,3,apple,8,3.328257
4,4,apple,14,3.712218
5,5,orange,6,3.601671
6,6,apple,13,3.932001
7,7,apple,10,0.250948


In [14]:
df["fruit"]

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: object

In [15]:
fruit_cat = df["fruit"].astype("category")

In [16]:
fruit_cat 

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

# The values of fruit_cat are not a NumPy array, but an instance of pandas.Categorical:

In [17]:
c = fruit_cat.values

In [18]:
type(c)

pandas.core.arrays.categorical.Categorical

In [19]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [20]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [21]:
#a column can be converted to categorical by assigning the converted result
df["fruit"] = df["fruit"].astype("category")

In [22]:
df.fruit

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [23]:
my_categories = pd.Categorical (["foo", "bar", "baz", "foo", "bar"])
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

In [24]:
categories = ["foo", "bar", "baz"]
codes = [0,1,2,0,0,1]

my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

Unless explicitly specified, categorical conversions assume no specific oredring of the categories.
hence:

In [25]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered = True)
ordered_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

In [26]:
my_cats_2.as_ordered()


[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

# Computations with Categorical

In [27]:
np.random.seed(12345)
draws = np.random.randn(1000)

In [28]:
draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

In [29]:
#Let's compute quartile binning of this data and extract some statistics
bins = pd.qcut(draws, 4)
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

In [30]:
bins = pd.qcut(draws, 4, labels = ["Q1", "Q2", "Q3", "Q4"])
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [31]:
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

In [32]:
#The labeled bins categorical does not contain information about the bin edges in the data,
#we can ue groupby to extract summary statistics

bins = pd.Series(bins, name = "quartiles")

result = (pd.Series(draws)
          .groupby(bins)
          .agg(["count", "min", "max"])
          .reset_index())

In [33]:
result

Unnamed: 0,quartiles,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


In [34]:
result["quartiles"]

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartiles, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [35]:
N = 10000000

draws = pd.Series(np.random.randn(N))

In [36]:
labels = pd.Series(["foo", "bar", "baz", "qux"] * (N//4))

In [37]:
# we convert labels to categorical, as labels uses significantly more memory than categories:
categories = labels.astype("category")

In [38]:
labels.memory_usage()

80000128

In [39]:
categories.memory_usage()

10000320

In [40]:
%time  _ = labels.astype("category")

Wall time: 1.04 s


GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings

In [41]:
s = pd.Series(["a", "b", "c", "d"] * 2)

In [42]:
cat_s = s.astype("category")
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [43]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [44]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

In [45]:
actual_categories = ["a", "b", "c","d", "e"]

In [46]:
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

In [47]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [48]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

In [49]:
draws = np.random.randn(2000)
bins = pd.qcut(draws, 4, labels = ["Q1", "Q2", "Q3", "Q4"] )
bins.codes


array([1, 3, 1, ..., 3, 1, 3], dtype=int8)

In [50]:
bins = pd.Series(bins, name = "quartile")

results = (pd.Series(draws)
          .groupby(bins)
          .agg(["count", "min", "max"])
          .reset_index())
results

Unnamed: 0,quartile,count,min,max
0,Q1,500,-4.414263,-0.651922
1,Q2,500,-0.647185,-0.008258
2,Q3,500,-0.007074,0.654507
3,Q4,500,0.654999,3.498391


In [51]:
results["quartile"]

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [52]:
s = pd.Series(["a", "b", "c", "d"] * 2)
cat_s = s.astype("category")
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [53]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [54]:
cat_s.cat.categories           # special method "cat" provides access to categorical methods

Index(['a', 'b', 'c', 'd'], dtype='object')

In [55]:
actual_categories = ["a", "b", "c", "d", "e"]

In [56]:
cat_s2 = cat_s2.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

In [57]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [58]:
pd.value_counts(cat_s2)

d    2
c    2
b    2
a    2
e    0
dtype: int64

In [59]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

In large datasets the categoricals are often used as a convinient toolfor memory savings and better performance.

In [60]:
cat_s3 = cat_s[cat_s.isin(["a", "b"])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [61]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

Methods:

1. add_categories

2. as_ordered

3. as_unordered

4. remove_categories

5. remove_unused_categories

6. rename_categories

7. reorder_categories

8. set_categories

# Creating dummy variables for modeling

When we are using statistics or Machine Learning tools, we transform categorical data into dummy variables also known as "ONE HOT ENCODING"

This involves creating a DataFrame with a column for each distinct category, these columns contain 1s for occurrences of a given category and 0 otherwise

In [62]:
cat_s = pd.Series(["a", "b", "c", "d"] * 2, dtype = "category")

In [63]:
pd.get_dummies(cat_s)   #one hot encoding of cat_s

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1


# Advanced GroupBy Use:
Group Transformers and "Unwrapped" GroupBy

we have looked at "apply" operation with GroupBy operations to perform transformations
There is also another built in method called "Transform".

"Transform" is similar to "apply" only that it imposes more constaraints on the kind of function that we can use:

(a) It can produce a scalar value to be broadcast to the shape of the group

(b) It can produce an object of the same shape as the input group

(c) It must not mutate its input

In [64]:
df = pd.DataFrame({"keys" : ["a", "b", "c"] * 4,
                 "values" : np.arange(12.)})

In [65]:
df

Unnamed: 0,keys,values
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


In [66]:
g = df.groupby("keys").values
g

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000013505FB4948>

In [67]:
g.mean()

keys
a    4.5
b    5.5
c    6.5
Name: values, dtype: float64

In [68]:
df["mean"]=g.transform(lambda x : x.mean()) # by using transform we broadcast a to theshape of the group

In [69]:
df   # by using transform we broadcast a to theshape of the group

Unnamed: 0,keys,values,mean
0,a,0.0,4.5
1,b,1.0,5.5
2,c,2.0,6.5
3,a,3.0,4.5
4,b,4.0,5.5
5,c,5.0,6.5
6,a,6.0,4.5
7,b,7.0,5.5
8,c,8.0,6.5
9,a,9.0,4.5


In [70]:
g.transform("median")

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: values, dtype: float64

In [71]:
g.transform(lambda x : x*2)

0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: values, dtype: float64

In [72]:
df.groupby("keys").values.transform(lambda x: x.rank(ascending = False))

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: values, dtype: float64

In [73]:
def normalize(x):
    return (x-x.mean())/x.std()

In [74]:
g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: values, dtype: float64

In [75]:
g.apply(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: values, dtype: float64

Built-in aggregate functions like "mean" or "sum" are often faster than a general apply function.
These also have a "fast past" when used with transform. This allows to perform a so called unwrapped group operation

In [76]:
g.transform("mean")

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: values, dtype: float64

In [77]:
normalized = (df["values"] - g.transform("mean"))/g.transform("std")

In [78]:
normalized

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: values, dtype: float64

While an unwrapped group operation may involve multiple group aggregations, the overall benefit of vectorized operations out weights this.

# ... Vectorized operation:
df["ratio"] = 100 * (df["x"] / df["y"])

# ... Non-vectorized operation:
def calc_ratio(row):
    return 100 * (row["x"] / row["y"])

df["ratio2"] = df.apply(calc_ratio, axis=1)


Vectorized:     0.0043 secs

Non-vectorized: 5.6435 secs


# Grouped Time Resampling

In [79]:
N = 15
times = pd.date_range("2017-05-20 00:00", freq = "1min", periods = N)


In [80]:
df = pd.DataFrame({"time" : times,"value" : np.random.randn(N)})
df.set_index("time").resample("5min").mean()

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
2017-05-20 00:00:00,-0.137651
2017-05-20 00:05:00,-0.462134
2017-05-20 00:10:00,-0.240776


In [81]:
df2 = pd.DataFrame({"time" : times.repeat(3),
             "key" : np.tile(["a", "b", "c"], N),
             "value" : np.arange(N * 3.)})

#numpy.tile
#numpy.tile(A, reps)[source]

#Construct an array by repeating A the number of times given by reps, here it is "N".

In [82]:
from pandas.core import resample as rp
time_key = rp.TimeGrouper('5min')
df2.set_index('time').groupby(["key", time_key]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,value
key,time,Unnamed: 2_level_1
a,2017-05-20 00:00:00,30.0
a,2017-05-20 00:05:00,105.0
a,2017-05-20 00:10:00,180.0
b,2017-05-20 00:00:00,35.0
b,2017-05-20 00:05:00,110.0
b,2017-05-20 00:10:00,185.0
c,2017-05-20 00:00:00,40.0
c,2017-05-20 00:05:00,115.0
c,2017-05-20 00:10:00,190.0


In [83]:
resampled = (df2.set_index("time")
            .groupby(["key",time_key])
            .sum())

resampled

Unnamed: 0_level_0,Unnamed: 1_level_0,value
key,time,Unnamed: 2_level_1
a,2017-05-20 00:00:00,30.0
a,2017-05-20 00:05:00,105.0
a,2017-05-20 00:10:00,180.0
b,2017-05-20 00:00:00,35.0
b,2017-05-20 00:05:00,110.0
b,2017-05-20 00:10:00,185.0
c,2017-05-20 00:00:00,40.0
c,2017-05-20 00:05:00,115.0
c,2017-05-20 00:10:00,190.0


In [84]:
resampled.reset_index()

Unnamed: 0,key,time,value
0,a,2017-05-20 00:00:00,30.0
1,a,2017-05-20 00:05:00,105.0
2,a,2017-05-20 00:10:00,180.0
3,b,2017-05-20 00:00:00,35.0
4,b,2017-05-20 00:05:00,110.0
5,b,2017-05-20 00:10:00,185.0
6,c,2017-05-20 00:00:00,40.0
7,c,2017-05-20 00:05:00,115.0
8,c,2017-05-20 00:10:00,190.0


# Techniques for method Chaining:

In [85]:
df2     #load data

Unnamed: 0,time,key,value
0,2017-05-20 00:00:00,a,0.0
1,2017-05-20 00:00:00,b,1.0
2,2017-05-20 00:00:00,c,2.0
3,2017-05-20 00:01:00,a,3.0
4,2017-05-20 00:01:00,b,4.0
5,2017-05-20 00:01:00,c,5.0
6,2017-05-20 00:02:00,a,6.0
7,2017-05-20 00:02:00,b,7.0
8,2017-05-20 00:02:00,c,8.0
9,2017-05-20 00:03:00,a,9.0


In [86]:
df2 =df2[df2["value"]<22]  # first statement


df2

Unnamed: 0,time,key,value
0,2017-05-20 00:00:00,a,0.0
1,2017-05-20 00:00:00,b,1.0
2,2017-05-20 00:00:00,c,2.0
3,2017-05-20 00:01:00,a,3.0
4,2017-05-20 00:01:00,b,4.0
5,2017-05-20 00:01:00,c,5.0
6,2017-05-20 00:02:00,a,6.0
7,2017-05-20 00:02:00,b,7.0
8,2017-05-20 00:02:00,c,8.0
9,2017-05-20 00:03:00,a,9.0


In [87]:
df2["col1_demeaned"] = df2["value"] - df2["value"].mean() # second statement
df2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,time,key,value,col1_demeaned
0,2017-05-20 00:00:00,a,0.0,-10.5
1,2017-05-20 00:00:00,b,1.0,-9.5
2,2017-05-20 00:00:00,c,2.0,-8.5
3,2017-05-20 00:01:00,a,3.0,-7.5
4,2017-05-20 00:01:00,b,4.0,-6.5
5,2017-05-20 00:01:00,c,5.0,-5.5
6,2017-05-20 00:02:00,a,6.0,-4.5
7,2017-05-20 00:02:00,b,7.0,-3.5
8,2017-05-20 00:02:00,c,8.0,-2.5
9,2017-05-20 00:03:00,a,9.0,-1.5


In [88]:
result = df2.groupby("key").col1_demeaned.std()   # third statement

In [89]:
result        

key
a    7.348469
b    6.480741
c    6.480741
Name: col1_demeaned, dtype: float64

In [90]:
result = (df2.assign(col1_demeaned = df2.value - df2.value.mean())  # all three statements combined together into one
          .groupby("key")
          .col1_demeaned.std())

In [91]:
result  # same actions done in a single statement.

key
a    7.348469
b    6.480741
c    6.480741
Name: col1_demeaned, dtype: float64

# What is are callables? 

In [92]:
#Load Data

df2 = pd.DataFrame({"time" : times.repeat(3),
             "key" : np.tile(["a", "b", "c"], N),
             "value" : np.arange(N * 3.)})



# first statement

df2 = df2[df2["value"]<22]


or 

we can write it as 

In [93]:
df = (df2
     [lambda x:x["value"]<22])

In [94]:
df

Unnamed: 0,time,key,value
0,2017-05-20 00:00:00,a,0.0
1,2017-05-20 00:00:00,b,1.0
2,2017-05-20 00:00:00,c,2.0
3,2017-05-20 00:01:00,a,3.0
4,2017-05-20 00:01:00,b,4.0
5,2017-05-20 00:01:00,c,5.0
6,2017-05-20 00:02:00,a,6.0
7,2017-05-20 00:02:00,b,7.0
8,2017-05-20 00:02:00,c,8.0
9,2017-05-20 00:03:00,a,9.0


In [95]:
#We write the entire sequence of three statements in single chained expression

result = (df2
         [lambda x: x.value<22]  
          #obtaining same results using "callables" which are function like arguments(the lambda fn)
         .assign(col1_demeaned = lambda x:x.value - x.value.mean())
         .groupby("key")
         .col1_demeaned.std())

In [96]:
result     

key
a    7.348469
b    6.480741
c    6.480741
Name: col1_demeaned, dtype: float64

# Method chaining makes code less readable
# The output from each statement cannot be evaluated seperatly so debugging becomes difficult

# Pipe Method

Sometimes we need to use our own functions or from third party libraries

In [118]:
#Checkout the difference: 
np.random.seed(12345)
df = pd.DataFrame({"Name" : ["Wes"] * 3 + ["Bill"] * 3 + ["Hill"] * 3,
                    "subject" : ["Math", "Physisc", "Chem"] * 3,
                    "Marks" : np.random.randint(60, 90, 9 )})

df.head()

Unnamed: 0,Name,subject,Marks
0,Wes,Math,62
1,Wes,Physisc,65
2,Wes,Chem,89
3,Bill,Math,61
4,Bill,Physisc,64


df = pd.DataFrame({"Name" : ["Wes", "Bill", "Hill"] * 3,
                    "subject" : ["Math", "Physisc", "Chem"] * 3,
                    "Marks" : np.random.randint(60, 90, 9 )})
df.head()

In [120]:
#Returns Pandas DataFrame

def get_subject_rank(input_df): 
    
    input_df = input_df.copy()
    input_df["Subject_Rank"] = (input_df
                                .groupby("subject")["Marks"]
                                .rank("dense", ascending = False))
    
    return input_df

df.pipe(get_subject_rank)

Unnamed: 0,Name,subject,Marks,Subject_Rank
0,Wes,Math,62,2.0
1,Wes,Physisc,65,1.0
2,Wes,Chem,89,1.0
3,Bill,Math,61,3.0
4,Bill,Physisc,64,2.0
5,Bill,Chem,69,2.0
6,Hill,Math,65,1.0
7,Hill,Physisc,62,3.0
8,Hill,Chem,89,1.0


Parameters for Rank:

    axis{0 or ‘index’, 1 or ‘columns’}, default 0

        Index to direct ranking.
    method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’

        How to rank the group of records that have the same value (i.e. ties):

            average: average rank of the group

            min: lowest rank in the group

            max: highest rank in the group

            first: ranks assigned in order they appear in the array

            dense: like ‘min’, but rank always increases by 1 between groups.

    numeric_onlybool, optional

        For DataFrame objects, rank only numeric columns if set to True.
    na_option{‘keep’, ‘top’, ‘bottom’}, default ‘keep’

        How to rank NaN values:

            keep: assign NaN rank to NaN values

            top: assign lowest rank to NaN values

            bottom: assign highest rank to NaN values

    ascendingbool, default True

        Whether or not the elements should be ranked in ascending order.
    pctbool, default False

        Whether or not to display the returned rankings in percentile for

# Return pandas series

Pipe can return arbitrary outputs when defined in functions. In the following example, the function returns pandas series once df_or_not = False. Other arguments needs to be specified in the calling in pipe when functions have more than one arguments, also shown in the example below.

In [125]:
def get_subject_rank(input_df, df_or_not = True):
    
    input_df = input_df.copy()
    if df_or_not is True:
        input_df[Subject_score] = (input_df
                                   .groupby("subject")["Marks"]
                                   .rank(ascending = False))
        return input_df
    else:
        output_series = (input_df
                        .groupby("subject")["Marks"]
                        .rank(ascending = "False"))
        return output_series
    
df.pipe(get_subject_rank, df_or_not = False)

0    2.0
1    3.0
2    2.5
3    1.0
4    2.0
5    1.0
6    3.0
7    1.0
8    2.5
Name: Marks, dtype: float64

# Data is not the first argument

When calling functions in pipe, the first argument of the function by default is the dataframe/series applied by pipe. Here is an example of a function that modifies scores - add_score. The first argument - input_df - is df. There is no need to specify input_df in the calling in pipe.

In [130]:
def add_score(input_df, add_score):
    input_df = input_df.copy()
    input_df = input_df.assign(new_score = lambda x:x.Marks + add_score)
    return input_df

df.pipe(add_score,2)

Unnamed: 0,Name,subject,Marks,new_score
0,Wes,Math,62,64
1,Wes,Physisc,65,67
2,Wes,Chem,89,91
3,Bill,Math,61,63
4,Bill,Physisc,64,66
5,Bill,Chem,69,71
6,Hill,Math,65,67
7,Hill,Physisc,62,64
8,Hill,Chem,89,91


The two arguments of add_score are swapped with each other. In this case, df is the second argument in the calling. Thus, a tuple - (function, “the argument of data”) - is passed to point out that which argument is the data to apply the function on.

In [131]:
def add_score(add_score, input_df):
    input_df = input_df.copy()
    input_df = input_df.assign(new_score = lambda x:x.Marks + add_score)
    return input_df

df.pipe((add_score, "input_df"), 2)

Unnamed: 0,Name,subject,Marks,new_score
0,Wes,Math,62,64
1,Wes,Physisc,65,67
2,Wes,Chem,89,91
3,Bill,Math,61,63
4,Bill,Physisc,64,66
5,Bill,Chem,69,71
6,Hill,Math,65,67
7,Hill,Physisc,62,64
8,Hill,Chem,89,91


Debug in method chaining

Some critics might have concerns that it is hard to debug with long chaining processes due to the lack of intermediate results returned. There is a great way to tackle this problem - decorators. A decorators is a function that extends the behavior of wrapped function without explicitly modifying it.

Let’s look into actual examples. By using decorators & logging together, any properties of dataframe can be returned in log files when specified in decorators. Here, shape & columns are returned using log_shape & log_columns. The logging information are also printed below for reference.

Note: wraps is used to eliminate the side effect of decorators so that the name, docstring, arguments list, etc. are carried after the usage of decorators.



In [134]:
from functools import wraps
import logging

def log_shape(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        result = func(*args, **kwargs)
        logging.info("%s,%s" % (func.__name__, result.shape))
        return result
    return wrapper

def log_columns(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        result = func(*args, **kwargs)
        logging.info("%s,%s" % (func.__name__, result.columns))
        return result
    return wrapper

@log_columns
@log_shape
def get_subject_rank(input_df):
    input_df = input_df.copy()
    input_df['subject_rank'] = (input_df
                                .groupby(['subject'])['Marks']
                                .rank(ascending=False))
    return input_df

@log_columns
@log_shape
def add_score(input_df, added_score):
    input_df = input_df.copy()
    input_df = input_df.assign(new_score=lambda x: x.Marks+added_score)
    return input_df

(
    df.pipe(get_subject_rank)
      .pipe(add_score, 2)
)


Unnamed: 0,Name,subject,Marks,subject_rank,new_score
0,Wes,Math,62,2.0,64
1,Wes,Physisc,65,1.0,67
2,Wes,Chem,89,1.5,91
3,Bill,Math,61,3.0,63
4,Bill,Physisc,64,2.0,66
5,Bill,Chem,69,3.0,71
6,Hill,Math,65,1.0,67
7,Hill,Physisc,62,3.0,64
8,Hill,Chem,89,1.5,91


for further details please see [here](https://tomaugspurger.github.io/method-chaining) and this [link](https://sinyi-chou.github.io/python-pandas-pipe/)

Pipe is a flexible method to accommodate customized functions during pandas operations. It is great that pandas has implemented lots of methods to enable method chaining during the data manipulation process. I enjoy exploring more possibility & efficient way to play with data in pandas.

More info about why chaining (Python) & pipe (R) are useful for data scientists can be found in this article by Tom Augspurger - one of the main contributors of pandas about method chaining and the chapter about pipes of R for data science