# Python Lecture 9: dataframes and split-apply-combine, regular expressions

Split-apply-combine is an often used way to deal with data:

1. we split a dataset into one or more groups
2. we apply functions ("in parallel") to each of the groups
3. we combine the groups into one dataset

Depending on the platform/programming language/library we are using, we might use different terminology, but the way of thinking is the same.

The 'Map/Reduce' paradigm might also be seen as an application of split-apply-combine.

In **pandas**, we can use either

* `groupby`, or
* `pivot_table` 

to solve this kind of split-apply combine problems.

Let's see how this might be done in an example:

In [2]:
import numpy as np
import pandas as pd

df = pd.read_csv('sample-sales-20.csv')
df

Unnamed: 0,account number,name,sku,category,quantity,unit price,ext price,date
0,296809,Carroll PLC,QN-82852,Belt,13,44.48,578.24,2014-09-27 07:13:03
1,98022,Heidenreich-Bosco,MJ-21460,Shoes,19,53.62,1018.78,2014-07-29 02:10:44
2,563905,"Kerluke, Reilly and Bechtelar",AS-93055,Shirt,12,24.16,289.92,2014-03-01 10:51:24
3,93356,Waters-Walker,AS-93055,Shirt,5,82.68,413.40,2013-11-17 20:41:11
4,659366,Waelchi-Fahey,AS-93055,Shirt,18,99.64,1793.52,2014-01-03 08:14:27
5,563905,"Kerluke, Reilly and Bechtelar",AS-93055,Shirt,17,52.82,897.94,2013-12-04 02:07:05
6,995267,Cole-Eichmann,GS-86623,Shoes,18,15.28,275.04,2014-04-09 16:15:03
7,524021,Hegmann and Sons,LL-46261,Shoes,7,78.78,551.46,2014-06-18 19:25:10
8,929400,"Senger, Upton and Breitenberg",LW-86841,Shoes,17,38.19,649.23,2014-02-10 05:55:56
9,563905,"Kerluke, Reilly and Bechtelar",KV-99194,Shirt,12,26.98,323.76,2014-05-20 00:21:28


### What questions could we ask about this data?

In [33]:
#which products is has the most customers? 
#  concretely: how many pieces have we sold?
#df[df['sku']  == 'AS-93055']
#df[df['sku']  == 'QN-82852']
df.groupby('sku').sum().sort_values('quantity', ascending=False)



Unnamed: 0_level_0,account number,quantity,unit price,ext price
sku,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AS-93055,57040659,1136,6095.63,65756.89
KV-99194,56283483,1119,6107.3,65972.27
QN-82852,58925385,1113,5742.91,57577.31
MJ-21460,53872090,1052,5591.66,58781.21
GS-86623,47895378,1046,5379.88,54528.92
LL-46261,51517799,1031,5501.5,56669.23
VG-32047,52174118,1026,5780.17,58157.14
FK-71853,50698086,973,5253.47,56888.83
LW-86841,59701073,925,6095.33,56574.86
WJ-02096,47100826,907,4631.78,48937.24


In [35]:
#anothe way to answer: 
df.groupby('sku').sum().sort_values('ext price', ascending=False)



Unnamed: 0_level_0,account number,quantity,unit price,ext price
sku,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KV-99194,56283483,1119,6107.3,65972.27
AS-93055,57040659,1136,6095.63,65756.89
MJ-21460,53872090,1052,5591.66,58781.21
VG-32047,52174118,1026,5780.17,58157.14
QN-82852,58925385,1113,5742.91,57577.31
FK-71853,50698086,973,5253.47,56888.83
LL-46261,51517799,1031,5501.5,56669.23
LW-86841,59701073,925,6095.33,56574.86
GS-86623,47895378,1046,5379.88,54528.92
WJ-02096,47100826,907,4631.78,48937.24


In [37]:
#Who is the 'best customer'

df.groupby('name').sum().sort_values('ext price', ascending=False)

Unnamed: 0_level_0,account number,quantity,unit price,ext price
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Kihn, McClure and Denesik",43971714,638,3487.45,38935.29
Waters-Walker,4667800,578,3303.95,36778.96
Carroll PLC,16918113,677,2947.25,35934.31
Hegmann and Sons,30393218,606,3413.61,35213.72
Kunze Inc,45907560,605,3025.18,34406.54
"Volkman, Goyette and Lemke",36022128,538,2832.62,32006.87
Cole-Eichmann,50758617,538,2833.0,30435.42
Berge LLC,38215944,522,3076.92,30064.87
"Senger, Upton and Breitenberg",54834600,566,3023.81,29577.46
Waelchi-Fahey,35605764,501,3017.41,28968.68


In [40]:
kmd = df[df['name'] == 'Kihn, McClure and Denesik']
kmd.groupby('category').sum()

Unnamed: 0_level_0,account number,quantity,unit price,ext price
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Belt,9097596,164,724.29,10116.9
Shirt,20469591,288,1653.58,18956.35
Shoes,14404527,186,1109.58,9862.04


In [44]:
df.groupby(['name', 'category']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,account number,quantity,unit price,ext price
name,category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Berge LLC,Belt,6614298,120,450.96,6033.53
Berge LLC,Shirt,14698440,166,1226.54,9670.24
Berge LLC,Shoes,16903206,236,1399.42,14361.1
Carroll PLC,Belt,4155326,184,733.03,9359.26
Carroll PLC,Shirt,6529798,257,1098.93,13717.61
Carroll PLC,Shoes,6232989,236,1115.29,12857.44
Cole-Eichmann,Belt,12938471,147,640.67,8112.7
Cole-Eichmann,Shirt,20900607,236,1226.75,14528.01
Cole-Eichmann,Shoes,16919539,155,965.58,7794.71
"Davis, Kshlerin and Reilly",Belt,1834818,33,309.37,1604.13


In [49]:
df.groupby(['name', 'category']).sum()[  ['quantity', 'ext price']   ].unstack()

Unnamed: 0_level_0,quantity,quantity,quantity,ext price,ext price,ext price
category,Belt,Shirt,Shoes,Belt,Shirt,Shoes
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Berge LLC,120,166,236,6033.53,9670.24,14361.1
Carroll PLC,184,257,236,9359.26,13717.61,12857.44
Cole-Eichmann,147,236,155,8112.7,14528.01,7794.71
"Davis, Kshlerin and Reilly",33,161,185,1604.13,7533.03,9917.6
"Ernser, Cruickshank and Lind",97,262,96,5894.38,16944.19,5250.45
Gorczany-Hahn,103,237,100,3642.48,12576.83,5988.59
Hamill-Hackett,32,148,239,1609.74,8880.04,12944.0
Hegmann and Sons,88,278,240,4909.38,16774.47,13529.87
Heidenreich-Bosco,106,92,233,6262.94,5965.25,13200.1
Huel-Haag,43,200,115,2219.51,11944.01,6924.36


In [55]:
#new question: which brand is sold the most?
#   brand = first two letters of SKU
#      we need to put those into a new column

def get_first_two_letters(sku):
    return sku[:2]

#df['brand'] = df['sku'].apply(get_first_two_letters)
#df

#or just use a lambda:

df['brand'] = df['sku'].apply(lambda s: s[:2])

df.groupby('brand').sum().sort_values('ext price', ascending=False)

Unnamed: 0_level_0,account number,quantity,unit price,ext price
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KV,56283483,1119,6107.3,65972.27
AS,57040659,1136,6095.63,65756.89
MJ,53872090,1052,5591.66,58781.21
VG,52174118,1026,5780.17,58157.14
QN,58925385,1113,5742.91,57577.31
FK,50698086,973,5253.47,56888.83
LL,51517799,1031,5501.5,56669.23
LW,59701073,925,6095.33,56574.86
GS,47895378,1046,5379.88,54528.92
WJ,47100826,907,4631.78,48937.24


In [58]:
#which customer buys which brand:
df.groupby(['name', 'brand']).sum()['ext price'].unstack()

brand,AS,FK,GS,KV,LL,LW,MJ,QN,VG,WJ
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Berge LLC,2994.95,2205.93,1967.75,1973.93,2505.84,1778.3,8109.21,3737.73,2495.43,2295.8
Carroll PLC,4491.04,310.06,2501.28,1692.3,4887.11,1218.21,4250.84,1333.8,7224.21,8025.46
Cole-Eichmann,4257.94,2937.22,1656.65,5490.57,3313.13,1792.47,1032.46,6301.71,1842.28,1810.99
"Davis, Kshlerin and Reilly",2369.01,1115.8,2664.5,2546.21,4527.3,1363.64,1362.16,1160.57,1502.01,443.56
"Ernser, Cruickshank and Lind",2304.33,1559.4,117.22,7969.18,1548.56,1485.36,2099.31,4007.49,5111.28,1886.89
Gorczany-Hahn,2389.12,4665.77,606.08,3018.34,1430.75,605.68,3346.08,1546.21,2503.6,2096.27
Hamill-Hackett,3452.66,2764.63,3638.28,1174.27,3846.51,4273.35,1185.86,1277.16,1488.48,332.58
Hegmann and Sons,6275.38,5607.93,1113.88,2267.12,1624.43,5525.83,5265.73,2634.73,2624.04,2274.65
Heidenreich-Bosco,1131.59,1784.16,5912.27,521.96,2789.58,2711.27,1786.98,4132.8,2527.54,2130.14
Huel-Haag,3115.55,3650.04,3140.6,2649.21,1170.81,875.7,1737.25,737.49,2529.21,1482.02


In [62]:
#which months do we sell how many shoes/belts/shirts

df['month'] = df['date'].apply(lambda s: s[5:7])
df.groupby(['month', 'category']).sum()['quantity'].unstack()

category,Belt,Shirt,Shoes
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,187,441,344
2,215,368,339
3,134,408,328
4,157,220,273
5,213,380,315
6,133,294,217
7,159,235,339
8,118,387,372
9,192,485,356
10,133,304,419


## Regular expressions

See pdf!

In [64]:
import re
#check whether a string contains a digit:
re.search("[0-9]", "abc")

In [65]:
re.search("[0-9]", "acd4c")

<_sre.SRE_Match object; span=(3, 4), match='4'>

In [66]:
re.search("[0-9]", "acd66c")

<_sre.SRE_Match object; span=(3, 4), match='6'>

In [67]:
re.search("5", "acd643215c")

<_sre.SRE_Match object; span=(8, 9), match='5'>

In [68]:
#years look like YYYY, Y is a digit
year = "[0-9]{4}"
re.search(year, "Christoph was born in 1977")

<_sre.SRE_Match object; span=(22, 26), match='1977'>

In [70]:
re.findall(year, "Christoph was born in 1977, some else was born in 1997, 2001, 9999, 12000")

['1977', '1997', '2001', '9999', '1200']

In [71]:
year = "[12][0-9]{3}"
re.findall(year, "Christoph was born in 1977, some else was born in 1997, 2001, 9999, 12000")

['1977', '1997', '2001', '1200']

In [73]:
year = "((19|20)[0-9]{2})"
re.findall(year, "Christoph was born in 1977, some else was born in 1997, 2001, 9999, 12000")

[('1977', '19'), ('1997', '19'), ('2001', '20'), ('2000', '20')]

In [78]:
date = "([0-9]{4})-([0-9]{2})-([0-9]{2})"
match = re.search(date, "today is 2017-49-17")

In [80]:
print(match.group(0))
print(match.group(1))
print(match.group(2))
print(match.group(3))

2017-49-17
2017
49
17


In [8]:
report = df.pivot_table(values=['quantity'],
                           index=['name'],
                           columns=['category'], 
                           aggfunc=np.sum)
report.head(n=10)

Unnamed: 0_level_0,quantity,quantity,quantity
category,Belt,Shirt,Shoes
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Berge LLC,120,166,236
Carroll PLC,184,257,236
Cole-Eichmann,147,236,155
"Davis, Kshlerin and Reilly",33,161,185
"Ernser, Cruickshank and Lind",97,262,96
Gorczany-Hahn,103,237,100
Hamill-Hackett,32,148,239
Hegmann and Sons,88,278,240
Heidenreich-Bosco,106,92,233
Huel-Haag,43,200,115


In [7]:
report = df.pivot_table(values=['ext price','quantity'],
                           index=['name'],
                           columns=['category'], 
                           fill_value=0,
                           aggfunc=np.sum)
report.head(n=10)

Unnamed: 0_level_0,ext price,ext price,ext price,quantity,quantity,quantity
category,Belt,Shirt,Shoes,Belt,Shirt,Shoes
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Berge LLC,6033.53,9670.24,14361.1,120,166,236
Carroll PLC,9359.26,13717.61,12857.44,184,257,236
Cole-Eichmann,8112.7,14528.01,7794.71,147,236,155
"Davis, Kshlerin and Reilly",1604.13,7533.03,9917.6,33,161,185
"Ernser, Cruickshank and Lind",5894.38,16944.19,5250.45,97,262,96
Gorczany-Hahn,3642.48,12576.83,5988.59,103,237,100
Hamill-Hackett,1609.74,8880.04,12944.0,32,148,239
Hegmann and Sons,4909.38,16774.47,13529.87,88,278,240
Heidenreich-Bosco,6262.94,5965.25,13200.1,106,92,233
Huel-Haag,2219.51,11944.01,6924.36,43,200,115
