# Pandas in Practice

本文主要介绍了`pandas`在实践中一些常用的方法  
随着使用的深入，会逐步慢慢更新

In [1]:
import numpy as np
import pandas as pd

## 常用数据汇总方法

创建一个示例数据框

In [2]:
np.random.seed(1024)

df = pd.DataFrame({'user_id': np.random.randint(1, 15, size=30), 
                   'action': np.random.randint(40, size=30), 
                   'value': np.random.randn(30)}, 
                  columns=['user_id', 'action', 'value'])
df.head()

Unnamed: 0,user_id,action,value
0,12,13,-0.737976
1,2,30,0.835748
2,2,12,-0.599908
3,13,5,-1.491901
4,6,6,-1.612696


- 汇总总体数据

汇总用户数（去重）、`action`数以及`value`的均值

In [3]:
# 使用agg方法对每列应用不同的汇总方法
df.agg({'user_id': 'nunique', 
        'action': 'size', 
        'value': 'mean'}). \
        to_frame().T. \
        rename(columns={'user_id': 'user_count', 'action': 'action_count', 'value': 'avg_value'})

Unnamed: 0,user_count,action_count,avg_value
0,11.0,30.0,0.104938


- 分组数据总会

按用户分组，计算每个用户的`action`数和`value`的均值、总和以及最大值减去最小值的差

In [4]:
# 使用groupby对数据分组，然后使用agg方法对每列应用不同的汇总方法
df_summary = df.groupby('user_id').agg({'action': 'size', 
                                        'value': ['mean', 'sum', lambda x: max(x) - min(x)]}).reset_index()
df_summary.columns = df_summary.columns.droplevel(0)
df_summary.columns = ['user_id', 'action_count', 'avg_value', 'sum_value', 'value_diff']
df_summary

Unnamed: 0,user_id,action_count,avg_value,sum_value,value_diff
0,1,5,0.03094,0.1547,2.022097
1,2,4,0.540606,2.162423,1.86467
2,3,2,0.692094,1.384189,2.013567
3,4,3,1.656883,4.970649,2.210926
4,6,4,-0.677293,-2.70917,1.702545
5,8,2,-0.657255,-1.31451,1.299752
6,9,1,0.583282,0.583282,0.0
7,10,3,0.087281,0.261843,1.112621
8,12,4,-0.367629,-1.470517,1.60642
9,13,1,-1.491901,-1.491901,0.0


- 自定义分组数据总汇

将`user_id`以小于等于7和大于25分为`old`和`new`2组，action数`value`均值

In [5]:
# 使用numpy.where方法来对user_id分组
df['user_category'] = np.where(df['user_id'] <= 7, 'old', 'new')
df.groupby('user_category').agg({'action': 'size', 'value': 'mean'}).reset_index(). \
    rename(columns={'action': 'action_count', 'value': 'avg_value'})

Unnamed: 0,user_category,action_count,avg_value
0,new,12,-0.234555
1,old,18,0.331266


将action以间隔5进行封箱，统计用户数和`value`均值

In [6]:
# 使用pandas.cut方法对数据封箱
df['action_category'] = pd.cut(df['action'], bins=np.arange(0, 45, 5))
df.groupby('action_category').agg({'user_id': 'nunique', 'value': 'mean'}).reset_index(). \
    rename(columns={'user_id': 'user_count', 'value': 'avg_value'})

Unnamed: 0,action_category,user_count,avg_value
0,"(0, 5]",3,-0.396481
1,"(5, 10]",3,-0.132592
2,"(10, 15]",5,-0.252417
3,"(15, 20]",4,0.133855
4,"(20, 25]",2,1.292345
5,"(25, 30]",5,0.379669
6,"(30, 35]",1,0.66182
7,"(35, 40]",2,-0.107871


- 缺少数据填补0

对action分组，计算用户数，`action`取值为0到39，因此缺少的`action`用0填补

In [7]:
action_summary = df.groupby('action').agg({'user_id': 'nunique'}).reset_index().rename(columns={'user_id': 'user_count'})
full_actions = pd.DataFrame({'action': np.arange(40)})
action_summary = pd.merge(action_summary, full_actions, how='outer')
action_summary['user_count'] = action_summary['user_count']. \
    where(~action_summary['user_count'].isnull(), 0). \
    astype(np.int)
action_summary.sort_values('action').reset_index(drop=True).head(10)

Unnamed: 0,action,user_count
0,0,0
1,1,1
2,2,0
3,3,1
4,4,0
5,5,1
6,6,2
7,7,0
8,8,0
9,9,1


- `apply` & `transform`

增加两列，一列为value归一化后的值，一列为用户最大value值

In [8]:
# 使用apply来做归一化
df['norm_value'] = df[['value']].apply(lambda x: (x - min(x)) / (max(x) - min(x)))
# 使用transform来计算每个用户的最大值
df['user_max_value'] = df.groupby('user_id')['value'].transform(lambda x: x.max())
df.sort_values('user_id').head(10)

Unnamed: 0,user_id,action,value,user_category,action_category,norm_value,user_max_value
29,1,36,0.060935,old,"(35, 40]",0.353711,1.251196
12,1,39,-0.306965,old,"(35, 40]",0.275958,1.251196
24,1,16,-0.079565,old,"(15, 20]",0.324017,1.251196
7,1,15,1.251196,old,"(10, 15]",0.605265,1.251196
9,1,26,-0.770901,old,"(25, 30]",0.177908,1.251196
6,2,33,0.66182,old,"(30, 35]",0.480704,1.264762
2,2,12,-0.599908,old,"(10, 15]",0.214046,1.264762
1,2,30,0.835748,old,"(25, 30]",0.517463,1.264762
19,2,26,1.264762,old,"(25, 30]",0.608132,1.264762
11,3,27,1.698878,old,"(25, 30]",0.69988,1.698878


## 数据框合并

这里继续使用上一小节中的数据框，然后创建几个新的数据框用于演示数据框合并

In [9]:
np.random.seed(1024)

df = pd.DataFrame({'user_id': np.random.randint(1, 15, size=30), 
                   'action': np.random.randint(40, size=30), 
                   'value': np.random.randn(30)}, 
                  columns=['user_id', 'action', 'value'])
df.head()

Unnamed: 0,user_id,action,value
0,12,13,-0.737976
1,2,30,0.835748
2,2,12,-0.599908
3,13,5,-1.491901
4,6,6,-1.612696


- 合并列

创建一个数据框，包含`rate`和`score`两列，将其与`df`按照列进行合并

In [10]:
df2 = pd.DataFrame({'rate': np.round(np.random.random(30), 4), 
                    'score': np.random.randint(1, 11, size=30)})
df2.head()

Unnamed: 0,rate,score
0,0.277,5
1,0.7259,6
2,0.7777,6
3,0.0116,8
4,0.8471,2


In [11]:
df_col = pd.concat([df, df2], axis=1)
df_col.head()

Unnamed: 0,user_id,action,value,rate,score
0,12,13,-0.737976,0.277,5
1,2,30,0.835748,0.7259,6
2,2,12,-0.599908,0.7777,6
3,13,5,-1.491901,0.0116,8
4,6,6,-1.612696,0.8471,2


- 合并行

创建一个数据框，与`df`的字段相同，按行进行合并

In [12]:
df3 = pd.DataFrame({'user_id': np.random.randint(1, 15, size=10), 
                    'action': np.random.randint(40, size=10), 
                    'value': np.random.randn(10)}, 
                   columns=['user_id', 'action', 'value'])
df3.head()

Unnamed: 0,user_id,action,value
0,2,31,1.722941
1,14,15,-0.031106
2,14,7,0.460358
3,1,9,-0.905594
4,11,37,-0.615603


In [13]:
df_row = pd.concat([df, df3], axis=0, ignore_index=True)
df_row.tail(10)

Unnamed: 0,user_id,action,value
30,2,31,1.722941
31,14,15,-0.031106
32,14,7,0.460358
33,1,9,-0.905594
34,11,37,-0.615603
35,12,25,0.636349
36,3,25,1.953322
37,1,17,0.148141
38,14,32,0.293199
39,9,7,0.552424


- 按关键字合并(`join`)

创建一个数据框，包含`user_id`、`gender`和`age`字段，按照`user_id`为关键字，与`df`进行合并

In [14]:
df4 = pd.DataFrame({'user_id': np.arange(1, 15), 
                    'gender': np.random.randint(2, size=14), 
                    'age': np.random.randint(20, 50, size=14)}, 
                   columns=['user_id', 'gender', 'age'])
df4.head()

Unnamed: 0,user_id,gender,age
0,1,0,28
1,2,0,30
2,3,0,23
3,4,1,20
4,5,1,37


In [15]:
df_merge = pd.merge(df, df4, on='user_id')
df_merge.head()

Unnamed: 0,user_id,action,value,gender,age
0,12,13,-0.737976,0,44
1,12,16,0.704062,0,44
2,12,13,-0.902358,0,44
3,12,25,-0.534245,0,44
4,2,30,0.835748,0,30


## 将字符串变量转为dummy variable

将字符串变量转为`dummy variable`是机器学习中常用的方法

普通的字符串变量我们使用`get_dummies`方法就可以将变量转为dummy variable

In [16]:
df = pd.DataFrame({'user_id': [1, 2, 3, 4, 5], 
                   'area': ['Shanghai', 'Tokyo', 'Shanghai', 'New York', 'New York']})

In [17]:
area = pd.get_dummies(df['area'])
pd.concat((df['user_id'], area), axis=1)

Unnamed: 0,user_id,New York,Shanghai,Tokyo
0,1,0,1,0
1,2,0,0,1
2,3,0,1,0
3,4,1,0,0
4,5,1,0,0


但是有时候由于业务等需要，变量会存储为一定的格式，比如可以用`,`切分的字符串，`json`格式字符串等，这类可以转为`list`形式的字符串变量也可以转为dummy variable

In [18]:
df = pd.DataFrame({'user_id': [1, 2, 3, 4, 5, 6, 7, 8], 
                   'my_index': ['1,2', '1,3,5', '2,3', '1,4,5', '1,2,3', '4,5', '1,5', '2,4'],
                   'interest': ['["football", "f1", "tennis"]',
                                '["f1"]', 
                                '["football", "basketball", "snooker"]',
                                '["football", "tennis"]', 
                                '["football", "swimming"]', 
                                '["basketball", "f1"]', 
                                '["football", "tennis"]', 
                                '["football", "f1"]']}, 
                  columns=['user_id', 'my_index', 'interest'])
df

Unnamed: 0,user_id,my_index,interest
0,1,12,"[""football"", ""f1"", ""tennis""]"
1,2,135,"[""f1""]"
2,3,23,"[""football"", ""basketball"", ""snooker""]"
3,4,145,"[""football"", ""tennis""]"
4,5,123,"[""football"", ""swimming""]"
5,6,45,"[""basketball"", ""f1""]"
6,7,15,"[""football"", ""tennis""]"
7,8,24,"[""football"", ""f1""]"


具体转化步骤为
- 将字符串切分为`list`
- 使用`apply`中转为`Series`进行转化
- 使用`stack`将数据转为一列
- 转为`one-hot`形式
- 对`level=0`进行`sum`

In [19]:
my_index = pd.get_dummies(df['my_index'].map(lambda x: x.split(',')).apply(pd.Series).stack()).sum(level=0)
my_index

Unnamed: 0,1,2,3,4,5
0,1,1,0,0,0
1,1,0,1,0,1
2,0,1,1,0,0
3,1,0,0,1,1
4,1,1,1,0,0
5,0,0,0,1,1
6,1,0,0,0,1
7,0,1,0,1,0


对于`json`格式，使用`json.loads`转为`list`形式，随后步骤与上面相同

In [20]:
import json
interest = pd.get_dummies(df['interest'].map(lambda x: json.loads(x)).apply(pd.Series).stack()).sum(level=0)
interest

Unnamed: 0,basketball,f1,football,snooker,swimming,tennis
0,0,1,1,0,0,1
1,0,1,0,0,0,0
2,1,0,1,1,0,0
3,0,0,1,0,0,1
4,0,0,1,0,1,0
5,1,1,0,0,0,0
6,0,0,1,0,0,1
7,0,1,1,0,0,0


In [21]:
pd.concat((df['user_id'], my_index, interest), axis=1)

Unnamed: 0,user_id,1,2,3,4,5,basketball,f1,football,snooker,swimming,tennis
0,1,1,1,0,0,0,0,1,1,0,0,1
1,2,1,0,1,0,1,0,1,0,0,0,0
2,3,0,1,1,0,0,1,0,1,1,0,0
3,4,1,0,0,1,1,0,0,1,0,0,1
4,5,1,1,1,0,0,0,0,1,0,1,0
5,6,0,0,0,1,1,1,1,0,0,0,0
6,7,1,0,0,0,1,0,0,1,0,0,1
7,8,0,1,0,1,0,0,1,1,0,0,0
