这个Jupyter Notebook会带领大家学习基本的Pandas操作，包含以下这些内容：
* DataFrame
* File I/O
* Slice
* Function
* Sort
* Query
* Group By
* Apply

In [49]:
!pip install pandas

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7 -m pip install --upgrade pip' command.[0m


## DataFrame
DataFrame是Pandas的基石，任何一切都基于这个DataFrame而构建，DataFrame可以理解为一个Excel表格，作为一个二维结构，他有多重列也有多重行，我们可以使用构造器来创建DataFrame，在实际应用中一般使用Pandas File I/O

In [1]:
import pandas as pd

df = pd.DataFrame([['Hello', 1, 2.0], ['Hi', 3, 4.0]])

df

Unnamed: 0,0,1,2
0,Hello,1,2.0
1,Hi,3,4.0


在上面这个代码块中我们使用了一个二维矩阵来实现DataFrame的初始化，而实际上这个参数可以是任何Array-Like，比如说Tuple, Numpy Ndarray类似的东西。还有一个值得注意的是在代码的最后是`df`而不是`print(df)`，这是因为Jupyter自带一个专门查看Pandas DataFrame的工具，只需要打出dataframe的名字即可调用，如果加上print()函数则会将DataFrame转化为字符串，无法调用这个工具

### Pandas File I/O
实际中Pandas的DataFrame是会变得极其巨大而难以用这种方式初始化，一般这样的数据会被保存在如下几个格式中：
* xlsx：Microsoft Excel格式，传统的数据存储格式
* csv: 逗号分隔符格式，被Microsoft Excel支持
* json: Javascript的数据集格式，支持模型嵌套，常用于Web
* parquet: 先进的数据格式，由Apache提出并用于分布式框架Spark，具有高压缩比和格式准确性
而Pandas提供了针对这些不同格式的文件读取输出支持，当然这种支持少不了来自Python开发者社区其他包的支持，所以如果无法读取某些文件，请按照报错信息的提示安装所需要的包，目前支持最好的是.csv文件，所有的作业数据集会以这种文件分发

如下是一个简单的例子，数据来自Kaggle的Titanic项目，我们选择其中的test.csv进行数据读取，首先将test.csv放置在同一个文件夹。(使用Jupyter Lab你应该可以在左侧文件区直接看到这个文件)

In [2]:
import pandas as pd

df = pd.read_csv('test.csv')

df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


而如果我们想将这个dataframe保存到别的文件中，比如test.parquet，以parquet形式保存，我们需要这么写：

In [3]:
df.to_parquet('test.parquet')

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

请注意，pandas的输入方法是基于对象pd，也即pandas包本身，而pandas的输出则是基于对象dataframe，请确保你分清了这一点，API的格式是：
* read_{file format}(): file format有：xlsx, csv, parquet, json
* to_{file format}(): file format有：xlsx, csv, parquet, json

## DataFrame APIs
在上述的例子中你会发现，比较大的DataFrame会只显示前5行，后5行，这是Pandas为了不让Jupyter输出框爆炸作出的牺牲hhh，我们可以只显示一些我们需要的行吗，比如前5行？前10行？

答案是使用DataFrame.head()函数

In [4]:
# Head 默认前5行
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [5]:
# df.head()还可以接受一个参数n表示前n行
df.head(10)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


如果我们想知道这个表到底有几列，列名是什么东西，可以用DataFrame.columns 这个参数

In [6]:
df.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

请注意df.columns不是方法，是一个公共参数

在前面的表格中，我们会发现，有一个多出来的列，0， 1， 2， 3， 4， 5，... 而且这个列没有列名，也不出现在df.columns的结果中，这是Pandas DataFrame的index也即序号，他可以被df.indexs这个公共参数所找到，比如说：

In [7]:
df.index

RangeIndex(start=0, stop=418, step=1)

如果我们想获得这个表格有几行几列，可以使用DataFrame.shape参数，这也不是方法

In [8]:
df.shape

(418, 11)

In [9]:
df.size

4598

df.size参数则是获得这个表格有多少个格子，注意418 x 11 = 4598

对于某些全数字的列，我们想快速获取一些统计参数，比如说表中的Age，最大值是多少？最小值是多少？平均值是多少？请使用DataFrame.describe()方法

In [10]:
df.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


## Pandas 切片
如果我们想操作某个具体的列或者行，或者就是DataFrame中的一小部分，应该如何做呢？一般我们会使用Pandas的切片功能

基础的Pandas切片非常简单，只需要在df后接一个方括号，放上想要的列名，例如：

In [12]:
df['Age']

0      34.5
1      47.0
2      62.0
3      27.0
4      22.0
       ... 
413     NaN
414    39.0
415    38.5
416     NaN
417     NaN
Name: Age, Length: 418, dtype: float64

如果想选取多重列，请用list放置所有想要的列名

In [13]:
df[['Age', 'Fare']]

Unnamed: 0,Age,Fare
0,34.5,7.8292
1,47.0,7.0000
2,62.0,9.6875
3,27.0,8.6625
4,22.0,12.2875
...,...,...
413,,8.0500
414,39.0,108.9000
415,38.5,7.2500
416,,8.0500


但如果我只想要某些行呢？比如1-5行？

In [18]:
df[1:6]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S


有经验的Python使用者会知道1:6的意思是从第一个开始直到6，但是6不计入，所以想选取1-5行，需要的是1:6

如果我想要混合切片，1-5行加上'Age', 'Fare'这两列，请使用DataFrame.loc函数

In [17]:
df.loc[1:5, ['Age', 'Fare']]

Unnamed: 0,Age,Fare
1,47.0,7.0
2,62.0,9.6875
3,27.0,8.6625
4,22.0,12.2875
5,14.0,9.225


但是在loc中1:5就是1-5行，莫名其妙的API🤷‍♂️

## Aggregate Function
Aggregate Function，聚合函数，是一系列常用于列数据统计的函数，常用的聚合函数有：mean(), sum(), std(), max(), min()他们的特点是可以针对一整个列进行操作

In [22]:
df['Age'].mean()

30.272590361445783

In [23]:
df['Age'].max()

76.0

In [24]:
df['Age'].min()

0.17

这个宝宝好可怜🥺

In [25]:
df['Age'].std()

14.181209235624422

### 其他常用函数
Pandas还提供了一些好玩且实用的函数，比如cumsum(), cumprod()

这些函数会创建一个新的列，你应该把这个列添加到原本的dataframe中

In [27]:
df['Age_cum'] = df['Age'].cumsum()
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,81.5
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,192.5


cumsum(), cumulative sum的意义是对一个列做累加，也即Age_cum(t) = Agecum(t-1) + Age(t)不断的做累加，可以在金融领域算累加的算术收益

cumprod()函数则会做累乘操作，Age_cum_prod(t) = Age_cum_prod(t) x Age(t)

In [28]:
df['Age_cum_prod'] = df['Age'].cumprod()
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,34.5
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,81.5,1621.5
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2714391.0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,192.5,59716602.0


其实pandas还支持列之间直接使用标准运算符进行操作，比如+-*/

In [43]:
df['temp'] = df['Age'] + df['Fare']
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod,temp
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,34.5,42.3292
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,81.5,1621.5,54.0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0,71.6875
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2714391.0,35.6625
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,192.5,59716602.0,34.2875


In [44]:
df['temp'] = df['Age'] / df['Fare']
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod,temp
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,34.5,4.40658
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,81.5,1621.5,6.714286
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0,6.4
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2714391.0,3.116883
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,192.5,59716602.0,1.790437


## Sort
在很多时候，我们希望能够对DataFrame按照某种顺序进行排列，也就是排序，可以是按照df.index, 也可以是某一列的值大小，Pandas为我们提供了DataFrame.sort_values(), sort_indexs()两个API

In [55]:
df.sort_values('Age', ascending=True)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod,temp,age_now,age_now_apply,age_log,age_exp
354,1246,3,"Dean, Miss. Elizabeth Gladys Millvina""""",female,0.17,1,2,C.A. 2315,20.5750,,S,8475.50,inf,0.008262,109.17,109.17,-1.771957,1.185305
201,1093,3,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0,2,347080,14.4000,,S,4886.33,1.241582e+228,0.022917,109.33,109.33,-1.108663,1.390968
281,1173,3,"Peacock, Master. Alfred Edward",male,0.75,1,1,SOTON/O.Q. 3101315,13.7750,,S,6712.50,inf,0.054446,109.75,109.75,-0.287682,2.117000
307,1199,3,"Aks, Master. Philip Frank",male,0.83,0,1,392091,9.3500,,S,7182.33,inf,0.088770,109.83,109.83,-0.186330,2.293319
250,1142,2,"West, Miss. Barbara J",female,0.92,1,2,C.A. 34651,27.7500,,S,6170.75,3.724109e+284,0.033153,109.92,109.92,-0.083382,2.509290
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
408,1300,3,"Riordan, Miss. Johanna Hannah""""",female,,0,0,334915,7.7208,,Q,,,,,,,
410,1302,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.7500,,Q,,,,,,,
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,,,,,,,
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,,,,,,,


其实sort_values还可以接受多重参数，比如'Age'和'Fare'

In [56]:
df.sort_values(['Age', 'Fare'], ascending=False)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod,temp,age_now,age_now_apply,age_log,age_exp
96,988,1,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1,0,19877,78.8500,C46,S,2509.50,2.589715e+114,0.963855,185.0,185.0,4.330733,1.014800e+33
81,973,1,"Straus, Mr. Isidor",male,67.0,1,0,PC 17483,221.7792,C55 C57,S,2238.50,1.927124e+102,0.302102,176.0,176.0,4.204693,1.252363e+29
179,1071,1,"Compton, Mrs. Alexander Taylor (Mary Eliza Ing...",female,64.0,0,2,PC 17756,83.1583,E45,C,4461.50,7.806418e+206,0.769617,173.0,173.0,4.158883,6.235149e+27
236,1128,1,"Warren, Mr. Frank Manley",male,64.0,1,0,110813,75.2500,D37,C,5804.83,1.914473e+269,0.850498,173.0,173.0,4.158883,6.235149e+27
305,1197,1,"Crosby, Mrs. Edward Gifford (Catherine Elizabe...",female,64.0,1,1,112901,26.5500,B26,S,7151.50,inf,2.410546,173.0,173.0,4.158883,6.235149e+27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
211,1103,3,"Finoli, Mr. Luigi",male,,0,0,SOTON/O.Q. 3101308,7.0500,,S,,,,,,,
163,1055,3,"Pearce, Mr. Ernest",male,,0,0,343271,7.0000,,S,,,,,,,
116,1008,3,"Thomas, Mr. John",male,,0,0,2681,6.4375,,C,,,,,,,
133,1025,3,"Thomas, Mr. Charles P",male,,1,0,2621,6.4375,,C,,,,,,,


注意到将ascending设置为False之后这个顺序编程降序了，也即descendin，这个参数调控的是升序和降序

而且'Age'和'Fare'的顺序很重要，('Age', 'Fare')是先对Age排序后Age相同的行用Fare排序

注意到在这里index混乱了，我们可以使用sort_index()函数将它返回原本的顺序

In [59]:
df.sort_index()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod,temp,age_now,age_now_apply,age_log,age_exp
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,34.5,4.406580,143.5,143.5,3.540959,9.619658e+14
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S,81.5,1621.5,6.714286,156.0,156.0,3.850148,2.581313e+20
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0,6.400000,171.0,171.0,4.127134,8.438357e+26
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2714391.0,3.116883,136.0,136.0,3.295837,5.320482e+11
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,192.5,59716602.0,1.790437,131.0,131.0,3.091042,3.584913e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,,,,,,,
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,10012.0,inf,0.358127,148.0,148.0,3.663562,8.659340e+16
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,10050.5,inf,5.310345,147.5,147.5,3.650658,5.252155e+16
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,,,,,,,


## Query
有时候我们选择的列不基于对乘性或统一性而是基于某些条件，比如我们想选择年龄大于60的人，这种情况下我们将使用Pandas的查询API， 这个API有两种形式，但都...不太稳定:(

In [29]:
temp = df[df['Age']>=60]
temp

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0
13,905,2,"Howard, Mr. Benjamin",male,63.0,1,0,24065,26.0,,S,433.5,1.642993e+19
48,940,1,"Bucknell, Mrs. William Robert (Emma Eliza Ward)",female,60.0,0,0,11813,76.2917,D15,C,1387.5,1.408856e+61
69,961,1,"Fortune, Mrs. Mark (Mary McDougald)",female,60.0,1,4,19950,263.0,C23 C25 C27,S,1886.5,2.554839e+86
81,973,1,"Straus, Mr. Isidor",male,67.0,1,0,PC 17483,221.7792,C55 C57,S,2238.5,1.9271240000000002e+102
96,988,1,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1,0,19877,78.85,C46,S,2509.5,2.589715e+114
114,1006,1,"Straus, Mrs. Isidor (Rosalie Ida Blun)",female,63.0,1,0,PC 17483,221.7792,C55 C57,S,2929.5,4.4798880000000004e+134
142,1034,1,"Ryerson, Mr. Arthur Larned",male,61.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C,3565.0,8.335583e+164
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S,3801.5,2.7908219999999997e+175
179,1071,1,"Compton, Mrs. Alexander Taylor (Mary Eliza Ing...",female,64.0,0,2,PC 17756,83.1583,E45,C,4461.5,7.806418e+206


第一种方式是在传统切片时向内传入一个条件选择器，就比如`df['Age']>=60`，这个选择器将被应用在每一行，如果返回true则这一行被选中，false则这行没被选中，类似的选择器还有很多，比如我们可以选择一下性别为male（男性）

In [31]:
temp = df[df['Sex']=='male']
temp

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,3.450000e+01
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,1.005330e+05
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2.714391e+06
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.2250,,S,206.5,8.360324e+08
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0000,,S,262.5,6.521053e+11
...,...,...,...,...,...,...,...,...,...,...,...,...,...
407,1299,1,"Widener, Mr. George Dunton",male,50.0,1,1,113503,211.5000,C80,C,9905.0,inf
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,,
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,10050.5,inf
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,,


另一种方式则是使用DataFrame.query() API，比如说之前两个选择都可以用这种方式来写：

In [32]:
temp = df.query("Age>=60")
temp

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0
13,905,2,"Howard, Mr. Benjamin",male,63.0,1,0,24065,26.0,,S,433.5,1.642993e+19
48,940,1,"Bucknell, Mrs. William Robert (Emma Eliza Ward)",female,60.0,0,0,11813,76.2917,D15,C,1387.5,1.408856e+61
69,961,1,"Fortune, Mrs. Mark (Mary McDougald)",female,60.0,1,4,19950,263.0,C23 C25 C27,S,1886.5,2.554839e+86
81,973,1,"Straus, Mr. Isidor",male,67.0,1,0,PC 17483,221.7792,C55 C57,S,2238.5,1.9271240000000002e+102
96,988,1,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1,0,19877,78.85,C46,S,2509.5,2.589715e+114
114,1006,1,"Straus, Mrs. Isidor (Rosalie Ida Blun)",female,63.0,1,0,PC 17483,221.7792,C55 C57,S,2929.5,4.4798880000000004e+134
142,1034,1,"Ryerson, Mr. Arthur Larned",male,61.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C,3565.0,8.335583e+164
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S,3801.5,2.7908219999999997e+175
179,1071,1,"Compton, Mrs. Alexander Taylor (Mary Eliza Ing...",female,64.0,0,2,PC 17756,83.1583,E45,C,4461.5,7.806418e+206


In [34]:
temp = df.query("Sex=='male'")
temp

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,3.450000e+01
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,1.005330e+05
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2.714391e+06
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.2250,,S,206.5,8.360324e+08
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0000,,S,262.5,6.521053e+11
...,...,...,...,...,...,...,...,...,...,...,...,...,...
407,1299,1,"Widener, Mr. George Dunton",male,50.0,1,1,113503,211.5000,C80,C,9905.0,inf
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,,
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,10050.5,inf
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,,


在使用Query函数时有一个很需要注意的点，就是最好使用双引号而非单引号，这样在你需要匹配字符串时，对字符串使用单引号就不会出错，这个真的很坑

## Group By
现在我想要知道男性和女性各自群体的平均年龄，聪明的你一定想到了选取两次，比如`df.query("Sex=='male'").mean()`和`df.query("Sex=='female'").mean()`，如果有多重可能，比如某个具体年龄段逝者的数量，一次次写Query是不可能的，所以我们会使用DataFrame.groupby() API，Group By的字面意思就是把这个东西按照某个列的值分组，相同值分为一组，比如我们看：

In [36]:
# Male and Female的平均年龄

df.groupby('Sex').mean()

Unnamed: 0_level_0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Age_cum,Age_cum_prod
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,1096.789474,2.144737,30.272362,0.565789,0.598684,49.747699,5011.782047,inf
male,1102.620301,2.334586,30.272732,0.379699,0.274436,27.527877,5133.660537,inf


In [37]:
# 每一个年龄的死者数量

df.groupby('Age').apply(len)

Age
0.17     1
0.33     1
0.75     1
0.83     1
0.92     1
        ..
62.00    1
63.00    2
64.00    3
67.00    1
76.00    1
Length: 79, dtype: int64

这样我们就获取了在每一个值情况下具体的情况，groupby()其实会返回一个子dataframe，更多的API请参考https://pandas.pydata.org/docs/reference/groupby.html

## Apply
你也许注意到了在上一个例子中我使用了一个没讲过的函数.apply()，并且传入了一个诡异的参数，len，大家应该都知道len()是获取数组长度的函数，它为什么可以被当作参数传入.apply()函数？

事实上，在现代编程语言中，函数也是对象，它也可以被当作参数进行传递，这就是函数式编程的思想，当然这不做要求

怎么解读.apply函数呢？大家可以把它理解为一个对于整个DataFrame的循环，也就是for loop:

现在假设这么一个场景：我们想算算这些逝者到2021年应该有几岁，我们应该怎么做？（泰坦尼克过去了109年）

最简单的方法应该是：

In [45]:
df['age_now'] = df['Age'] + 109
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod,temp,age_now
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,34.5,4.40658,143.5
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,81.5,1621.5,6.714286,156.0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0,6.4,171.0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2714391.0,3.116883,136.0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,192.5,59716602.0,1.790437,131.0


如果我们循环整个dataframe:

In [46]:
for i in range(len(df)):
    df['age_now'].iloc[i] = df['Age'].iloc[i] + 109
    
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod,temp,age_now
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,34.5,4.40658,143.5
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,81.5,1621.5,6.714286,156.0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0,6.4,171.0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2714391.0,3.116883,136.0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,192.5,59716602.0,1.790437,131.0


对DataFrame做loop经常导致这样的报警，而且非常缓慢，请尽量不要这么做！
作为替代，强烈建议使用.apply()函数

In [47]:
df['age_now_apply'] = df['Age'].apply(lambda x: x + 109)

df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod,temp,age_now,age_now_apply
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,34.5,4.40658,143.5,143.5
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,81.5,1621.5,6.714286,156.0,156.0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0,6.4,171.0,171.0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2714391.0,3.116883,136.0,136.0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,192.5,59716602.0,1.790437,131.0,131.0


.apply()函数会接受一个函数，并且把这个函数用在DataFrame的所有行中，这就是loop的效果

在这里你可能会发现我使用了lambda，这是一种特殊表达式，可以创造一个简短的函数，这个函数没有名字，不可以再被引用，所以不会污染外部空间，Python中的lambda表达式可以直接书写结果而不必使用return关键字。

所以上述的代码可以等同于：

In [48]:
def add_109(x):
    return x + 109

df['age_now_apply'] = df['Age'].apply(add_109)

df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod,temp,age_now,age_now_apply
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,34.5,4.40658,143.5,143.5
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,81.5,1621.5,6.714286,156.0,156.0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0,6.4,171.0,171.0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2714391.0,3.116883,136.0,136.0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,192.5,59716602.0,1.790437,131.0,131.0


使用.apply()函数还能完成一些非常有用的功能，比如说取对数，取指数，还有很多

对数和指数操作在Numpy中有很好的实现，就不要自己写了

In [50]:
!pip install numpy

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [51]:
import numpy as np

df['age_log'] = df['Age'].apply(np.log)

df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod,temp,age_now,age_now_apply,age_log
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,34.5,4.40658,143.5,143.5,3.540959
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,81.5,1621.5,6.714286,156.0,156.0,3.850148
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0,6.4,171.0,171.0,4.127134
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2714391.0,3.116883,136.0,136.0,3.295837
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,192.5,59716602.0,1.790437,131.0,131.0,3.091042


In [52]:
df['age_exp'] = df['Age'].apply(np.exp)

df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_cum,Age_cum_prod,temp,age_now,age_now_apply,age_log,age_exp
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5,34.5,4.40658,143.5,143.5,3.540959,961965800000000.0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,81.5,1621.5,6.714286,156.0,156.0,3.850148,2.581313e+20
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,143.5,100533.0,6.4,171.0,171.0,4.127134,8.438357e+26
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,170.5,2714391.0,3.116883,136.0,136.0,3.295837,532048200000.0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,192.5,59716602.0,1.790437,131.0,131.0,3.091042,3584913000.0


如果想传入函数作为参数，请千万不要加上括号()哦

那么恭喜大家，到这里已经掌握了绝大部分Pandas的操作，多多练习吧！