# 总览
## PySpark vs Pandas vs Numpy
Pandas DataFrame是一个二维数组，底层使用Numpy和array存储，Numpy使用C语言编写，运行速度很快

Spark和Pandas都可以集成SQL能力，但他们支持的SQL规范不一致，为了保持统一，需要做数据转换

PySpark DataFram转Pandas DataFrame使用toPandas()方法

Pandas DataFrame转PySpark DataFrame，使用sqlContext.createDataFrame(pdf)
## 参见
- [pandas文档](https://pandas.pydata.org/docs/getting_started/index.html)

## Pandas常用操作

### 导入库

In [2]:
import pandas as pd
import numpy as np

### 创建数据
pandas支持创建单列数据 Series和多列数据 DataFrame

In [7]:
ingredients = pd.Series(['4 cups', '1 cup', '2 large', '1 can'], ['Flour', 'Milk', 'Eggs', 'Spam'], name='Dinner')
ingredients

Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object

In [4]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通过传入一个Numpy数组来创建DataFrame，使用date_range来创建索引数据

注意返回的数据里面有一个freq对象，这个表示时间类数据的频率，D表示日历日频率，常见的有：
- B 工作日频率 
- C 自定义工作日频率
- D 日历日频率
- W 周频率 
- M 月末频率 
- SM 半月结束频率 
- BM 营业月结束频率

In [9]:
dates = pd.date_range("20130101", periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [10]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.384673,-2.031751,-0.232746,-1.503841
2013-01-02,-0.110752,0.600712,-0.656642,1.680755
2013-01-03,0.790116,-1.594041,-1.526773,-0.867965
2013-01-04,-0.900156,-0.345626,-0.101756,0.090274
2013-01-05,0.054581,1.545837,-0.440527,0.590728
2013-01-06,1.134316,-0.605631,-0.810375,1.010176


查看数据情况(行数,列数)

In [11]:
df.shape

(6, 4)

In [7]:
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df

Unnamed: 0,one,two,three
0,a,1.2,4.2
1,b,70.0,0.03
2,x,5.0,0.0


通过传入字典对象来创建DataFrame

In [8]:
people = pd.DataFrame({'Name':['a','b'],'Age':[18, 22]}, index=[0,1])
people

Unnamed: 0,Name,Age
0,a,18
1,b,22


In [9]:
loc = pd.DataFrame({'Location':['四川1','重庆']}, index=[0,1])
p2 = people.join(loc)
p2

Unnamed: 0,Name,Age,Location
0,a,18,四川1
1,b,22,重庆


In [10]:
melbourne_file_path = './data/melb_data.csv'
mel_df = pd.read_csv(melbourne_file_path)

In [11]:
mel_df.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,65.0,63.0,99.0,99.0,99.0
mean,2.707071,1074929.0,7.114141,3086.383838,2.646465,1.494949,1.222222,306.616162,119.753846,1954.301587,-37.779628,144.939786,3638.0
std,0.860144,519046.1,5.293142,63.369939,0.860863,0.612351,0.985151,471.33795,49.113081,48.638441,0.049538,0.053325,310.013199
min,1.0,300000.0,2.5,3042.0,1.0,1.0,0.0,0.0,18.0,1880.0,-37.8481,144.8679,3280.0
25%,2.0,717500.0,2.5,3042.0,2.0,1.0,1.0,128.5,85.0,1900.0,-37.8088,144.87965,3464.0
50%,3.0,941000.0,3.3,3067.0,3.0,1.0,1.0,177.0,113.0,1965.0,-37.8016,144.9523,3464.0
75%,3.0,1326250.0,13.5,3067.0,3.0,2.0,2.0,298.0,142.0,2001.5,-37.72365,144.9957,4019.0
max,6.0,2850000.0,13.5,3206.0,6.0,3.0,6.0,4290.0,309.0,2016.0,-37.7164,145.0067,4019.0


In [12]:
mel_df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [13]:
df2 = pd.DataFrame({
    "A": 1.0,
    "B": pd.Timestamp("20130102"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3]*4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo",
})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [14]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [15]:
df2.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [16]:
df.describe()

Unnamed: 0,one,two,three
count,3,3.0,3.0
unique,3,3.0,3.0
top,a,1.2,4.2
freq,1,1.0,1.0


转置数据,行列颠倒

In [17]:
df.T

Unnamed: 0,0,1,2
one,a,b,x
two,1.2,70,5
three,4.2,0.03,0


对列进行排序

In [18]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,two,three,one
0,1.2,4.2,a
1,70.0,0.03,b
2,5.0,0.0,x


对值排序

In [19]:
df.sort_values("B")

KeyError: 'B'

## 数据选择

选择单列，返回一个Series

In [None]:
df.A

2013-01-01   -0.957286
2013-01-02   -0.155476
2013-01-03   -1.169260
2013-01-04    0.192617
2013-01-05    0.735926
2013-01-06    0.058342
Freq: D, Name: A, dtype: float64

### 根据正则匹配
使用str.contains包含字符，使用~是不包含

In [25]:
df[df.one.str.contains('^a',na=False)]

Unnamed: 0,one,two,three
0,a,1.2,4.2


In [24]:
df[~df.one.str.contains('^a',na=False)]

Unnamed: 0,one,two,three
1,b,70,0.03
2,x,5,0.0


### 返回指定列数据

In [76]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.957286,-2.079644,0.071533,1.184266
2013-01-02,-0.155476,-1.682659,-0.199683,-1.236227
2013-01-03,-1.16926,-0.403638,-0.012646,0.022892


In [77]:
df["20130102":"20130104"]

Unnamed: 0,A,B,C,D
2013-01-02,-0.155476,-1.682659,-0.199683,-1.236227
2013-01-03,-1.16926,-0.403638,-0.012646,0.022892
2013-01-04,0.192617,0.39655,-0.11204,-0.224726


### 根据label获取数据
DataFrame.loc 根据Label获取数据
支持的如下3种输入
* 一个label, e.g. 5
* list label, e.g. ['a','b','c']
* slice object, e.g. 'a':'f'

In [80]:
df.loc[dates[0]]

A   -0.957286
B   -2.079644
C    0.071533
D    1.184266
Name: 2013-01-01 00:00:00, dtype: float64

In [8]:
df.loc[['20130103','20130104']]

Unnamed: 0,A,B,C,D
2013-01-03,-0.346333,-0.424171,0.353539,-0.061492
2013-01-04,-1.585203,-0.685209,0.076646,-0.893802


获取多列数据

In [81]:
df.loc[:,["A","B"]]

Unnamed: 0,A,B
2013-01-01,-0.957286,-2.079644
2013-01-02,-0.155476,-1.682659
2013-01-03,-1.16926,-0.403638
2013-01-04,0.192617,0.39655
2013-01-05,0.735926,-1.158651
2013-01-06,0.058342,-0.436147


In [82]:
df.loc["20130102":"20130104", ["A","B"]]

Unnamed: 0,A,B
2013-01-02,-0.155476,-1.682659
2013-01-03,-1.16926,-0.403638
2013-01-04,0.192617,0.39655


### 根据位置获取数据
df.iloc根据索引位置和列位置获取数据，注意两个参数都必须是位置，不能一个位置一个label

In [83]:
df.iloc[3]

A    0.192617
B    0.396550
C   -0.112040
D   -0.224726
Name: 2013-01-04 00:00:00, dtype: float64

In [84]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,0.192617,0.39655
2013-01-05,0.735926,-1.158651


In [85]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,-0.155476,-0.199683
2013-01-03,-1.16926,-0.012646
2013-01-05,0.735926,2.095934


In [86]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2013-01-02,-0.155476,-1.682659,-0.199683,-1.236227
2013-01-03,-1.16926,-0.403638,-0.012646,0.022892


In [87]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2013-01-01,-2.079644,0.071533
2013-01-02,-1.682659,-0.199683
2013-01-03,-0.403638,-0.012646
2013-01-04,0.39655,-0.11204
2013-01-05,-1.158651,2.095934
2013-01-06,-0.436147,-0.013942


In [88]:
df.iloc[1,1]

-1.6826593042512668

In [89]:
df.iat[1,1]

-1.6826593042512668

## Boolean索引

使用单列值选择数据

In [90]:
df[df["A"]>0]

Unnamed: 0,A,B,C,D
2013-01-04,0.192617,0.39655,-0.11204,-0.224726
2013-01-05,0.735926,-1.158651,2.095934,-0.883915
2013-01-06,0.058342,-0.436147,-0.013942,-1.017744


In [91]:
df[df>0]

Unnamed: 0,A,B,C,D
2013-01-01,,,0.071533,1.184266
2013-01-02,,,,
2013-01-03,,,,0.022892
2013-01-04,0.192617,0.39655,,
2013-01-05,0.735926,,2.095934,
2013-01-06,0.058342,,,


使用isin()方法过滤数据

In [92]:
df2 = df.copy()

In [94]:
df2["E"]=["one","one","two","three","four","three"]
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.957286,-2.079644,0.071533,1.184266,one
2013-01-02,-0.155476,-1.682659,-0.199683,-1.236227,one
2013-01-03,-1.16926,-0.403638,-0.012646,0.022892,two
2013-01-04,0.192617,0.39655,-0.11204,-0.224726,three
2013-01-05,0.735926,-1.158651,2.095934,-0.883915,four
2013-01-06,0.058342,-0.436147,-0.013942,-1.017744,three


In [95]:
df2[df2["E"].isin(["two","four"])]

Unnamed: 0,A,B,C,D,E
2013-01-03,-1.16926,-0.403638,-0.012646,0.022892,two
2013-01-05,0.735926,-1.158651,2.095934,-0.883915,four


In [97]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range("20130102", periods=6))

In [98]:
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [99]:
df["F"] = s1

In [107]:
df

Unnamed: 0,A,B,C,D,F,G
2013-01-01,-0.957286,-2.079644,0.071533,1.184266,,
2013-01-02,-0.155476,-1.682659,-0.199683,-1.236227,1.0,
2013-01-03,-1.16926,-0.403638,-0.012646,0.022892,2.0,
2013-01-04,0.192617,0.39655,-0.11204,-0.224726,3.0,
2013-01-05,0.735926,-1.158651,2.095934,-0.883915,4.0,
2013-01-06,0.058342,-0.436147,-0.013942,-1.017744,5.0,


In [101]:
s2 = pd.Series([11,12,13,14,15,16,17])
s2

0    11
1    12
2    13
3    14
4    15
5    16
6    17
dtype: int64

In [106]:
df.columns

Index(['A', 'B', 'C', 'D', 'F', 'G'], dtype='object')

In [108]:
df.dropna()

Unnamed: 0,A,B,C,D,F,G


In [109]:
df

Unnamed: 0,A,B,C,D,F,G
2013-01-01,-0.957286,-2.079644,0.071533,1.184266,,
2013-01-02,-0.155476,-1.682659,-0.199683,-1.236227,1.0,
2013-01-03,-1.16926,-0.403638,-0.012646,0.022892,2.0,
2013-01-04,0.192617,0.39655,-0.11204,-0.224726,3.0,
2013-01-05,0.735926,-1.158651,2.095934,-0.883915,4.0,
2013-01-06,0.058342,-0.436147,-0.013942,-1.017744,5.0,


设置值

In [111]:
# 根据label设置
df.at[dates[0],"A"] = 0
df.loc[dates[0]]

A    0.000000
B   -2.079644
C    0.071533
D    1.184266
F         NaN
G         NaN
Name: 2013-01-01 00:00:00, dtype: float64

In [113]:
# 根据位置设置
df.iat[0,1]=0
df.iloc[0,:]

A    0.000000
B    0.000000
C    0.071533
D    1.184266
F         NaN
G         NaN
Name: 2013-01-01 00:00:00, dtype: float64

In [114]:
# 设置一个数组
df.loc[:,"D"] = np.array([5]*len(df))

In [115]:
df.D

2013-01-01    5
2013-01-02    5
2013-01-03    5
2013-01-04    5
2013-01-05    5
2013-01-06    5
Freq: D, Name: D, dtype: int64

In [116]:
# 使用 where操作
df2 = df.copy()
df2[df2>0] = -df2
df2

Unnamed: 0,A,B,C,D,F,G
2013-01-01,0.0,0.0,-0.071533,-5,,
2013-01-02,-0.155476,-1.682659,-0.199683,-5,-1.0,
2013-01-03,-1.16926,-0.403638,-0.012646,-5,-2.0,
2013-01-04,-0.192617,-0.39655,-0.11204,-5,-3.0,
2013-01-05,-0.735926,-1.158651,-2.095934,-5,-4.0,
2013-01-06,-0.058342,-0.436147,-0.013942,-5,-5.0,


In [117]:
df

Unnamed: 0,A,B,C,D,F,G
2013-01-01,0.0,0.0,0.071533,5,,
2013-01-02,-0.155476,-1.682659,-0.199683,5,1.0,
2013-01-03,-1.16926,-0.403638,-0.012646,5,2.0,
2013-01-04,0.192617,0.39655,-0.11204,5,3.0,
2013-01-05,0.735926,-1.158651,2.095934,5,4.0,
2013-01-06,0.058342,-0.436147,-0.013942,5,5.0,


## 数据转换
可以使用函数或者Mapping转换数据

In [26]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [28]:
lowercased = data.food.str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [29]:
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


### 替换值

In [30]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [31]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [32]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

## 数据缺失
pandas使用np.nan来表示数据缺失

缺失数据处理
对空值的处理有3种办法
1. 去掉包含空值的列
2. 给空值赋值为平均值
3. 赋值的同时，添加一个标明数值缺失的属性

### 给空值赋值为平均值

In [5]:
from sklearn.impute import SimpleImputer

df_si = pd.DataFrame({'A': [1, np.nan, 2, 3, 4], 'B': [2, 3, 4, 5, 6]})
my_imputer = SimpleImputer()

imputed_df_si = pd.DataFrame(my_imputer.fit_transform(df_si))
imputed_df_si


Unnamed: 0,0,1
0,1.0,2.0
1,2.5,3.0
2,2.0,4.0
3,3.0,5.0
4,4.0,6.0


In [6]:
imputed_df_si_2 = pd.DataFrame(my_imputer.transform(df_si))
imputed_df_si_2


Unnamed: 0,0,1
0,1.0,2.0
1,2.5,3.0
2,2.0,4.0
3,3.0,5.0
4,4.0,6.0


Reindexing 可以根据列进行数据变化，返回一个数据的拷贝

In [119]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns)+["E"])
df1.loc[dates[0]:dates[1], "E"] = 1
df1

Unnamed: 0,A,B,C,D,F,G,E
2013-01-01,0.0,0.0,0.071533,5,,,1.0
2013-01-02,-0.155476,-1.682659,-0.199683,5,1.0,,1.0
2013-01-03,-1.16926,-0.403638,-0.012646,5,2.0,,
2013-01-04,0.192617,0.39655,-0.11204,5,3.0,,


In [125]:
df1.at["20130102","G"]=2

In [126]:
df1

Unnamed: 0,A,B,C,D,F,G,E
2013-01-01,0.0,0.0,0.071533,5,,,1.0
2013-01-02,-0.155476,-1.682659,-0.199683,5,1.0,2.0,1.0
2013-01-03,-1.16926,-0.403638,-0.012646,5,2.0,,
2013-01-04,0.192617,0.39655,-0.11204,5,3.0,,


`DataFrame.dropna()`删处任何办好空数据的列，返回删除后的数据，不会改变原数据

In [127]:
df1.dropna(how="any")

Unnamed: 0,A,B,C,D,F,G,E
2013-01-02,-0.155476,-1.682659,-0.199683,5,1.0,2.0,1.0


In [128]:
df1

Unnamed: 0,A,B,C,D,F,G,E
2013-01-01,0.0,0.0,0.071533,5,,,1.0
2013-01-02,-0.155476,-1.682659,-0.199683,5,1.0,2.0,1.0
2013-01-03,-1.16926,-0.403638,-0.012646,5,2.0,,
2013-01-04,0.192617,0.39655,-0.11204,5,3.0,,


`DataFrame.fillna`填充缺失的数据

In [129]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,G,E
2013-01-01,0.0,0.0,0.071533,5,5.0,5.0,1.0
2013-01-02,-0.155476,-1.682659,-0.199683,5,1.0,2.0,1.0
2013-01-03,-1.16926,-0.403638,-0.012646,5,2.0,5.0,5.0
2013-01-04,0.192617,0.39655,-0.11204,5,3.0,5.0,5.0


`isna()`获取nan的boolean数据

In [130]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,F,G,E
2013-01-01,False,False,False,False,True,True,False
2013-01-02,False,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True,True
2013-01-04,False,False,False,False,False,True,True


## 修改列类型

In [25]:
df[['two', 'three']] = df[['two','three']].astype(float)
df.dtypes

one       object
two      float64
three    float64
dtype: object

In [31]:
s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
s.dtypes

dtype('O')

In [32]:
pd.to_numeric(s) # convert everything to float values
s.dtypes

dtype('O')

## 排序

In [54]:
df = pd.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    'col2': [2, 1, 9, 8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})
df.sort_values(by=['col1','col2'], ascending=True)

Unnamed: 0,col1,col2,col3,col4
1,A,1,1,B
0,A,2,0,a
2,B,9,9,c
5,C,4,3,F
4,D,7,2,e
3,,8,4,D


要恢复排序可以使用reset_index

In [53]:
df.reset_index(drop=True)

Unnamed: 0,col1,col2,col3,col4
0,A,2,0,a
1,A,1,1,B
2,B,9,9,c
3,,8,4,D
4,D,7,2,e
5,C,4,3,F


## 索引

In [48]:
df = pd.DataFrame([('bird', 389.0),
                   ('bird', 24.0),
                   ('mammal', 80.5),
                   ('mammal', np.nan)],
                  index=['falcon', 'parrot', 'lion', 'monkey'],
                  columns=('class', 'max_speed'))
df

Unnamed: 0,class,max_speed
falcon,bird,389.0
parrot,bird,24.0
lion,mammal,80.5
monkey,mammal,


当我们执行reset_index后，旧的index会作用一列添加到数据中

In [49]:
df.reset_index()

Unnamed: 0,index,class,max_speed
0,falcon,bird,389.0
1,parrot,bird,24.0
2,lion,mammal,80.5
3,monkey,mammal,


我们可以使用drop参数移除索引

In [50]:
df.reset_index(drop=True)

Unnamed: 0,class,max_speed
0,bird,389.0
1,bird,24.0
2,mammal,80.5
3,mammal,


## 操作

### 统计

In [131]:
df.mean()

A   -0.056309
B   -0.547424
C    0.304859
D    5.000000
F    3.000000
G         NaN
dtype: float64

In [135]:
df.mean(1)

2013-01-01    1.267883
2013-01-02    0.792436
2013-01-03    1.082891
2013-01-04    1.695425
2013-01-05    2.134642
2013-01-06    1.921651
Freq: D, dtype: float64

### Apply
`DataFrame.apply()`应用一个用户自定义的函数到数据上

In [137]:
df

Unnamed: 0,A,B,C,D,F,G
2013-01-01,0.0,0.0,0.071533,5,,
2013-01-02,-0.155476,-1.682659,-0.199683,5,1.0,
2013-01-03,-1.16926,-0.403638,-0.012646,5,2.0,
2013-01-04,0.192617,0.39655,-0.11204,5,3.0,
2013-01-05,0.735926,-1.158651,2.095934,5,4.0,
2013-01-06,0.058342,-0.436147,-0.013942,5,5.0,


In [136]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F,G
2013-01-01,0.0,0.0,0.071533,5,,
2013-01-02,-0.155476,-1.682659,-0.12815,10,1.0,
2013-01-03,-1.324737,-2.086297,-0.140796,15,3.0,
2013-01-04,-1.13212,-1.689747,-0.252836,20,6.0,
2013-01-05,-0.396194,-2.848398,1.843098,25,10.0,
2013-01-06,-0.337851,-3.284545,1.829156,30,15.0,


In [138]:
df.apply(lambda x: x.max()-x.min())

A    1.905186
B    2.079209
C    2.295617
D    0.000000
F    4.000000
G         NaN
dtype: float64

In [149]:
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.1/242.1 KB[0m [31m778.9 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## 读写数据

In [150]:
import openpyxl

df.to_excel("foo.xlsx", sheet_name="S1")

In [151]:
df2 = pd.read_excel("foo.xlsx","S1", index_col=None, na_values=["NA"])
df2

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2000-01-01,0.633438,0.512440,0.304796,0.539556
1,2000-01-02,-0.415030,0.000162,0.973473,2.241457
2,2000-01-03,-1.163734,-0.579928,-0.355194,2.634230
3,2000-01-04,-1.913343,-0.680999,0.634416,2.917080
4,2000-01-05,-2.651016,-0.132421,0.589823,3.704411
...,...,...,...,...,...
995,2002-09-22,38.434651,3.747595,36.119291,-58.312262
996,2002-09-23,38.402417,3.297220,36.136038,-58.493243
997,2002-09-24,39.695945,2.151043,36.639095,-58.485476
998,2002-09-25,38.864487,3.854588,37.865119,-57.315924


## Pandas函数

DataFrame.corr(method='pearson', min_periods=1) 计算相关系数

参数说明：

method：可选值为{‘pearson’, ‘kendall’, ‘spearman’}

- pearson：Pearson相关系数来衡量两个数据集合是否在一条线上面，即针对线性数据的相关系数计算，针对非线性数据便会有误差。
- kendall：用于反映分类变量相关性的指标，即针对无序序列的相关系数，非正太分布的数据
- spearman：非线性的，非正太分布的数据的相关系数

min_periods：样本最少的数据量

返回值：各类型之间的相关系数DataFrame表格。

两组数据间的相关性计算可以分为如下3种情况：

1. 数值数据与分类数据
2. 数值数据与数值数据
3. 分类数据与分类数据



### 数值与数值的相关性

In [17]:
import pandas as pd
 
data = pd.DataFrame({'化妆品费': [30, 50, 120, 20, 70, 150, 50, 60, 80, 100],
                     '置装费': [70, 80, 250, 50, 120, 300, 100, 150, 20, 180]})
print(data.corr()) # 计算所有的变量的两两相关性
print(data['化妆品费'].corr(data['置装费'])) # 只计算选择的两个变量的相关性

          化妆品费       置装费
化妆品费  1.000000  0.850918
置装费   0.850918  1.000000
0.8509180035311159


### 数值与分类的相关性

In [18]:
# 情况1：分类标签为数字
data = pd.DataFrame({'id': [3, 2, 1, 1, 2, 3, 2, 3, 1, 1, 2, 3, 1, 2, 1],
                     'age': [27, 33, 16, 29, 32, 23, 25, 28, 22, 18, 26, 26, 15, 29, 26]})
print('pearson:', data['id'].corr(data['age']))
print('spearman', data['id'].corr(data['age'], method='spearman'))
 
# 情况2：分类标签为字符串
data1 = pd.DataFrame({'id': ['c', 'b', 'a', 'a', 'b', 'c', 'b', 'c', 'a', 'a', 'b', 'c', 'a', 'b', 'a'],
                     'age': [27, 33, 16, 29, 32, 23, 25, 28, 22, 18, 26, 26, 15, 29, 26]})
print('spearman', data1['id'].corr(data1['age'], method='spearman'))
 
# 输出
# pearson: 0.4465155114816965
# spearman 0.4016086046008866
# spearman 0.4016086046008866

pearson: 0.4465155114816965
spearman 0.4016086046008866
spearman 0.4016086046008866



The input array could not be properly checked for nan values. nan values will be ignored.

