## pandas 常用数据类型
+ `Series` 一维, 带标签数组
+ `DataFrame` 二维, `Series`容器

### Series 对象本质上由两个数组构成
+ 一个数组构成对象的键(index, 索引)
+ 一个数组构成对象的值(values)

In [2]:
import numpy as np
import pandas as pd

In [3]:
import string
t = pd.Series(np.arange(10), index=list(string.ascii_uppercase[:10]))
t

A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [4]:
type(t)

pandas.core.series.Series

In [12]:
# 通过列表创建 Series

pd.Series([1, 23, 34, 45, 2], index=list('abcde'))

a     1
b    23
c    34
d    45
e     2
dtype: int64

In [3]:
# 通过字典创建 Series

temp_dict = {"name" : "xiaohong", "age" : 30, "tel" : 10086}
t3 = pd.Series(temp_dict)
t3

name    xiaohong
age           30
tel        10086
dtype: object

In [16]:
# 访问数据 (单个数据)

t3['age'], t3['tel'], t3[1], t3[2]

(30, 10086, 30, 10086)

In [17]:
# 访问数据 (多个数据)

t3[["age", "tel"]], t3[[1, 2]]

(age       30
 tel    10086
 dtype: object, age       30
 tel    10086
 dtype: object)

In [18]:
# 获取索引集合 (可迭代)

t3.index

Index(['name', 'age', 'tel'], dtype='object')

In [20]:
type(t3.index)

pandas.core.indexes.base.Index

In [19]:
# 获取值集合

t3.values

array(['xiaohong', 30, 10086], dtype=object)

In [21]:
type(t3.values)

numpy.ndarray

## DataFrame 对象既有行索引，又有列索引
+ 行索引，表明不同行，横向索引，叫`index`, `0轴`, `axis=0`
+ 列索引，表明不同列，纵向索引，叫`columns`, `1轴`, `axis=1`

In [4]:
t1 = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'), columns=list('WXYZ'))
t1

Unnamed: 0,W,X,Y,Z
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11


In [7]:
d1 = {"name" : ["xiaoming", "xiaogang"], "age" : [20, 32], "tel" : [10086, 10010]}
pd.DataFrame(d1)

Unnamed: 0,name,age,tel
0,xiaoming,20,10086
1,xiaogang,32,10010


In [8]:
d2 = [{"name":"hong", "age":32, "tel":10010}, {"name":"gang", "tel":10000}, {"name":"wang", "age":22}, {}]
pd.DataFrame(d2)

Unnamed: 0,age,name,tel
0,32.0,hong,10010.0
1,,gang,10000.0
2,22.0,wang,
3,,,


In [10]:
t2 = pd.DataFrame(d2, columns=['name', 'age', 'tel'])
t2

Unnamed: 0,name,age,tel
0,hong,32.0,10010.0
1,gang,,10000.0
2,wang,22.0,
3,,,


## DataFrame 整体情况查询
+ `df.head(3)`    显示头部几行, 默认`5`行
+ `df.tail(3)`    显示末尾几行, 默认`5`行
+ `df.info()`     显示概要信息: 行数、列数、列索引、列非空值个数、列类型、内存占用等
+ `df.describe()`   综合统计结果

## DataFrame 的基础属性
+ `df.shape`       行数、列数
+ `df.dtypes`      列数据类型
+ `df.ndim`        数据维度
+ `df.index`       行索引
+ `df.columns`     列索引
+ `df.values`      对象值, 二维 ndarray 数组

In [13]:
t2.shape

(4, 3)

In [14]:
t2.dtypes

name     object
age     float64
tel     float64
dtype: object

In [15]:
t2.ndim

2

In [11]:
t2.index

RangeIndex(start=0, stop=4, step=1)

In [12]:
t2.columns

Index(['name', 'age', 'tel'], dtype='object')

In [16]:
t2.values

array([['hong', 32.0, 10010.0],
       ['gang', nan, 10000.0],
       ['wang', 22.0, nan],
       [nan, nan, nan]], dtype=object)

## Pandas 之 loc
+ `df.loc` 通过`标签`获取行数据
+ `df.iloc` 通过`位置`获取数据

In [6]:
t3 = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'), columns=list('WXYZ'))
t3

Unnamed: 0,W,X,Y,Z
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11


In [18]:
t3.loc["a", "Z"]

3

In [19]:
t3.loc['a']

W    0
X    1
Y    2
Z    3
Name: a, dtype: int64

In [20]:
t3.loc['a', :]

W    0
X    1
Y    2
Z    3
Name: a, dtype: int64

In [21]:
t3.loc[:, 'Y']

a     2
b     6
c    10
Name: Y, dtype: int64

In [22]:
t3.loc[['a', 'c']]

Unnamed: 0,W,X,Y,Z
a,0,1,2,3
c,8,9,10,11


In [23]:
t3.loc[['a', 'c'], :]

Unnamed: 0,W,X,Y,Z
a,0,1,2,3
c,8,9,10,11


In [24]:
t3.loc[:, ['W', 'Z']]

Unnamed: 0,W,Z
a,0,3
b,4,7
c,8,11


In [25]:
t3.loc[['a', 'b'], ['W', 'Z']]

Unnamed: 0,W,Z
a,0,3
b,4,7


In [26]:
t3.iloc[1]

W    4
X    5
Y    6
Z    7
Name: b, dtype: int64

In [27]:
t3.iloc[1, :]

W    4
X    5
Y    6
Z    7
Name: b, dtype: int64

In [28]:
t3.iloc[:, 2]

a     2
b     6
c    10
Name: Y, dtype: int64

In [29]:
t3.iloc[:, [2, 1]]

Unnamed: 0,Y,X
a,2,1
b,6,5
c,10,9


In [30]:
t3.iloc[[0, 2], [2, 1]]

Unnamed: 0,Y,X
a,2,1
c,10,9


In [31]:
t3.iloc[1:, :2]

Unnamed: 0,W,X
b,4,5
c,8,9


## Pandas 之布尔索引

In [None]:
df[ df['Count_AnimalName'] > 800]

df[ (df['Row_Labels'].str.len() > 4) & (df['Count_AnimalName'] > 700) ]

# https://pandas.pydata.org/pandas-docs/stable/reference/series.html#string-handling

## Pandas 之缺失数据的处理
> 判断数据是否为`NaN`: `pd.isnull(df)`, `pd.notnull(df)`

+ 处理方式1：删除`NaN`所在的行列 `dropna(axis=0, how='any', inplace=False)`
+ 处理方式2：填充数据，`t.fillna(t.mean()), t.fillna(t.median()), t.fillna(0)`

In [16]:
d4 = [{"name":"hong", "age":32, "tel":10010}, {"name":"gang", "tel":10000}, {"name":"wang", "age":22}]
t4 = pd.DataFrame(d4)
t4

Unnamed: 0,age,name,tel
0,32.0,hong,10010.0
1,,gang,10000.0
2,22.0,wang,


In [9]:
pd.isnull(t4)

Unnamed: 0,age,name,tel
0,False,False,False
1,True,False,False
2,False,False,True


In [10]:
pd.notnull(t4)

Unnamed: 0,age,name,tel
0,True,True,True
1,False,True,True
2,True,True,False


In [12]:
t4.mean()

age       27.0
tel    10005.0
dtype: float64

In [17]:
t4.fillna(t4.mean())

Unnamed: 0,age,name,tel
0,32.0,hong,10010.0
1,27.0,gang,10000.0
2,22.0,wang,10005.0


In [14]:
t4['age'] = t4['age'].fillna(t4['age'].mean())
t4['age']

0    32.0
1    27.0
2    22.0
Name: age, dtype: float64

## Pandas 常用统计方法
```python
# 电影评分的平均分
df['Rating'].mean()

# 导演的总人数
len(df['Director'].unique())

# 演员的总人数
temp_list = np.array(df['Actors'].str.split(",").tolist())   # 返回的是一个嵌套list
len(set([i for j in temp_list for i in j]))     # 通过列表推导式将嵌套list转换成单个list

# 电影时长的最大、最小值
df['Runtime'].max()
df['Runtime'].min()

# 电影时长的最大、最小值对应的序号
df['Runtime'].argmax()
df['Runtime'].argmin()

# 电影时长的中位数
df['Runtime'].median()
```

## 数据合并之 join
> `join`: 默认情况下把`行索引相同`的数据合并到一起

In [3]:
df1 = pd.DataFrame(np.ones((2, 4)), index=['A', 'B'], columns=list('abcd'))
df1

Unnamed: 0,a,b,c,d
A,1.0,1.0,1.0,1.0
B,1.0,1.0,1.0,1.0


In [6]:
df2 = pd.DataFrame(np.zeros((3, 3)), index=['A', 'B', 'C'], columns=list('xyz'))
df2

Unnamed: 0,x,y,z
A,0.0,0.0,0.0
B,0.0,0.0,0.0
C,0.0,0.0,0.0


In [7]:
df1.join(df2)

Unnamed: 0,a,b,c,d,x,y,z
A,1.0,1.0,1.0,1.0,0.0,0.0,0.0
B,1.0,1.0,1.0,1.0,0.0,0.0,0.0


In [8]:
df2.join(df1)

Unnamed: 0,x,y,z,a,b,c,d
A,0.0,0.0,0.0,1.0,1.0,1.0,1.0
B,0.0,0.0,0.0,1.0,1.0,1.0,1.0
C,0.0,0.0,0.0,,,,


## 数据合并之merge
> `merge`: 按照指定的列把数据按照一定的方式合并到一起

+ 默认的合并方式`inner`, 并集
+ `merge outer`, 交集, `NaN`补全
+ `merge left`, 左边为准, `NaN`补全
+ `merge right`, 右边为准, `NaN`补全