### 在学习python的各种数据类型的时候都有学到各种的数据类型，字符串、列表、字典、元组啥的，关于各个不同类型的数据的相关操作都有涉及到索引这一块的内容，包括numpy多维数组，也有索引的操作。

## 索引器
### 表的索引器
#### 列索引是非常常见的索引方式，一般用[列名]来选择自己想要的内容，dataframe中只取一列的话，那么取出的数据就是series类型的数据，

In [2]:
import numpy as np
import pandas as pd


In [3]:
# 这里使用我自己的一个csv数据
data=pd.read_csv('train_k.csv')

In [4]:
data.head()

Unnamed: 0,PhoneService,StreamingTV,StreamingMovies,PaperlessBilling,MonthlyCharges,MultipleLines_Yes,InternetService_Fiber optic,PaymentMethod_Electronic check,tenure_2,Churn
0,0,0,0,1,29.85,0,0,1,0,0
1,1,0,0,0,56.95,0,0,0,0,0
2,1,0,0,1,53.85,0,0,0,0,1
3,0,0,0,0,42.3,0,0,0,0,0
4,1,0,0,1,70.7,0,1,1,0,1


### 取出单列的方式

In [6]:
data['PhoneService'].head()

0    0
1    1
2    1
3    0
4    1
Name: PhoneService, dtype: int64

### 如果想要输出单列，且列名中没有空格，可以直接用.列名

In [9]:
data.PhoneService.head()

0    0
1    1
2    1
3    0
4    1
Name: PhoneService, dtype: int64

### 也可以同时选中多行，形成一个新的列表

In [8]:
data1=data[['PhoneService','StreamingTV']]
data1.head()

Unnamed: 0,PhoneService,StreamingTV
0,0,0
1,1,0
2,1,0
3,0,0
4,1,0


### 序列的行索引
#### 创建一个series的数据类型

In [12]:
s=pd.Series(np.arange(10),index=["a","b","c","d","e","f","g","h","i","j"])

In [13]:
s["a"]

0

In [15]:
s["j"]

9

### 在选择多个索引对应的元素的时候，在列出索引的时候，需要将多个索引用两个[[]]，包起来

In [16]:
s[["a","d","f"]]

a    0
d    3
f    5
dtype: int32

### 如果想要选择某两个索引之间的所有元素，可以使用切片的方式

In [17]:
s["a":"h"]

a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
dtype: int32

In [20]:
s["a":"d":4]

a    0
dtype: int32

### 如果是整数索引，不同的元素的索引值相同，那么

In [22]:
s1=pd.Series(["a","b","d","g","e","w","u"],
            index=[1,2,3,4,2,6,1])

In [23]:
s1[1]

1    a
1    u
dtype: object

In [24]:
s1[[1,2]]

1    a
1    u
2    b
2    e
dtype: object

### 在设置索引值的时候，最好使用干净利落的数据类型，不要使用浮点数或者混合的数据作为索引

### 在dataframe数据中，有基于元素的loc索引器和基于位置的iloc索引器
- loc索引器的一般使用形式是loc[* , * ]，其中第一个 * 代表行的选择，第二个 * 代表列的选择，如果省略第二个位置写作 loc[*] ，这个 * 是指行的筛选。

In [27]:
data_demo=pd.read_csv("dogNames2.csv")

In [28]:
data_demo.head()

Unnamed: 0,Row_Labels,Count_AnimalName
0,1,1
1,2,2
2,40804,1
3,90201,1
4,90203,1


In [29]:
# 假设将Row_Labels作为index
data_demo.set_index("Row_Labels").head()
# 在这里的重新设置index需要对数据进行赋值的，但是这里这样设置显得不合理，
# 所以没就演示一下，步对数据集做改变

Unnamed: 0_level_0,Count_AnimalName
Row_Labels,Unnamed: 1_level_1
1,1
2,2
40804,1
90201,1
90203,1


### 在这里的重新设置index需要对数据进行赋值的，但是这里这样设置显得不合理， 所以没就演示一下，不对数据集做改变
#### 其他的操作都是同比类似于一般的index的，所以就不用逐一试试了

In [30]:
data_demo.set_index("Row_Labels").loc['1']


Count_AnimalName    1
Name: 1, dtype: int64

### 在使用loc索引器的时候也可以用布尔

In [32]:
data_demo.loc[data_demo.Count_AnimalName>1000].head()

Unnamed: 0,Row_Labels,Count_AnimalName
1156,BELLA,1195
9140,MAX,1153


### 对于复合条件来说，需要用& | ~来连接，不能连起来像数学公式那样写，也已将所有的条件都写进一个函数中，用数据调用函数就可以，匿名函数也能实现该功能。

In [34]:
df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]], columns=list('AB'))
df_chain

Unnamed: 0,A,B
0,0,0
1,1,0
2,-1,0


In [35]:
import warnings

In [36]:
 with warnings.catch_warnings():
        warnings.filterwarnings('error')
        try:
            df_chain[df_chain.A!=0].B = 1 # 使用方括号列索引后，再使用点的列索引
        except Warning as w:
            Warning_Msg = w

In [37]:
print(Warning_Msg)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [41]:
# 将b列的元素值设为1
df_chain.loc[df_chain.A!=0,'B'] = 1
df_chain

Unnamed: 0,A,B
0,0,0
1,1,1
2,-1,1


### iloc索引器
#### iloc 的使用与 loc 完全类似，只不过是针对位置进行筛选，在相应的 * 位置处一共也有五类合法对象，分别是：整数、整数列表、整数切片、布尔列表以及函数，函数的返回值必须是前面的四类合法对象中的一个，其输入同样也为 DataFrame 本身

In [43]:
data_demo.head()

Unnamed: 0,Row_Labels,Count_AnimalName
0,1,1
1,2,2
2,40804,1
3,90201,1
4,90203,1


In [42]:
data_demo.iloc[1,1]

2

In [44]:
data_demo.iloc[[0,1],[1,0]]

Unnamed: 0,Count_AnimalName,Row_Labels
0,1,1
1,2,2


In [47]:
data_demo.iloc[0:3,1:3]

Unnamed: 0,Count_AnimalName
0,1
1,2
2,1


In [49]:
data_demo.iloc[lambda x: slice(0, 2)]

Unnamed: 0,Row_Labels,Count_AnimalName
0,1,1
1,2,2


In [50]:
data_demo.iloc[(data_demo.Count_AnimalName>80).values].head()

Unnamed: 0,Row_Labels,Count_AnimalName
49,ABBY,148
77,ACE,128
421,ANGEL,210
472,ANNIE,93
502,APOLLO,95


In [51]:
# 对于series类型的数据
data_demo.Count_AnimalName.iloc[20]

13

In [52]:
data_demo.Count_AnimalName.iloc[1:10:2]

1     2
3     1
5     1
7     2
9    14
Name: Count_AnimalName, dtype: int64

### query方法
#### 在 pandas 中，支持把字符串形式的查询表达式传入 query 方法来查询数据，其表达式的执行结果必须返回布尔列表。

In [54]:
data_demo.head()

Unnamed: 0,Row_Labels,Count_AnimalName
0,1,1
1,2,2
2,40804,1
3,90201,1
4,90203,1


In [56]:
data_demo.query("Count_AnimalName>Count_AnimalName.mean()").head()


Unnamed: 0,Row_Labels,Count_AnimalName
8,APRIL,51
9,AUGUST,14
11,SUNDAY,13
13,FRIDAY,19
17,JUNE,24


### 随机抽样
#### 在实际操作中经常要用到对数据进行抽样，特征属性，这里使用的抽样函数是sample，sample 函数中的主要参数为 n, axis, frac, replace, weights ，前三个分别是指抽样数量、抽样的方向（0为行、1为列）和抽样比例（0.3则为从总体中抽出30%的样本）。replace 和 weights 分别是指是否放回和每个样本的抽样相对概率，当 replace = True 则表示有放回抽样。

In [58]:
data2=pd.DataFrame({"id":list("qwert"),
                       "value":[1,2,36,89,120]})
data2

Unnamed: 0,id,value
0,q,1
1,w,2
2,e,36
3,r,89
4,t,120


In [62]:
data2.sample(3, replace = True, weights = data2.value)

Unnamed: 0,id,value
4,t,120
4,t,120
3,r,89


#### 在这里是有放回的抽样，抽取了3个样本

### 多级索引

In [66]:
data

Unnamed: 0,PhoneService,StreamingTV,StreamingMovies,PaperlessBilling,MonthlyCharges,MultipleLines_Yes,InternetService_Fiber optic,PaymentMethod_Electronic check,tenure_2,Churn
0,0,0,0,1,29.85,0,0,1,0,0
1,1,0,0,0,56.95,0,0,0,0,0
2,1,0,0,1,53.85,0,0,0,0,1
3,0,0,0,0,42.30,0,0,0,0,0
4,1,0,0,1,70.70,0,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...
7016,1,1,1,1,84.80,1,0,0,1,0
7017,1,1,1,1,103.20,1,1,0,0,0
7018,0,0,0,1,29.60,0,0,1,0,0
7019,1,0,0,1,74.40,1,1,0,0,1


In [70]:
np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'),
                                          data.PhoneService.unique()], names=('StreamingTV', 'StreamingMovies'))
multi_column = pd.MultiIndex.from_product([['MonthlyCharges', 'PaperlessBilling'],
                                           data.PhoneService.unique()], names=('StreamingTV', 'StreamingMovies'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,2)*5 + 163).tolist(),
                              (np.random.randn(8,2)*5 + 65).tolist()],
                        index = multi_index,
                        columns = multi_column).round(1)

In [71]:
df_multi

Unnamed: 0_level_0,StreamingTV,MonthlyCharges,MonthlyCharges,PaperlessBilling,PaperlessBilling
Unnamed: 0_level_1,StreamingMovies,0,1,0,1
StreamingTV,StreamingMovies,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
A,0,171.8,165.0,72.5,64.0
A,1,167.9,174.2,66.6,60.7
B,0,172.3,158.1,52.2,68.3
B,1,167.8,162.2,69.3,61.3
C,0,162.5,165.1,76.3,57.7
C,1,163.7,170.3,65.2,64.1
D,0,166.8,163.6,72.7,72.3
D,1,165.2,164.7,65.8,66.9


In [72]:
df_multi.index

MultiIndex([('A', 0),
            ('A', 1),
            ('B', 0),
            ('B', 1),
            ('C', 0),
            ('C', 1),
            ('D', 0),
            ('D', 1)],
           names=['StreamingTV', 'StreamingMovies'])

In [73]:
df_multi.index.values

array([('A', 0), ('A', 1), ('B', 0), ('B', 1), ('C', 0), ('C', 1),
       ('D', 0), ('D', 1)], dtype=object)

In [74]:
df_multi.columns

MultiIndex([(  'MonthlyCharges', 0),
            (  'MonthlyCharges', 1),
            ('PaperlessBilling', 0),
            ('PaperlessBilling', 1)],
           names=['StreamingTV', 'StreamingMovies'])

In [75]:
df_multi.columns.names

FrozenList(['StreamingTV', 'StreamingMovies'])

In [76]:
df_multi.columns.values

array([('MonthlyCharges', 0), ('MonthlyCharges', 1),
       ('PaperlessBilling', 0), ('PaperlessBilling', 1)], dtype=object)

In [78]:
df_multi.index.get_level_values(0)

Index(['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], dtype='object', name='StreamingTV')

### 对于多层索引，太繁琐了，搞不太懂。。。。。。。。。。。
##### 上面的例子大多都是没有实际意义的，只是有个形式而已，有时间再搞搞吧

In [79]:
np.random.seed(0)

In [81]:
L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))

In [82]:
L3,L4 = ['D','E','F'],['d','e','f']

In [83]:
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))

In [85]:
df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)),
                     index=mul_index1,
                     columns=mul_index2)
df_ex

Unnamed: 0_level_0,Big,D,D,D,E,E,E,F,F,F
Unnamed: 0_level_1,Small,d,e,f,d,e,f,d,e,f
Upper,Lower,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
A,a,0,-6,-9,-4,-9,8,9,-5,-7
A,b,7,-6,-7,1,4,7,-2,0,-9
A,c,1,9,2,-7,-7,-6,-6,9,5
B,a,-6,8,9,5,0,-8,-5,1,2
B,b,-1,2,-7,7,-9,-9,-3,5,1
B,c,-1,4,-7,-6,-7,2,4,7,-1
C,a,-1,-1,-7,-6,3,5,-9,-5,-6
C,b,4,2,4,4,2,7,5,7,-8
C,c,-1,-9,-5,-3,4,-2,6,0,9


In [86]:
idx = pd.IndexSlice

In [87]:
df_ex.loc[idx['C':, ('D', 'f'):]]

Unnamed: 0_level_0,Big,D,E,E,E,F,F,F
Unnamed: 0_level_1,Small,f,d,e,f,d,e,f
Upper,Lower,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
C,a,-7,-6,3,5,-9,-5,-6
C,b,4,4,2,7,5,7,-8
C,c,-5,-3,4,-2,6,0,9


In [88]:
df_ex.loc[idx[:'A', lambda x:x.sum()>0]] # 列和大于0

Unnamed: 0_level_0,Big,D,D,E,F
Unnamed: 0_level_1,Small,d,e,f,e
Upper,Lower,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
A,a,0,-6,8,-5
A,b,7,-6,7,0
A,c,1,9,-6,9


### 多级索引的构造
####  前面提到了多级索引表的结构和切片，那么除了使用 set_index 之外，如何自己构造多级索引呢？常用的有 from_tuples, from_arrays, from_product 三种方法，它们都是 pd.MultiIndex 对象下的函数。

In [89]:
my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]

In [90]:
pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])

MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

In [91]:
my_array = [list('aabb'), ['cat', 'dog']*2]

In [92]:
pd.MultiIndex.from_arrays(my_array, names=['First','Second'])

MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

In [93]:
my_list1 = ['a','b']

In [94]:
my_list2 = ['cat','dog']


In [95]:
pd.MultiIndex.from_product([my_list1,my_list2],names=['First','Second'])

MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])