# Pandas的索引lindex的用途
 
 把数据存储于普通的column列也能用于数据查询，那使用index有什么好处?
 
 index的用途总结:
 - 1.更方便的数据查询;
 - 2.使用index可以获得性能提升;
 - 3.自动的数据对齐功能;
 - 4.更多更强大的数据结构支持

In [3]:
import pandas as pd

In [4]:
fpath = 'douban250_xinxi.csv'

In [5]:
df = pd.read_csv(fpath)

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,ranking,rating,time1,number_of_comments
0,0,1,9.7,1994,3231859
1,1,2,9.6,1993,2385441
2,2,3,9.5,1997,2454845
3,3,4,9.5,1994,2395170
4,4,5,9.4,2001,2495307


In [7]:
df.count()

Unnamed: 0            250
ranking               250
rating                250
time1                 250
number_of_comments    250
dtype: int64

## 1.使用index数据查询

In [10]:
# drop = False,让索引列还保持在column
df.set_index("ranking",inplace=True,drop=False)

In [11]:
df.head()

Unnamed: 0_level_0,Unnamed: 0,ranking,rating,time1,number_of_comments
ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,1,9.7,1994,3231859
2,1,2,9.6,1993,2385441
3,2,3,9.5,1997,2454845
4,3,4,9.5,1994,2395170
5,4,5,9.4,2001,2495307


In [12]:
df.index

Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
       ...
       241, 242, 243, 244, 245, 246, 247, 248, 249, 250],
      dtype='int64', name='ranking', length=250)

In [13]:
# 使用column的condition查询方法
df.loc[df['ranking']==50].head()

Unnamed: 0_level_0,Unnamed: 0,ranking,rating,time1,number_of_comments
ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
50,49,50,9.1,2018,1149521


In [16]:
# 新本版好像做了变化，index查询必须是df.loc[[index]]才返回DF，只有一对方括号返回的是Series
# 使用index的查询方法
df.loc[[50]].head()

Unnamed: 0_level_0,Unnamed: 0,ranking,rating,time1,number_of_comments
ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
50,49,50,9.1,2018,1149521


## 2.使用index会提升查询性能

- 如果index是唯一的，Pandas会使用哈希表优化，查询性能为O(1);
- 如果index不是唯一的，但是有序，Pandas会使用二分查找算法，查询性能为O(logN)
- 如果index是完全随机的，那么每次查询都要扫描全表，查询性能为O(N);

### 实验一:完全随机的顺序查询

In [17]:
#将数据打乱:
from sklearn.utils import shuffle
df_shuffle = shuffle(df)

In [18]:
df_shuffle.head()

Unnamed: 0_level_0,Unnamed: 0,ranking,rating,time1,number_of_comments
ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8,7,8,9.4,1994,2519333
150,149,150,8.8,2011,602982
188,187,188,8.7,2000,662612
49,48,49,9.0,1995,1346045
177,176,177,9.5,1984,152238


In [19]:
# 看一下索引是否是递增的
df_shuffle.index.is_monotonic_increasing

False

In [20]:
df_shuffle.index.is_unique

True

In [21]:
# 计时,查询ranking == 50的数据性能
%timeit df_shuffle.loc[[50]]

94.8 μs ± 1.72 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### 实验二:将index排序后查询

In [23]:
df_sorted = df_shuffle.sort_index()

In [24]:
df_sorted.head()

Unnamed: 0_level_0,Unnamed: 0,ranking,rating,time1,number_of_comments
ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,1,9.7,1994,3231859
2,1,2,9.6,1993,2385441
3,2,3,9.5,1997,2454845
4,3,4,9.5,1994,2395170
5,4,5,9.4,2001,2495307


In [25]:
# 看一下索引是否是递增的
df_sorted.index.is_monotonic_increasing

True

In [26]:
# 计时,查询ranking == 50的数据性能
%timeit df_sorted.loc[[50]]

99.4 μs ± 4.95 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


## 3.使用index自动对齐数据

包括series和Dataframe

In [32]:
#索引是abc
s1 = pd.Series([1,2,3],index=list('abc'))

In [33]:
s1

a    1
b    2
c    3
dtype: int64

In [34]:
s2 = pd.Series([2,3,4],index = list('bcd'))

In [35]:
s2

b    2
c    3
d    4
dtype: int64

In [37]:
s1+s2

a    NaN
b    4.0
c    6.0
d    NaN
dtype: float64

## 4.使用index更多更强大的数据结构支持
#### 很多强大的索引数据结构
- Categoricallndex，基于分类数据的Index，提升性能;。
- Multilndex，多维索引，用于groupby多维聚合后结果等;。
- DatetimeIndex，时间类型索引，强大的日期和时间的方法支持: