pandas的索引index的用途
把数据存储与普通的columns列也能用于数据查询，那使用index有什么好处？

# index的用途总结：
1. 更方便的数据查询
2. 使用index可以获得性能提升
3. 自动的数据对其功能
4. 更多更强大的数据结构支持

In [3]:
import pandas as pd
df=pd.read_csv('Movie-Dataset-Latest.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,id,title,release_date,overview,popularity,vote_average,vote_count,video
0,0,19404,Dilwale Dulhania Le Jayenge,1995-10-20,"Raj is a rich, carefree, happy-go-lucky second...",25.884,8.7,3304,False
1,1,278,The Shawshank Redemption,1994-09-23,Framed in the 1940s for the double murder of h...,60.11,8.7,20369,False
2,2,238,The Godfather,1972-03-14,"Spanning the years 1945 to 1955, a chronicle o...",62.784,8.7,15219,False
3,3,724089,Gabriel's Inferno Part II,2020-07-31,Professor Gabriel Emerson finally learns the t...,28.316,8.6,1360,False
4,4,424,Schindler's List,1993-11-30,The true story of how businessman Oskar Schind...,38.661,8.6,12158,False


In [4]:
df.dtypes

Unnamed: 0        int64
id                int64
title            object
release_date     object
overview         object
popularity      float64
vote_average    float64
vote_count        int64
video              bool
dtype: object

1. 使用index查询数据

In [5]:
# drop=false，让索引列还保持在column
df.set_index('id',inplace=True,drop=False)

In [6]:
df.head()

Unnamed: 0_level_0,Unnamed: 0,id,title,release_date,overview,popularity,vote_average,vote_count,video
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
19404,0,19404,Dilwale Dulhania Le Jayenge,1995-10-20,"Raj is a rich, carefree, happy-go-lucky second...",25.884,8.7,3304,False
278,1,278,The Shawshank Redemption,1994-09-23,Framed in the 1940s for the double murder of h...,60.11,8.7,20369,False
238,2,238,The Godfather,1972-03-14,"Spanning the years 1945 to 1955, a chronicle o...",62.784,8.7,15219,False
724089,3,724089,Gabriel's Inferno Part II,2020-07-31,Professor Gabriel Emerson finally learns the t...,28.316,8.6,1360,False
424,4,424,Schindler's List,1993-11-30,The true story of how businessman Oskar Schind...,38.661,8.6,12158,False


In [7]:
df.count()

Unnamed: 0      9463
id              9463
title           9463
release_date    9463
overview        9449
popularity      9463
vote_average    9463
vote_count      9463
video           9463
dtype: int64

In [8]:
# 使用index的查询方法，查询id为500的数据
df.loc[500]

Unnamed: 0                                                    187
id                                                            500
title                                              Reservoir Dogs
release_date                                           1992-09-02
overview        A botched robbery indicates a police informant...
popularity                                                 28.045
vote_average                                                  8.2
vote_count                                                  11351
video                                                       False
Name: 500, dtype: object

In [9]:
# 使用id列来查询
df.loc[df['id']==500]

Unnamed: 0_level_0,Unnamed: 0,id,title,release_date,overview,popularity,vote_average,vote_count,video
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
500,187,500,Reservoir Dogs,1992-09-02,A botched robbery indicates a police informant...,28.045,8.2,11351,False


2. 使用index会提升查询效率
- 如果index是唯一 的，pandas会使用hash表优化，查询性能为O(1)
- 如果index不是唯一的，但是有序，pandas会使用二分查找算法，查询性能为O(log(n))
- 如果index是完全随机的，那么每次查询都会扫描全表，查询性能为O(N)

In [10]:
# 索引是否唯一
df.index.is_unique

True

In [12]:
# 索引是否单调递增
df.index.is_monotonic_increasing

False

实验一：完全随机的顺序查询

In [13]:
# 将数据随机打散
from sklearn.utils import shuffle
# 计数，查询id==500数据性能
%timeit df.loc[500]

43.9 µs ± 438 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


实验二：将index排序后的查询（因为本数据中的索引唯一，所以排序与不排序后都是O(1)时间复杂度

In [14]:
df_sort=df.sort_index()

In [15]:
df_sort.head()

Unnamed: 0_level_0,Unnamed: 0,id,title,release_date,overview,popularity,vote_average,vote_count,video
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5,8304,5,Four Rooms,1995-12-09,It's Ted the Bellhop's first night on the job....,13.058,5.7,2062,False
6,5087,6,Judgment Night,1993-10-15,"While racing to a boxing match, Frank, Mike, J...",9.324,6.5,218,False
11,145,11,Star Wars,1977-05-25,Princess Leia is captured and held hostage by ...,77.419,8.2,16428,False
12,577,12,Finding Nemo,2003-05-30,"Nemo, an adventurous young clownfish, is unexp...",85.511,7.8,15792,False
13,21,13,Forrest Gump,1994-07-06,A man with a low IQ has accomplished great thi...,49.031,8.5,21823,False


In [18]:
# 查询排序后索引为500的性能
%timeit df_sort.loc[500]

43.9 µs ± 145 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


3. 使用index能自动对其数据
包括series和df

In [19]:
s1=pd.Series([1,2,3],index=['a','b','c'])
s2=pd.Series([2,4,5],index=['b','c','d'])

In [20]:
s1+s2

a    NaN
b    4.0
c    7.0
d    NaN
dtype: float64

4. 使用index更多更强大的数据结构支持
很多强大的索引数据结构
- CategoricalIndex，基于分类数据的index，提升性能
- MultiIndex，多维索引，用于groupby多维聚合后结果等
- DatetimeIndex，时间类型索引，强大的日期和时间的方法支持