# INDEXING AND SELECTING DATA

pandas documentation Indexing 파트의 첫마디.
>The axis labeling information in pandas objects serves many purposes:
- Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display. **데이터 구조**를 알려줌
- Enables automatic and explicit data alignment. **피벗테이블** 역할을 함
- Allows intuitive getting and setting of subsets of the data set. **잘라내기** 편함

In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in this area.

## 1.Different choices for indexing :`.loc[]`, `.iloc[]`

Getting values from an object with multi-axes selection uses the following notation (using .loc as an example, but the following applies to .iloc as well). Any of the axes accessors may be the **null slice** `:`. Axes left out of the specification are assumed to be :, e.g. p.loc['a'] is equivalent to p.loc['a', :, :].

object type  // indexers

`Series`  //   `s.loc[indexer]`


`DataFrame` // `df.loc[row_indexer, column_indexer]`

### Selection by label: 

`.loc` is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. 

Allowed inputs are:

- single label `.`
- list or array of labels `[]`
- slice object with labels `['a':'f']`
- boolean array, 
- callable function with one argument `[lambda]`



### Selection by Position: 
`.iloc` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

.iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics).

Allowed inputs are:

- an integer
- a list or array of integers
- a slice object with ints 1:7
- boolean array
- `callable` function lambda


.loc, .iloc, and also [] indexing can accept a callable as indexer. See more at Selection By Callable.

---
## 2.Basics : `__getitem__` **Lower-dimensional slices**

As mentioned when introducing the data structures in the last section, the primary function of indexing with `[]` (a.k.a. `__getitem__` for those familiar with implementing class behavior in Python) is selecting out **lower-dimensional** slices. The following table shows return type values when indexing pandas objects with []:

Object//Selection//Return Value Type

`Series`//`series[label]`//scalar value

`DataFrame`//`df[col_name]`//`Series` corresponding to colname

In [47]:
import numpy as np
import pandas as pd

In [48]:
dates = pd.date_range('1/1/2020', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), 
                  index=dates, columns=list('ABCD'))


In [49]:
df

Unnamed: 0,A,B,C,D
2020-01-01,-0.408547,1.123157,1.67815,0.424292
2020-01-02,1.423409,0.568891,-1.315006,1.384911
2020-01-03,1.699967,1.52824,0.695293,-1.168754
2020-01-04,0.26504,-0.681416,0.603564,-0.845449
2020-01-05,0.064487,2.195352,0.342475,0.303606
2020-01-06,-0.810517,-0.518233,0.517434,-0.020261
2020-01-07,-0.462684,-2.098923,0.710486,-0.737353
2020-01-08,0.400037,-0.469473,0.549603,0.256445


In [50]:
s = df['A']

In [51]:
s

2020-01-01   -0.408547
2020-01-02    1.423409
2020-01-03    1.699967
2020-01-04    0.265040
2020-01-05    0.064487
2020-01-06   -0.810517
2020-01-07   -0.462684
2020-01-08    0.400037
Freq: D, Name: A, dtype: float64

In [44]:
s[dates[5]]

1.358169525215837

You can pass **a list of columns** to [] to select columns in that order. 

If a column is not contained in the DataFrame, an exception will be raised. 

Multiple columns can also be set in this manner:

In [52]:
df_swap= df.copy()
df_swap[  ['A', 'B']  ] = ['left', 'right']

In [53]:
df_swap

Unnamed: 0,A,B,C,D
2020-01-01,left,right,1.67815,0.424292
2020-01-02,left,right,-1.315006,1.384911
2020-01-03,left,right,0.695293,-1.168754
2020-01-04,left,right,0.603564,-0.845449
2020-01-05,left,right,0.342475,0.303606
2020-01-06,left,right,0.517434,-0.020261
2020-01-07,left,right,0.710486,-0.737353
2020-01-08,left,right,0.549603,0.256445


In [54]:
df_swap[   ['B', 'A']   ] = df_swap[  ['A', 'B']  ]

In [55]:
df_swap

Unnamed: 0,A,B,C,D
2020-01-01,right,left,1.67815,0.424292
2020-01-02,right,left,-1.315006,1.384911
2020-01-03,right,left,0.695293,-1.168754
2020-01-04,right,left,0.603564,-0.845449
2020-01-05,right,left,0.342475,0.303606
2020-01-06,right,left,0.517434,-0.020261
2020-01-07,right,left,0.710486,-0.737353
2020-01-08,right,left,0.549603,0.256445


>**Warning** pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc.
This will not modify df because the **column alignment is before value assignment.**

In [63]:
df_swap= df.copy()
df_swap[  ['A', 'B']  ] = ['left', 'right']

In [64]:
df_swap[  ['A', 'B']  ]

Unnamed: 0,A,B
2020-01-01,left,right
2020-01-02,left,right
2020-01-03,left,right
2020-01-04,left,right
2020-01-05,left,right
2020-01-06,left,right
2020-01-07,left,right
2020-01-08,left,right


In [66]:
df_swap.loc[:,  ['B', 'A']]  =  df_swap[  ['A', 'B']]  # column alignment (LHS도 df, RHS도 df.....LHS의 칼럼'A'와, RHS의 칼럼 'A' 맞춘 다음 Set value... )
df_swap[  ['A',  'B']  ]

Unnamed: 0,A,B
2020-01-01,left,right
2020-01-02,left,right
2020-01-03,left,right
2020-01-04,left,right
2020-01-05,left,right
2020-01-06,left,right
2020-01-07,left,right
2020-01-08,left,right


The correct way to swap column values is by using **raw values**

In [67]:
df_swap.loc[:, ['B', 'A'] ] = df_swap[ ['A','B'] ].to_numpy()  # df vs df로 set하지 말고.... df vs raw vlaue로 set하는 습관!
df_swap[ ['A', 'B']]

Unnamed: 0,A,B
2020-01-01,right,left
2020-01-02,right,left
2020-01-03,right,left
2020-01-04,right,left
2020-01-05,right,left
2020-01-06,right,left
2020-01-07,right,left
2020-01-08,right,left


---
## 3.Attribute access: `.colname`

---
## 4.Slicing ranges with `[]` operator

The most robust and consistent way of slicing ranges along arbitrary axes is described in the Selection by Position section detailing the .iloc method. For now, we explain the semantics of slicing using the [] operator.

With Series, the syntax works exactly as with an ndarray, returning a slice of the values and the corresponding labels:

In [15]:
s[:5]

2020-01-01   -0.086274
2020-01-02    1.195122
2020-01-03    0.502920
2020-01-04   -1.002705
2020-01-05    0.862553
Freq: D, Name: A, dtype: float64

In [16]:
s[::2]

2020-01-01   -0.086274
2020-01-03    0.502920
2020-01-05    0.862553
2020-01-07   -0.415786
Freq: 2D, Name: A, dtype: float64

In [17]:
s[::-1]

2020-01-08    0.550129
2020-01-07   -0.415786
2020-01-06   -1.239679
2020-01-05    0.862553
2020-01-04   -1.002705
2020-01-03    0.502920
2020-01-02    1.195122
2020-01-01   -0.086274
Freq: -1D, Name: A, dtype: float64

Note that setting works as well:

In [19]:
s2 = s.copy()

In [20]:
s2[:5] = 0
s2

2020-01-01    0.000000
2020-01-02    0.000000
2020-01-03    0.000000
2020-01-04    0.000000
2020-01-05    0.000000
2020-01-06   -1.239679
2020-01-07   -0.415786
2020-01-08    0.550129
Freq: D, Name: A, dtype: float64

With DataFrame, slicing inside of `[]` **slices the rows**. This is provided largely as a convenience since it is such a common operation.

In [21]:
df[:3]

Unnamed: 0,A,B,C,D
2020-01-01,left,right,-0.675281,0.36322
2020-01-02,left,right,-0.733005,0.323143
2020-01-03,left,right,0.35808,1.61709


In [22]:
df[::-1]

Unnamed: 0,A,B,C,D
2020-01-08,left,right,1.551972,0.932489
2020-01-07,left,right,0.385273,-0.523831
2020-01-06,left,right,-1.918526,0.669132
2020-01-05,left,right,0.512151,-0.572492
2020-01-04,left,right,0.196877,0.085281
2020-01-03,left,right,0.35808,1.61709
2020-01-02,left,right,-0.733005,0.323143
2020-01-01,left,right,-0.675281,0.36322


---
## 5.Selection by label: `.loc`

pandas는 purely label based indexing을 지향한다. 이러한 인덱싱 기법은 엄격한 기준을 따라 작동하게 됨. 
- 인덱싱하는 모든 label은 데이터의 index에 포함되어 있어야 하며, 그렇지 않은 경우 KeyError를 띄운다. 
- 슬라이싱의 경우, 시작점과 종료지점은 모두 included 이다. 
- Integer 형식도 valid labels이지만, 이는 position base가 아님을 유념해야 한다.

.loc 을 통해 엑셀과 유사한 indexing을 할 수 있다. 
 * a single label 'a'
 * a list or array of labels ['a','b','c']
 * a slice object with labels 'a':'f'
 * a boolean array
 * a callable

In [69]:
s1 = pd.Series(np.random.randn(6), index=list('abcdef'))

In [70]:
s1

a    0.757619
b    1.966498
c   -1.968928
d    0.839758
e   -0.514362
f    0.314853
dtype: float64

In [5]:
s1.loc['b']

-0.6588407061581625

In [72]:
s1.loc['c':]

c   -1.968928
d    0.839758
e   -0.514362
f    0.314853
dtype: float64

Note that setting works as well:

In [6]:
s1.loc['c':] = 0
s1

a    0.627383
b   -0.658841
c    0.000000
d    0.000000
e    0.000000
f    0.000000
dtype: float64

With a DataFrame:

In [73]:
df1 = pd.DataFrame(np.random.randn(6,4),
                  index=list('abcdef'),
                  columns=list('ABCD'))

In [74]:
df1

Unnamed: 0,A,B,C,D
a,0.178794,-2.022135,-1.375135,-1.022757
b,-0.04596,-0.300867,-0.063852,-0.009528
c,0.411428,-1.884421,0.850557,0.022043
d,0.783467,-0.159341,-1.208755,-0.187216
e,-0.575892,-0.512767,1.491305,0.514291
f,-0.041696,-0.287906,-0.090305,-0.656934


In [75]:
df1.loc[['a','c','d']]

Unnamed: 0,A,B,C,D
a,0.178794,-2.022135,-1.375135,-1.022757
c,0.411428,-1.884421,0.850557,0.022043
d,0.783467,-0.159341,-1.208755,-0.187216


In [76]:
df1.loc[['a','b']]

Unnamed: 0,A,B,C,D
a,0.178794,-2.022135,-1.375135,-1.022757
b,-0.04596,-0.300867,-0.063852,-0.009528


In [77]:
df1.loc[['a','b'], :]  #explicit column is better

Unnamed: 0,A,B,C,D
a,0.178794,-2.022135,-1.375135,-1.022757
b,-0.04596,-0.300867,-0.063852,-0.009528


Accessing via label slices:

In [78]:
df1.loc['d':, 'A':'C']

Unnamed: 0,A,B,C
d,0.783467,-0.159341,-1.208755
e,-0.575892,-0.512767,1.491305
f,-0.041696,-0.287906,-0.090305


For getting a cross section(reduce to a Series) using a label

In [79]:
df1.loc['a']

A    0.178794
B   -2.022135
C   -1.375135
D   -1.022757
Name: a, dtype: float64

For getting values with a boolean array:

In [80]:
df1.loc['a']>0

A     True
B    False
C    False
D    False
Name: a, dtype: bool

In [81]:
df1.loc[:, df1.loc['a'] > 0]

Unnamed: 0,A
a,0.178794
b,-0.04596
c,0.411428
d,0.783467
e,-0.575892
f,-0.041696


For getting a value explicitly..(equiv to deprecated df.get_value('a','A')

In [82]:
df1.loc['a','A']

0.1787938493538499

#### Slicing With Labels: `.loc[row_indexer, col_indexer]`
When using .loc with slices, if both the start and the stop labels are present in the index, then elements _located_ between the two (including them) are returned.

In [83]:
s = pd.Series(list('abcde'), index=[0,3,2,5,4])

In [84]:
s

0    a
3    b
2    c
5    d
4    e
dtype: object

In [85]:
s.loc[3:4]

3    b
2    c
5    d
4    e
dtype: object

If at least one of the two is absent, but the index is **sorted**, and can be compared against start and stop labels, then slicing will still work as expected, by selecting labels which _rank_ between the two:

In [86]:
s.sort_index()

0    a
2    c
3    b
4    e
5    d
dtype: object

In [87]:
s  # s는 그대로 있음

0    a
3    b
2    c
5    d
4    e
dtype: object

In [88]:
s.sort_index().loc[1:6]

2    c
3    b
4    e
5    d
dtype: object

However, if at least one of the two is absent and the index is not sorted, an error will be raised (since doing otherwise would be computationally expensive, as well as potentially ambiguous for mixed type indexes). For instance, in the above example, s.loc[1:6] would raise KeyError.

For the rationale behind this behavior, see [Endpoints are inclusive](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-endpoints-are-inclusive).

---
## 6.Selection by position

pandas는 Integer-based indexing도 지원함(엑셀의 참조 방법이 두가지인 것처럼)
Numpy나 Python의 slicing은 Position based indexing을 지향함. 단순하긴 한데, 코드 작성자가 데이터구조를 고려해 Axis를 정해줘야 해서 불편함.
직관적인 label-based indexing(if included..)을 애용합시다.
더욱이나 Integer-based indexing은 start bound is included, while the upper bound is excluded.라는 점을 유념해야 함.


In [23]:
s1 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))

In [24]:
s1

0   -0.984229
2   -0.636703
4   -1.147303
6    0.847961
8   -1.068470
dtype: float64

In [25]:
s1.iloc[:3]

0   -0.984229
2   -0.636703
4   -1.147303
dtype: float64

In [26]:
s1.iloc[3]

0.8479608300585351

Note that setting works as well:

In [27]:
s1.iloc[:3] = 0

In [28]:
s1

0    0.000000
2    0.000000
4    0.000000
6    0.847961
8   -1.068470
dtype: float64

With a DataFrame:

In [29]:
df1 = pd.DataFrame(np.random.randn(6,4),
                  index=list(range(0, 12, 2)),
                  columns=list(range(0, 8, 2)))


In [30]:
df1

Unnamed: 0,0,2,4,6
0,-1.757197,0.123202,1.139341,0.719753
2,0.16334,-2.717282,-1.773719,0.138733
4,-0.680407,0.518743,1.245755,-1.229042
6,0.374214,0.709128,-0.461879,0.525768
8,0.013141,-0.773591,-0.968731,0.080641
10,1.853453,0.957776,0.056136,0.271977


Select via integer slicing:

In [31]:
df1.iloc[:3]

Unnamed: 0,0,2,4,6
0,-1.757197,0.123202,1.139341,0.719753
2,0.16334,-2.717282,-1.773719,0.138733
4,-0.680407,0.518743,1.245755,-1.229042


In [32]:
df1.iloc[1:5, 2:4]

Unnamed: 0,4,6
2,-1.773719,0.138733
4,1.245755,-1.229042
6,-0.461879,0.525768
8,-0.968731,0.080641


Select via integer list:

In [33]:
df1.iloc[[1,3,5], [1,3]]

Unnamed: 0,2,6
2,-2.717282,0.138733
6,0.709128,0.525768
10,0.957776,0.271977


In [34]:
df1.iloc[1:3, :] #1포지션부터 +2개 행, 모든열

Unnamed: 0,0,2,4,6
2,0.16334,-2.717282,-1.773719,0.138733
4,-0.680407,0.518743,1.245755,-1.229042


In [35]:
df1.iloc[:, 1:3] #모든 행, 1포지션부터 2개열

Unnamed: 0,2,4
0,0.123202,1.139341
2,-2.717282,-1.773719
4,0.518743,1.245755
6,0.709128,-0.461879
8,-0.773591,-0.968731
10,0.957776,0.056136


In [36]:
df1.iloc[1]  #1포지션 행

0    0.163340
2   -2.717282
4   -1.773719
6    0.138733
Name: 2, dtype: float64

Out of range slice indexes are handled gracefully just as in Python/Numpy

In [37]:
x = list('abcdef')

In [38]:
x

['a', 'b', 'c', 'd', 'e', 'f']

In [39]:
x[4:10]  #포지션4('e')부터 길이는 최대 2개이므로 4:6이 엄밀하게 맞는 코드이나, 6개를 불러오라 해도 알아서 끝까지만 탐색

['e', 'f']

In [40]:
x[8:10]

[]

In [41]:
s=pd.Series(x)

In [42]:
s

0    a
1    b
2    c
3    d
4    e
5    f
dtype: object

In [43]:
s.iloc[4:10]

4    e
5    f
dtype: object

In [44]:
s.iloc[8:10]

Series([], dtype: object)

Note that using slices that go out of bounds can result in an empty axis.(an empty Df being returned)

In [45]:
dfl = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [46]:
dfl

Unnamed: 0,A,B
0,0.312556,0.118931
1,0.931763,-0.487888
2,0.605933,0.964204
3,-1.300026,0.285345
4,-0.360477,1.617083


In [47]:
dfl.iloc[:, 2:3] #모든 행, 2포지션에서 1개열 뽑아와.... 근데 2포지션? 없는데? 열이 2개니까 1포지션까지만 있지(0,1)

0
1
2
3
4


In [48]:
dfl.iloc[:, 1:3]

Unnamed: 0,B
0,0.118931
1,-0.487888
2,0.964204
3,0.285345
4,1.617083


In [49]:
dfl.iloc[4:6] #4포 행에서 2개 행 뽑아와... 근데 행개수 5개인데...

Unnamed: 0,A,B
4,-0.360477,1.617083


A single indexer that is out of bounds will raise an `IndexError`. A list of indexers where any element is out of bounds will raise an `indexerror`

In [50]:
dfl.iloc[5]

IndexError: single positional indexer is out-of-bounds

In [51]:
dfl.iloc[[4,5,6]]

IndexError: positional indexers are out-of-bounds

In [52]:
dfl.iloc[:, 4]

IndexError: single positional indexer is out-of-bounds

---
## 8. Selection by callable
.loc, .iloc, and also [] indexing can accept a **callable as indexer**. The callable must be a function with one argument(the calling Series or DataFrame) that returns valid output for indexing.

In [89]:
df1 = pd.DataFrame(np.random.randn(6,4),
                  index=list('abcdef'),
                  columns=list('ABCD'))
df1

Unnamed: 0,A,B,C,D
a,-0.080855,1.27751,-0.737546,-0.794271
b,0.921885,-1.012873,1.564373,-0.125116
c,0.503842,-0.192807,-0.54482,-0.291089
d,-0.608056,-0.251297,0.000414,-0.353568
e,-0.671749,-0.578578,-0.947718,-1.455978
f,-1.579022,0.016129,-1.279447,1.277409


In [90]:
df1.loc[lambda df: df.A > 0, :] #행은 콜러블...콜링 df1 자기자신...A열... 0보다 큰 행...., 모든 열을 불러와...

Unnamed: 0,A,B,C,D
b,0.921885,-1.012873,1.564373,-0.125116
c,0.503842,-0.192807,-0.54482,-0.291089


In [93]:
df1.loc[df1.A>0, :] #equivalent Selection by boolean array

Unnamed: 0,A,B,C,D
b,0.921885,-1.012873,1.564373,-0.125116
c,0.503842,-0.192807,-0.54482,-0.291089


In [94]:
df1.loc[:, lambda df: ['A', 'B']]

Unnamed: 0,A,B
a,-0.080855,1.27751
b,0.921885,-1.012873
c,0.503842,-0.192807
d,-0.608056,-0.251297
e,-0.671749,-0.578578
f,-1.579022,0.016129


In [57]:
df1.iloc[:, lambda df: [0,1]]

Unnamed: 0,A,B
a,1.207571,-0.848926
b,1.136135,1.050766
c,0.102336,0.088907
d,-0.27507,1.073008
e,0.34575,-0.654756
f,-1.560028,-1.45241


In [96]:
df1[lambda df: df.columns[0]] # callable return... 'A'

a   -0.080855
b    0.921885
c    0.503842
d   -0.608056
e   -0.671749
f   -1.579022
Name: A, dtype: float64

Use callable indexing in Series

In [97]:
df1.A.loc[lambda s: s> 0]

b    0.921885
c    0.503842
Name: A, dtype: float64

Using these methods.. you can chain data selection operations without using a temporary variable...**AMAZING!!!**

In [98]:
bb= pd.read_csv('data/baseball.csv', index_col='id')

In [103]:
(bb.groupby(['year', 'team'])).sum().loc[lambda df: df.r > 100]

Unnamed: 0_level_0,Unnamed: 1_level_0,stint,g,ab,r,h,X2b,X3b,hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
year,team,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2007,CIN,6,379,745,101,203,35,2,36,125.0,10.0,1.0,105,127.0,14.0,1.0,1.0,15.0,18.0
2007,DET,5,301,1062,162,283,54,4,37,144.0,24.0,7.0,97,176.0,3.0,10.0,4.0,8.0,28.0
2007,HOU,4,311,926,109,218,47,6,14,77.0,10.0,4.0,60,212.0,3.0,9.0,16.0,6.0,17.0
2007,LAN,11,413,1021,153,293,61,3,36,154.0,7.0,5.0,114,141.0,8.0,9.0,3.0,8.0,29.0
2007,NYN,13,622,1854,240,509,101,3,61,243.0,22.0,4.0,174,310.0,24.0,23.0,18.0,15.0,48.0
2007,SFN,5,482,1305,198,337,67,6,40,171.0,26.0,7.0,235,188.0,51.0,8.0,16.0,6.0,41.0
2007,TEX,2,198,729,115,200,40,4,28,115.0,21.0,4.0,73,140.0,4.0,5.0,2.0,8.0,16.0
2007,TOR,4,459,1408,187,378,96,2,58,223.0,4.0,2.0,190,265.0,16.0,12.0,4.0,16.0,38.0


---
## 9. IX indexer is deprecated

---
## 10. Indexing with list with missing labels is deprecated

Indexing with list with missing labels is deprecated, in favor of `.reindex`

In [62]:
s=pd.Series([1,2,3])

In [63]:
s

0    1
1    2
2    3
dtype: int64

In [64]:
s.loc[[1,2]]

1    2
2    3
dtype: int64

In [65]:
s.loc[[1,2,3]]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.


1    2.0
2    3.0
3    NaN
dtype: float64

### Reindexing.. 
the idiomatic way to achieve selecting potentially not-found elements.

See also [reindexing_basic functionality](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-reindexing)

In [66]:
s.reindex([1,2,3])

1    2.0
2    3.0
3    NaN
dtype: float64

In [67]:
labels=[1,2,3]

Alternatively, selecting only valid keys..... it is guaranteed to preserve the dtype of the selection

In [68]:
s.loc[s.index.intersection(labels)]

1    2
2    3
dtype: int64

In [69]:
s = pd.Series(np.arange(4), index=['a','a','b','c'])

In [70]:
s

a    0
a    1
b    2
c    3
dtype: int32

In [71]:
labels = ['c','d']

In [72]:
s.reindex(labels)

ValueError: cannot reindex from a duplicate axis

Duplicate axis인 경우..... intersection으로 중복된 axis를 풀어주고... .reindex 하면 됨

In [73]:
s.loc[s.index.intersection(labels)].reindex(labels)

c    3.0
d    NaN
dtype: float64

In [74]:
labels = ['a','d']

In [75]:
s.loc[s.index.intersection(labels)]

a    0
a    1
a    0
a    1
dtype: int32

In [76]:
s.loc[s.index.intersection(labels)].reindex(labels)

ValueError: cannot reindex from a duplicate axis

---
## 11. Selecting random samples

A random selection of rows or columns from a Series or DataFrame with the sample() method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.

In [77]:
s=pd.Series([0,1,2,3,4,5])

In [78]:
s.sample()

5    5
dtype: int64

In [79]:
s.sample(n=3)

4    4
3    3
2    2
dtype: int64

In [80]:
s.sample(frac=0.5)

5    5
0    0
4    4
dtype: int64

By default, sample will return each row at most once, but one can also sample with replacement using the replace option:

In [81]:
s.sample(n=6, replace=True)

5    5
1    1
2    2
2    2
1    1
0    0
dtype: int64

By default, each row has an equal probability of being selected, but if you want rows to have different probabilities, you can pass the sample function sampling weights as weights. These weights can be a list, a NumPy array, or a Series, but they must be of the same length as the object you are sampling. Missing values will be treated as a weight of zero, and inf values are not allowed. If weights do not sum to 1, they will be re-normalized by dividing all weights by the sum of the weights. For example:


In [82]:
ex_weights = pd.Series([0, 0, 0.2, 0.2, 0.2, 0.4], index=s.index, name="prob")

In [83]:
ex_weights

0    0.0
1    0.0
2    0.2
3    0.2
4    0.2
5    0.4
Name: prob, dtype: float64

In [84]:
s.sample(n=3, weights=ex_weights)

5    5
2    2
3    3
dtype: int64

When applied to a DF, you can use a column of the DF as sampling weights (provided you ar sampling rows and not columns) by simply passing the name of the column as string

In [85]:
df2 = pd.DataFrame({'col1': [9,8,7,6],
                    'weight_column': [0.5, 0.4, 0.1, 0]})

In [86]:
df2.sample(n=3, weights='weight_column')

Unnamed: 0,col1,weight_column
1,8,0.4
0,9,0.5
2,7,0.1


sample also allows users to sample columns instead of rows using the axis argument.

In [87]:
df2.sample(n=1, axis=1)

Unnamed: 0,col1
0,9
1,8
2,7
3,6


Finally, one can also set a seed for sample’s random number generator using the random_state argument, which will accept either an integer (as a seed) or a NumPy RandomState object.

In [104]:
df4 = pd.DataFrame({'col1': [1,2,3], 'col2': [2,3,4]})

In [105]:
df4.sample(2, random_state=2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


In [106]:
df4.sample(2, random_state=2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


---
## 12. Setting with enlargement

The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.

In the Series case this is effectively an **appending operation.**

In [107]:
se = pd.Series([1,2,3])

In [108]:
se

0    1
1    2
2    3
dtype: int64

In [109]:
se[5]=5.

In [110]:
se

0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

A DataFrame can be **enlarged** on either **axis** via .loc.

In [111]:
dfi = pd.DataFrame(np.arange(6).reshape(3,2),
                  columns=['A','B'])

In [119]:
dfi.loc[:, 'C'] = dfi.loc[:, 'A']  # enlarge column

In [120]:
dfi

Unnamed: 0,A,B,C
0,0,1,0
1,2,3,2
2,4,5,4
3,5,5,5


In [121]:
dfi.loc[3] = 5  # appending row

In [122]:
dfi

Unnamed: 0,A,B,C
0,0,1,0
1,2,3,2
2,4,5,4
3,5,5,5


---
## 13. Fast scalar value getting and setting

Since indexing with `[]` must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. 

If you only want to access a scalar value, the **fastest** way is to use the `.at` and `.iat` methods, which are implemented on all of the data structures.

Similarly to loc, `.at` provides label based **scalar lookups**, while, `.iat` provides integer based lookups analogously to iloc

In [129]:
s = pd.Series(range(6))

In [131]:
s

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [130]:
s.iat[5]

5

You can also set using these same indexers.

In [157]:
dates = pd.date_range('2000-01-01', periods=8)

In [158]:
df = pd.DataFrame(np.random.randn(8,4),
                 index=dates, 
                 columns=list('ABCD'))

In [159]:
df.at[dates[5], 'A']

0.36593562502442434

In [160]:
df.loc[dates[5], 'A']

0.36593562502442434

In [161]:
df.iat[3,0]

-0.2920614981792712

In [162]:
df.at[dates[5], 'E']=7
df.iat[3,0] = 7

In [163]:
df

Unnamed: 0,A,B,C,D,E
2000-01-01,0.4062,-0.296639,2.235388,-0.79124,
2000-01-02,1.973969,1.333666,0.926866,-1.169578,
2000-01-03,-0.506551,0.738186,-0.746852,0.011261,
2000-01-04,7.0,2.806684,1.485592,-0.598927,
2000-01-05,0.117411,0.492342,-0.773951,0.833298,
2000-01-06,0.365936,0.162427,-1.31642,1.474309,7.0
2000-01-07,1.074966,-1.10859,-0.397353,-0.339021,
2000-01-08,0.882472,0.077286,1.182555,0.135684,


`at` may **enlarge** the object in-place as above if the indexer is missing.

In [166]:
df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7

In [167]:
df

Unnamed: 0,A,B,C,D,E,0
2000-01-01,0.4062,-0.296639,2.235388,-0.79124,,
2000-01-02,1.973969,1.333666,0.926866,-1.169578,,
2000-01-03,-0.506551,0.738186,-0.746852,0.011261,,
2000-01-04,7.0,2.806684,1.485592,-0.598927,,
2000-01-05,0.117411,0.492342,-0.773951,0.833298,,
2000-01-06,0.365936,0.162427,-1.31642,1.474309,7.0,
2000-01-07,1.074966,-1.10859,-0.397353,-0.339021,,
2000-01-08,0.882472,0.077286,1.182555,0.135684,,
2000-01-09,,,,,,7.0


---
## 14. Boolean indexing
Another common operation is the use of boolean vectors to filter the dat. The operators are | for or, & for and , ~ for not.
These __must__ be grouped by using parentheses, since by default Python will evaluate an expression such as df.A >2 & df.B < 3 as df.A > (2& df.B) < 3, while the desired evaluation order is (df.A>2) & (df.B <3)

In [111]:
s = pd.Series(range(-3,4))

In [112]:
range(-3,4)

range(-3, 4)

In [113]:
s

0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [114]:
s[s>0]

4    1
5    2
6    3
dtype: int64

In [115]:
s[(s<-1)|(s>0.5)]

0   -3
1   -2
4    1
5    2
6    3
dtype: int64

In [116]:
s[~(s<0)]

3    0
4    1
5    2
6    3
dtype: int64

In [117]:
df[df['A'] > 0]

Unnamed: 0,A,B,C,D,E,0
2000-01-03,0.01328,-0.119128,-0.450226,-0.085095,,
2000-01-04,0.195814,-1.864385,-1.318207,-1.500732,,
2000-01-05,1.123464,-0.703644,-1.237762,1.49568,,
2000-01-08,0.763204,1.451818,-1.058464,-0.468414,,


In [118]:
df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
                    'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
                    'c': np.random.randn(7)})

In [119]:
df2

Unnamed: 0,a,b,c
0,one,x,0.198419
1,one,y,-0.801844
2,two,y,0.213706
3,three,x,0.857568
4,two,y,0.153562
5,one,x,0.143885
6,six,x,-0.406033


List comprehension and the map methosd of series can also be used to produce more complex criteria

In [120]:
criterion = df2['a'].map(lambda x: x.startswith('t'))

In [121]:
criterion

0    False
1    False
2     True
3     True
4     True
5    False
6    False
Name: a, dtype: bool

In [122]:
df2[criterion]

Unnamed: 0,a,b,c
2,two,y,0.213706
3,three,x,0.857568
4,two,y,0.153562


In [123]:
#equivalent but slower
df2[[x.startswith('t') for x in df2['a']]]

Unnamed: 0,a,b,c
2,two,y,0.213706
3,three,x,0.857568
4,two,y,0.153562


In [124]:
#Multiple criteria
df2[criterion & (df2['b'] == 'x')]

Unnamed: 0,a,b,c
3,three,x,0.857568


In [125]:
df2.loc[lambda df: df.a.map(lambda x: x.startswith('t'))]

Unnamed: 0,a,b,c
2,two,y,0.213706
3,three,x,0.857568
4,two,y,0.153562


In [126]:
df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']

Unnamed: 0,b,c
3,x,0.857568


---
## 15. Indexing with isin
Consider the isin() method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. this allows you to select rows where one or more columns have values you want 

In [127]:
s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')

In [128]:
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [129]:
s.isin([2, 4, 6])

4    False
3    False
2     True
1    False
0     True
dtype: bool

In [130]:
s[s.isin([2, 4, 6])]

2    2
0    4
dtype: int64

In [131]:
s[s.index.isin([2, 4, 6])]

4    0
2    2
dtype: int64

In [132]:
s.reindex([2, 4, 6])

2    2.0
4    0.0
6    NaN
dtype: float64

In [133]:
s_mi = pd.Series(np.arange(6),
                index=pd.MultiIndex.from_product([[0,1],['a','b','c']]))

In [134]:
s_mi

0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int32

In [135]:
s_mi.iloc[s_mi.index.isin([(1,'a'), (2,'b'), (0,'c')])]

0  c    2
1  a    3
dtype: int32

In [136]:
s_mi.iloc[s_mi.index.isin(['a','c','e'], level=1)]

0  a    0
   c    2
1  a    3
   c    5
dtype: int32

In [137]:
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
                   'ids2': ['a', 'n', 'c', 'n']})

In [138]:
df

Unnamed: 0,vals,ids,ids2
0,1,a,a
1,2,b,n
2,3,f,c
3,4,n,n


In [139]:
values=['a','b',1,3]

In [140]:
df.isin(values)

Unnamed: 0,vals,ids,ids2
0,True,True,True
1,False,True,False
2,True,False,False
3,False,False,False


In [141]:
values={'ids':['a','b'], 'vals':[1,3]}

In [142]:
df.isin(values)

Unnamed: 0,vals,ids,ids2
0,True,True,False
1,False,True,False
2,True,False,False
3,False,False,False


In [143]:
values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

In [144]:
row_mask = df.isin(values).all(1)

In [145]:
df.loc[row_mask]

Unnamed: 0,vals,ids,ids2
0,1,a,a


---
## 16. The `.where()` Method and Masking
Selecting values from a series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the __same shape as the original data__, you can use the where() method in Series and DataFrame.

In [163]:
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [146]:
s[s>0]

3    1
2    2
1    3
0    4
dtype: int64

In [147]:
s.where(s>0)

4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

Selecting values from a DF with a boolean criterion now also preserves input data hape. where() is used under the hood as the implementation. the code below is equivalent to df.where(df < 0)

In [148]:
dates = pd.date_range('20010101', periods=8)
df = pd.DataFrame(np.random.randn(8,3),
                 index=dates)

In [149]:
df

Unnamed: 0,0,1,2
2001-01-01,-2.540347,-0.007532,-1.194449
2001-01-02,-0.128392,1.047398,-0.157081
2001-01-03,-0.440818,-0.697807,1.157269
2001-01-04,-1.016749,1.398112,-0.336502
2001-01-05,-0.770575,1.294242,-0.205928
2001-01-06,0.670972,-0.176181,0.196063
2001-01-07,1.145125,0.938324,0.878154
2001-01-08,-0.25102,0.853689,0.493412


In [150]:
df[df< 0]

Unnamed: 0,0,1,2
2001-01-01,-2.540347,-0.007532,-1.194449
2001-01-02,-0.128392,,-0.157081
2001-01-03,-0.440818,-0.697807,
2001-01-04,-1.016749,,-0.336502
2001-01-05,-0.770575,,-0.205928
2001-01-06,,-0.176181,
2001-01-07,,,
2001-01-08,-0.25102,,


In [151]:
df.where(df<0)

Unnamed: 0,0,1,2
2001-01-01,-2.540347,-0.007532,-1.194449
2001-01-02,-0.128392,,-0.157081
2001-01-03,-0.440818,-0.697807,
2001-01-04,-1.016749,,-0.336502
2001-01-05,-0.770575,,-0.205928
2001-01-06,,-0.176181,
2001-01-07,,,
2001-01-08,-0.25102,,


In [152]:
df.where(df<0, other=-df)

Unnamed: 0,0,1,2
2001-01-01,-2.540347,-0.007532,-1.194449
2001-01-02,-0.128392,-1.047398,-0.157081
2001-01-03,-0.440818,-0.697807,-1.157269
2001-01-04,-1.016749,-1.398112,-0.336502
2001-01-05,-0.770575,-1.294242,-0.205928
2001-01-06,-0.670972,-0.176181,-0.196063
2001-01-07,-1.145125,-0.938324,-0.878154
2001-01-08,-0.25102,-0.853689,-0.493412


You may wish to set values based on some boolean criteria. This can be done intuitively like so:

In [153]:
s2=s.copy()

In [154]:
s2[s2<0] = 0

In [155]:
s2

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [156]:
df2 = df.copy()

In [157]:
df2[df2 <0] = 0

In [158]:
df2

Unnamed: 0,0,1,2
2001-01-01,0.0,0.0,0.0
2001-01-02,0.0,1.047398,0.0
2001-01-03,0.0,0.0,1.157269
2001-01-04,0.0,1.398112,0.0
2001-01-05,0.0,1.294242,0.0
2001-01-06,0.670972,0.0,0.196063
2001-01-07,1.145125,0.938324,0.878154
2001-01-08,0.0,0.853689,0.493412


By default, where returns a modified copy of the data. There is an optional parameter inplace so that the original data can be modified without creating a copy:

In [159]:
df_orig = df.copy()

In [160]:
df_orig

Unnamed: 0,0,1,2
2001-01-01,-2.540347,-0.007532,-1.194449
2001-01-02,-0.128392,1.047398,-0.157081
2001-01-03,-0.440818,-0.697807,1.157269
2001-01-04,-1.016749,1.398112,-0.336502
2001-01-05,-0.770575,1.294242,-0.205928
2001-01-06,0.670972,-0.176181,0.196063
2001-01-07,1.145125,0.938324,0.878154
2001-01-08,-0.25102,0.853689,0.493412


In [161]:
df_orig.where(df > 0, other=-df, inplace=True)

In [162]:
df_orig

Unnamed: 0,0,1,2
2001-01-01,2.540347,0.007532,1.194449
2001-01-02,0.128392,1.047398,0.157081
2001-01-03,0.440818,0.697807,1.157269
2001-01-04,1.016749,1.398112,0.336502
2001-01-05,0.770575,1.294242,0.205928
2001-01-06,0.670972,0.176181,0.196063
2001-01-07,1.145125,0.938324,0.878154
2001-01-08,0.25102,0.853689,0.493412


### Alignment
Furthermorem `where` aligns the input boolean condition (ndarray or DataFrame), such that **partial selection** with setting is possible.
이건 판다스의 Label 지향성 때문에 가능한 기능!

In [173]:
df2 = df.copy()

In [174]:
df2[1:4]  > 0 

Unnamed: 0,0,1,2
2001-01-02,False,True,False
2001-01-03,False,False,True
2001-01-04,False,True,False


In [175]:
df2[df2[1:4] > 0]

Unnamed: 0,0,1,2
2001-01-01,,,
2001-01-02,,1.047398,
2001-01-03,,,1.157269
2001-01-04,,1.398112,
2001-01-05,,,
2001-01-06,,,
2001-01-07,,,
2001-01-08,,,


In [178]:
df2[df2[1:4] > 0] = 'Fuck'

In [179]:
df2

Unnamed: 0,0,1,2
2001-01-01,-2.540347,-0.00753169,-1.19445
2001-01-02,-0.128392,Fuck,-0.157081
2001-01-03,-0.440818,-0.697807,Fuck
2001-01-04,-1.016749,Fuck,-0.336502
2001-01-05,-0.770575,1.29424,-0.205928
2001-01-06,0.670972,-0.176181,0.196063
2001-01-07,1.145125,0.938324,0.878154
2001-01-08,-0.25102,0.853689,0.493412


In [186]:
df.columns = ['A','B','C']

In [187]:
df2 = df.copy()

In [188]:
cond = df2>0

In [194]:
df2.where(cond)

Unnamed: 0,A,B,C
2001-01-01,,,
2001-01-02,,1.047398,
2001-01-03,,,1.157269
2001-01-04,,1.398112,
2001-01-05,,1.294242,
2001-01-06,0.670972,,0.196063
2001-01-07,1.145125,0.938324,0.878154
2001-01-08,,0.853689,0.493412


In [195]:
df2.where(cond, other=df2['A'], axis = 'index') #other = df2['A']인데 index방향으로 끌어다 쓸지, column 방향으로 끌어다 쓸지 알려줘야지..!

Unnamed: 0,A,B,C
2001-01-01,-2.540347,-2.540347,-2.540347
2001-01-02,-0.128392,1.047398,-0.128392
2001-01-03,-0.440818,-0.440818,1.157269
2001-01-04,-1.016749,1.398112,-1.016749
2001-01-05,-0.770575,1.294242,-0.770575
2001-01-06,0.670972,0.670972,0.196063
2001-01-07,1.145125,0.938324,0.878154
2001-01-08,-0.25102,0.853689,0.493412


This is equivalent to (but faster than) the following

In [196]:
df.apply(lambda x, y: x.where(x > 0, y), y=df['A'])

Unnamed: 0,A,B,C
2001-01-01,-2.540347,-2.540347,-2.540347
2001-01-02,-0.128392,1.047398,-0.128392
2001-01-03,-0.440818,-0.440818,1.157269
2001-01-04,-1.016749,1.398112,-1.016749
2001-01-05,-0.770575,1.294242,-0.770575
2001-01-06,0.670972,0.670972,0.196063
2001-01-07,1.145125,0.938324,0.878154
2001-01-08,-0.25102,0.853689,0.493412


### Mask
`mask` is the inverse boolean operation of where

In [197]:
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [199]:
s.mask(s >= 3)

4    0.0
3    1.0
2    2.0
1    NaN
0    NaN
dtype: float64

In [200]:
s.where(s >= 3)

4    NaN
3    NaN
2    NaN
1    3.0
0    4.0
dtype: float64

In [201]:
df.mask(df >= 0)

Unnamed: 0,A,B,C
2001-01-01,-2.540347,-0.007532,-1.194449
2001-01-02,-0.128392,,-0.157081
2001-01-03,-0.440818,-0.697807,
2001-01-04,-1.016749,,-0.336502
2001-01-05,-0.770575,,-0.205928
2001-01-06,,-0.176181,
2001-01-07,,,
2001-01-08,-0.25102,,


---
## 17. The `.query()` Method

---
## 18. Duplicate data

---
## 19. Dictionary-like `.get()` method

---
## 20. the `.lookup()` method

---
## 21. Index object

---
## 22. Set/reset Index

---
## 23. Returning a view versus a copy

When setting values in a pandas object, care must be taken to avoid what is called `chained indexing`. Here is an example

In [203]:
dfmi = pd.DataFrame([list('abcd'),
                     list('efgh'),
                     list('ijkl'),
                     list('mnop')],
                   columns=pd.MultiIndex.from_product([['one','two'],
                                                       ['first','second']]))

In [204]:
dfmi

Unnamed: 0_level_0,one,one,two,two
Unnamed: 0_level_1,first,second,first,second
0,a,b,c,d
1,e,f,g,h
2,i,j,k,l
3,m,n,o,p


Compare these two access methods:

In [206]:
dfmi['one']['second']

0    b
1    f
2    j
3    n
Name: second, dtype: object

In [207]:
dfmi.loc[:, ('one', 'second')]

0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

These both yield the same results, so which should you use? It is instructive to understand the order of operations on these and why method 2 (.loc) is much preferred over method 1 (chained []).

`dfmi['one']` selects the first level of the columns and returns a DataFrame that is singly-indexed. Then another Python operation dfmi_with_one['second'] selects the series indexed by 'second'. This is indicated by the variable dfmi_with_one because pandas sees these operations as separate events. e.g. separate calls to __getitem__, so it has to treat them as linear operations, they happen one after another.

Contrast this to df.loc[:,('one','second')] which passes a nested tuple of (slice(None),('one','second')) to a single call to __getitem__. This allows pandas to deal with this as a single entity. Furthermore this order of operations can be significantly faster, and allows one to index both axes if so desired.

In [None]:
dfmi.loc[:, ('one', 'second')] = 