## Indexing and Selecting Data

The Axis Labelling in Pandas Serves many purposes in the notion of machine learning and solving Data Science Problems.

**Some of the Major Advantages of Indexing and Selecting Data using Pandas**

* Identifying Specific Data

* Enablement of Explicit and automatic Data Alignment

* Allows intuitive getting and setting of data


### Different Choices available for Indexing

**1. Selection by Label (.loc)**

* It is Primarily label based indexing

* It can be used with a Boolean Array

* It will raise KeyError, If the Items are not Found

* Let's See the Allowed Inputs for .loc :

    * A single Label i.e., 9 or 'c'
    
    * A List of Array or Labels i.e., [1, 2, 3] or ['a', 'b', 'c']
    
    * A Slice Object with labels i.e., a:f
    
    * A Boolean Array
    
    * A Callable Function with one argument that returns valid output for indexing
    
 
**2. Selection by Position (.iloc)**

* It is Primarily Integer Position based from (0 to length-1 of the axis)

* It can also be used with a Boolean Array

* It will raise an IndexError if a requested Indexer is out of bound.

* Sliced Indexers are allowed as they allow out-of-bounds indexing.

* Let's see the Allowed Inputs for .iloc :

    * An Integer i.e., 5
    
    * A List or Array of Integers i.e., [3, 4, 5, 6]
    
    * A Slice Objects with Integers i.e., 1:7
    
    * A Boolean Array
    
    * A Callable Function with one Argument, that returns valid output for Indexing.
    

**3. Selection by Callable (.loc, .loc, [])**

* Getting Values from an Object from multi axes Selection.

* Any of the Axis accessors may be the null slice.

* For Example &nbsp; &nbsp;          **Series :** &nbsp; s.loc[indexer]

* For Example &nbsp; &nbsp;        **DataFrame :** &nbsp; s.iloc[indexer]

### Constructing a Time Series Data

In [2]:
# importing the necessary packages

import pandas as pd
import numpy as np

In [3]:
# creating a time series using the date_range method from pandas
time_series = pd.date_range('1/1/1990', periods=10)

# creating a dataframe using DataFrame method from pandas 
# first we are specifying the values using random function
# then secondly we are specifying the index values as time series
# at last, we are proving names to the columns
df = pd.DataFrame(np.random.randn(10, 5),   # 10 refers to the number of rows
                                            # 5 refers to the number of columns
                  index=time_series,
                  columns=['Column 1', 'Column 2', 'Column 3', 'Column 4', 'Column 5'])

In [4]:
# lets check the dataframe that we have just created 

df

Unnamed: 0,Column 1,Column 2,Column 3,Column 4,Column 5
1990-01-01,-0.437065,-0.173568,-1.051065,-0.421785,1.085173
1990-01-02,-1.441536,0.885908,2.145456,-0.734285,0.178335
1990-01-03,0.885411,1.21625,-0.099258,0.917426,-1.152994
1990-01-04,-1.031629,-2.467834,-0.372809,-1.007372,-1.032365
1990-01-05,-0.106526,1.094225,-0.09156,-0.207691,1.243627
1990-01-06,-0.323139,-0.329337,-0.491954,0.21633,-0.448961
1990-01-07,0.662646,-1.614075,1.226858,1.436355,-0.082745
1990-01-08,1.01898,-0.096794,-0.040299,-1.289262,0.65368
1990-01-09,0.625459,1.333686,-0.992327,0.357914,1.04434
1990-01-10,0.3722,-2.097976,1.100345,1.144849,-0.136923


### Indexing

In [5]:
df['Column 1']

1990-01-01   -0.437065
1990-01-02   -1.441536
1990-01-03    0.885411
1990-01-04   -1.031629
1990-01-05   -0.106526
1990-01-06   -0.323139
1990-01-07    0.662646
1990-01-08    1.018980
1990-01-09    0.625459
1990-01-10    0.372200
Freq: D, Name: Column 1, dtype: float64

In [6]:
df['Column 5']

1990-01-01    1.085173
1990-01-02    0.178335
1990-01-03   -1.152994
1990-01-04   -1.032365
1990-01-05    1.243627
1990-01-06   -0.448961
1990-01-07   -0.082745
1990-01-08    0.653680
1990-01-09    1.044340
1990-01-10   -0.136923
Freq: D, Name: Column 5, dtype: float64

In [7]:
# timeseries specific indexing

col1 = df['Column 1']
col1[time_series[3]]

-1.0316291689596224

In [8]:
# multi indexing

df[['Column 1', 'Column 2', 'Column 3']]

Unnamed: 0,Column 1,Column 2,Column 3
1990-01-01,-0.437065,-0.173568,-1.051065
1990-01-02,-1.441536,0.885908,2.145456
1990-01-03,0.885411,1.21625,-0.099258
1990-01-04,-1.031629,-2.467834,-0.372809
1990-01-05,-0.106526,1.094225,-0.09156
1990-01-06,-0.323139,-0.329337,-0.491954
1990-01-07,0.662646,-1.614075,1.226858
1990-01-08,1.01898,-0.096794,-0.040299
1990-01-09,0.625459,1.333686,-0.992327
1990-01-10,0.3722,-2.097976,1.100345


### Dictionary can be converted into a Dataframe

In [9]:
# creating a dataframe using the dictionary structure
x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})

# selection by position
# specifying new values to the row number 1
x.loc[1] = {'x': 9, 'y': 99}

# printing the resultant dataframe
print(x)

   x   y
0  1   3
1  9  99
2  3   5


### Slicing the Rows

In [18]:
# trying to get only the first three rows of the dataframe

df[::2]

Unnamed: 0,Column 1,Column 2,Column 3,Column 4,Column 5
1990-01-01,-0.437065,-0.173568,-1.051065,-0.421785,1.085173
1990-01-03,0.885411,1.21625,-0.099258,0.917426,-1.152994
1990-01-05,-0.106526,1.094225,-0.09156,-0.207691,1.243627
1990-01-07,0.662646,-1.614075,1.226858,1.436355,-0.082745
1990-01-09,0.625459,1.333686,-0.992327,0.357914,1.04434


In [31]:
# trying to get the last five rows of the dataset

df[5:]

Unnamed: 0,Column 1,Column 2,Column 3,Column 4,Column 5
1990-01-06,-0.211976,-1.677285,0.986574,1.22421,2.150838
1990-01-07,-0.034812,0.161018,-0.66123,-0.491391,1.15405
1990-01-08,-2.505332,1.018269,0.070691,0.818273,0.174375
1990-01-09,1.456221,0.464277,-1.431023,0.337144,-0.56323
1990-01-10,-0.507146,-1.354816,-0.708028,1.116605,1.127737


In [32]:
# trying to get the rows from 3rd to 7th index

df[3:8]

Unnamed: 0,Column 1,Column 2,Column 3,Column 4,Column 5
1990-01-04,0.466653,-0.093075,2.057947,0.284638,-0.554037
1990-01-05,-0.908633,0.033583,0.084177,-1.41998,0.314567
1990-01-06,-0.211976,-1.677285,0.986574,1.22421,2.150838
1990-01-07,-0.034812,0.161018,-0.66123,-0.491391,1.15405
1990-01-08,-2.505332,1.018269,0.070691,0.818273,0.174375


In [33]:
# trying to the reverse the dataframe

df[::-1]

Unnamed: 0,Column 1,Column 2,Column 3,Column 4,Column 5
1990-01-10,-0.507146,-1.354816,-0.708028,1.116605,1.127737
1990-01-09,1.456221,0.464277,-1.431023,0.337144,-0.56323
1990-01-08,-2.505332,1.018269,0.070691,0.818273,0.174375
1990-01-07,-0.034812,0.161018,-0.66123,-0.491391,1.15405
1990-01-06,-0.211976,-1.677285,0.986574,1.22421,2.150838
1990-01-05,-0.908633,0.033583,0.084177,-1.41998,0.314567
1990-01-04,0.466653,-0.093075,2.057947,0.284638,-0.554037
1990-01-03,-1.394191,-0.903202,0.544889,-0.782389,-0.227692
1990-01-02,1.932264,-0.604755,0.875935,0.295972,0.319658
1990-01-01,-0.003666,-1.691704,-1.279483,-0.990529,0.235384


### Slicing the Columns

In [20]:
# trying to get the column 1 only with all the rows

df.loc[:, 'Column 1']

1990-01-01   -0.437065
1990-01-02   -1.441536
1990-01-03    0.885411
1990-01-04   -1.031629
1990-01-05   -0.106526
1990-01-06   -0.323139
1990-01-07    0.662646
1990-01-08    1.018980
1990-01-09    0.625459
1990-01-10    0.372200
Freq: D, Name: Column 1, dtype: float64

In [36]:
# trying to get the column1 and column2 with all the rows

df.loc[:, ['Column 1', 'Column 2']]

Unnamed: 0,Column 1,Column 2
1990-01-01,-0.003666,-1.691704
1990-01-02,1.932264,-0.604755
1990-01-03,-1.394191,-0.903202
1990-01-04,0.466653,-0.093075
1990-01-05,-0.908633,0.033583
1990-01-06,-0.211976,-1.677285
1990-01-07,-0.034812,0.161018
1990-01-08,-2.505332,1.018269
1990-01-09,1.456221,0.464277
1990-01-10,-0.507146,-1.354816


In [38]:
# trying to get the last column of the dataframe

df.iloc[:,-1]

1990-01-01    0.235384
1990-01-02    0.319658
1990-01-03   -0.227692
1990-01-04   -0.554037
1990-01-05    0.314567
1990-01-06    2.150838
1990-01-07    1.154050
1990-01-08    0.174375
1990-01-09   -0.563230
1990-01-10    1.127737
Freq: D, Name: Column 5, dtype: float64

In [40]:
# selecting all the rows of the dataframe from column 1 to column 4

df.loc[:, 'Column 1':'Column 4']

Unnamed: 0,Column 1,Column 2,Column 3,Column 4
1990-01-01,-0.003666,-1.691704,-1.279483,-0.990529
1990-01-02,1.932264,-0.604755,0.875935,0.295972
1990-01-03,-1.394191,-0.903202,0.544889,-0.782389
1990-01-04,0.466653,-0.093075,2.057947,0.284638
1990-01-05,-0.908633,0.033583,0.084177,-1.41998
1990-01-06,-0.211976,-1.677285,0.986574,1.22421
1990-01-07,-0.034812,0.161018,-0.66123,-0.491391
1990-01-08,-2.505332,1.018269,0.070691,0.818273
1990-01-09,1.456221,0.464277,-1.431023,0.337144
1990-01-10,-0.507146,-1.354816,-0.708028,1.116605


In [41]:
# performing operations on the dataframe and returns boolean values

df['Column 1'] > 1

1990-01-01    False
1990-01-02     True
1990-01-03    False
1990-01-04    False
1990-01-05    False
1990-01-06    False
1990-01-07    False
1990-01-08    False
1990-01-09     True
1990-01-10    False
Freq: D, Name: Column 1, dtype: bool

In [43]:
# getting the data based on the condition on the dataframe

df[df['Column 1'] > 1]

Unnamed: 0,Column 1,Column 2,Column 3,Column 4,Column 5
1990-01-02,1.932264,-0.604755,0.875935,0.295972,0.319658
1990-01-09,1.456221,0.464277,-1.431023,0.337144,-0.56323


### Selection based on position

In [47]:
# indexing based on position

df.iloc[3:8, 2:5]

Unnamed: 0,Column 3,Column 4,Column 5
1990-01-04,2.057947,0.284638,-0.554037
1990-01-05,0.084177,-1.41998,0.314567
1990-01-06,0.986574,1.22421,2.150838
1990-01-07,-0.66123,-0.491391,1.15405
1990-01-08,0.070691,0.818273,0.174375


In [48]:
# indexing based on specific points

df.iloc[[1, 2, 8], [1, 4]]

Unnamed: 0,Column 2,Column 5
1990-01-02,-0.604755,0.319658
1990-01-03,-0.903202,-0.227692
1990-01-09,0.464277,-0.56323


In [49]:
# getting values based on position specific queries

df['Column 2'].loc[lambda x: x > 0]

1990-01-05    0.033583
1990-01-07    0.161018
1990-01-08    1.018269
1990-01-09    0.464277
Name: Column 2, dtype: float64

### Indexing with isin

In [52]:
# creating a series using the series method from pandas

s = pd.Series(np.arange(5),
              index=np.arange(5)[::-1],
              dtype='int64')

# using isin method to get the boolean values
s.isin([2, 4, 6])

4    False
3    False
2     True
1    False
0     True
dtype: bool

In [53]:
# getting the values instead

s[s.isin([1, 2, 3])]

3    1
2    2
1    3
dtype: int64

In [54]:
# operation to return only the selected rows

s[s>1]

2    2
1    3
0    4
dtype: int64

### Use of Where Method

In [56]:
# operation to return the same number of rows

s.where(s>1)

4    NaN
3    NaN
2    2.0
1    3.0
0    4.0
dtype: float64

In [59]:
df.where(df > 1, -df) == np.where(df < 0, df, -df)


Unnamed: 0,Column 1,Column 2,Column 3,Column 4,Column 5
1990-01-01,False,False,False,False,True
1990-01-02,False,False,True,True,True
1990-01-03,False,False,True,False,False
1990-01-04,True,False,False,True,False
1990-01-05,False,True,True,False,True
1990-01-06,False,False,True,False,False
1990-01-07,False,True,False,False,False
1990-01-08,False,False,True,True,True
1990-01-09,False,True,False,True,False
1990-01-10,False,False,False,False,False


### The Query Method

In [68]:
# using the query method

# creating a dataframe with 20 rows and columns with names a, b, c, d, and e.
df = pd.DataFrame(np.random.rand(20, 5), columns=list('abcde'))

# writing our query i.e., a<b and b<c
df.query('(a < b) & (b < c)')

Unnamed: 0,a,b,c,d,e
2,0.521978,0.847324,0.977072,0.333716,0.071769
5,0.05234,0.186719,0.371316,0.837859,0.708874
6,0.377804,0.780182,0.922601,0.207939,0.104135
7,0.004283,0.116739,0.747088,0.388032,0.2717
12,0.013863,0.329824,0.33263,0.156754,0.948943


In [69]:
# performing the same query without using the query method

df[(df.a < df.b) & (df.b < df.c)]

Unnamed: 0,a,b,c,d,e
2,0.521978,0.847324,0.977072,0.333716,0.071769
5,0.05234,0.186719,0.371316,0.837859,0.708874
6,0.377804,0.780182,0.922601,0.207939,0.104135
7,0.004283,0.116739,0.747088,0.388032,0.2717
12,0.013863,0.329824,0.33263,0.156754,0.948943


### Multi Index Query

In [70]:
# exploring the multi index query method

# making arrays of colors and foods using random function from numpy arrays
colors = np.random.choice(['red', 'green'], size=20)
foods = np.random.choice(['eggs', 'ham'], size=20)

In [71]:
# lets check the colors array

colors

array(['green', 'red', 'red', 'green', 'red', 'green', 'green', 'green',
       'red', 'green', 'red', 'green', 'red', 'green', 'red', 'green',
       'green', 'red', 'red', 'green'], dtype='<U5')

In [72]:
# lets check the foods array

foods

array(['eggs', 'eggs', 'eggs', 'ham', 'eggs', 'eggs', 'eggs', 'eggs',
       'ham', 'eggs', 'ham', 'eggs', 'eggs', 'eggs', 'ham', 'ham', 'ham',
       'eggs', 'ham', 'eggs'], dtype='<U4')

In [74]:
# writing a multiindex query

index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])
index

MultiIndex([('green', 'eggs'),
            (  'red', 'eggs'),
            (  'red', 'eggs'),
            ('green',  'ham'),
            (  'red', 'eggs'),
            ('green', 'eggs'),
            ('green', 'eggs'),
            ('green', 'eggs'),
            (  'red',  'ham'),
            ('green', 'eggs'),
            (  'red',  'ham'),
            ('green', 'eggs'),
            (  'red', 'eggs'),
            ('green', 'eggs'),
            (  'red',  'ham'),
            ('green',  'ham'),
            ('green',  'ham'),
            (  'red', 'eggs'),
            (  'red',  'ham'),
            ('green', 'eggs')],
           names=['color', 'food'])

In [76]:
# adding some data to the dataframe

df = pd.DataFrame(np.random.randn(n, 2), index=index)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
color,food,Unnamed: 2_level_1,Unnamed: 3_level_1
green,eggs,-0.408826,0.34843
red,eggs,-1.07741,1.960878
red,eggs,1.011858,1.407959
green,ham,-2.165852,0.766006
red,eggs,-0.527807,-0.58338
green,eggs,-1.683552,1.166294
green,eggs,0.924701,0.571976
green,eggs,-0.346824,0.442086
red,ham,-2.448306,-0.12808
green,eggs,1.851817,-0.895124


In [77]:
# let's query only when the color is red.

df.query('color == "red"')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
color,food,Unnamed: 2_level_1,Unnamed: 3_level_1
red,eggs,-1.07741,1.960878
red,eggs,1.011858,1.407959
red,eggs,-0.527807,-0.58338
red,ham,-2.448306,-0.12808
red,ham,-0.200797,1.229525
red,eggs,-0.779116,-0.162368
red,ham,0.67016,0.175116
red,eggs,-0.497182,0.278269
red,ham,-0.515286,0.253234


In [79]:
# lets query only when the food is ham

df.query('food == "ham"')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
color,food,Unnamed: 2_level_1,Unnamed: 3_level_1
green,ham,-2.165852,0.766006
red,ham,-2.448306,-0.12808
red,ham,-0.200797,1.229525
red,ham,0.67016,0.175116
green,ham,-0.669305,0.790545
green,ham,-0.435462,-0.592302
red,ham,-0.515286,0.253234


### Use of in and not in Operators with Query method

In [87]:
# making a dataframe

df = pd.DataFrame({'Column_1': list('aabbccddeef'),
                   'Column_2': list('aaaabbbbccc'),
                   'Column_3': np.random.randint(5, size=11),
                   'Column_4': np.random.randint(9, size=11)
                  })

# printing the dataframe
print(df)

   Column_1 Column_2  Column_3  Column_4
0         a        a         0         2
1         a        a         1         4
2         b        a         2         7
3         b        a         2         1
4         c        b         4         8
5         c        b         3         4
6         d        b         1         1
7         d        b         2         8
8         e        c         4         3
9         e        c         4         0
10        f        c         4         5


In [88]:
# get all rows where columns "Column_1" and "Column_2" have overlapping values

df.query('Column_1 in Column_2')

Unnamed: 0,Column_1,Column_2,Column_3,Column_4
0,a,a,0,2
1,a,a,1,4
2,b,a,2,7
3,b,a,2,1
4,c,b,4,8
5,c,b,3,4


In [95]:
# get all rows where columns "Column_2" and "Column_4" have overlapping values

df.query('Column_1 not in Column_2')

Unnamed: 0,Column_1,Column_2,Column_3,Column_4
6,d,b,1,1
7,d,b,2,8
8,e,c,4,3
9,e,c,4,0
10,f,c,4,5


In [98]:
# using "in" and "and" operators at the same time

df.query('Column_1 in Column_2 and Column_3 < Column_4')

Unnamed: 0,Column_1,Column_2,Column_3,Column_4
0,a,a,0,2
1,a,a,1,4
2,b,a,2,7
4,c,b,4,8
5,c,b,3,4


### The Lookup Method

* If we want to extract a set of values given a sequence of row labels and column labels, and the lookup method allows for this and returns a NumPy array. For instance:

In [101]:
# exploring the lookup method

# creating a dataframe for the operation
df= pd.DataFrame(np.random.rand(20, 4),
                 columns = ['A', 'B', 'C', 'D'])

# let's check the dataframe created by us
print(df)

           A         B         C         D
0   0.570840  0.395620  0.199326  0.666648
1   0.653718  0.752156  0.397719  0.014774
2   0.135247  0.094875  0.270221  0.316469
3   0.345519  0.743585  0.212749  0.657107
4   0.870950  0.840104  0.787402  0.365160
5   0.976516  0.370821  0.803845  0.324834
6   0.941608  0.282396  0.700177  0.140920
7   0.661355  0.888109  0.925090  0.097509
8   0.187827  0.103927  0.683807  0.751939
9   0.986279  0.888407  0.134285  0.870976
10  0.873067  0.408993  0.326492  0.169234
11  0.205198  0.875111  0.725407  0.322938
12  0.651413  0.831462  0.762170  0.762231
13  0.123559  0.704337  0.515483  0.857105
14  0.209482  0.512466  0.742059  0.128738
15  0.039759  0.848629  0.592560  0.614235
16  0.518451  0.109659  0.143195  0.773832
17  0.129688  0.959038  0.962294  0.422354
18  0.271099  0.509175  0.265339  0.905698
19  0.816178  0.140315  0.835810  0.890499


In [104]:
# applying the lookup method

df.lookup(list(range(0, 8, 2)), 
          ['A', 'B', 'C', 'D'])

array([0.57084022, 0.09487485, 0.78740206, 0.14091985])