Indexing, Selecting & Assigning

In [3]:
import pandas as pd
reviews = pd.read_csv("../dataset/train_FD001.txt", index_col=0)
pd.set_option('display.max_rows', 5)

Indexing and selection allow us to access specific parts of a dataset for focused analysis.

Native accessors

Columns in a DataFrame can be accessed using their labels or a list of labels.

In [4]:
df = pd.read_csv("../dataset/train_FD001.txt", sep=r"\s+", header=None)
cols = (
    ["engine_id", "cycle", "op1", "op2", "op3"] +
    [f"sensor{i}" for i in range(1, 22)]
)

df.columns = cols

In [5]:
df.head()

Unnamed: 0,engine_id,cycle,op1,op2,op3,sensor1,sensor2,sensor3,sensor4,sensor5,...,sensor12,sensor13,sensor14,sensor15,sensor16,sensor17,sensor18,sensor19,sensor20,sensor21
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044


In [9]:
df.engine_id

0          1
1          1
        ... 
20629    100
20630    100
Name: engine_id, Length: 20631, dtype: int64

the indexing operator [] does have the advantage that it can handle column names with reserved characters in them 

In [11]:
df['engine_id']

0          1
1          1
        ... 
20629    100
20630    100
Name: engine_id, Length: 20631, dtype: int64

In [15]:
df['engine_id'] [3]

np.int64(1)

Indexing 

Index-Based Selection with `.iloc`
Pandas provides `.iloc` for selecting data using numerical index positions.  
It follows the index-based selection paradigm, similar to array slicing in Python.


In [16]:
df.iloc[0]

engine_id     1.000
cycle         1.000
              ...  
sensor20     39.060
sensor21     23.419
Name: 0, Length: 26, dtype: float64

In [17]:
df.iloc[:, 0]

0          1
1          1
        ... 
20629    100
20630    100
Name: engine_id, Length: 20631, dtype: int64

In [18]:
df.iloc[:3, 0]

0    1
1    1
2    1
Name: engine_id, dtype: int64

In [19]:
df.iloc[1:3, 0]

1    1
2    1
Name: engine_id, dtype: int64

In [20]:
df.iloc[[0, 1, 2], 0]

0    1
1    1
2    1
Name: engine_id, dtype: int64

In [21]:
df.iloc[-5:]

Unnamed: 0,engine_id,cycle,op1,op2,op3,sensor1,sensor2,sensor3,sensor4,sensor5,...,sensor12,sensor13,sensor14,sensor15,sensor16,sensor17,sensor18,sensor19,sensor20,sensor21
20626,100,196,-0.0004,-0.0003,100.0,518.67,643.49,1597.98,1428.63,14.62,...,519.49,2388.26,8137.6,8.4956,0.03,397,2388,100.0,38.49,22.9735
20627,100,197,-0.0016,-0.0005,100.0,518.67,643.54,1604.5,1433.58,14.62,...,519.68,2388.22,8136.5,8.5139,0.03,395,2388,100.0,38.3,23.1594
20628,100,198,0.0004,0.0,100.0,518.67,643.42,1602.46,1428.18,14.62,...,520.01,2388.24,8141.05,8.5646,0.03,398,2388,100.0,38.44,22.9333
20629,100,199,-0.0011,0.0003,100.0,518.67,643.23,1605.26,1426.53,14.62,...,519.67,2388.23,8139.29,8.5389,0.03,395,2388,100.0,38.29,23.064
20630,100,200,-0.0032,-0.0005,100.0,518.67,643.85,1600.38,1432.14,14.62,...,519.3,2388.26,8137.33,8.5036,0.03,396,2388,100.0,38.37,23.0522


Label-based selection

In [22]:
df.loc[0, 'engine_id']

np.int64(1)

Manipulating the index

The set_index() method can be used to do the job. Here is what happens when we set_index to the title field

In [23]:
df.set_index("cycle")

Unnamed: 0_level_0,engine_id,op1,op2,op3,sensor1,sensor2,sensor3,sensor4,sensor5,sensor6,...,sensor12,sensor13,sensor14,sensor15,sensor16,sensor17,sensor18,sensor19,sensor20,sensor21
cycle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.70,1400.60,14.62,21.61,...,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.4190
2,1,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,21.61,...,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.00,23.4236
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199,100,-0.0011,0.0003,100.0,518.67,643.23,1605.26,1426.53,14.62,21.61,...,519.67,2388.23,8139.29,8.5389,0.03,395,2388,100.0,38.29,23.0640
200,100,-0.0032,-0.0005,100.0,518.67,643.85,1600.38,1432.14,14.62,21.61,...,519.30,2388.26,8137.33,8.5036,0.03,396,2388,100.0,38.37,23.0522


Conditional selection

In [24]:
df.engine_id == '1'

0        False
1        False
         ...  
20629    False
20630    False
Name: engine_id, Length: 20631, dtype: bool

In [25]:
df.loc[df.engine_id == '1']

Unnamed: 0,engine_id,cycle,op1,op2,op3,sensor1,sensor2,sensor3,sensor4,sensor5,...,sensor12,sensor13,sensor14,sensor15,sensor16,sensor17,sensor18,sensor19,sensor20,sensor21


We can use the ampersand (&) to bring the two questions together

In [28]:
 df.loc[(df.engine_id == '1') & (df.cycle >= 150)]

Unnamed: 0,engine_id,cycle,op1,op2,op3,sensor1,sensor2,sensor3,sensor4,sensor5,...,sensor12,sensor13,sensor14,sensor15,sensor16,sensor17,sensor18,sensor19,sensor20,sensor21


| for or 

In [29]:
df.loc[(df.engine_id == '1') | (df.cycle >= 150)]

Unnamed: 0,engine_id,cycle,op1,op2,op3,sensor1,sensor2,sensor3,sensor4,sensor5,...,sensor12,sensor13,sensor14,sensor15,sensor16,sensor17,sensor18,sensor19,sensor20,sensor21
149,1,150,0.0010,-0.0003,100.0,518.67,643.06,1589.01,1409.22,14.62,...,520.50,2388.21,8122.99,8.4726,0.03,394,2388,100.0,38.68,23.2082
150,1,151,-0.0019,-0.0001,100.0,518.67,642.82,1592.39,1411.94,14.62,...,520.94,2388.22,8127.21,8.4612,0.03,394,2388,100.0,38.56,23.2277
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20629,100,199,-0.0011,0.0003,100.0,518.67,643.23,1605.26,1426.53,14.62,...,519.67,2388.23,8139.29,8.5389,0.03,395,2388,100.0,38.29,23.0640
20630,100,200,-0.0032,-0.0005,100.0,518.67,643.85,1600.38,1432.14,14.62,...,519.30,2388.26,8137.33,8.5036,0.03,396,2388,100.0,38.37,23.0522


isin is lets you select data whose value "is in" a list of values

In [30]:
df.loc[df.engine_id.isin([1, 2])]

Unnamed: 0,engine_id,cycle,op1,op2,op3,sensor1,sensor2,sensor3,sensor4,sensor5,...,sensor12,sensor13,sensor14,sensor15,sensor16,sensor17,sensor18,sensor19,sensor20,sensor21
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.70,1400.60,14.62,...,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.4190
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.00,23.4236
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
477,2,286,-0.0010,-0.0003,100.0,518.67,643.44,1603.63,1429.57,14.62,...,519.51,2388.22,8169.97,8.4932,0.03,395,2388,100.0,38.33,23.0169
478,2,287,-0.0005,0.0006,100.0,518.67,643.85,1608.50,1430.84,14.62,...,519.81,2388.21,8175.57,8.5365,0.03,398,2388,100.0,38.43,23.0848


Assigning data

In [31]:
df['engine_condition'] = 'operational'
df['engine_condition']


0        operational
1        operational
            ...     
20629    operational
20630    operational
Name: engine_condition, Length: 20631, dtype: str

In [32]:
df['high_cycle'] = df['cycle'] > 100
df['high_cycle']


0        False
1        False
         ...  
20629     True
20630     True
Name: high_cycle, Length: 20631, dtype: bool

In [33]:
df['remaining_steps'] = range(len(df), 0, -1)
df['remaining_steps']


0        20631
1        20630
         ...  
20629        2
20630        1
Name: remaining_steps, Length: 20631, dtype: int64