# Pandas Indexing

## Indexing using `.iloc()`

`.iloc` → Integer-location based indexing  
Think: **“I” for Index number (position)**  
Used to access rows and columns **strictly by numeric positions**, similar to Python list indexing.

**Syntax:**  
`df.iloc[row_index, column_index]`

`.iloc` uses **zero-based indexing** and **excludes the stop value** (just like Python slicing).  
Only **integer positions** are allowed – labels will not work here.  
Useful when your DataFrame index is not a DatetimeIndex.

In [1]:
import numpy as np
import pandas as pd

ksm = pd.read_csv('stock_detailed.csv')

In [2]:
ksm

Unnamed: 0,Symbol,Series,Date,Prev Close,Open Price,High Price,Low Price,Last Price,Close Price,Average Price,Total Traded Quantity,No. of Trades
0,KSM,EQ,02-Apr-18,1131.8,1141.0,1149.55,1121.3,1136.7,1137.15,1135.86,4036351,142078
1,KSM,EQ,03-Apr-18,1137.15,1134.7,1143.55,1128.1,1139.4,1140.45,1135.21,2038584,114034
2,KSM,EQ,04-Apr-18,1140.45,1144.0,1144.55,1120.0,1122.7,1124.2,1131.81,2406651,137029
3,KSM,EQ,05-Apr-18,1124.2,1139.55,1151.3,1129.1,1146.05,1147.55,1140.71,3881772,101745
4,KSM,EQ,06-Apr-18,1147.55,1143.0,1146.0,1122.1,1127.55,1127.0,1128.81,2968871,137277
5,KSM,EQ,09-Apr-18,1127.0,1125.0,1125.8,1106.55,1111.45,1111.25,1113.12,3601441,171118
6,KSM,EQ,10-Apr-18,1111.25,1112.0,1124.5,1105.4,1114.65,1113.4,1115.77,4463029,144468
7,KSM,EQ,11-Apr-18,1113.4,1118.0,1131.5,1116.5,1124.5,1124.25,1124.78,4512787,102586
8,KSM,EQ,12-Apr-18,1124.25,1129.45,1172.75,1125.0,1164.05,1162.6,1157.71,8522183,216130
9,KSM,EQ,13-Apr-18,1162.6,1174.0,1185.9,1150.25,1168.0,1171.45,1172.37,10613519,180965


In [3]:
# Using .iloc()
# Select the first four rows of all the columns
# '.iloc()' method DOES NOT include the rows and columns in its stop argument
ksm.iloc[:4]

Unnamed: 0,Symbol,Series,Date,Prev Close,Open Price,High Price,Low Price,Last Price,Close Price,Average Price,Total Traded Quantity,No. of Trades
0,KSM,EQ,02-Apr-18,1131.8,1141.0,1149.55,1121.3,1136.7,1137.15,1135.86,4036351,142078
1,KSM,EQ,03-Apr-18,1137.15,1134.7,1143.55,1128.1,1139.4,1140.45,1135.21,2038584,114034
2,KSM,EQ,04-Apr-18,1140.45,1144.0,1144.55,1120.0,1122.7,1124.2,1131.81,2406651,137029
3,KSM,EQ,05-Apr-18,1124.2,1139.55,1151.3,1129.1,1146.05,1147.55,1140.71,3881772,101745


In [4]:
# Select the rows from index 1 to index 4 (4 rows in total) and Columns with index from  2 to  3 (2 columns)
# .iloc() is similar to numpy array indexing
# .iloc() is extremely useful when your data is not labelled and you need to refer to columns using their integer location instead
print(ksm.iloc[1:5, 2:4])

        Date  Prev Close
1  03-Apr-18     1137.15
2  04-Apr-18     1140.45
3  05-Apr-18     1124.20
4  06-Apr-18     1147.55


In [5]:
# Selecting the exact requested rows and columns

print(ksm.iloc[[1, 3, 5, 7], [1, 3, 5, 7, 9]])

  Series  Prev Close  High Price  Last Price  Average Price
1     EQ     1137.15     1143.55     1139.40        1135.21
3     EQ     1124.20     1151.30     1146.05        1140.71
5     EQ     1127.00     1125.80     1111.45        1113.12
7     EQ     1113.40     1131.50     1124.50        1124.78


In [6]:
# Selecting the first two rows and all the columns

print(ksm.iloc[1:3, :])

  Symbol Series       Date  Prev Close  Open Price  High Price  Low Price  \
1    KSM     EQ  03-Apr-18     1137.15      1134.7     1143.55     1128.1   
2    KSM     EQ  04-Apr-18     1140.45      1144.0     1144.55     1120.0   

   Last Price  Close Price  Average Price  Total Traded Quantity  \
1      1139.4      1140.45        1135.21                2038584   
2      1122.7      1124.20        1131.81                2406651   

   No. of Trades  
1         114034  
2         137029  


In [7]:
#selecting all rows and first two columns
print(ksm.iloc[:, 1:3])

  Series       Date
0     EQ  02-Apr-18
1     EQ  03-Apr-18
2     EQ  04-Apr-18
3     EQ  05-Apr-18
4     EQ  06-Apr-18
5     EQ  09-Apr-18
6     EQ  10-Apr-18
7     EQ  11-Apr-18
8     EQ  12-Apr-18
9     EQ  13-Apr-18


## Indexing using `.loc()`

It is a label-location based indexer for selecting data points.  
Think: **“L” for Label (name)**  
Used to access rows and columns **by their labels** – index names or column names.

**Syntax:**  
`df.loc[row_label, column_label]`

`.loc()` **includes** the rows and columns in its stop argument.  
The `.loc` indexer takes **row arguments first** and **column arguments second**.


In [8]:
import pandas as pd
import numpy as np
import yfinance as yf
import warnings
warnings.filterwarnings('ignore')

In [9]:
data = yf.download("HDFCBANK.NS", start="2020-01-01", end="2024-12-31")

[*********************100%***********************]  1 of 1 completed


In [10]:
data.head()

Price,Close,High,Low,Open,Volume
Ticker,HDFCBANK.NS,HDFCBANK.NS,HDFCBANK.NS,HDFCBANK.NS,HDFCBANK.NS
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2020-01-01,605.529358,606.192392,601.740655,604.345388,3673698
2020-01-02,609.389099,609.981084,605.718794,605.718794,6137166
2020-01-03,600.698792,608.560338,598.425547,607.234269,10855550
2020-01-06,587.698792,597.573144,585.354555,596.720663,10890186
2020-01-07,597.004761,602.143177,593.050314,596.199685,14724494


In [11]:
data.shape

(1237, 5)

In [12]:
data.columns = ['close', 'high', 'low', 'open', 'volume']

In [13]:
data.head()

Unnamed: 0_level_0,close,high,low,open,volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-01,605.529358,606.192392,601.740655,604.345388,3673698
2020-01-02,609.389099,609.981084,605.718794,605.718794,6137166
2020-01-03,600.698792,608.560338,598.425547,607.234269,10855550
2020-01-06,587.698792,597.573144,585.354555,596.720663,10890186
2020-01-07,597.004761,602.143177,593.050314,596.199685,14724494


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1237 entries, 2020-01-01 to 2024-12-30
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   close   1237 non-null   float64
 1   high    1237 non-null   float64
 2   low     1237 non-null   float64
 3   open    1237 non-null   float64
 4   volume  1237 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 58.0 KB


In [15]:
# Select all rows for a specific column
print(data.loc[:, ['close']])

                 close
Date                  
2020-01-01  605.529358
2020-01-02  609.389099
2020-01-03  600.698792
2020-01-06  587.698792
2020-01-07  597.004761
...                ...
2024-12-23  888.490295
2024-12-24  887.059631
2024-12-26  883.433655
2024-12-27  887.133667
2024-12-30  877.094360

[1237 rows x 1 columns]


In [16]:
# Select all the rows of these specific columns
print(data.loc[:, ['close', 'volume']])

                 close    volume
Date                            
2020-01-01  605.529358   3673698
2020-01-02  609.389099   6137166
2020-01-03  600.698792  10855550
2020-01-06  587.698792  10890186
2020-01-07  597.004761  14724494
...                ...       ...
2024-12-23  888.490295  11044592
2024-12-24  887.059631  14485834
2024-12-26  883.433655  10481678
2024-12-27  887.133667   7259330
2024-12-30  877.094360  22222218

[1237 rows x 2 columns]


In [17]:
# Select the first six rows of the specific columns
data.loc[data.index[:6], ['close', 'open']]

Unnamed: 0_level_0,close,open
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-01,605.529358,604.345388
2020-01-02,609.389099,605.718794
2020-01-03,600.698792,607.234269
2020-01-06,587.698792,596.720663
2020-01-07,597.004761,596.199685
2020-01-08,595.441956,590.540276


In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1237 entries, 2020-01-01 to 2024-12-30
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   close   1237 non-null   float64
 1   high    1237 non-null   float64
 2   low     1237 non-null   float64
 3   open    1237 non-null   float64
 4   volume  1237 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 90.3 KB


In [19]:
data.head()

Unnamed: 0_level_0,close,high,low,open,volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-01,605.529358,606.192392,601.740655,604.345388,3673698
2020-01-02,609.389099,609.981084,605.718794,605.718794,6137166
2020-01-03,600.698792,608.560338,598.425547,607.234269,10855550
2020-01-06,587.698792,597.573144,585.354555,596.720663,10890186
2020-01-07,597.004761,602.143177,593.050314,596.199685,14724494


In [20]:
#Slicing By Date Range (from the start until 2020-01-06)
data.loc[:'2020-01-06', ['close', 'open']]

Unnamed: 0_level_0,close,open
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-01,605.529358,604.345388
2020-01-02,609.389099,605.718794
2020-01-03,600.698792,607.234269
2020-01-06,587.698792,596.720663


In [21]:
#Slice a single year
data.loc['2021']

Unnamed: 0_level_0,close,high,low,open,volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-01-01,674.886292,683.387169,672.778791,681.966406,8810938
2021-01-04,670.600342,681.019274,662.549349,681.019274,15740192
2021-01-05,675.667786,677.585840,667.285304,672.115877,14386824
2021-01-06,672.755188,681.966448,669.226924,679.598509,22134050
2021-01-07,670.718689,678.461838,668.966437,678.414490,19894842
...,...,...,...,...,...
2021-12-27,690.054932,691.576948,676.427905,679.638458,4705098
2021-12-28,694.811218,697.712595,691.291489,694.763608,5450678
2021-12-29,691.505615,694.906435,688.437776,692.552053,7668702
2021-12-30,695.144226,697.688875,687.296207,693.717314,7215918


In [22]:
#Slice a specific month
data.loc['2021-05']

Unnamed: 0_level_0,close,high,low,open,volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-05-03,669.866272,673.394536,652.272531,659.707836,22473700
2021-05-04,657.505615,673.915444,655.114032,667.7351,21486328
2021-05-05,664.254272,667.569387,654.356275,663.496544,14421612
2021-05-06,663.449219,668.13775,660.655039,666.622234,11477044
2021-05-07,670.008301,674.838873,667.877156,669.15582,12048334
2021-05-10,672.423706,677.230634,669.084946,675.809871,11060050
2021-05-11,664.704285,674.483828,660.678788,661.128673,14519034
2021-05-12,662.786255,667.095893,657.742532,662.904652,13774926
2021-05-14,656.795288,662.502045,654.664143,660.347197,10604142
2021-05-17,682.0849,683.19782,654.166919,660.726099,15120692


In [23]:
pd.options.display.float_format = '{:.2f}'.format
#Slice a specific day, output it as DataFrame
data.loc['2021-05-10']

close         672.42
high          677.23
low           669.08
open          675.81
volume   11060050.00
Name: 2021-05-10 00:00:00, dtype: float64

In [24]:
#Slice a date range
data.loc['2021-01-01' : '2021-03-31']

Unnamed: 0_level_0,close,high,low,open,volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-01-01,674.89,683.39,672.78,681.97,8810938
2021-01-04,670.60,681.02,662.55,681.02,15740192
2021-01-05,675.67,677.59,667.29,672.12,14386824
2021-01-06,672.76,681.97,669.23,679.60,22134050
2021-01-07,670.72,678.46,668.97,678.41,19894842
...,...,...,...,...,...
2021-03-24,700.34,713.44,696.65,706.07,16368900
2021-03-25,693.02,708.27,686.82,705.74,23964356
2021-03-26,706.26,709.91,698.07,707.54,12021258
2021-03-30,735.81,740.00,711.12,713.53,25607644


In [25]:
#Slice only year and month
data.loc['2020-07' : '2020-09']

Unnamed: 0_level_0,close,high,low,open,volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-07-01,513.65,519.05,502.62,504.77,34846718
2020-07-02,515.93,526.27,513.94,516.35,36954496
2020-07-03,508.61,518.96,506.74,517.63,27597254
2020-07-06,522.37,530.37,520.95,524.71,35558216
2020-07-07,523.39,526.49,517.66,525.40,24349850
...,...,...,...,...,...
2020-09-24,487.98,495.85,485.43,491.58,19809510
2020-09-25,494.43,498.19,485.74,497.27,20321544
2020-09-28,499.26,501.53,493.79,496.77,16152422
2020-09-29,503.21,506.69,497.74,501.06,12644446


In [26]:
#Slice by multi-level condition (date + columns)
data.loc['2021-01-01' : '2021-12-31', ['open', 'close']]

Unnamed: 0_level_0,open,close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-01,681.97,674.89
2021-01-04,681.02,670.60
2021-01-05,672.12,675.67
2021-01-06,679.60,672.76
2021-01-07,678.41,670.72
...,...,...
2021-12-27,679.64,690.05
2021-12-28,694.76,694.81
2021-12-29,692.55,691.51
2021-12-30,693.72,695.14


In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1237 entries, 2020-01-01 to 2024-12-30
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   close   1237 non-null   float64
 1   high    1237 non-null   float64
 2   low     1237 non-null   float64
 3   open    1237 non-null   float64
 4   volume  1237 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 90.3 KB


In [28]:
#To check the start date
data.index.min()

Timestamp('2020-01-01 00:00:00')

In [29]:
#To check the end date
data.index.max()

Timestamp('2024-12-30 00:00:00')

In [30]:
data['sma200'] = data['close'].rolling(window=200).mean()

In [31]:
data.tail()

Unnamed: 0_level_0,close,high,low,open,volume,sma200
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-12-23,888.49,890.96,878.77,879.12,11044592,800.76
2024-12-24,887.06,892.29,882.84,889.01,14485834,801.78
2024-12-26,883.43,893.92,878.5,887.28,10481678,802.72
2024-12-27,887.13,890.93,882.62,885.43,7259330,803.67
2024-12-30,877.09,895.4,873.69,884.15,22222218,804.55


In [32]:
data.head()

Unnamed: 0_level_0,close,high,low,open,volume,sma200
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-01,605.53,606.19,601.74,604.35,3673698,
2020-01-02,609.39,609.98,605.72,605.72,6137166,
2020-01-03,600.7,608.56,598.43,607.23,10855550,
2020-01-06,587.7,597.57,585.35,596.72,10890186,
2020-01-07,597.0,602.14,593.05,596.2,14724494,


## Missing values

Missing values are values that are absent from the dataframe. Usually, all the dataframes that you would work on would be large and there will be a case of 'missing values' in most of them. <br>
Hence, it becomes important for you to learn how to handle these missing values.

In [33]:
ksm = pd.read_csv('stock_nan.csv')

In [34]:
ksm

Unnamed: 0,Symbol,Series,Date,Prev Close,Open Price,High Price,Low Price,Last Price,Close Price,Average Price,Total Traded Quantity,No. of Trades
0,KSM,EQ,02-Apr-18,1131.8,1141.0,1149.55,1121.3,1136.7,1137.15,1135.86,4036351.0,142078.0
1,KSM,EQ,03-Apr-18,1137.15,1134.7,1143.55,1128.1,1139.4,1140.45,1135.21,2038584.0,114034.0
2,KSM,EQ,04-Apr-18,1140.45,1144.0,,,,,,,137029.0
3,KSM,EQ,05-Apr-18,1124.2,,1151.3,1129.1,1146.05,1147.55,1140.71,3881772.0,
4,KSM,EQ,06-Apr-18,1147.55,1143.0,1146.0,1122.1,1127.55,,1128.81,2968871.0,137277.0
5,KSM,EQ,09-Apr-18,,,,1106.55,1111.45,1111.25,,3601441.0,171118.0
6,KSM,EQ,10-Apr-18,1111.25,1112.0,1124.5,,,,1115.77,,
7,KSM,EQ,11-Apr-18,1113.4,1118.0,1131.5,1116.5,1124.5,1124.25,1124.78,4512787.0,102586.0
8,KSM,EQ,12-Apr-18,,1129.45,,1125.0,1164.05,1162.6,1157.71,8522183.0,216130.0
9,KSM,EQ,13-Apr-18,1162.6,1174.0,1185.9,1150.25,1168.0,1171.45,1172.37,10613519.0,


## DataFrame.isnull()
This method returns a Boolean result.<br>
It will return 'True' if the data point has a 'NaN' (Not a Number) value. Missing data is represented by a NaN value. 

In [35]:
# Understanding the 'NaN' values of the 'Close Price' column in the infy dataframe

print(ksm['Close Price'].isnull())

0    False
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8    False
9    False
Name: Close Price, dtype: bool


In [36]:
# Understanding the 'NaN' values of the entire dataframe

print(ksm.isnull())

   Symbol  Series   Date  Prev Close  Open Price  High Price  Low Price  \
0   False   False  False       False       False       False      False   
1   False   False  False       False       False       False      False   
2   False   False  False       False       False        True       True   
3   False   False  False       False        True       False      False   
4   False   False  False       False       False       False      False   
5   False   False  False        True        True        True      False   
6   False   False  False       False       False       False       True   
7   False   False  False       False       False       False      False   
8   False   False  False        True       False        True      False   
9   False   False  False       False       False       False      False   

   Last Price  Close Price  Average Price  Total Traded Quantity  \
0       False        False          False                  False   
1       False        False          Fa

## DataFrame.notnull()
This method returns a Boolean result.<br>
It will return 'True' if the data point is not a 'NaN' (Not a Number) value. Missing data is represented by a NaN value. 

In [37]:
print(ksm['Close Price'].notnull())

0     True
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8     True
9     True
Name: Close Price, dtype: bool


## DataFrame.fillna()
The .fillna() method will fill all the 'NaN' values of the entire dataframe or of the requested columns with a scalar value of your choice.Scaler means a single fixed value, not an array, list, or series. 

In [38]:
# Replace NaN with a Scalar Value of 1100

print(ksm.fillna(1100))

  Symbol Series       Date  Prev Close  Open Price  High Price  Low Price  \
0    KSM     EQ  02-Apr-18     1131.80     1141.00     1149.55    1121.30   
1    KSM     EQ  03-Apr-18     1137.15     1134.70     1143.55    1128.10   
2    KSM     EQ  04-Apr-18     1140.45     1144.00     1100.00    1100.00   
3    KSM     EQ  05-Apr-18     1124.20     1100.00     1151.30    1129.10   
4    KSM     EQ  06-Apr-18     1147.55     1143.00     1146.00    1122.10   
5    KSM     EQ  09-Apr-18     1100.00     1100.00     1100.00    1106.55   
6    KSM     EQ  10-Apr-18     1111.25     1112.00     1124.50    1100.00   
7    KSM     EQ  11-Apr-18     1113.40     1118.00     1131.50    1116.50   
8    KSM     EQ  12-Apr-18     1100.00     1129.45     1100.00    1125.00   
9    KSM     EQ  13-Apr-18     1162.60     1174.00     1185.90    1150.25   

   Last Price  Close Price  Average Price  Total Traded Quantity  \
0     1136.70      1137.15        1135.86             4036351.00   
1     1139.40   

In [39]:
# This will fill the 'Close Price' column with the scalar value of 1000
#It only fills NaN values temporarily for display - it doesn’t modify your actual dataset.
print(ksm['Close Price'].fillna(1000))

0   1137.15
1   1140.45
2   1000.00
3   1147.55
4   1000.00
5   1111.25
6   1000.00
7   1124.25
8   1162.60
9   1171.45
Name: Close Price, dtype: float64


In [40]:
import warnings
warnings.filterwarnings("ignore")

# If we want to do 'fillna()' using the 'backfill' method, 
#then backfill will take the value from the next row and fill the NaN value with that same value

print(ksm['Close Price'])
print(ksm['Close Price'].fillna(method='backfill'))

0   1137.15
1   1140.45
2       NaN
3   1147.55
4       NaN
5   1111.25
6       NaN
7   1124.25
8   1162.60
9   1171.45
Name: Close Price, dtype: float64
0   1137.15
1   1140.45
2   1147.55
3   1147.55
4   1111.25
5   1111.25
6   1124.25
7   1124.25
8   1162.60
9   1171.45
Name: Close Price, dtype: float64


In [41]:
# It is even possible to do it for the entire dataframe with the 'backfill' values
print(ksm.fillna(method='backfill'))
#print(infy['Close Price'].fillna(method='bfill')) --> backfill or bfill does same

  Symbol Series       Date  Prev Close  Open Price  High Price  Low Price  \
0    KSM     EQ  02-Apr-18     1131.80     1141.00     1149.55    1121.30   
1    KSM     EQ  03-Apr-18     1137.15     1134.70     1143.55    1128.10   
2    KSM     EQ  04-Apr-18     1140.45     1144.00     1151.30    1129.10   
3    KSM     EQ  05-Apr-18     1124.20     1143.00     1151.30    1129.10   
4    KSM     EQ  06-Apr-18     1147.55     1143.00     1146.00    1122.10   
5    KSM     EQ  09-Apr-18     1111.25     1112.00     1124.50    1106.55   
6    KSM     EQ  10-Apr-18     1111.25     1112.00     1124.50    1116.50   
7    KSM     EQ  11-Apr-18     1113.40     1118.00     1131.50    1116.50   
8    KSM     EQ  12-Apr-18     1162.60     1129.45     1185.90    1125.00   
9    KSM     EQ  13-Apr-18     1162.60     1174.00     1185.90    1150.25   

   Last Price  Close Price  Average Price  Total Traded Quantity  \
0     1136.70      1137.15        1135.86             4036351.00   
1     1139.40   

In [42]:
# If we want to do 'fillna()' using the 'ffill' method, then ffill will take the value from the previous row and fill the NaN value with that same value

print(ksm['Close Price'])
print("---------------------------------------------------------")
print(ksm['Close Price'].fillna(method='ffill'))
# 'pad' does the same thing as 'ffill'
# print(infy['Close Price'].fillna(method='pad'))

0   1137.15
1   1140.45
2       NaN
3   1147.55
4       NaN
5   1111.25
6       NaN
7   1124.25
8   1162.60
9   1171.45
Name: Close Price, dtype: float64
---------------------------------------------------------
0   1137.15
1   1140.45
2   1140.45
3   1147.55
4   1147.55
5   1111.25
6   1111.25
7   1124.25
8   1162.60
9   1171.45
Name: Close Price, dtype: float64


## DataFrame.dropna()

This method will drop the entire 'row' which has even a single 'NaN' value present, as per the request.

In [43]:
# By default, dropna() will exclude or drop all the rows which have even one NaN value in it

print(ksm.dropna())

  Symbol Series       Date  Prev Close  Open Price  High Price  Low Price  \
0    KSM     EQ  02-Apr-18     1131.80     1141.00     1149.55    1121.30   
1    KSM     EQ  03-Apr-18     1137.15     1134.70     1143.55    1128.10   
7    KSM     EQ  11-Apr-18     1113.40     1118.00     1131.50    1116.50   

   Last Price  Close Price  Average Price  Total Traded Quantity  \
0     1136.70      1137.15        1135.86             4036351.00   
1     1139.40      1140.45        1135.21             2038584.00   
7     1124.50      1124.25        1124.78             4512787.00   

   No. of Trades  
0      142078.00  
1      114034.00  
7      102586.00  


In [44]:
# If we specify the axis = 1, it will exclude or drop all the columns which has even one NaN value in it

print(ksm.dropna(axis=1))

  Symbol Series       Date
0    KSM     EQ  02-Apr-18
1    KSM     EQ  03-Apr-18
2    KSM     EQ  04-Apr-18
3    KSM     EQ  05-Apr-18
4    KSM     EQ  06-Apr-18
5    KSM     EQ  09-Apr-18
6    KSM     EQ  10-Apr-18
7    KSM     EQ  11-Apr-18
8    KSM     EQ  12-Apr-18
9    KSM     EQ  13-Apr-18


## Replacing values

Replacing helps us to select any data point in the entire dataframe and replace it with the value of our choice.

In [45]:
import pandas as pd
import numpy as np

# We will create a dataframe using the 'pd.DataFrame' constructor

df = pd.DataFrame({'one': [100, 200, 300, 400, 500, 2000],
                   'two': [1000, 0, 30, 40, 50, 60]})

print(df)

    one   two
0   100  1000
1   200     0
2   300    30
3   400    40
4   500    50
5  2000    60


In [46]:
# .replace() will first find the value which you want to replace and replace it the value you have given.
# NaN values cannot be replaced as they are not defined
# Example: In the below '1000' is the value it will find and replace it with '10'

print(df.replace({1000: 10, 2000: 60}))

   one  two
0  100   10
1  200    0
2  300   30
3  400   40
4  500   50
5   60   60


In [47]:
print(ksm['Close Price'])

0   1137.15
1   1140.45
2       NaN
3   1147.55
4       NaN
5   1111.25
6       NaN
7   1124.25
8   1162.60
9   1171.45
Name: Close Price, dtype: float64


In [48]:
ksm

Unnamed: 0,Symbol,Series,Date,Prev Close,Open Price,High Price,Low Price,Last Price,Close Price,Average Price,Total Traded Quantity,No. of Trades
0,KSM,EQ,02-Apr-18,1131.8,1141.0,1149.55,1121.3,1136.7,1137.15,1135.86,4036351.0,142078.0
1,KSM,EQ,03-Apr-18,1137.15,1134.7,1143.55,1128.1,1139.4,1140.45,1135.21,2038584.0,114034.0
2,KSM,EQ,04-Apr-18,1140.45,1144.0,,,,,,,137029.0
3,KSM,EQ,05-Apr-18,1124.2,,1151.3,1129.1,1146.05,1147.55,1140.71,3881772.0,
4,KSM,EQ,06-Apr-18,1147.55,1143.0,1146.0,1122.1,1127.55,,1128.81,2968871.0,137277.0
5,KSM,EQ,09-Apr-18,,,,1106.55,1111.45,1111.25,,3601441.0,171118.0
6,KSM,EQ,10-Apr-18,1111.25,1112.0,1124.5,,,,1115.77,,
7,KSM,EQ,11-Apr-18,1113.4,1118.0,1131.5,1116.5,1124.5,1124.25,1124.78,4512787.0,102586.0
8,KSM,EQ,12-Apr-18,,1129.45,,1125.0,1164.05,1162.6,1157.71,8522183.0,216130.0
9,KSM,EQ,13-Apr-18,1162.6,1174.0,1185.9,1150.25,1168.0,1171.45,1172.37,10613519.0,


In [49]:
print(ksm['Close Price'].replace({1147.55: 3000}))

0   1137.15
1   1140.45
2       NaN
3   3000.00
4       NaN
5   1111.25
6       NaN
7   1124.25
8   1162.60
9   1171.45
Name: Close Price, dtype: float64


## Reindexing 

Reindexing changes the row labels and column labels of a dataframe.<br> 
To reindex means to confirm the data to match a given set of labels along a particular axis.

In [50]:
ksm.head()

Unnamed: 0,Symbol,Series,Date,Prev Close,Open Price,High Price,Low Price,Last Price,Close Price,Average Price,Total Traded Quantity,No. of Trades
0,KSM,EQ,02-Apr-18,1131.8,1141.0,1149.55,1121.3,1136.7,1137.15,1135.86,4036351.0,142078.0
1,KSM,EQ,03-Apr-18,1137.15,1134.7,1143.55,1128.1,1139.4,1140.45,1135.21,2038584.0,114034.0
2,KSM,EQ,04-Apr-18,1140.45,1144.0,,,,,,,137029.0
3,KSM,EQ,05-Apr-18,1124.2,,1151.3,1129.1,1146.05,1147.55,1140.71,3881772.0,
4,KSM,EQ,06-Apr-18,1147.55,1143.0,1146.0,1122.1,1127.55,,1128.81,2968871.0,137277.0


In [51]:
#Reindexing adjusts the row and column layout as specified, producing a reshaped DataFrame
ksm_reindexed = ksm.reindex(index=[0, 2, 4, 6, 8], columns=[
                              'Open Price', 'High Price', 'Low Price', 'Close Price'])

print(ksm_reindexed)

   Open Price  High Price  Low Price  Close Price
0     1141.00     1149.55    1121.30      1137.15
2     1144.00         NaN        NaN          NaN
4     1143.00     1146.00    1122.10          NaN
6     1112.00     1124.50        NaN          NaN
8     1129.45         NaN    1125.00      1162.60


In [52]:
data.head()

Unnamed: 0_level_0,close,high,low,open,volume,sma200
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-01,605.53,606.19,601.74,604.35,3673698,
2020-01-02,609.39,609.98,605.72,605.72,6137166,
2020-01-03,600.7,608.56,598.43,607.23,10855550,
2020-01-06,587.7,597.57,585.35,596.72,10890186,
2020-01-07,597.0,602.14,593.05,596.2,14724494,


In [53]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1237 entries, 2020-01-01 to 2024-12-30
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   close   1237 non-null   float64
 1   high    1237 non-null   float64
 2   low     1237 non-null   float64
 3   open    1237 non-null   float64
 4   volume  1237 non-null   int64  
 5   sma200  1038 non-null   float64
dtypes: float64(5), int64(1)
memory usage: 99.9 KB


### Class Exercises: Using `.loc` on a Price DataFrame

1. **Select all rows for the year 2021 using `.loc`.**

2. **Select close prices between `2022-01-01` and `2022-03-31`.**

3. **Select the rows where high > 500 using `.loc` with a condition.**

4. **Select both close and volume columns for the date range `2023-05-01` to `2023-05-31`.**

5. **Select the row for a single date (e.g., `2020-06-15`) and print all columns.**

6. **Select rows where volume is above its 95th percentile using `.loc`.**

7. **Select rows where close > open for the entire year 2024.**

8. **Create a new df using `.loc` that contains only rows from `2021` and columns `['close', 'open']`.**

9. **Select the last 10 dates using `.loc` with index slicing.**

10. **Select rows where (high – low) > 10 and display close, high, low columns only.**

11. **Extract all rows where close crossed above 200-day MA**

In [54]:
df2 = data.copy()

In [55]:
#Select all rows for the year 2021 using .loc.
df2.loc['2021'][:]

Unnamed: 0_level_0,close,high,low,open,volume,sma200
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-01-01,674.89,683.39,672.78,681.97,8810938,523.63
2021-01-04,670.60,681.02,662.55,681.02,15740192,524.68
2021-01-05,675.67,677.59,667.29,672.12,14386824,525.98
2021-01-06,672.76,681.97,669.23,679.60,22134050,527.22
2021-01-07,670.72,678.46,668.97,678.41,19894842,528.48
...,...,...,...,...,...,...
2021-12-27,690.05,691.58,676.43,679.64,4705098,718.17
2021-12-28,694.81,697.71,691.29,694.76,5450678,718.02
2021-12-29,691.51,694.91,688.44,692.55,7668702,717.88
2021-12-30,695.14,697.69,687.30,693.72,7215918,717.66


In [56]:
#Select close prices between 2022-01-01 and 2022-03-31.
df2.loc['2022-01-01' : '2022-03-31', 'close']

Date
2022-01-03   722.80
2022-01-04   727.04
2022-01-05   744.30
2022-01-06   732.36
2022-01-07   737.50
              ...  
2022-03-25   680.59
2022-03-28   681.49
2022-03-29   690.53
2022-03-30   702.49
2022-03-31   699.35
Name: close, Length: 61, dtype: float64

In [57]:
#Now the output is a DataFrame (because lists preserve 2D structure).
df2.loc['2022-01-01' : '2022-03-31', ['close']]

Unnamed: 0_level_0,close
Date,Unnamed: 1_level_1
2022-01-03,722.80
2022-01-04,727.04
2022-01-05,744.30
2022-01-06,732.36
2022-01-07,737.50
...,...
2022-03-25,680.59
2022-03-28,681.49
2022-03-29,690.53
2022-03-30,702.49


In [58]:
#Select the rows where high > 500 using .loc with a condition.
df2.loc[df2['high'] > 500]

Unnamed: 0_level_0,close,high,low,open,volume,sma200
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-01,605.53,606.19,601.74,604.35,3673698,
2020-01-02,609.39,609.98,605.72,605.72,6137166,
2020-01-03,600.70,608.56,598.43,607.23,10855550,
2020-01-06,587.70,597.57,585.35,596.72,10890186,
2020-01-07,597.00,602.14,593.05,596.20,14724494,
...,...,...,...,...,...,...
2024-12-23,888.49,890.96,878.77,879.12,11044592,800.76
2024-12-24,887.06,892.29,882.84,889.01,14485834,801.78
2024-12-26,883.43,893.92,878.50,887.28,10481678,802.72
2024-12-27,887.13,890.93,882.62,885.43,7259330,803.67


In [59]:
#Select both close and volume columns for the date range 2023-05-01 to 2023-05-31.
df2.loc['2023-05-01' : '2023-05-31', ['close', 'volume']]

Unnamed: 0_level_0,close,volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-05-02,811.85,32221184
2023-05-03,814.69,29407270
2023-05-04,831.36,56854638
2023-05-05,782.21,62770416
2023-05-08,791.26,37244524
2023-05-09,791.28,43688938
2023-05-10,794.94,46340550
2023-05-11,795.47,35734618
2023-05-12,802.49,22478760
2023-05-15,806.34,18512260


In [60]:
#Select the row for a single date (e.g., 2020-06-15) and print all columns.

df2.loc['2020-06-15']

close         449.84
high          461.27
low           446.59
open          458.43
volume   32009936.00
sma200           NaN
Name: 2020-06-15 00:00:00, dtype: float64

In [61]:
#Select rows where volume is above its 95th percentile using .loc
df2.loc[df2['volume'] > df2['volume'].quantile(0.95)]

Unnamed: 0_level_0,close,high,low,open,volume,sma200
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-03-12,483.68,511.47,475.22,509.11,59000770,
2020-03-13,506.64,512.16,435.32,464.14,66342364,
2020-03-18,415.29,470.27,409.65,466.48,61181406,
2020-03-19,424.12,435.68,376.50,401.13,67220048,
2020-03-20,418.11,433.14,390.50,414.39,88637464,
...,...,...,...,...,...,...
2024-07-24,791.33,799.69,783.44,793.35,61728206,750.73
2024-08-30,807.53,819.92,799.76,816.96,445342100,759.52
2024-09-20,858.99,860.84,842.39,846.80,60623386,766.12
2024-10-07,798.11,818.44,795.74,815.08,94458296,769.51


In [62]:
#Select rows where close > open for the entire year 2024.
df2.loc['2024'][df2['close'] > df2['open']]

Unnamed: 0_level_0,close,high,low,open,volume,sma200
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-01-02,826.93,828.73,821.87,826.78,29242092,780.04
2024-01-04,822.91,824.91,813.11,816.66,26734056,780.63
2024-01-10,805.93,807.87,798.85,799.62,16115824,781.83
2024-01-15,814.13,818.07,799.92,801.91,28320356,782.56
2024-01-16,817.22,819.41,806.97,814.23,25322500,782.88
...,...,...,...,...,...,...
2024-12-12,917.23,921.86,912.02,913.16,17293790,793.62
2024-12-13,923.39,925.00,902.43,918.83,19009002,794.70
2024-12-16,920.16,922.51,915.25,920.11,13107082,795.80
2024-12-23,888.49,890.96,878.77,879.12,11044592,800.76


In [63]:
#Create a new df using .loc that contains only rows from 2021 and columns ['close', 'open'].

df3 = df2.loc['2021', ['close', 'open']]

In [64]:
df3.tail()

Unnamed: 0_level_0,close,open
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-12-27,690.05,679.64
2021-12-28,694.81,694.76
2021-12-29,691.51,692.55
2021-12-30,695.14,693.72
2021-12-31,703.66,695.14


In [65]:
#Select the last 10 dates using .loc with index slicing.
df2.loc[df2.index[-10:]]

Unnamed: 0_level_0,close,high,low,open,volume,sma200
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-12-16,920.16,922.51,915.25,920.11,13107082,795.8
2024-12-17,904.4,918.76,901.07,915.8,21079532,796.86
2024-12-18,893.28,905.76,891.52,902.72,23422312,797.87
2024-12-19,884.79,888.54,877.64,887.8,25380770,798.84
2024-12-20,873.94,886.79,871.74,879.22,25692348,799.75
2024-12-23,888.49,890.96,878.77,879.12,11044592,800.76
2024-12-24,887.06,892.29,882.84,889.01,14485834,801.78
2024-12-26,883.43,893.92,878.5,887.28,10481678,802.72
2024-12-27,887.13,890.93,882.62,885.43,7259330,803.67
2024-12-30,877.09,895.4,873.69,884.15,22222218,804.55


In [66]:
#Select rows where (high – low) > 10 and display close, high, low columns only.
sel = df2['high'] - df2['low'] > 10
df2.loc[sel, ['close', 'high', 'low']]

Unnamed: 0_level_0,close,high,low
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-03,600.70,608.56,598.43
2020-01-06,587.70,597.57,585.35
2020-01-08,595.44,597.74,587.27
2020-01-20,594.31,617.96,593.17
2020-01-27,574.56,584.88,573.87
...,...,...,...
2024-12-19,884.79,888.54,877.64
2024-12-20,873.94,886.79,871.74
2024-12-23,888.49,890.96,878.77
2024-12-26,883.43,893.92,878.50


In [67]:
#Extract all rows where close price crossed above 200-MA
df2.head()

Unnamed: 0_level_0,close,high,low,open,volume,sma200
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-01,605.53,606.19,601.74,604.35,3673698,
2020-01-02,609.39,609.98,605.72,605.72,6137166,
2020-01-03,600.7,608.56,598.43,607.23,10855550,
2020-01-06,587.7,597.57,585.35,596.72,10890186,
2020-01-07,597.0,602.14,593.05,596.2,14724494,


In [68]:
condition1 = df2['close'] > df2['sma200']
condition2 = df2['close'].shift(1) <= df2['sma200'].shift(1)
df2.loc[condition1 & condition2]

Unnamed: 0_level_0,close,high,low,open,volume,sma200
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-08-04,696.95,701.33,684.92,685.39,22053948,685.1
2021-11-25,725.8,729.29,716.79,720.5,10251652,722.39
2021-12-02,725.7,727.15,713.46,715.6,11203110,721.39
2021-12-07,725.68,728.68,718.17,720.09,12427534,720.98
2022-01-03,722.8,724.4,704.18,706.32,9069184,717.43
2022-01-21,723.73,727.63,706.61,713.46,11537694,719.14
2022-02-02,728.3,730.1,716.12,719.14,13969290,720.4
2022-02-10,725.39,730.34,714.24,720.12,14314530,721.72
2022-02-21,723.97,728.06,711.34,715.36,7468066,723.15
2022-04-04,788.04,819.1,743.21,751.51,97450970,721.73
