# Hierarchal Indexing and Handling missing data
Hierarchal indexing is very important in Pandas.
It makes it possible to have multiple (two or more) indexes on an axis.
Somewhat abstractedly, it allows us to work with multi-dimensional data.

In [2]:
import numpy as np
import pandas as pd

print('NumPy version: ', np.__version__)
print('Pandas version: ', pd.__version__)

NumPy version:  1.23.5
Pandas version:  1.5.2


In [5]:
index = [['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
        [1,2,3,1,2,3,1,2,1,2]]

# lets create a series with multi-level index
ser = pd.Series(np.random.randn(10), index=index)
ser

a  1    0.039157
   2   -3.014031
   3    0.427069
b  1    0.206950
   2   -0.212716
   3    0.061772
c  1   -0.212706
   2    0.349951
d  1    1.039830
   2   -0.006973
dtype: float64

In [6]:
# with hierarchal index, partial indexing is possible
ser['a'] # returns all sub-indexes under 'a'

1    0.039157
2   -3.014031
3    0.427069
dtype: float64

In [7]:
# if we want a single value, we need to index level 2
ser['a'][2]

-3.014030712847249

Having said that, most of the time we are going to work with DataFrame
With DataFrame, either axis can have a Hierarchical Index

In [15]:
# creating a dataframe with multi-level index for rows
df = pd.DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                 columns=['AB', 'ON', 'BC'])
df

Unnamed: 0,Unnamed: 1,AB,ON,BC
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


Now the question is how to index the above df?
On the column axis, we just use df[],
On the row axis, we use df.loc

In [16]:
df['AB']

a  1    0
   2    3
b  1    6
   2    9
Name: AB, dtype: int32

In [17]:
df.loc['a']

Unnamed: 0,AB,ON,BC
1,0,1,2
2,3,4,5


In [18]:
# We want to grab a single value, lets try grabbing 11
df.loc['b'].loc[2]['BC']

11

In [19]:
# the hierarchical levels can have names
# if so, these will show up in the console output
df.index.names

FrozenList([None, None])

In [20]:
# Looks like it doesnt have names for its levels yet, lets assign them!
df.index.names = ['L_1', 'L_2']
df

Unnamed: 0_level_0,Unnamed: 1_level_0,AB,ON,BC
L_1,L_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [21]:
# xs() - A very useful function for grabbing data from multilevel index
df.xs('a')

Unnamed: 0_level_0,AB,ON,BC
L_2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,1,2
2,3,4,5


If we want to grab all the data from df where index L_2 is 1, its tricky for loc[], but xs will do the magic here

In [22]:
df.xs(1, level = 'L_2')

Unnamed: 0_level_0,AB,ON,BC
L_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,0,1,2
b,6,7,8


**3) Handling Missing Data**
Missing data is common in data science applications (NA or NaN). Pandas has some convenient methods for dealing with these

In [23]:
# Lets create a data frame with missing data
data_dict = {'A': [1, 2, np.nan, 4, np.nan],
            'B': [np.nan, np.nan, np.nan, np.nan, np.nan],
            'C': [11, 12, 13, 14, 15],
            'D': [16, np.nan, 18, 19, 20]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [24]:
# isnull() and notnull()
df.isnull()

Unnamed: 0,A,B,C,D
0,False,True,False,False
1,False,True,False,True
2,True,True,False,False
3,False,True,False,False
4,True,True,False,False


In [25]:
df.notnull()

Unnamed: 0,A,B,C,D
0,True,False,True,True
1,True,False,True,False
2,False,False,True,True
3,True,False,True,True
4,False,False,True,True


In [26]:
# sum() - pandas considers NaN as 0
df['A'].sum()

7.0

In [27]:
# mean() - while computing mean(), NaN is ignored
df['A'].mean()

2.3333333333333335

In [28]:
# dropna() - drop any row (default value) with any NaN values
df.dropna()

Unnamed: 0,A,B,C,D


In [29]:
# dropna - drop any column 
df.dropna(axis = 1)

Unnamed: 0,C
0,11
1,12
2,13
3,14
4,15


In [30]:
# dropna - thresh parameter is an int type, its default value is None.
# thresh = 3 means it will drop any rows/columns with less than 3 non-NaN values
df.dropna(thresh=3, axis=1)

Unnamed: 0,A,C,D
0,1.0,11,16.0
1,2.0,12,
2,,13,18.0
3,4.0,14,19.0
4,,15,20.0


In [31]:
# fillna - we can use fillna to fill in NaN values, inplace=True parameter makes it permanent
df.fillna('Filled')

Unnamed: 0,A,B,C,D
0,1.0,Filled,11,16.0
1,2.0,Filled,12,Filled
2,Filled,Filled,13,18.0
3,4.0,Filled,14,19.0
4,Filled,Filled,15,20.0


In [32]:
# lets fill in the values using the mean of the column
df['A'].fillna(value = df['A'].mean())

0    1.000000
1    2.000000
2    2.333333
3    4.000000
4    2.333333
Name: A, dtype: float64

In [33]:
# fillna - ffill method - i.e. 'pad/forward-fill' method
df.fillna(method='ffill')

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,16.0
2,2.0,,13,18.0
3,4.0,,14,19.0
4,4.0,,15,20.0


In [35]:
# fillname - bfill method - i.e. backfill method
df.fillna(method='bfill')

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,18.0
2,4.0,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0
