![alt text](pandas.png "Title")

In [0]:
import pandas as pd
import numpy as np

# Dataframes indexes

Dataframe have a row index and a column index. Indexes hold the axis labels and other metadata. As you learn pandas, you'll realize mastering indexes is important.

Advanced note: Indexes are immutable and can contain duplicate labels!

## Setting an index

In [0]:
# We can set an index at df creation:
data = {
    'gender': ['M', 'F', 'F','M'],
    'subjid': [10011, 10010, 10014, 10013],
    'age':    [20, 25, 23, 26] 
}

df = pd.DataFrame(
    data,
    index   = ['Study123-10011', 'Study123-10010', 'Study123-10014', 'Study123-10013'],
    columns = ['subjid', 'age', 'gender']
)

# the index is not an default range of integers but thelist of values we passed
df

In [0]:
# set_index(): set an index using a Series (from the df or not)

df = pd.DataFrame(data, columns=['subjid', 'gender', 'age'])

# At this point, the index is a default one: a range of integer
print("Default index:", list(df.index) )
df

In [0]:
# Use subjid values as index
df.set_index(df.subjid, inplace=True)

# the same but being implicit about where the Series comes from
# df.set_index(['subjid'], inplace=True)

# Let's look at it
print("New index: ", list(df.index) )

df

## Reindexing

You can create a new df with the data __conformed__ to a new index.

In [0]:
data = {'gender': ['M', 'F', 'F','M'],
        'age':    [20, 25, 23, 26] }

df = pd.DataFrame(
    data,
    index   = [10011, 10010, 10014, 10013],
    columns = ['gender', 'age',]
)

df

In [0]:
new_index = [10010, 10011, 10012, 10013, 10014, 10015]
new = df.reindex(new_index)

new

In [0]:
# We can fill out the missings with what we want:
new = df.reindex(new_index, fill_value = 'missing')
new

In [0]:
# Replace the missings with a carried forward (ffill) or backward (bfill) value.
# Note that the index must be sorted first.
new = df.sort_index().reindex(new_index, method = 'ffill')
new

In [0]:
# The difference between set_index() and reindex(). 
# The following will crash.

data = {
    'gender': ['M', 'F', 'F','M'],
    'subjid': [10011, 10010, 10014, 10013],
    'age':    [20, 25, 23, 26] 
}

df = pd.DataFrame(
    data,
    columns = ['subjid', 'age', 'gender']
)

df = df.set_index(pd.Series( [10010, 10011, 10012, 10013, 10014, 10015]))
df

## Index sorting

Change the rows order based on an index sort. This is different from sorting on column values.

In [0]:
data = {'gender': ['M', 'F', 'F','M'],
        'age':    [20, 25, 23, 26] }

df = pd.DataFrame(
    data,
    index=  [10011, 10010, 10014, 10013],
    columns=['gender', 'age',]
)
df

In [0]:
df.index

In [0]:
# Sorts the row index, this is NOT 'in place'
df.sort_index() 

In [0]:
# If we need to save the sorting: 

# option 1
df = df.sort_index() 

# option 2
df.sort_index(inplace=True) 

In [0]:
# Descending order:
df= df.sort_index(ascending=False)
df

In [0]:
df.index

In [0]:
# We can sort the column index too:
df=df.sort_index(axis=1) 

# same:
# df.sort_index(axis='columns')
df

## Removing an index

In [0]:
patients = [10010, 10011, 10012]
data = {'gender': ['M', 'F', 'F'],
        'age':    [20, 25, 23],
       }

df = pd.DataFrame(data, index= patients, columns=['age', 'gender', 'race'])
df

In [0]:
# Let's get rid of this index
df.reset_index(inplace=True)

# or alternatively (default is in_place=False)
# df = df.reset_index()

# The index is now a regular column and the index is back to default: a range of integers
df

## Advanced: hierarchical indexes

Each axis of a dataframe can have a hierarchical index, i.e. multiple index levels.

In [0]:
df = pd.DataFrame(
    data    = np.arange(24).reshape(6, 4),
    index   = [[101, 101, 102, 102, 103, 103], ['Visit1', 'Visit2', 'Visit1', 'Visit2', 'Visit1', 'Visit2']],
    columns = [['Study_A', 'Study_A', 'Study_B', 'Study_B'],
               ['param1', 'param2', 'param1', 'param2']]
)
df.index.names= ['subjid','visit']
df

In [0]:
# Btw
np.arange(24)

In [0]:
np.arange(24).reshape(6, 4),

In [0]:
df.index

In [0]:
# Slicing still work
df['Study_A']

In [0]:
# We can filter on values from the index levels
df[df.index.get_level_values(0) == 101] # 0 here means first level of index (i.e. subjid)

In [0]:
# Loc probably is easier to read
df.loc[102:103]

In [0]:
# you can interchange levels
df.swaplevel('visit','subjid')

__________________________________________________
Nicolas Dupuis, Methodology and Innovation (IDAR C&SP), 2020+