![alt text](pandas.png "Title")

In [0]:
import pandas as pd
import numpy as np
import random

# Dataframes: derive new columns

This is the often core of what we need to do: add new variables in a dataframe. Let's see several ways to do this.

## Test data

In [0]:
patients = [10010, 10011, 10012, 10013]
data = {'gender': ['M', 'F', 'F', 'M'],
        'age':    [20, 40, 20, None],
       }

df = pd.DataFrame(data, index= patients, columns=['age', 'gender'])
df

## Broadcasting

Broadcasting (a NumPy concept) describes how we can combine arrays and scalar values.

In [0]:
# Create a new df column with the same value in all df rows:
df['study'] = "Study_A123"

# Same, but use element-wise values of another column:
df['age(months)'] = df['age'] * 12

# Using methods on the element-wise values:
df['STUDY'] = df['study'].str.upper() # str accessor brings us method for strings

# Use a regex pattern to extract and broadcast a value:
df['cluster'] = df.study.str.extract('(?:_)(.)') 

# map() applies a function on every iterable item:
df ['a-g'] = df.age.map(str) + '-' + df['gender'] 

df

## Using a Python iterable

You can unpack the values from an iterable (e.g. a list, a tuple or a dictionnary). The size of the iterable must be the same of the dataframe.

In [0]:
# Example with a dictionnary:
mydict = {'1': 10, '2': 20, '3': 30, '4': 40}
mydict.keys()

In [0]:
df['dict_keys'] = mydict.keys()
df['dict_values'] = mydict.values()
df

In [0]:
# Example with lists & tuples
df['newvar']  = [letter for letter in 'abcd']
df['newvar2'] = tuple(letter for letter in 'wxyz')
df

# Bottom line: you can always construct your new column outside of Pandas and bring it later on

## Conditional logic

In a SAS datastep there is an implicit 'loop on every record' logic. We can use each variable in an easy way. 

How do we do that in pandas? Let's create a flag in this df: True for males over 40 years, False otherwise.

In [0]:
# Option 1: apply() provides an implicit looping on every row.

# This function will get a Series which represents a df row
def create_newvar(row): 
    
    # for readibility
    age    = row['age'] 
    gender = row['gender']
    
    # now it feels a bit more like SAS doesn't it?
    if gender == 'M' and age > 40:
        return True
    else:
        return False

# apply() applies a function to each column (if axis = 0) or row (if axis = 1).
# The function returns values which we use to create a new df column

df['flag'] = df.apply(create_newvar, axis = 1) # or axis = 'columns'
df

# Yes: 1) one new variable at a time, for now.
#      2) axis is confusing. Here's what the doc says:
#         Axis along which the function is applied:
#           0 or ‘index’: apply function to each column.
#           1 or ‘columns’: apply function to each row.
#        Most of the time, we'll use axis=1...

In [0]:
# Alternatively, with a lambda. It's probably the right way to do this small task.
df['flag'] = df.apply(
    lambda row: True if row['gender']=='M' and row['age']>40 else False,
    axis = 1
)

In [0]:
# Option 2: explicitely iterating on rows, feeling even more like in SAS. Careful, this is not super efficient.

# itterrows() returns a tuple at every iteration: the index value and the row.
for index, row in df.iterrows():
    
    if row['gender'] == 'M' and row['age'] > 40:
        df.loc[index, 'flag'] = True
    
    else:
        df.loc[index, 'flag'] = False
df

In [0]:
# Numpy also provides an easy way: condition/ statement when True, statement when false
df['flag2'] = np.where(
    df['flag']==True, df['age'] * 12,
    'Ignored'
)

df

## Create multiple variables in one go

In [0]:
# This is probably what we want...

def create_newvars(row): 
    
    if row['gender'] == 'M' and row['age'] > 40:
        
        # Return a dictionary: keys = names of the future df variables.
        # We could have passed a list instead but then you can't choose the var names and must rename afterwards
        return {'Flag1': True, 'Flag2': 'cat1'}
    else:
        return {'Flag1': False, 'Flag2': 'cat2'}

newvars = df.apply(create_newvars, axis='columns', result_type='expand')

# newvars df contains only the new variables, let's add these cols to the original df
df = pd.concat([df, newvars], axis='columns')
df

In [0]:
# split() can also be useful to split a column into others
df['test'] = "firstname-lastname"
df[['Firstname','Lastname']] = df.test.str.split('-', expand=True)
df[['test', 'Firstname','Lastname']]

## Discretization: create categorical variables

In [0]:
# cut() uses a list of bins to create an interval variable:
bins = [10, 20, 30, 40, 100]
df['agegr'] = pd.cut(df['age'], bins, right=True)
df[['age','agegr']]

# ( or ) are exclusive, [ or ] are inclusive. Change the 'right' value to revert this

In [0]:
# We could pass our own group names:
bins = [10, 20, 30, 40, 100]
groups = ['teens', 'young adults', 'adults', 'aging adults']
df['agegr'] = pd.cut(df['age'], bins, labels=groups)
df[['age','agegr']]

In [0]:
# Somehow related, we can simply compare values:
df['flag'] = df['age'].gt(30) # True if age is greater than 30, False otherwise. 
df[['age','flag']]

# Also available: eq(), ge(), lt() and le()

## Change from baseline

In [0]:
# let's create a VS dataframe
def create_vs():
    patients = [10010, 10011, 10013]
    visits = [1, 2, 3]
    param = ['heart rate', 'systolic blood pressure']

    data = {'subjid': sorted(patients * len(visits)) * len(param),
            'visit' : visits * len(param) * len(patients),
            'param' : sorted(param * len(visits) * len(patients)),
            'result': [random.randint(50, 150)  for n in range(len(visits) * len(patients))] +
                      [random.randint(100, 180) for n in range(len(visits) * len(patients))] 
    }

    return pd.DataFrame(data, columns=['subjid', 'visit', 'param', 'result']).sort_values(['subjid','param', 'visit']).reset_index()
    
vs=create_vs()
vs.head()

In [0]:
# Let's add baseline flag at visit 1:

# I'm using a lambda function because it's a small & unique task...
vs['bslfl'] = vs.apply(lambda row: True if row['visit']==1 else False, axis = 1)
vs.head()

In [0]:
# Change from previous visit.

# shift() is the equivalent of the lag function in SAS. Combines well with a groupby().
vs['shift'] = vs.groupby(['subjid', 'param'])['result'].shift(periods=1) # periods = number of rows to shift
vs['chgbsl'] = vs['result'] - vs['shift']
vs.head(8)

In [0]:
# In fact, diff() does the difference between the original and the shifted values.
vs['diff'] = vs.groupby(['subjid', 'param'])['result'].diff() # diff(periods=1) by default
vs.head(5)

# As high level as it gets :-)

In [0]:
# Let's now create a better Baseline flag and a Change from Baseline

# Let's add a few missings. We want the baseline flag to be True for the earliest visit with a result, by subjid & param.
vs=create_vs()

vs.loc[0, 'result']= None
vs.loc[4, 'result']= None
vs.loc[9, 'result']= None
vs.head(10)

In [0]:
# Create the baseline flag
baselines = pd.DataFrame(vs[vs['result'].notnull()].groupby(['subjid', 'param'])['visit'].min()).reset_index()
baselines['baseline'] = True
baselines

In [0]:
# Merge flag to vs
vs = vs.merge(baselines, how='left', on=['subjid', 'param', 'visit'])
vs['baseline'].fillna(False, inplace=True)
vs

In [0]:
# What was the result value at baseline?
baselines = vs[ vs['baseline']].rename(columns={'result': 'bsl_value'}) [['subjid', 'param', 'bsl_value']]
baselines

In [0]:
vs = vs.merge(baselines, how='left', on=['subjid', 'param'])

In [0]:
# Finally calculate the change from baseline
vs['Chg_bsl'] = vs['result'] - vs['bsl_value']
vs

## Ranks

Ranking assign ranks (numbers) from 1 to the number of data points. In case of ties, by default rank is assigned to the mean rank (but there are more options)

In [0]:
data = {
    'subjid': ['S01', 'S02', 'S03', 'S04'],
    'age':    [20, 30, 30, 40],
}

df = pd.DataFrame(data, columns=['subjid', 'age'])
df

In [0]:
# series Age : 20 years old has rank 1, 40 has rank 4 and 30 has mean rank from 2 to 3. This is not in place
df['age'].rank()


In [0]:
# in a dataframe: this is not in place
df_rank = df.rank()
df_rank

__________________________________________________
Nicolas Dupuis, Methodology and Innovation (IDAR C&SP), 2020+