# Advanced Pandas functionality 
## - DataFrame.apply()

## Introduction
* We now try to use Pandas DataFrames to hold objects instead of numbers
* Process all Columns or Rows using the .apply .applymap methods

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Preparing test data

First we generate some objects, namely 100 numpy arrays containing 500 random values each:

In [2]:
curves = [np.random.randn(500) for i in range(100)]

Then we generate some random ids for the curves (This could be Tube-IDs):

In [3]:
ids = np.random.choice(range(10000, 99999), 100, replace=False)
ids

array([18357, 76964, 19250, 93057, 83236, 93509, 90130, 91088, 85293,
       88248, 19496, 10661, 15259, 11987, 45957, 94743, 63430, 87849,
       67998, 83674, 35721, 56530, 50343, 73938, 61845, 54186, 12525,
       22579, 30518, 48427, 70773, 85325, 59341, 59098, 53484, 22250,
       89846, 91291, 58360, 11589, 69700, 10614, 61021, 62444, 26057,
       12456, 83770, 59133, 88394, 40679, 91280, 99492, 63283, 42928,
       50447, 66279, 83303, 22544, 46636, 49433, 34218, 81231, 92724,
       40299, 79468, 41294, 21315, 49740, 53654, 49783, 87218, 97320,
       13227, 94428, 23952, 85741, 90032, 23779, 56768, 10930, 27678,
       94893, 99295, 72915, 74903, 72727, 76454, 50844, 40126, 77506,
       98308, 48019, 92759, 12975, 12171, 42717, 64838, 44341, 25374,
       63772])

.. and put everything into a Series:

In [4]:
s1 = pd.Series(data=curves, 
               index=ids, 
               name='first_sensor')

Finally we make a DataFrame from it:

In [5]:
df1 = s1.to_frame()
df1.head(5)

Unnamed: 0,first_sensor
18357,"[-0.7390371179293477, -0.24291510833306965, 0...."
76964,"[1.0730572340401001, 0.5279795464380411, -0.83..."
19250,"[-0.22260823611479102, 0.40813512383792894, -0..."
93057,"[-1.5031759223766656, 0.5489861195300714, 0.69..."
83236,"[-0.528825059017967, 0.7856483314081809, 1.806..."


For demonstration purposes we now add Measurements from a second sensor:

In [6]:
curves_from_sensor_2 = [np.random.randn(500) for i in range(100)]
s2 = pd.Series(data=curves_from_sensor_2, 
               index=pd.Int64Index(ids, name='ID'), 
               name='second_sensor')
df2 = s2.to_frame()

In [7]:
df = df1.join(df2)
df.head(2)

Unnamed: 0,first_sensor,second_sensor
18357,"[-0.7390371179293477, -0.24291510833306965, 0....","[1.0118496457408093, 1.4592287726873725, 1.101..."
76964,"[1.0730572340401001, 0.5279795464380411, -0.83...","[-0.5575252272650129, -0.14120645797188594, 0...."


# Applying functions

## 1. `DataFrame.apply()`
We now want to calculate some summarizing statistics on the curves. Therefore we use `.apply()` on the dataframe. The function called by `.apply` gets the columns (`axis=0`) or the rows (`axis=1`) of the dataframe one by one as input.

In [8]:
def _calculate_mean_of_sensor(row, column='first_sensor'):
    single_curve = row[column]    
    return np.mean(single_curve)

# Axis=1 applies Row-Wise!!
mean_of_first_sensor = df.apply(_calculate_mean_of_sensor, axis=1).rename('mean_of_first_sensor')
mean_of_first_sensor.head(2)

18357   -0.036690
76964   -0.007325
Name: mean_of_first_sensor, dtype: float64

A function can use multiple columns for calculation. Lets say we want to calculate the difference of the means from sensor 1 and sensor 2:

In [9]:
def _get_mean_difference(row, first_sensor='first_sensor', second_sensor='second_sensor'):
    sensor_1_curve = row[first_sensor]
    sensor_2_curve = row[second_sensor]
    
    return np.abs(np.mean(sensor_1_curve) - np.mean(sensor_2_curve))

mean_difference = df.apply(_get_mean_difference, axis=1).rename('mean_difference')
mean_difference.head(2)

18357    0.020503
76964    0.056166
Name: mean_difference, dtype: float64

Functions can also have multiple outputs. In this case we return a pd.Series:

In [10]:
def _get_mean_difference(row, first_sensor='first_sensor', second_sensor='second_sensor'):
    sensor_1_curve = row[first_sensor]
    sensor_2_curve = row[second_sensor]
    mean_curve_1 = np.mean(sensor_1_curve)
    mean_curve_2 = np.mean(sensor_2_curve)
 
    return pd.Series({'Mean_Curve_1': mean_curve_1, 'Mean_Curve_2': mean_curve_2})

means = df.apply(_get_mean_difference, axis=1)
means.head(2)

Unnamed: 0,Mean_Curve_1,Mean_Curve_2
18357,-0.03669,-0.016187
76964,-0.007325,-0.063491


## 2. `DataFrame.applymap()`

If we want to apply the SAME function to ALL fields of the table, and not row or columnwise, we can use `.applymap()`. Here we calculate the length of each curve:

In [11]:
lengths = df.applymap(len).add_prefix('length_')
lengths.head(2)

Unnamed: 0,length_first_sensor,length_second_sensor
18357,500,500
76964,500,500


## 3. Series.apply()
`Series.apply()` applies the function simply to each field of the Series. This is very similar to `DataFrame.applymap()`

In [12]:
s1.apply(len).head(2)

18357    500
76964    500
Name: first_sensor, dtype: int64