# Advanced Pandas functionality 
## - DataFrame.apply()

## Introduction
* We now try to use Pandas DataFrames to hold objects instead of numbers
* Process all Columns or Rows using the .apply .applymap methods

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Preparing test data

First we generate some objects, namely 100 numpy arrays containing 500 random values each:

In [2]:
curves = [np.random.randn(500) for i in range(100)]

Then we generate some random ids for the curves (This could be Tube-IDs):

In [3]:
ids = np.random.choice(range(10000, 99999), 100, replace=False)
ids

array([93563, 22458, 75349, 28930, 57059, 48114, 13249, 55497, 14562,
       25324, 23289, 98390, 27936, 24347, 28367, 18411, 36689, 15152,
       56849, 57945, 56477, 86446, 31007, 17934, 20601, 13583, 24839,
       86655, 27471, 11182, 88883, 22753, 21043, 61047, 51750, 26818,
       30222, 69613, 54620, 11155, 16727, 31381, 57058, 25865, 18823,
       35712, 63752, 37585, 42704, 88786, 53601, 76173, 71348, 34119,
       20593, 46421, 56276, 98167, 72349, 13059, 82956, 21153, 70396,
       45327, 18490, 44584, 96570, 46667, 25960, 52693, 10497, 26414,
       36272, 95517, 21321, 64550, 12055, 33480, 26764, 15704, 56865,
       81970, 24535, 18206, 94292, 69097, 82515, 90101, 53048, 52202,
       99273, 43577, 50854, 25144, 23005, 68256, 89236, 21148, 36570,
       93422])

.. and put everything into a Series:

In [4]:
s1 = pd.Series(data=curves, 
               index=ids, 
               name='first_sensor')

Finally we make a DataFrame from it:

In [5]:
df1 = s1.to_frame()
df1.head(5)

Unnamed: 0,first_sensor
93563,"[-0.6598735981198447, -0.7671201736835079, 1.1..."
22458,"[-0.9303387949314981, 0.38224683602931453, 0.1..."
75349,"[0.02845500051866786, 1.5454019483180568, 1.94..."
28930,"[1.3364052040665673, -1.7036977375995483, 1.03..."
57059,"[-0.7226818672934231, 0.6207632929940364, 0.70..."


For demonstration purposes we now add Measurements from a second sensor:

In [6]:
curves_from_sensor_2 = [np.random.randn(500) for i in range(100)]
s2 = pd.Series(data=curves_from_sensor_2, 
               index=pd.Int64Index(ids, name='ID'), 
               name='second_sensor')
df2 = s2.to_frame()

  index=pd.Int64Index(ids, name='ID'),


In [7]:
df = df1.join(df2)
df.head(2)

Unnamed: 0,first_sensor,second_sensor
93563,"[-0.6598735981198447, -0.7671201736835079, 1.1...","[0.9551655381636115, -0.6139232678406525, -1.6..."
22458,"[-0.9303387949314981, 0.38224683602931453, 0.1...","[-1.382049580957035, 0.7038721760818228, 0.147..."


# Applying functions

## 1. `DataFrame.apply()`
We now want to calculate some summarizing statistics on the curves. Therefore we use `.apply()` on the dataframe. The function called by `.apply` gets the columns (`axis=0`) or the rows (`axis=1`) of the dataframe one by one as input.

In [8]:
def _calculate_mean_of_sensor(row, column='first_sensor'):
    single_curve = row[column]    
    return np.mean(single_curve)

# Axis=1 applies Row-Wise!!
mean_of_first_sensor = df.apply(_calculate_mean_of_sensor, axis=1).rename('mean_of_first_sensor')
mean_of_first_sensor.head(2)

93563    0.018476
22458   -0.015098
Name: mean_of_first_sensor, dtype: float64

A function can use multiple columns for calculation. Lets say we want to calculate the difference of the means from sensor 1 and sensor 2:

In [9]:
def _get_mean_difference(row, first_sensor='first_sensor', second_sensor='second_sensor'):
    sensor_1_curve = row[first_sensor]
    sensor_2_curve = row[second_sensor]
    
    return np.abs(np.mean(sensor_1_curve) - np.mean(sensor_2_curve))

mean_difference = df.apply(_get_mean_difference, axis=1).rename('mean_difference')
mean_difference.head(2)

93563    0.060306
22458    0.081392
Name: mean_difference, dtype: float64

Functions can also have multiple outputs. In this case we return a pd.Series:

In [10]:
def _get_mean_difference(row, first_sensor='first_sensor', second_sensor='second_sensor'):
    sensor_1_curve = row[first_sensor]
    sensor_2_curve = row[second_sensor]
    mean_curve_1 = np.mean(sensor_1_curve)
    mean_curve_2 = np.mean(sensor_2_curve)
 
    return pd.Series({'Mean_Curve_1': mean_curve_1, 'Mean_Curve_2': mean_curve_2})

means = df.apply(_get_mean_difference, axis=1)
means.head(2)

Unnamed: 0,Mean_Curve_1,Mean_Curve_2
93563,0.018476,-0.04183
22458,-0.015098,0.066294


## 2. `DataFrame.applymap()`

If we want to apply the SAME function to ALL fields of the table, and not row or columnwise, we can use `.applymap()`. Here we calculate the length of each curve:

In [11]:
lengths = df.applymap(len).add_prefix('length_')
lengths.head(2)

Unnamed: 0,length_first_sensor,length_second_sensor
93563,500,500
22458,500,500


## 3. Series.apply()
`Series.apply()` applies the function simply to each field of the Series. This is very similar to `DataFrame.applymap()`

In [12]:
s1.apply(len).head(2)

93563    500
22458    500
Name: first_sensor, dtype: int64