# Advanced Pandas functionality 
## - DataFrame.apply()

## Introduction
* We now try to use Pandas DataFrames to hold objects instead of numbers
* Process all Columns or Rows using the .apply .map methods

In [27]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Preparing test data

First we generate some objects, namely 100 numpy arrays containing 500 random values each:

In [28]:
curves = [np.random.randn(500) for i in range(100)]

Then we generate some random ids for the curves (This could be Tube-IDs):

In [29]:
ids = np.random.choice(range(10000, 99999), 100, replace=False)
ids

array([82590, 55844, 31985, 61340, 74576, 76616, 27787, 76249, 83860,
       19898, 27529, 41697, 52773, 53054, 26543, 21920, 45355, 25344,
       71838, 91207, 53333, 93331, 10537, 96239, 93528, 43127, 18427,
       75788, 70301, 10901, 54968, 95862, 82471, 99576, 74093, 59369,
       17133, 96882, 57381, 52192, 97838, 62893, 74781, 75079, 75000,
       56897, 25269, 91873, 76151, 19890, 63756, 33356, 79946, 23777,
       95785, 26768, 77871, 48135, 59609, 19166, 39150, 31924, 59096,
       91456, 10628, 17015, 66773, 34167, 80354, 71051, 29980, 53992,
       36534, 53577, 66095, 26716, 59311, 31984, 53482, 68444, 76486,
       85529, 85286, 91226, 76524, 55389, 13746, 94707, 28957, 67612,
       92276, 16596, 26332, 89707, 24263, 40422, 64275, 35380, 21453,
       23506])

.. and put everything into a Series:

In [30]:
s1 = pd.Series(data=curves, 
               index=ids, 
               name='first_sensor')

Finally we make a DataFrame from it:

In [31]:
df1 = s1.to_frame()
df1.head(5)

Unnamed: 0,first_sensor
82590,"[0.7001446564274595, -0.6202553402125156, 0.83..."
55844,"[-0.8794582998349463, 0.8375811217847537, -0.3..."
31985,"[-2.370463320766105, 0.6998033482535078, 1.077..."
61340,"[-2.201685429659071, -0.7434162727229623, -2.2..."
74576,"[1.4860213731146745, -0.6830079893991837, 0.62..."


For demonstration purposes we now add Measurements from a second sensor:

In [32]:
curves_from_sensor_2 = [np.random.randn(500) for i in range(100)]
s2 = pd.Series(data=curves_from_sensor_2, 
               index=pd.Index(ids, 'int64', name='ID'), 
               name='second_sensor')
df2 = s2.to_frame()

In [33]:
df = df1.join(df2)
df.head(2)

Unnamed: 0,first_sensor,second_sensor
82590,"[0.7001446564274595, -0.6202553402125156, 0.83...","[-0.678286071981691, 1.4435591485656292, -0.69..."
55844,"[-0.8794582998349463, 0.8375811217847537, -0.3...","[1.3725975495992109, 0.40835253873105787, 0.24..."


# Applying functions

## 1. `DataFrame.apply()`
We now want to calculate some summarizing statistics on the curves. Therefore we use `.apply()` on the dataframe. The function called by `.apply` gets the columns (`axis=0`) or the rows (`axis=1`) of the dataframe one by one as input.

In [34]:
def _calculate_mean_of_sensor(row, column='first_sensor'):
    single_curve = row[column]    
    return np.mean(single_curve)

# Axis=1 applies Row-Wise!!
mean_of_first_sensor = df.apply(_calculate_mean_of_sensor, axis=1).rename('mean_of_first_sensor')
mean_of_first_sensor.head(2)

82590   -0.055035
55844    0.020998
Name: mean_of_first_sensor, dtype: float64

A function can use multiple columns for calculation. Lets say we want to calculate the difference of the means from sensor 1 and sensor 2:

In [35]:
def _get_mean_difference(row, first_sensor='first_sensor', second_sensor='second_sensor'):
    sensor_1_curve = row[first_sensor]
    sensor_2_curve = row[second_sensor]
    
    return np.abs(np.mean(sensor_1_curve) - np.mean(sensor_2_curve))

mean_difference = df.apply(_get_mean_difference, axis=1).rename('mean_difference')
mean_difference.head(2)

82590    0.088417
55844    0.008528
Name: mean_difference, dtype: float64

Functions can also have multiple outputs. In this case we return a pd.Series:

In [36]:
def _get_mean_difference(row, first_sensor='first_sensor', second_sensor='second_sensor'):
    sensor_1_curve = row[first_sensor]
    sensor_2_curve = row[second_sensor]
    mean_curve_1 = np.mean(sensor_1_curve)
    mean_curve_2 = np.mean(sensor_2_curve)
 
    return pd.Series({'Mean_Curve_1': mean_curve_1, 'Mean_Curve_2': mean_curve_2})

means = df.apply(_get_mean_difference, axis=1)
means.head(2)

Unnamed: 0,Mean_Curve_1,Mean_Curve_2
82590,-0.055035,0.033382
55844,0.020998,0.01247


## 2. `DataFrame.map()`

If we want to apply the SAME function to ALL fields of the table, and not row or columnwise, we can use `.map()`. Here we calculate the length of each curve:

In [37]:
lengths = df.map(len).add_prefix('length_')
lengths.head(2)

Unnamed: 0,length_first_sensor,length_second_sensor
82590,500,500
55844,500,500


## 3. Series.apply()
`Series.apply()` applies the function simply to each field of the Series. This is very similar to `DataFrame.map()`

In [38]:
s1.apply(len).head(2)

82590    500
55844    500
Name: first_sensor, dtype: int64