# Basics - Apply, Map and Vectorised Functions

In [1]:
import pandas as pd
import numpy as np

In [3]:
data = np.round(np.random.normal(size=(4,3)), 2)
df = pd.DataFrame(data, columns=['A','B','C'])
df

Unnamed: 0,A,B,C
0,0.24,-0.03,0.96
1,1.3,0.26,0.62
2,0.12,-0.04,-0.59
3,-0.51,0.28,-0.93


## Apply
Used to execute an arbitrary function again an entire dataframe, or subsection. Applies in a vectorised fashion.

In [4]:
df.apply(lambda x: 1 + np.abs(x))

Unnamed: 0,A,B,C
0,1.24,1.03,1.96
1,2.3,1.26,1.62
2,1.12,1.04,1.59
3,1.51,1.28,1.93


In [5]:
df.A.apply(np.abs)

0    0.24
1    1.30
2    0.12
3    0.51
Name: A, dtype: float64

### Things to note here
Becareful when using apply as it will overwrite original data

In [8]:
# give an example
def double_if_positive(x):
    # copy original value first
    x = x.copy()
    x[x > 0] *= 2 # remember x here is a list, not a single value
    return x

df.apply(double_if_positive, raw=True)

Unnamed: 0,A,B,C
0,0.48,-0.03,1.92
1,2.6,0.52,1.24
2,0.24,-0.04,-0.59
3,-0.51,0.56,-0.93


## Map
Similar to apply, but operators on Series, and uses dictionary based inputs rather than an array of values.

In [9]:
series = pd.Series(['Steve', 'Alex', 'Jess', 'Mark'])
series.map({'Steve': 'Stephen'})

0    Stephen
1        NaN
2        NaN
3        NaN
dtype: object

In [11]:
series.map(lambda x: f'I am {x}')

0    I am Steve
1     I am Alex
2     I am Jess
3     I am Mark
dtype: object

## Vectorised functions
Pandas and numpy obviously have tons of these, here are some examples

In [12]:
display(df, df.abs())

Unnamed: 0,A,B,C
0,0.24,-0.03,0.96
1,1.3,0.26,0.62
2,0.12,-0.04,-0.59
3,-0.51,0.28,-0.93


Unnamed: 0,A,B,C
0,0.24,0.03,0.96
1,1.3,0.26,0.62
2,0.12,0.04,0.59
3,0.51,0.28,0.93


In [13]:
series = pd.Series(['Obi-Wan Kenobi', 'Luke Skywalker', 'Han Solo', 'Leia Organa'])
series.str.split()

0    [Obi-Wan, Kenobi]
1    [Luke, Skywalker]
2          [Han, Solo]
3       [Leia, Organa]
dtype: object

In [14]:
# you can turn the series into dataframe like
series.str.split(expand=True)

Unnamed: 0,0,1
0,Obi-Wan,Kenobi
1,Luke,Skywalker
2,Han,Solo
3,Leia,Organa


In [15]:
# get particular value
series.str.contains('Skywalker')

0    False
1     True
2    False
3    False
dtype: bool

In [16]:
series.str.upper()

0    OBI-WAN KENOBI
1    LUKE SKYWALKER
2          HAN SOLO
3       LEIA ORGANA
dtype: object

## User defined functions
Let's investigate a super simple example of trying to find the hypotenuse given x and y distances.

In [18]:
data2 = np.random.normal(10, 2, size=(100000, 2))
df2 = pd.DataFrame(data2, columns=['x', 'y'])

In [21]:
hypot = (df2.x **2 + df2.y **2)**0.5
hypot[0]

np.float64(12.39732233274286)

In [23]:
# if you write it in a function
# but it takes some miliseconds
def hypot1(x,y):
    return np.sqrt(x**2 + y**2)

h1 = []
for index, (x,y) in df2.iterrows():
    h1.append(hypot1(x,y))
print(h1[0])

12.39732233274286


In [25]:
# compare to this
def hypot2(xs, ys):
    return np.sqrt(xs**2 + ys**2)

h2 = hypot2(df2.x, df2.y)
print(h2[0])

12.39732233274286


In [27]:
data2.shape

(100000, 2)

Vectorising everything you can is the key to speeding up your code. Once you've done that, you should use other tool to investigate. PyCharm Professional has a great optimisation tool built in. Jupyter has %lprun (line profiler) command you can find here: \
https://github.com/rkern/line_profiler

## Recap
- apply
- map
- .str & similar