# Basics - Apply, Map and Vectorised Functions

In [1]:
import pandas as pd
import numpy as np

data = np.round(np.random.normal(size=(4, 3)), 2)
df = pd.DataFrame(data, columns=["A", "B", "C"])
df.head()

Unnamed: 0,A,B,C
0,-1.2,0.24,0.08
1,0.77,0.24,0.11
2,3.7,0.29,-0.9
3,-0.52,-0.08,0.01


## Apply

Used to execute an arbitrary function again an entire dataframe, or a subection. Applies in a vectorised fashion.

In [2]:
df.apply(lambda x: 1 + np.abs(x))

Unnamed: 0,A,B,C
0,2.2,1.24,1.08
1,1.77,1.24,1.11
2,4.7,1.29,1.9
3,1.52,1.08,1.01


In [3]:
df.A.apply(np.abs)

0    1.20
1    0.77
2    3.70
3    0.52
Name: A, dtype: float64

In [6]:
#def double_if_positive(x):
#    if x > 0:
#        return 2 * x
#    return x
#
#df.apply(double_if_positive)
#this statement is not vectorized, so will not work

In [4]:
def double_if_positive(x):
    x[x > 0] *= 2
    return x

df.apply(double_if_positive)#modifies original value, so this is not good

Unnamed: 0,A,B,C
0,-1.2,0.48,0.16
1,1.54,0.48,0.22
2,7.4,0.58,-0.9
3,-0.52,-0.08,0.02


In [5]:
df

Unnamed: 0,A,B,C
0,-1.2,0.48,0.16
1,1.54,0.48,0.22
2,7.4,0.58,-0.9
3,-0.52,-0.08,0.02


In [7]:
def double_if_positive(x):
    x = x.copy()
    x[x > 0] *= 2
    return x

df.apply(double_if_positive, raw=True) #creates a copy, only works if you have smaller data source

Unnamed: 0,A,B,C
0,-1.2,0.96,0.32
1,3.08,0.96,0.44
2,14.8,1.16,-0.9
3,-0.52,-0.08,0.04


## Map

Similar to apply, but operators on Series, and uses dictionary based inputs rather than an array of values.


In [8]:
series = pd.Series(["Steve", "Alex", "Jess", "Mark"])

In [9]:
series.map({"Steve": "Stephen"})

0    Stephen
1        NaN
2        NaN
3        NaN
dtype: object

In [10]:
series.map(lambda d: f"I am {d}") #modified original value

0    I am Steve
1     I am Alex
2     I am Jess
3     I am Mark
dtype: object

## Vectorised functions

Pandas and numpy obviously have tons of these, here are some examples

In [11]:
display(df, df.abs())

Unnamed: 0,A,B,C
0,-1.2,0.48,0.16
1,1.54,0.48,0.22
2,7.4,0.58,-0.9
3,-0.52,-0.08,0.02


Unnamed: 0,A,B,C
0,1.2,0.48,0.16
1,1.54,0.48,0.22
2,7.4,0.58,0.9
3,0.52,0.08,0.02


In [12]:
series = pd.Series(["Obi-Wan Kenobi", "Luke Skywalker", "Han Solo", "Leia Organa"])

In [13]:
"Luke Skywalker".split()

['Luke', 'Skywalker']

In [14]:
series.str.split(expand=True) #series is expanded into a dataframe

Unnamed: 0,0,1
0,Obi-Wan,Kenobi
1,Luke,Skywalker
2,Han,Solo
3,Leia,Organa


In [15]:
series.str.contains("Skywalker")

0    False
1     True
2    False
3    False
dtype: bool

In [16]:
series.str.upper().str.split()#makes everything uppercase and splits

0    [OBI-WAN, KENOBI]
1    [LUKE, SKYWALKER]
2          [HAN, SOLO]
3       [LEIA, ORGANA]
dtype: object

## User defined functions

Lets investigate a super simple example of trying to find the hypotenuse given x and y distances.


In [17]:
data2 = np.random.normal(10, 2, size=(100000, 2))
df2 = pd.DataFrame(data2, columns=["x", "y"])

In [18]:
hypot = (df2.x**2 + df2.y**2)**0.5
print(hypot[0])

13.28330580962249


In [19]:
def hypot1(x, y):
    return np.sqrt(x**2 + y**2)

h1 = []
for index, (x, y) in df2.iterrows():
    h1.append(hypot1(x, y))
print(h1[0])

13.28330580962249


In [20]:
def hypot2(row):
    return np.sqrt(row.x**2 + row.y**2)

h2 = df2.apply(hypot2, axis=1)
print(h2[0])

13.28330580962249


In [21]:
def hypot3(xs, ys):
    return np.sqrt(xs**2 + ys**2)
h3 = hypot3(df2.x, df2.y)
print(h3[0])

13.28330580962249


Vectorising everything you can is the key to speeding up your code. Once you've done that, you should use other tools to investigate. PyCharm Professional has a great optimisation tool built in. Jupyter has %lprun (line profiler) command you can find here: https://github.com/rkern/line_profiler

### Recap

* apply
* map
* .str & similar