# Modifying Series and DataFrames
In the previous notebooks we learned how to assign new values to rows and columns<br>
by first indexing our dataframe with either `loc[]` *(explicit index)* or `iloc[]` *(implicit index (0,1,2,...))* or just `[]` *(usually not preferred)*<br>
and then assigning a new values, which can either be completely new or based on the existing data.

In [1]:
import pandas as pd
import numpy as np

#### toy dataframe

In [2]:
df = pd.DataFrame([0,1,2,3,4], columns=["a"])
df

Unnamed: 0,a
0,0
1,1
2,2
3,3
4,4


So for example we could add a new column that only contains the string `"HI"`.

In [3]:
df["b"] = "HI"
df

Unnamed: 0,a,b
0,0,HI
1,1,HI
2,2,HI
3,3,HI
4,4,HI


Or we add a new column based on column `a` by multiplying it with `4`.

In [4]:
df["c"] = df["a"]*4
df

Unnamed: 0,a,b,c
0,0,HI,0
1,1,HI,4
2,2,HI,8
3,3,HI,12
4,4,HI,16


In this notebook we will go through some more elaborate ways of modifying our Series and DataFrames.

## `apply`, `map`, `applymap` 
These are convenience functions that allow us to **apply** any functions to **series** and **dataframes**.<br>
While they are **convenient**, they are [**not efficient**](https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code) because the functions will be applied element wise without making use of vectorization.<br>
We will briefly cover them nonetheless for cases where it may be too complicated to vectorize a given function or for when we don't care about time.<br>
In the end we will talk about how we can vectorize the functions that we want to apply to our series and dataframes.


<img src="https://i.stack.imgur.com/IZys3.png" alt="overview" style="width:700px"> 

### `apply`
takes as arguments a `function` and an `axis`.<br>
The `axis` determines if the `function` should be applied to the column-wise `axis=0` (default) or row-wise `axis=1`.

In [5]:
df

Unnamed: 0,a,b,c
0,0,HI,0
1,1,HI,4
2,2,HI,8
3,3,HI,12
4,4,HI,16


In [6]:
df.apply(np.sum, axis=0)

a            10
b    HIHIHIHIHI
c            40
dtype: object

In [7]:
try:
    df.apply(np.sum, axis=1)
except TypeError:
    print(TypeError)

<class 'TypeError'>


If we want to add Integers and Strings together we can write a custom function.<br>
In this custom function we define what should happen to each row `x`.


In [8]:
def add_all(x):
    return f"{x[0]}{x[1]}{x[2]}"

In [9]:
df.apply(add_all, axis=1)

0     0HI0
1     1HI4
2     2HI8
3    3HI12
4    4HI16
dtype: object

If we want so save the result we can assign it to a new column.

In [10]:
df["abc"] = df.apply(add_all, axis=1)
df

Unnamed: 0,a,b,c,abc
0,0,HI,0,0HI0
1,1,HI,4,1HI4
2,2,HI,8,2HI8
3,3,HI,12,3HI12
4,4,HI,16,4HI16


<br>

We can also combine `apply` with `lambda` functions

In [11]:
df.apply(lambda x: f"{x[0]}{x[1]}{x[2]}", axis=1)

0     0HI0
1     1HI4
2     2HI8
3    3HI12
4    4HI16
dtype: object

<br>

### `applymap`


In [12]:
df

Unnamed: 0,a,b,c,abc
0,0,HI,0,0HI0
1,1,HI,4,1HI4
2,2,HI,8,2HI8
3,3,HI,12,3HI12
4,4,HI,16,4HI16


In [13]:
df.applymap(str).applymap(lambda x: x*2)

Unnamed: 0,a,b,c,abc
0,0,HIHI,0,0HI00HI0
1,11,HIHI,44,1HI41HI4
2,22,HIHI,88,2HI82HI8
3,33,HIHI,1212,3HI123HI12
4,44,HIHI,1616,4HI164HI16


<br>

### `map` *and `replace`* 
can be used to substitute values of a series

In [14]:
df

Unnamed: 0,a,b,c,abc
0,0,HI,0,0HI0
1,1,HI,4,1HI4
2,2,HI,8,2HI8
3,3,HI,12,3HI12
4,4,HI,16,4HI16


In [15]:
df["a"].map({0: "zero", 1: "one", 2: "two"})

0    zero
1     one
2     two
3     NaN
4     NaN
Name: a, dtype: object

In [16]:
df["a"].replace({0: "zero", 1: "one", 2: "two"})

0    zero
1     one
2     two
3       3
4       4
Name: a, dtype: object

<br>
<br>

## Vectorization
**Don't do loops, do vectorization** <br>


### Numeric Data

In [17]:
np.random.seed(42)

random_values = np.random.random((1_000_000, 2))

df = pd.DataFrame(random_values, columns=['a', 'b'])

In [18]:
df.head()

Unnamed: 0,a,b
0,0.37454,0.950714
1,0.731994,0.598658
2,0.156019,0.155995
3,0.058084,0.866176
4,0.601115,0.708073


No vectorization<br>
`apply` applies the function to every row one by one

In [19]:
%%timeit
df.apply(lambda x: x["a"]+x["b"], axis=1)

4.68 s ± 13.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Vectorization

In [20]:
%%timeit
df["a"] + df["b"]

613 µs ± 147 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Vectorization without the pandas series overhead

In [21]:
%%timeit
df["a"].values + df["b"].values

555 µs ± 168 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


<br>

### Conditionals *(if else)*
Here we have a dataframe with temperature measurements in *Celsius* and *Fahrenheit*.<br>
Based on the *region* we want to decide which to take

In [31]:
region = np.random.choice(["USA", "EU"], size=100_000)
temp = np.random.randint(low=-10, high=40, size=100_000)

df =  pd.DataFrame({"region": region, "celsius": temp, "fahrenheit": (1.8 * temp) + 32})
df.head()

Unnamed: 0,region,celsius,fahrenheit
0,EU,17,62.6
1,USA,29,84.2
2,USA,9,48.2
3,EU,-4,24.8
4,USA,-9,15.8


No vectorization

In [32]:
def regional_temperature(x):
    if x["region"] == "EU":
        return x["celsius"]
    else:
        return x["fahrenheit"]

In [33]:
%%timeit
df.apply(regional_temperature, axis=1)

468 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Vectorization with `np.where`

In [34]:
%%timeit
np.where(df["region"]=="EU", df["celsius"], df["fahrenheit"])

3.36 ms ± 5.99 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### elif
What if we have more then one condition?<br>
Lets say we want to match the numbers 0-4 to different colors.

In [44]:
values = np.random.randint(low=0, high=5, size=100_000)
df = pd.DataFrame(values, columns=["value"])
df.head()

Unnamed: 0,value
0,3
1,2
2,1
3,3
4,4


No vectorization

In [36]:
def num_to_color(x):
    if x["value"] == 0:
        return "red"
    elif x["value"] == 1:
        return "green"
    elif x["value"] == 2:
        return "blue"
    elif x["value"] == 3:
        return "yellow"
    else:
        return "purple"
        

In [37]:
%%timeit
df.apply(num_to_color, axis=1)

600 ms ± 4.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Vectorization with `np.select`

In [46]:
%%timeit

conditions = [
    df["value"] == 0,
    df["value"] == 1,
    df["value"] == 2,
    df["value"] == 3,
    df["value"] == 4,
]

choices = [
    "red",
    "green",
    "blue",
    "yellow",
    "purple"
]

np.select(conditions, choices)

33.1 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


We could also do this with map, but `np.select` allows for more flexibility. So we could for example also base our conditions on other columns if we wanted to.

In [45]:
%%timeit
df["value"].map({0: "red", 1: "green", 2: "blue", 3: "yellow", 4: "purple"})

882 µs ± 3.84 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


<br>

### Logical operators
Given a `dataframe` with two columns that have values of `0` or `1`, we want to *apply* the logical **and**

In [39]:
values = np.random.randint(low=0, high=2, size=200_000).reshape(-1, 2)
df =  pd.DataFrame(values, columns=["a", "b"])
df.head()

Unnamed: 0,a,b
0,0,0
1,1,1
2,1,1
3,0,1
4,1,0


No vectorization

In [40]:
%%timeit
df.apply(lambda x: x["a"] and x["b"], axis=1)

408 ms ± 3.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Vectorization with a numpy function

In [41]:
%%timeit
np.logical_and(df["a"], df["b"])

262 µs ± 30.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Vectorization with a numpy function and without the pandas overhead

In [None]:
%%timeit
np.logical_and(df["a"].values, df["b"].values)

59.7 µs ± 49.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


<br>

### More on vectorization
Vectorizing can be hard, here are some more resources that might help you out.<br>
[1000x faster data manipulation: vectorizing with Pandas and Numpy](https://www.youtube.com/watch?v=nxWginnBklU&list=PLO4WA4pVxaiVmEU8WYlZj4CbUBGNOvD9G)<br>
[ Sofia Heisler No More Sad Pandas Optimizing Pandas Code for Speed and Efficiency PyCon 2017 ](https://www.youtube.com/watch?v=HN5d490_KKk&list=PLO4WA4pVxaiVmEU8WYlZj4CbUBGNOvD9G)
[When should I (not) want to use pandas apply() in my code?](https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code)
[Are for-loops in pandas really bad? When should I care?](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care)