# map filter reduce workshop

### Recipe for all!

1. a **function** object (important! We want an object, as in `my_function`, not an executed function, as in `my_function()` )
2. an **iterable** (what is that? Something you can iterate over, 'hello world', range(10), [1,2,3,4,5], etc)

## 1.  map

N to N function, that means we throw in an iterable of a length N in our map and get an iterable out, which has the same length

In [1]:
map(str, [1,2,3,4,5])

<map at 0x7f86244c3400>

woopsie! gives us a map-object. Disappointing. But it's actually quite useful, since Python stores this just as some kind of instructions that have to be run, like a construction plan. Similar to if you build a house. You just have the idea of the house in a construction plan. Which is quite handy, because especially when it comes to handling large data objects (df's), with this map object we have the option to only store "what we would want to do with", not the actual result. Only if I give my plan to the construction workers they will build my house.
* `str` : my instructions what to do with the material
* `[1,2,3,4,5,6,7]`: the list of the actual building material
* `map(str, [1,2,3,4,5])`: the whole construction plan

So let's build it, with the `list()` command

In [13]:
list(map(str, [1,2,3,4,5,6,7]))

['1', '2', '3', '4', '5', '6', '7']

So we get the same list in return, but for each element, the function `str()` is applied

In [30]:
### with our own function:
def square_it(x):
    return x**2

In [31]:
list(map(square_it, [1,2,3,4,5,6,7]))

[1, 4, 9, 16, 25, 36, 49]

In [32]:
### with multiple arguments!
def judge_groceries(x,y):
    return f"What's this? A {x}, {y}!"

In [129]:
list(map(judge_groceries, ['apple', 'kiwi', 'cherry', 'sprouts'], ['yummy', 'meh', 'yummy', 'yummy']))

["What's this? A apple, yummy!",
 "What's this? A kiwi, meh!",
 "What's this? A cherry, yummy!",
 "What's this? A sprouts, yummy!"]

But! In python 3, using list comprehensions are more recommended!

In [131]:
# the square it as a list comprehension:
[square_it(i) for i in [1,2,3,4,5,6,7]]

[1, 4, 9, 16, 25, 36, 49]

In [134]:
# Which one is faster?
%timeit [square_it(i) for i in [1,2,3,4,5,6,7]]

1.55 µs ± 22.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [135]:
%timeit list(map(square_it, [1,2,3,4,5,6,7]))

1.49 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


Basically hardly any difference in execution time!

# 2. filter

similar to reduce, but now it's a **filter**, that means it is a `N` to `M` relationship, where $M\leq N$. We put in a `N` long iterable and get an iterable of a smaller or equal length, depending on a **condition** function that we have to define.

In [22]:
filter(lambda x: x < 0, [-1, 4, -3, -4, 5, 10, 17])

<filter at 0x7f86244c35e0>

In [23]:
list(filter(lambda x: x < 0, [-1, 4, -3, -4, 5, 10, 17]))

[-1, -3, -4]

In [29]:
list(filter(lambda x: x in ['car', 'sky', 'hammer'], ['house', 'car', 'apple', 'hammer', 'bread']))

['car', 'hammer']

# Further reading

* https://book.pythontips.com/en/latest/map_filter.html

# 3. reduce

Needs to be imported:

In [36]:
from functools import reduce

Doesn't return a "reduce object" unlike map, filter, but the result straight away:

In [39]:
reduce((lambda x, y: x * y), [1, 2, 3, 4])

24

So what happens here? Reduce goes over the iterable one by one using your function, applying these steps:
1. apply your function to the first two elements of your iterable
2. use the result of **1.** as the first argument of the next calculation with your funcion, and the the next element in your iterable as the second argument of the calculation.
3. use the result of **2.** as the first argument of the next calculation with your funcion, and the the next element in your iterable as the second argument of the calculation.
4. etc... 

When all elements were iterated over, return the result.

In [48]:
reduce((lambda x,y: x*y), [1, 2, 3, 4])

24

In [54]:
reduce((lambda x,y: x + y, [1, 2, 3, 4]))

TypeError: reduce expected at least 2 arguments, got 1

In [55]:
reduce(lambda a,d: 10*a+d, [1,2,3,4,5,6,7,8], 0)

12345678

In [61]:
reduce(lambda e,f: str(e)+str(f), [1, 2, 3, 4], 'world')

'world1234'

### what is it good for in data analysis?

In [76]:
# Let#s say we have this data here:
s = "The-QUICK-Brown-fox-JUMPS-Over-the-Lazy-Dog"

# And let's say we have these data cleaning functions:
color = lambda x: x.replace('brown', 'blue')
speed = lambda x: x.replace('quick', 'slow')
dashes = lambda x: x.replace('-', ' ')
work = lambda x: x.replace('lazy', 'industrious')

# and in addition, some built in string formatting functions:

# we can gather them all in a list:
fs = [str.lower,
      color,
      speed,
      dashes,
      work,
      str.title,
     ]

# and define us a helper function
def call(s, func):
    return func(s)


reduce(call, # function which gets a string to be cleaned and a function from the function collection as input
       fs, # our function collection
       s   # starter string!
      )

'The Slow Blue Fox Jumps Over The Industrious Dog'

# 4. Apply

Pandas realm

In [108]:
# creating us an example dataframe with random strings in them
import random
import pandas as pd

str_lst = ["The","QUICK","Brown","fox","JUMPS","Over","the","Lazy","-","Dog"]

df = pd.DataFrame({'A':[random.choice(str_lst) for i in range(10)],
                  'B':[random.choice(str_lst) for i in range(10)],
                  'C':[random.choice(str_lst) for i in range(10)],
                  'D':[i for i in range(10)]})

In [109]:
df.head()

Unnamed: 0,A,B,C,D
0,the,JUMPS,The,0
1,The,the,JUMPS,1
2,Dog,fox,Dog,2
3,Dog,QUICK,-,3
4,Dog,Over,Lazy,4


In [110]:
df['D'].apply(lambda x: x**2)

0     0
1     1
2     4
3     9
4    16
5    25
6    36
7    49
8    64
9    81
Name: D, dtype: int64

In [111]:
# make it permanent:
df['D'] = df['D'].apply(lambda x: x**2)

you can also apply a function to the whole dataframe, but then make sure the function can be applied to all datatypes in your df's columns

# 5. applying it to 4.03 activity 2

In [119]:
import numpy as np

In [113]:
ls

'map filter reduce workshop.ipynb'   unit4_healthcare_for_all.csv


In [122]:
# We have here our healthcare data
df = pd.read_csv('unit4_healthcare_for_all.csv')

In [123]:
df.head()

Unnamed: 0,STATE,PVASTATE,DOB,MDMAUD,RECP3,GENDER,DOMAIN,INCOME,HOMEOWNR,HV1,...,VETERANS,NUMPROM,CARDPROM,CARDPM12,NUMPRM12,MAXADATE,RFA_2,NGIFTALL,TIMELAG,AVGGIFT
0,IL,,3712,XXXX,,F,T2,,,479,...,,74,27,6,14,9702,L4E,31,4.0,7.741935
1,CA,,5202,XXXX,,M,S1,6.0,H,5468,...,,32,12,6,13,9702,L2G,3,18.0,15.666667
2,NC,,0,XXXX,,M,R2,3.0,U,497,...,,63,26,6,14,9702,L4E,27,12.0,7.481481
3,CA,,2801,XXXX,,F,R2,1.0,U,1000,...,,66,27,6,14,9702,L4E,16,9.0,6.8125
4,FL,,2001,XXXX,X,F,S2,3.0,H,576,...,,113,43,10,25,9702,L2F,37,14.0,6.864865


In [126]:
df['DOMAIN'].unique()

array(['T2', 'S1', 'R2', 'S2', 'T1', 'R3', 'U1', 'C2', 'C1', 'U3', ' ',
       'R1', 'U2', 'C3', 'U4', 'S3', 'T3'], dtype=object)

We want to rewrite the whole column in a way that, whenever some value contains one of those domain codes, we want to substitute it with the actual text, e.g. "U1" or "U4" should turn into "Urban".

In [120]:
# define our dictionary acording to which we should "translate"
domain_categories = {"U" : "Urban",
                     "C" : "City",
                     "S" : "Suburban",
                     "T" : "Town",
                     "R" : "Rural"}
def clean_domain(x):
    if x[0] in list(domain_categories.keys()):
        return domain_categories[x[0]]
    else:
        return np.NaN
    
df['DOMAIN'] = list(map(clean_domain, df["DOMAIN"]))

array(['T2', 'S1', 'R2', 'S2', 'T1', 'R3', 'U1', 'C2', 'C1', 'U3', ' ',
       'R1', 'U2', 'C3', 'U4', 'S3', 'T3'], dtype=object)