## Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 
    
    AUTHOR: Dr. Roy Jafari 

# Chapter 4: Taking Advantage of Vectorization and Broadcasting (V&B) 

## Challenge 2: V&B or Boolean Masking?

In this challenge, we will experience the performance difference between V&B and Boolean masking. We will, also, get to practice what we learned in Challenge 1. So, let’s go! Answer the following questions and do the following tasks.

1.	The following piece of code creates `person_df` which is a pandas DataFrame filled with random values of `Height` in centimeters and `Weight` in kilograms. 

In [1]:
import pandas as pd
import numpy as np
n_rows = 10**7
person_df = pd.DataFrame(index=range(n_rows), 
                         columns=['Height','Weight'])
person_df.Height = np.random.normal(178,10,n_rows)
person_df.Weight = np.random.normal(83,7,n_rows)
print(person_df)

             Height     Weight
0        172.235367  91.573935
1        186.842833  89.136227
2        167.911872  69.420926
3        169.689645  89.714543
4        162.223785  96.863881
...             ...        ...
9999995  175.528332  81.675227
9999996  182.814207  79.911403
9999997  182.241537  88.957153
9999998  180.586609  82.469872
9999999  178.056192  77.360083

[10000000 rows x 2 columns]


2. We want to add a third column to `person_df`. The third column will be titled `BMI` which is Body Mass Index (BMI) and is calculated from the `Height` and `Weight` of an individual. `BMI` is calculated from the following formula. In this formula, weight must be in kilograms, and height must in meters. 

$\frac{Weight}{Height^2}$

The calculation that is needed to add BMI can be done using any of the four methods that we studied in Challenge 1. Which one should we use it? Calculate the BMI and add it as the third column to `person_df`. 


**Answer**: Obviously, from what we learned from Challenge 1, it is obvious that V&B will lead to the best performance. 

In [2]:
person_df['BMI'] =(
    person_df.Weight / 
    (person_df.Height/100)**2
)  
print(person_df)

             Height     Weight        BMI
0        172.235367  91.573935  30.869330
1        186.842833  89.136227  25.532971
2        167.911872  69.420926  24.622242
3        169.689645  89.714543  31.156751
4        162.223785  96.863881  36.807202
...             ...        ...        ...
9999995  175.528332  81.675227  26.509156
9999996  182.814207  79.911403  23.910511
9999997  182.241537  88.957153  26.784660
9999998  180.586609  82.469872  25.288568
9999999  178.056192  77.360083  24.400727

[10000000 rows x 3 columns]


3.	The following code adds a fourth column to `person_df` and that is `Gender`. Run the following code to get this done. 

```
person_df['Gender'] = np.random.binomial(1,0.5,n_rows)
print(person_df)
```

In [3]:
person_df['Gender'] = np.random.binomial(1,0.5,n_rows)
print(person_df)

             Height     Weight        BMI  Gender
0        172.235367  91.573935  30.869330       1
1        186.842833  89.136227  25.532971       1
2        167.911872  69.420926  24.622242       1
3        169.689645  89.714543  31.156751       0
4        162.223785  96.863881  36.807202       0
...             ...        ...        ...     ...
9999995  175.528332  81.675227  26.509156       1
9999996  182.814207  79.911403  23.910511       0
9999997  182.241537  88.957153  26.784660       0
9999998  180.586609  82.469872  25.288568       0
9999999  178.056192  77.360083  24.400727       1

[10000000 rows x 4 columns]


4.	The fifth column we will add to `person_df` is `Gender_Letter`. In this new column, we will either have the character `F` or `M` when the value of `Gender` is `1` or `0`, respectively. This can be done in many different ways, but some of them are much more performant than other. The following five pieces of code do this task. Study them, and reason which ones should be the fastest and mention why.

The following code uses the mapping a function method.
```
person_df['Gender_Letter'] = (
    list(map(
        lambda v: 'M' if v==0 else 'F',
        person_df.Gender))
)
```

The following code uses Boolean Masking.
```
BM = person_df.Gender == 0
person_df['Gender_Letter'] = None
person_df.loc[person_df[BM].index,'Gender_Letter'] ='M'
person_df.loc[person_df[~BM].index,'Gender_Letter'] ='F'
```

The following code uses the `.replace()` function of the pandas Series `persond_df.Gender`.
```
person_df['Gender_Letter'] = person_df.Gender.replace({0:'M',1:'F'})
```

The following code uses the `.where()` function from the numpy module.
```
person_df['Gender_Letter'] = np.where(person_df.Gender==0,'M','F')
```

**Answer**: 

One of the following two methods or both should be the fastest. 
- using the `.replace()` function of the pandas Series `persond_df.Gender`.
- using the `.where()` function from the NumPy module.

The reason these two methods will be more performant is that they outsource the task to the pandas or NumPy module, and they can access lower-level computer programing and also use V&B.  


5.	Run the four pieces of code in Step 4 to see if you were right. Report your findings.

To record how much each piece of code takes to run you may use the module `time`. Even better, if you are using Jupyter Notebook you may just use its `%%time` magic command. All you need is to add the magic command at the first line of the chunk of code. For instance, the following chunk of code shows how Jupyter Notebook timed the first piece of the code from Step 4 for me. 


In [4]:
%%time
person_df['Gender_Letter'] = (
    list(map(
        lambda v: 'M' if v==0 else 'F',
        person_df.Gender))
)

Wall time: 5.5 s


In [5]:
%%time
BM = person_df.Gender == 0
person_df['Gender_Letter'] = None
person_df.loc[person_df[BM].index,'Gender_Letter'] ='M'
person_df.loc[person_df[~BM].index,'Gender_Letter'] ='F'

Wall time: 4.56 s


In [6]:
%%time
person_df['Gender_Letter'] = person_df.Gender.replace({0:'M',1:'F'})

Wall time: 1.21 s


In [7]:
%%time
person_df['Gender_Letter'] = np.where(person_df.Gender==0,'M','F')

Wall time: 509 ms


**Answer**: As expected `.replace()` and `np.where()` were the most performant methods. We even learned that `np.where()` was twice better than `.replace()`. The reason is probably that panda does not use the underlying numpy methods for this function.

6.	We will add a sixth column to `person_df` in this step, and we will call it `Status`. This column will specify if a person is *Healthy*, *Overweight*, or *Underweight* based on their `BMI` and `Gender`. If a female’s `BMI` is lower than 19 they are underweight and if over 24 they are overweight, otherwise they are healthy. On the other hand, if a male’s `BMI` is between 20 and 25, they are healthy, if lower than 20 they are underweight, and if over 25 they are overweight. To save space, we will use `H` for *Healthy*, `O` for *Overweight*, and `U` for *Underweight* in the column `Status`.

You will get to create this new column with the following three different methods: mapping a function, Boolean Masking, and `np.where()`. Before actually implementing these methods, which one do you think will end up performing the fastest? Which one will be the slowest? What are your reasons?

**Answer**: `np.where()` will be the fastest and mapping a function will be the slowest based on what we experienced in step 5.

7.	Implement the three methods in Step 6 and time how long do they take to complete? Were you right about which one was the fastest and slowest? 

In [8]:
%%time
def specify_status(g,bmi):
    if g=='F':
        if bmi<19:
            return 'U'
        elif bmi>24:
            return 'O'
        else:
            return 'H'
    else:
        if bmi<20:
            return 'U'
        elif bmi>25:
            return 'O'
        else:
            return 'H'

person_df['Status'] = (
    list(
        map(        
            specify_status,
            person_df.Gender_Letter,
            person_df.BMI
        )
    )
)

Wall time: 11 s


In [9]:
%%time
g = person_df.Gender_Letter
bmi = person_df.BMI

BM_H = (( (g=='F') & ((bmi>=19) | (bmi<=24)))
        |
         ((g=='M') & ((bmi>=20) | (bmi<=25)))
       )
BM_O = (((g=='F') & (bmi>24))
        |
        ((g=='M') & (bmi>25))
       )
BM_U = (((g=='F') & (bmi<19))
        |
        ((g=='M') & (bmi<20))
       )

person_df['Status'] = None
person_df.loc[person_df[BM_H].index,'Status'] ='H'
person_df.loc[person_df[BM_O].index,'Status'] ='O'
person_df.loc[person_df[BM_U].index,'Status'] ='U'

Wall time: 17.5 s


In [10]:
%%time
g = person_df.Gender_Letter
bmi = person_df.BMI
person_df['Status'] =(
    np.where(
        g=='F',
        np.where(bmi<19,'U',np.where(bmi<=24,'H','O')),
        np.where(bmi<20,'U',np.where(bmi<=25,'H','O'))
    )
) 

Wall time: 2.6 s


**Answer**: I was right about `np.where()` being the fastest, but mapping the function was not the slowest, boolean masking was. The reason that I was mistaken is that I didn't pay attention to two things. First, to calculate boolean masks we need 7 to 11 V&B tasks. Even though one V&B is fast but multiple ones will lead to higher times. Second, updating person_df using boolean masks will also take some time. The following two chunks of code show this.

In [11]:
%%time
g = person_df.Gender_Letter
bmi = person_df.BMI

BM_H = (( (g=='F') & ((bmi>=19) | (bmi<=24)))
        |
         ((g=='M') & ((bmi>=20) | (bmi<=25)))
       )
BM_O = (((g=='F') & (bmi>24))
        |
        ((g=='M') & (bmi>25))
       )
BM_U = (((g=='F') & (bmi<19))
        |
        ((g=='M') & (bmi<20))
       )

Wall time: 11 s


In [12]:
%%time
person_df['Status'] = None
person_df.loc[person_df[BM_H].index,'Status'] ='H'
person_df.loc[person_df[BM_O].index,'Status'] ='O'
person_df.loc[person_df[BM_U].index,'Status'] ='U'

Wall time: 6.23 s


8.	Among the methods we used in this challenge, which one of them should be regarded as V&B, and how come?

**Answer**: `np.where()` is the fastest and is the one that outsource the task to numpy module, therefore, `np.where()` is V&B. 

9.	From your experience in this challenge, If you were to come up with a guideline for your future self to help in deciding between V&B and Boolean masking what that guidline would be?

**Answer**: If you can find a numpy function that will help you get the task done, it is best to use V&B, otherwise we will have no choice but to use approaches such as boolean masking, or mapping a function.