## Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 
    
    AUTHOR: Dr. Roy Jafari 

# Chapter 4: Taking Advantage of Vectorization and Broadcasting (V&B) 

## Challenge 2: V&B or Boolean Masking?

In this challenge, we will experience the performance difference between V&B and Boolean masking. We will, also, get to practice what we learned in Challenge 1. So, let’s go! Answer the following questions and do the following tasks.

1.	The following piece of code creates `person_df` which is a pandas DataFrame filled with random values of `Height` in centimeters and `Weight` in kilograms. 

In [None]:
import pandas as pd
import numpy as np
n_rows = 10**7
person_df = pd.DataFrame(index=range(n_rows), 
                         columns=['Height','Weight'])
person_df.Height = np.random.normal(178,10,n_rows)
person_df.Weight = np.random.normal(83,7,n_rows)
print(person_df)

2. We want to add a third column to `person_df`. The third column will be titled `BMI` which is Body Mass Index (BMI) and is calculated from the `Height` and `Weight` of an individual. `BMI` is calculated from the following formula. In this formula, weight must be in kilograms, and height must in meters. 

$\frac{Weight}{Height^2}$

The calculation that is needed to add BMI can be done using any of the four methods that we studied in Challenge 1. Which one should we use it? Calculate the BMI and add it as the third column to `person_df`. 


**Answer**: 

3.	The following code adds a fourth column to `person_df` and that is `Gender`. Run the following code to get this done. 

```
person_df['Gender'] = np.random.binomial(1,0.5,n_rows)
print(person_df)
```

**Answer**: 

4.	The fifth column we will add to `person_df` is `Gender_Letter`. In this new column, we will either have the character `F` or `M` when the value of `Gender` is `1` or `0`, respectively. This can be done in many different ways, but some of them are much more performant than other. The following five pieces of code do this task. Study them, and reason which ones should be the fastest and mention why.

The following code uses the mapping a function method.
```
person_df['Gender_Letter'] = (
    list(map(
        lambda v: 'M' if v==0 else 'F',
        person_df.Gender))
)
```

The following code uses Boolean Masking.
```
BM = person_df.Gender == 0
person_df['Gender_Letter'] = None
person_df.loc[person_df[BM].index,'Gender_Letter'] ='M'
person_df.loc[person_df[~BM].index,'Gender_Letter'] ='F'
```

The following code uses the `.replace()` function of the pandas Series `persond_df.Gender`.
```
person_df['Gender_Letter'] = person_df.Gender.replace({0:'M',1:'F'})
```

The following code uses the `.where()` function from the numpy module.
```
person_df['Gender_Letter'] = np.where(person_df.Gender==0,'M','F')
```

**Answer**: 



5.	Run the four pieces of code in Step 4 to see if you were right. Report your findings.

To record how much each piece of code takes to run you may use the module `time`. Even better, if you are using Jupyter Notebook you may just use its `%%time` magic command. All you need is to add the magic command at the first line of the chunk of code. For instance, the following chunk of code shows how Jupyter Notebook timed the first piece of the code from Step 4 for me. 


**Answer**: 

6.	We will add a sixth column to `person_df` in this step, and we will call it `Status`. This column will specify if a person is *Healthy*, *Overweight*, or *Underweight* based on their `BMI` and `Gender`. If a female’s `BMI` is lower than 19 they are underweight and if over 24 they are overweight, otherwise they are healthy. On the other hand, if a male’s `BMI` is between 20 and 25, they are healthy, if lower than 20 they are underweight, and if over 25 they are overweight. To save space, we will use `H` for *Healthy*, `O` for *Overweight*, and `U` for *Underweight* in the column `Status`.

You will get to create this new column with the following three different methods: mapping a function, Boolean Masking, and `np.where()`. Before actually implementing these methods, which one do you think will end up performing the fastest? Which one will be the slowest? What are your reasons?

**Answer**: 


7.	Implement the three methods in Step 6 and time how long do they take to complete? Were you right about which one was the fastest and slowest? 

**Answer**: 

8.	Among the methods we used in this challenge, which one of them should be regarded as V&B, and how come?

**Answer**: 

9.	From your experience in this challenge, If you were to come up with a guideline for your future self to help in deciding between V&B and Boolean masking what that guidline would be?

**Answer**: