# Zero To Hero Big Data Prepration
Taking Advantage of Cloud Technologies to Create Big Data Solutions
    
    AUTHOR: Dr. Roy Jafari 

## Chapter 6: Effectively employing computational and memory resources 

### Challenge 1: Loop or Map?

In this challenge, we will use *person_df.csv* which is randomly created. The code that has randomly generated the data is given below. The data has 30,000 rows representing imaginary individuals with their Height, Weight, BMI, and their Gender. 

In [8]:
import pandas as pd
import numpy as np

n = 30000
person_df = pd.DataFrame(index=range(n), columns=['Height','Weight'])
person_df.Height = np.random.normal(178,10,n)
person_df.Weight = np.random.normal(83,7,n)
person_df['BMI'] = person_df.Weight / ((person_df.Height/100)**2) 
person_df['Gender'] = np.random.binomial(1,0.4988,n)
person_df = person_df.replace({0:'M',1:'F'})
person_df.to_csv('person_df.csv', index=False)

The task we want to perform is to specify the health status of each individual. If a male individual has a `BMI` between 20 and 25 is considered healthy, a `BMI` smaller than 20 is considered Underweight, and a `BMI` higher than 25 is considered Overweight. Likewise, a female individual that has a `BMI` between 19 and 24 is considered healthy; underweight when `BMI` is below 19 and overweight when `BMI` is above 24.

Both of the following two pieces of code add the fifth column, Status, to the data which indicates if the person is healthy, underweight, or overweight.

Your challenge is to figure out which one is a more efficient approach in getting the task done without running the code?

The first piece of code is the following,
```
import pandas as pd
person_df = pd.read_csv('person_df.csv')
for i,row in person_df.iterrows():
    if(row.Gender == 'M'):
        if(row.BMI<20):
            person_df.loc[i,'Status'] = 'Underweight'
        elif(row.BMI<=25):
            person_df.loc[i,'Status'] = 'Healthy'
        else:
            person_df.loc[i,'Status'] = 'Overweight'
    else:
        if(row.BMI<19):
            person_df.loc[i,'Status'] = 'Underweight'
        elif(row.BMI<=24):
            person_df.loc[i,'Status'] = 'Healthy'
        else:
            person_df.loc[i,'Status'] = 'Overweight'
```

The second piece of the code is the following. 

```
import pandas as pd
person_df = pd.read_csv('person_df.csv')
def specifyStatus(gender,bmi):
    if(gender == 'M'):
        if(bmi<20):
            return 'Underweight'
        elif(bmi<=25):
            return 'Healthy'
        else:
            return 'Overweight'
    else:
        if(bmi<19):
            return 'Underweight'
        elif(bmi<=24):
            return 'Healthy'
        else:
            return 'Overweight'
person_df['Status'] = list(
    map(specifyStatus,person_df.Gender,person_df.BMI)
    )
```

**Answer**:

The first piece of code will have the CPU get some piece of data from the person_df and process it over and over again, and that's not an optimum orchestration. However, the second piece of code, gives what needs to be done to the CPU in form of a function and the CPU only needs to go to the data once, and this will save the CPU a lot of unnecessary moving around of data, and therefore the second piece of code will run faster. 

The following code uses the Jupyter Notebook `%%time` widget to measure how long it will take.

In [6]:
%%time
import pandas as pd

person_df = pd.read_csv('person_df.csv')

for i,row in person_df.iterrows():

    if(row.Gender == 'M'):
        if(row.BMI<20):
            person_df.loc[i,'Status'] = 'Underweight'
        elif(row.BMI<=25):
            person_df.loc[i,'Status'] = 'Healthy'
        else:
            person_df.loc[i,'Status'] = 'Overweight'
            
    else:
        if(row.BMI<19):
            person_df.loc[i,'Status'] = 'Underweight'
        elif(row.BMI<=24):
            person_df.loc[i,'Status'] = 'Healthy'
        else:
            person_df.loc[i,'Status'] = 'Overweight'

Wall time: 3.68 s


In [7]:
%%time
import pandas as pd

person_df = pd.read_csv('person_df.csv')

def specifyStatus(gender,bmi):
    if(gender == 'M'):
        if(bmi<20):
            return 'Underweight'
        elif(bmi<=25):
            return 'Healthy'
        else:
            return 'Overweight'
    else:
        if(bmi<19):
            return 'Underweight'
        elif(bmi<=24):
            return 'Healthy'
        else:
            return 'Overweight'

person_df['Status'] = list(
    map(specifyStatus,person_df.Gender,person_df.BMI)
    )

Wall time: 34.4 ms
