# Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 

    
    AUTHOR: Dr. Roy Jafari 

# Chapter 4: Taking Advantage of Vectorization and Broadcasting (V&B) 

## Challenge 1: V&B, iterating, applying, or mapping?

In this challenge, we are going to experience the significant difference between the performance of four methods of doing array operations. The four methods are the following.

- Iteration
- Pandas .apply() funciton 
- Mapping function
- V&B

Answer the following questions and do the following tasks.

1.	In this challenge, we will be using `numpy`, `pandas`, and `matplotlib` modules. So go ahead and import them as the following. 


In [None]:
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

2.	The following code defines the function `one_experiment()` which creates a random DataFrame, random_df, which has two randomly generated columns, `C1` and `C2`. The number of rows in `random_df`, `n_rows`, is an input to the function `one_expriment()`. After `random_df` is generated, the function performs the same task of multiplying the values of `C1` with the values of `C2` with four methods, naming `iterate`, `apply`, `map`, and `v&b`, and records the time it takes to get the task done with each method. 
Study the code to understand exactly what the function `one_experiment()` does and run the code to define the function for your computer’s CPU. 

In [None]:
def one_experiment(n_rows):
    output, keep={}, []
    random_df = pd.DataFrame(
        {'C1': np.random.random(n_rows),
         'C2': np.random.random(n_rows)}
    )
    t0= time.time()
    for i,row in random_df.iterrows():
        keep.append(row.C1*row.C2)
    random_df['C3_iterate'] = keep
    output['iterate'] = time.time()-t0
    t0= time.time()
    random_df['C3_map'] = random_df.apply(
        lambda r:r.C1*r.C2,
        axis=1
    )
    output['apply'] = time.time()-t0
    t0= time.time()
    random_df['C3_map'] = list(
        map(
            lambda x,y:x*y,
            random_df.C1,
            random_df.C2
        )
    )
    output['map'] = time.time()-t0
    t0= time.time()
    random_df['C3_map'] =(
        random_df.C1 * random_df.C2
    ) 
    output['v&b'] = time.time()-t0
    return output

3.	Now that we have the function `one_experiment()` defined, go ahead and give it some use and run it with a few different `n_rows`. Study the outputs of the functions and describe your observations. 

**Answer**: 

4.	The following function creates the function `experiments()` that expand the capability of the function `one_experiment()`. While the function `one_experiment()` only takes in `n_rows`, the function experiment also takes in `n_repeats`. The input `n_repeats` is the number of times that the function `one_experiment()` is repeated and the average time it takes for each method to complete the task is recorded and outputted. We added `n_repeats` because if we compare the methods only with just a one-time experiment the comparisons are not as reliable. Study the code and understand what exactly it does, and then run the code.

In [None]:
method_list = ['iterate','apply','map','v&b']
def experiments(n_rows,n_repeat):
    output = {m:0 for m in method_list}
    for _ in range(n_repeat):
        result = one_experiment(n_rows)
        output = {m:result[m]+output[m] 
                  for m in method_list}
    return {m:round(output[m]/n_repeat,5)
            for m in method_list}

5.	Now that we have the function `experiments()` defined, go ahead and give it some use and run it with a few different `n_rows` and `n_repeats`. For instance, you might want to run `experiments(100,5)` or `experiments(1000,10)`. Study the outputs of the functions and describe your observations.

**Answer**:

6.	Now we want to set up to do a systematic experiment to compare the four methods we are studying. The following code creates `result_df` which we will later use to record and study the results of our experiments. The index of `result_df` is the `n_rows` we will be passing to `experiments()`. Run the code and study its printout.

In [None]:
exp_options = [10**i for i in range(2,6)]
result_df = pd.DataFrame(index = exp_options,
                         columns = method_list)
print(result_df)

7.	The following code simply runs the function `experiment()` for each row if `result_df` and records its output into `result_df`. Pay attention that we are also passing 5 as `n_repeats` to experiment. run the code and study its output. Pay attention the code might take a few minutes to complete.

In [None]:
for o in exp_options:
    result_df.loc[o] = experiments(o,5)
print(result_df)

8.	Now we can use the wonderful matplotlib module to visualize our experiment. Run the following code, study the line plot it creates and describe your observations.

In [None]:
for m in method_list:
    result_df[m].plot(logx=True)
plt.xlabel('n_rows')
plt.ylabel('seconds')
plt.legend()
plt.savefig('images/challenge1_8.png', dpi=500)

**Answer**:

9.	From your observation in the plot that you created in step 8, answer the following questions: 1) at what `n_rows` do we start seeing a significant difference between the methods `iterate` and `apply`? 2) at what `n_rows` do we start seeing a significant difference between the method `apply` and `map`? 3) at what `n_rows` do we start seeing a significant difference between the method `map` and `v&b`? Pay attention, the correct answer to some of the quesitons might be that the visual cannot help you answer the question.

**Answer**:


10.	For the questions that you were not able to answer, in step 9, design, code, and perform experiments and visualize its results so we can answer those questions. Answer the question(s) after completing the described experiments. 

**Answer**: