# Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 

    
    AUTHOR: Dr. Roy Jafari 

# Chapter 5: Picking up the right tool 

## Challenge 2: Restructuring and Reformulating Data 

In this challenge, we will use a large dataset, `US_Shops_simulated.csv`, which contains about 5 million rows. Due to the size of this data, selecting the wrong tool can make simple restructuring and reformulating tasks take significantly longer. Use the following steps to complete this task.

1. The following code uses `pd.read_csv()` to read `US_Shops_simulated.csv` into `order_df`. Run the code and study the dataset. Specify columns that are providing an index versus the columns that are presenting values.

```
import pandas as pd
order_df = pd.read_csv('US_Shops_simulated.csv')
order_df
```

**Answer**: 

2. The following code uses the `.groupby()` and `.size()` functions to get all the indices for all the shops in this dataset. Run the following code and study its output.

```
shop_index = order_df.groupby(['State', 'Location']).size().index
print(shop_index)
```

3. Similar to step 2, the following code uses the `.groupby()` and `.size()` functions to get all the indices for all the dates in this dataset. Run the following code and study its output.

```
date_index = order_df.groupby(['Date']).size().index
print(date_index)
```


4. The following code uses `shop_index` and `date_index` to create the new table, `rev_shop_df`, to contain the daily revenue of all the shops for all the dates from `order_df`. In this table, each row represents a shop and the columns represent the dates.

```
rev_shop_df = pd.DataFrame(index=shop_index, columns=date_index)
rev_shop_df
```

5. The following code uses a for loop and the `.query()` function to wrangle `order_df` into `rev_shop_df`. Note that `%%time` is used to record how long the code will take to run. Run the following code and note its outputs.

```
%%time
for state,location in shop_index:
    wdf = (order_df
           .query(f'State == "{state}"')
           .query(f'Location == "{location}"')
           .copy()
    )
    rev_shop_df.loc[state,location] = (
        wdf.set_index('Date').number_of_customer_visits)
rev_shop_df
```

**Answer:**  

6. The following code performs an even bigger task than the one done in steps 4 and 5, but with more appropriate tools. While the task in steps 4 and 5 was to wrangle the revenue data, this step also wrangles the `profit` and `number_of_customer_visits` data. In this code, the functions `.drop()`, `.set_index()`, and `.unstack()` are employed. Note that the critical function enabling this restructuring is `.unstack()`.

```
%%time
shop_df = (
    order_df
    .drop(columns=['Month', 'Year'])
    .set_index(['State', 'Location', 'Date'])
    .unstack()
)
shop_df
```
Run `shop_df['revenue']` to see what we created in steps 4 and 5.

**Answer:** 

7. Calculate roughly how much faster the application of `.unstack()` was compared to using the for loop and `.query()`. What is the reason that `.unstack()` is much faster?

**Answer:** 

8. The function `.unstack()` has a peer function, `.stack()`. For practice, use `.stack()` to revert `shop_df` back to `order_df`.

**Answer:**

9. The set of functions `.stack()` and `.unstack()` are powerful. However, the same tasks can be done using `.melt()` and `.pivot()` functions. To practice, first create `shop_df` using one of these two functions, and revert it back to `order_df` using the other. Pay attention—doing it with `.melt()` might be tricky due to the fact that this function cannot deal with multi-level columns.

**Answer:**

10. Compare the sets of `.stack()` and `.unstack()` with the set of `.melt()` and `.pivot()` from two perspectives: runtime and ease of use.

**Answer:** 