# Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 

    
    AUTHOR: Dr. Roy Jafari 

# Chapter 5: Picking up the right tool 

## Challenge 1: Image pre-processing 

An image is essentially a dataset that contains a set of numbers for each pixel. While the dataset of an image is organized in a rectangular pixel structure, most data science algorithms are designed to work with datasets organized as tables of rows and columns. In this challenge, we will explore two approaches to transforming an image from its rectangular pixel structure into a tabular data structure. Naturally, one of these approaches demonstrates a failure to select the right tool. Let’s dive in, shall we?

1. In this challenge, we will build on an example from the first book in this series, *Optimizing the Big Data Problem Statement*, specifically from Chapter 4, *Example of Data Wrangling – Computer Vision Case Study*. If you don’t have the first book, no worries—you’ll find everything you need to understand and complete this challenge right here. First, review the example, then use the following code to load `train_d`, `train_l`, `test_d`, and `test_l` into your local Python environment.

```
from tensorflow.keras.datasets import mnist
(train_d, train_l), (test_d, test_l) = mnist.load_data()
```

2. Run the following code and study its output to understand the shape of these datasets. With the provided context, explain the data structures of these four matrices.

```
print(train_d.shape)
print(train_l.shape)
print(test_d.shape)
print(test_l.shape)
```

**Answer:** 

3. If you are having trouble answering the question in the previous step, run the following code for a hint.

```
print(train_d[0, :, :])
```

4. If you are still having trouble answering the question in step 2, try running the following code and study its output. This code uses the `plt.imshow()` function to visualize what we saw in step 3.

```
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

plt.imshow(train_d[0, :, :], cmap='gray')
plt.xticks([])
plt.yticks([])
plt.show()
```

5. The following code creates the pandas DataFrame, `feature_df`. We want to restructure `train_d` and `train_l` into this table. The values from columns `P0` to `P783` will come from `train_d` and the column `lable` will come from `train_l`.

```
import pandas as pd
columns = [f'P{i}' for i in range(28*28)]
columns.extend(['lable'])

feature_df = pd.DataFrame(index=range(60000), columns=columns)
feature_df
```

6. The following code defines `reshape_image_ourselve()` which takes one image with 28x28 pixels and flattens the pixelated structure into a single line of numbers. Run the following code and then test it with `train_d[0, :, :]`.

```
def reshape_image_ourselve(image):
    output_sr = pd.Series(index=range(28*28))
    for i in range(28):
        output_sr.iloc[28*i:28*(i+1)] = image[i, :]
    return output_sr.values
```

7. The following code uses the function defined in the previous step, `reshape_image_ourselve()`, to fill `feature_df`. Note that `%%time` is used to time how long it takes for your computer to run the task. Run the code and note how long it took.

```
%%time
feature_df = pd.DataFrame(index=range(60000), columns=[f'P{i}' for i in range(28*28)])

for i in range(60000):
    feature_df.loc[i] = reshape_image_ourselve(train_d[i, :, :])

feature_df['lable'] = train_l
feature_df
```

**Answer**: 


8. The following code accomplishes the same task as the previous step but uses a much better tool, `np.reshape()`. Run the following code and note how long it takes to execute.

```
%%time
feature_df = pd.DataFrame(train_d.reshape(60000, -1), columns=[f'P{i}' for i in range(28*28)])
feature_df['lable'] = train_l
feature_df
```

**Answer:** 


9. Compare the runtime of your computer from step 7 and step 8. Do you see a large difference? What do you think is the reason?

**Answer:** 