# Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 
    
    AUTHOR: Dr. Roy Jafari 


# Chapter 2: Choosing the right data types 

## Challenge 2: int8, int16, int32, or int64?

At the end of this challenge, we will have gained experience on when and where, and which of the four integer types int8, int16, int32, or int64 must be used. 
Use the following prompts to complete this challenge.
1.	The following code creates four pandas DataFrames from the one DataFrame `person_df`. The `person_df` DataFrame has two columns - `Height` and `Weight` – and 100 million rows. The DataFrame `person_df` is filled up randomly, and using the function `np.random.normal()` the code makes sure that the random values are generated within the reasonable range that we would expect for the values of weight and height. The chosen units of measurement for weight and height are Kilograms and Centimeters. After the generation of `person_df` is completed the four DataFrames are created from `persond_df`, and they are kept in the dictionary `dfs`. The difference between these DataFrames is their data type; these DataFrames have data types of *int8*, *int16*, *int32*, and *int64*, respectively. Run the code, and go to the next step. Pay attention the code might take a few seconds to complete.


```
import pandas as pd
import numpy as np

n = 10**8
person_df = pd.DataFrame(
    index=range(n),
    columns=['Height','Weight']
)
person_df.Height = np.random.normal(178,10,n)
person_df.Weight = np.random.normal(83,7,n)

int_bits = ['8','16','32','64']
dfs = {}
for int_bit in int_bits:
    type_name = f'int{int_bit}'
    dfs[type_name] = person_df.astype(type_name)
```


2.	The following code access each of the four DataFrames in `dfs` and run their `.info()` property. Run and study the outputs of the code. Is There a significant difference in the amount of memory these four DataFrames use? If yes, what accounts for the significant difference? 

```
for df in dfs.values():
    df.info()
    print('\r\n')
```

**Answer**: 


3.	The values in one of the DataFrames are corrupted. Use the following code to print out the values of all the DataFrames, and figure out which one is corrupted. Investigate to figure out what caused the corruption.

```
for key,df in dfs.items():
    print(key)
    print(df)
    print('\r\n')
```

**Answer**: 


4.	The following code performs a computational experiment to see if the runtime of performing big data manipulations changes when our data is encoded in *int8*, *int16*, *int32*, or *int64*. The code attempts to calculate the **Body Mass Index (BMI)** which is a health metric calculated from the height and weight of an individual. The formula for BMI is weight in Kilograms divided by the square of the height in Meters. Run the following code and study its outcome.

```
import time
exp1_df = pd.DataFrame(
    index = dfs.keys(),
    columns=['RunTime']
)
for key in dfs.keys():
    wdf = dfs[key]
    t0 = time.time()
    wdf['BMI'] = wdf.Weight/((wdf.Height/100)**2)
    exp1_df.at[key,'RunTime'] = time.time()-t0
print(exp1_df)
```

5.	Weren’t you expecting to see an increasing `RunTime` when we move from *int8* to *in64*? However, that’s not what happens when you run the code. What do you think is the reason? If you are having a hard time coming to an answer, study and run the following code, it will give you a hint. 

```
for df in dfs.values():
    df.info()
    print('\r\n--DIVIDE--\r\n')
```

**Answer**:

6.	The following code runs another experiment. Once you run the experiment you will see that when we move from int8 to in64 the RunTime also increases. Study the code, and figure out what’s different in this code that matches our expectations.

```
exp2_df = pd.DataFrame(
    index = dfs.keys(),
    columns=['RunTime']
)
for key in dfs.keys():
    wdf = dfs[key]
    t0 = time.time()
    wdf['Some Calculation'] = wdf.Weight+(wdf.Height**2)
    exp2_df.at[key,'RunTime'] = time.time()-t0
print(exp2_df)
```

**Answer**: 

7.	If you are having trouble answering the question in the preceding step, running the following code will give you an important hint.

```
for df in dfs.values():
    df.info()
    print('\r\n--DIVIDE--\r\n')
```

**Answer**: 


8.	From what you experienced, formulate an answer to the following question: if we are aiming to use the lightest-weight-possible integer data type between *int8*, *int16*, *int32*, and *int64*, how can we make sure that our data will not get corrupted similar to what we experienced in step 2.

**Answer**: 

9.	From what you experienced, formulate an answer to the following question: in what situations choosing the lightest-weight-possible integer data type will for sure lead to less CPU usage?

**Answer**: 