# Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 
    
    AUTHOR: Dr. Roy Jafari 


# Chapter 2: Choosing the right data types 

## Challenge 2: int8, int16, int32, or int64?

At the end of this challenge, we will have gained experience on when and where, and which of the four integer types int8, int16, int32, or int64 must be used. 
Use the following prompts to complete this challenge.
1.	The following code creates four pandas DataFrames from the one DataFrame `person_df`. The `person_df` DataFrame has two columns - `Height` and `Weight` – and 100 million rows. The DataFrame `person_df` is filled up randomly, and using the function `np.random.normal()` the code makes sure that the random values are generated within the reasonable range that we would expect for the values of weight and height. The chosen units of measurement for weight and height are Kilograms and Centimeters. After the generation of `person_df` is completed the four DataFrames are created from `persond_df`, and they are kept in the dictionary `dfs`. The difference between these DataFrames is their data type; these DataFrames have data types of *int8*, *int16*, *int32*, and *int64*, respectively. Run the code, and go to the next step. Pay attention the code might take a few seconds to complete.


In [1]:
import pandas as pd
import numpy as np

n = 10**8
person_df = pd.DataFrame(
    index=range(n),
    columns=['Height','Weight']
)
person_df.Height = np.random.normal(178,10,n)
person_df.Weight = np.random.normal(83,7,n)

int_bits = ['8','16','32','64']
dfs = {}
for int_bit in int_bits:
    type_name = f'int{int_bit}'
    dfs[type_name] = person_df.astype(type_name)

2.	The following code access each of the four DataFrames in `dfs` and run their `.info()` property. Run and study the outputs of the code. Is There a significant difference in the amount of memory these four DataFrames use? If yes, what accounts for the significant difference? 

In [2]:
for df in dfs.values():
    df.info()
    print('\r\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 2 columns):
 #   Column  Dtype
---  ------  -----
 0   Height  int8 
 1   Weight  int8 
dtypes: int8(2)
memory usage: 190.7 MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 2 columns):
 #   Column  Dtype
---  ------  -----
 0   Height  int16
 1   Weight  int16
dtypes: int16(2)
memory usage: 381.5 MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 2 columns):
 #   Column  Dtype
---  ------  -----
 0   Height  int32
 1   Weight  int32
dtypes: int32(2)
memory usage: 762.9 MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 2 columns):
 #   Column  Dtype
---  ------  -----
 0   Height  int64
 1   Weight  int64
dtypes: int64(2)
memory usage: 1.5 GB




**Answer**: Yes, there is a significant difference between them in terms of memory usage. The following list of their usages:

- *int8*: 190.7 MB
- *int16*: 381.5 MB
- *int32*: 762.9 MB
- *int64*: 1.5 GM

If you notice the amount of data usage doubles as we move from *int8* to *int16*, from *int16* to *int32*, and from *int32* to *int64*. 

The reason for these changes is the amount of memory pandas put aside for each type of integer. pandas put aside 1, 2, 4, and 8 bytes respectively for *int8*, *int16*, *int32*, and *int64*.


3.	The values in one of the DataFrames are corrupted. Use the following code to print out the values of all the DataFrames, and figure out which one is corrupted. Investigate to figure out what caused the corruption.

In [3]:
for key,df in dfs.items():
    print(key)
    print(df)
    print('\r\n')

int8
          Height  Weight
0            -95      88
1            -81      75
2            -51      70
3            -75      83
4            -79      80
...          ...     ...
99999995     -66      86
99999996     -85      81
99999997     -74      80
99999998     -87      80
99999999     -82      68

[100000000 rows x 2 columns]


int16
          Height  Weight
0            161      88
1            175      75
2            205      70
3            181      83
4            177      80
...          ...     ...
99999995     190      86
99999996     171      81
99999997     182      80
99999998     169      80
99999999     174      68

[100000000 rows x 2 columns]


int32
          Height  Weight
0            161      88
1            175      75
2            205      70
3            181      83
4            177      80
...          ...     ...
99999995     190      86
99999996     171      81
99999997     182      80
99999998     169      80
99999999     174      68

[100000000 rows 

**Answer**: The Height values in the first DataFrame (the one encoded using *int8*) are corrupted. 

To invetigate, we will check the range of values *int8* can encode. The following code uses the function `np.iinfo()` to do that. 

In [4]:
np.iinfo('int8')

iinfo(min=-128, max=127, dtype=int8)

As we can see int8 cannot accomodate to hold numbers larger than 127, and that is why we cannot use it to encode the value of height in centimeters.

4.	The following code performs a computational experiment to see if the runtime of performing big data manipulations changes when our data is encoded in *int8*, *int16*, *int32*, or *int64*. The code attempts to calculate the **Body Mass Index (BMI)** which is a health metric calculated from the height and weight of an individual. The formula for BMI is weight in Kilograms divided by the square of the height in Meters. Run the following code and study its outcome.

In [5]:
import time
exp1_df = pd.DataFrame(
    index = dfs.keys(),
    columns=['RunTime']
)
for key in dfs.keys():
    wdf = dfs[key]
    t0 = time.time()
    wdf['BMI'] = wdf.Weight/((wdf.Height/100)**2)
    exp1_df.at[key,'RunTime'] = time.time()-t0
print(exp1_df)


        RunTime
int8   1.960551
int16  2.471621
int32  1.801935
int64   3.56097


5.	Weren’t you expecting to see an increasing `RunTime` when we move from *int8* to *in64*? However, that’s not what happens when you run the code. What do you think is the reason? If you are having a hard time coming to an answer, study and run the following code, it will give you a hint. 

Yes, it is the natural guess that as the data is encoded in a bulkier way, the CPU has to work harder to get it processed. 

Let's see the hint first before answering why this happens.

In [6]:
for df in dfs.values():
    df.info()
    print('\r\n--DIVIDE--\r\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 3 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Height  int8   
 1   Weight  int8   
 2   BMI     float64
dtypes: float64(1), int8(2)
memory usage: 953.7 MB

--DIVIDE--

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 3 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Height  int16  
 1   Weight  int16  
 2   BMI     float64
dtypes: float64(1), int16(2)
memory usage: 1.1 GB

--DIVIDE--

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 3 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Height  int32  
 1   Weight  int32  
 2   BMI     float64
dtypes: float64(1), int32(2)
memory usage: 1.5 GB

--DIVIDE--

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 3 columns):
 #   Column  Dtype  
---  ---

All of the **BMI** is listed as *float64*. The reason for the unexpected `RunTime` in the previous step is the time it takes for the integer data types to be transformed into *float64*. *int32* will be transformed to *float64* the fastest and that's why the DataFrame with *int32* was quickest to be manipulated.

6.	The following code runs another experiment. Once you run the experiment you will see that when we move from int8 to in64 the RunTime also increases. Study the code, and figure out what’s different in this code that matches our expectations.

In [7]:
exp2_df = pd.DataFrame(
    index = dfs.keys(),
    columns=['RunTime']
)
for key in dfs.keys():
    wdf = dfs[key]
    t0 = time.time()
    wdf['Some Calculation'] = wdf.Weight+(wdf.Height**2)
    exp2_df.at[key,'RunTime'] = time.time()-t0
print(exp2_df)


        RunTime
int8   0.846579
int16  0.499681
int32  0.757181
int64  4.391319


**Answer**: The difference is that the computation to create the new column `'Some Calculation'` does not require data transformation from integer to float, and that is why what we would expect to see is as the data type becomes heavier the CPU has to work harder will be confirmed.

7.	If you are having trouble answering the question in the preceding step, running the following code will give you an important hint.

In [8]:
for df in dfs.values():
    df.info()
    print('\r\n--DIVIDE--\r\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 4 columns):
 #   Column            Dtype  
---  ------            -----  
 0   Height            int8   
 1   Weight            int8   
 2   BMI               float64
 3   Some Calculation  int8   
dtypes: float64(1), int8(3)
memory usage: 1.0 GB

--DIVIDE--

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 4 columns):
 #   Column            Dtype  
---  ------            -----  
 0   Height            int16  
 1   Weight            int16  
 2   BMI               float64
 3   Some Calculation  int16  
dtypes: float64(1), int16(3)
memory usage: 1.3 GB

--DIVIDE--

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 4 columns):
 #   Column            Dtype  
---  ------            -----  
 0   Height            int32  
 1   Weight            int32  
 2   BMI               float64

**Answer**: We can see that `Some Calculation` has the same data type as the `Height` and `Weight` of the DataFrame, and that is why our expectation matches what happened in the previous step.

8.	From what you experienced, formulate an answer to the following question: if we are aiming to use the lightest-weight-possible integer data type between *int8*, *int16*, *int32*, and *int64*, how can we make sure that our data will not get corrupted similar to what we experienced in step 2.

**Answer**: We have to check the range that the integer type can accommodate and if we would expect all of the values to be in that range, then we can use that type, otherwise we have to go to the next larger data type.


9.	From what you experienced, formulate an answer to the following question: in what situations choosing the lightest-weight-possible integer data type will for sure lead to less CPU usage?

**Answer**: If only the data does not have to be transformed during the data manipulation then making sure that we are using the lightest-weight-possible integer data will also guarantee less CPU processing as well.    